Wednesday 20 May 2015

From the blogosphere - batch effects in sequence data

Posted by: Jean M Macklaim

We've performed a number of sequencing runs lately in replicates (replicated in PCR amplification, library preparation, sequencing run, and sequencing platform) to explore the technical variation, including batch effects. We know technical variation can change your results in big ways.

A good example of batch effect has been brought to attention recently:

http://simplystatistics.org/2015/05/20/is-it-species-or-is-it-batch-they-are-confounded-so-we-cant-know/

http://www.the-scientist.com/?articles.view/articleNo/43003/title/Batch-Effect-Behind-Species-Specific-Results-/

In summary, two major papers (1 and 2) nearly a decade apart report the same conclusion: human tissues are more similar in gene expression to any other human tissue than to the equivalent tissue in mouse.

Whoa, what?

So human lung tissue is more similar to human eye tissue than to mouse lung?

As it turns out, this likely isn't true. The observation is due to the confounding batch effect. In other words, what the authors thought was explained by differences in species was actually confounded by the differences in platforms/analyses (in the first case e.g. a different microarray with a different set of probes were used for mouse vs human). The most recent 2014 study has been refuted for confounding batch effect too in at least one publication.

So how does one deal with confounding variables?

The most robust way is to sufficiently randomize your samples so technical parameters don't line up with the biological parameters you are testing, and ensure you maintain a consistent protocol throughout.

This, of course, does not guarantee you won't confound your data, but it should help prevent a systemic bias in sample preparation. If you process all your controls on Tuesdays and your affected samples on Fridays, you won't be able to determine if the effect is due to the condition or the day of the week. As another example, it's important to randomize sample loadings to ensure you don't line up all of condition1 in row 1 of your 96-well plate, and condition2 in row 2 etc. This is where is is essential to collect sufficient metadata about your samples in order to explicitly examine confounding variables.

And of course, ALWAYS look at your data

Related articles:

Leek, Jeffrey T., et al. "Tackling the widespread and critical impact of batch effects in high-throughput data." Nature Reviews Genetics 11.10 (2010): 733-739. doi:10.1038/nrg2825

Akey, Joshua M., et al. "On the design and analysis of gene expression studies in human populations" Nature Genetics. 39 (2007): 807-808. doi:10.1038/ng0707-807

No comments:

Post a Comment