Wednesday 20 May 2015

From the blogosphere - batch effects in sequence data

Posted by: Jean M Macklaim

We've performed a number of sequencing runs lately in replicates (replicated in PCR amplification, library preparation, sequencing run, and sequencing platform) to explore the technical variation, including batch effects. We know technical variation can change your results in big ways.

A good example of batch effect has been brought to attention recently:

http://simplystatistics.org/2015/05/20/is-it-species-or-is-it-batch-they-are-confounded-so-we-cant-know/

http://www.the-scientist.com/?articles.view/articleNo/43003/title/Batch-Effect-Behind-Species-Specific-Results-/

In summary, two major papers (1 and 2) nearly a decade apart report the same conclusion: human tissues are more similar in gene expression to any other human tissue than to the equivalent tissue in mouse.

Whoa, what?

So human lung tissue is more similar to human eye tissue than to mouse lung?

As it turns out, this likely isn't true. The observation is due to the confounding batch effect. In other words, what the authors thought was explained by differences in species was actually confounded by the differences in platforms/analyses (in the first case e.g. a different microarray with a different set of probes were used for mouse vs human). The most recent 2014 study has been refuted for confounding batch effect too in at least one publication.

So how does one deal with confounding variables?

The most robust way is to sufficiently randomize your samples so technical parameters don't line up with the biological parameters you are testing, and ensure you maintain a consistent protocol throughout.

This, of course, does not guarantee you won't confound your data, but it should help prevent a systemic bias in sample preparation. If you process all your controls on Tuesdays and your affected samples on Fridays, you won't be able to determine if the effect is due to the condition or the day of the week. As another example, it's important to randomize sample loadings to ensure you don't line up all of condition1 in row 1 of your 96-well plate, and condition2 in row 2 etc. This is where is is essential to collect sufficient metadata about your samples in order to explicitly examine confounding variables.

And of course, ALWAYS look at your data

Tuesday 12 May 2015

The count you get is just a guideline - part 2

Posted by: Greg Gloor

In previous posts I have shown the effect of sequencing on the underlying shape of the data, and that the data are better thought of as likelihoods of observing a particular count for a particular gene or OTU. Now I'm going to show you the effect of combining those two ideas into an approach to examine datasets of this type that minimizes false positives while maintaining power.

The two-step method is implemented in the ALDEx2 package on Bioconductor. Step one is to model the technical variance by sampling instances from the Dirichlet distribution. This distribution is essentially a multivariate Poisson distribution, and is a good estimator for the underlying technical variation that we have seen in previous blog posts. Step two is to use the centered log-ratio transformation to re-capitulate the underlying shape of the data. These two steps are implemented in the aldex.clr function.

We can now use these transformed instances for standard statistical tests. Below shows an example for a dataset where we are interested in knowing which variants differ in abundance between two conditions. What we want is to find out which variants will give us the same answer regardless of the underlying technical variation. The way we do this is to calculate P values, and Benjamin-Hochberg adjusted P values for each Dirichlet instance, and then to tabulate the mean values. The mean P value for random data should converge on 0.5. The mean P value (and BH adjusted values) for real differences should be relatively invariant to technical replication. That is, we should make the same conclusions if we sequence today or tomorrow.

The figure shows four plots of what we get from a different numbers of Dirichlet instances (number on the left), coupled with a histogram of the P values and the number of BH adjusted values that pass a threshold of <0.05 (number on the right). P values <0.05 are in orange, BH <0.05 are in red. As you can see we start off with 138 'significant' variants with a single instance, this is also about what we get if we just use our raw data. After only 5 Dirichlet instances we have reduced this to 80 'significant' variants, and by 100 or so, we stabilize on a value of 75 or so.


Below is an animated gif that walks through 2000 instances. It is pretty clear that inferred technical variation has a large effect on what is called 'significant', and that we can remove at least some of this stochastic noise quite simply. Also it is clear that the number of significant variants stabilizes and is not affected by further random sampling.

(Click the image to restart the gif animation)