Tuesday 12 May 2015

The count you get is just a guideline - part 2

Posted by: Greg Gloor

In previous posts I have shown the effect of sequencing on the underlying shape of the data, and that the data are better thought of as likelihoods of observing a particular count for a particular gene or OTU. Now I'm going to show you the effect of combining those two ideas into an approach to examine datasets of this type that minimizes false positives while maintaining power.

The two-step method is implemented in the ALDEx2 package on Bioconductor. Step one is to model the technical variance by sampling instances from the Dirichlet distribution. This distribution is essentially a multivariate Poisson distribution, and is a good estimator for the underlying technical variation that we have seen in previous blog posts. Step two is to use the centered log-ratio transformation to re-capitulate the underlying shape of the data. These two steps are implemented in the aldex.clr function.

We can now use these transformed instances for standard statistical tests. Below shows an example for a dataset where we are interested in knowing which variants differ in abundance between two conditions. What we want is to find out which variants will give us the same answer regardless of the underlying technical variation. The way we do this is to calculate P values, and Benjamin-Hochberg adjusted P values for each Dirichlet instance, and then to tabulate the mean values. The mean P value for random data should converge on 0.5. The mean P value (and BH adjusted values) for real differences should be relatively invariant to technical replication. That is, we should make the same conclusions if we sequence today or tomorrow.

The figure shows four plots of what we get from a different numbers of Dirichlet instances (number on the left), coupled with a histogram of the P values and the number of BH adjusted values that pass a threshold of <0.05 (number on the right). P values <0.05 are in orange, BH <0.05 are in red. As you can see we start off with 138 'significant' variants with a single instance, this is also about what we get if we just use our raw data. After only 5 Dirichlet instances we have reduced this to 80 'significant' variants, and by 100 or so, we stabilize on a value of 75 or so.


Below is an animated gif that walks through 2000 instances. It is pretty clear that inferred technical variation has a large effect on what is called 'significant', and that we can remove at least some of this stochastic noise quite simply. Also it is clear that the number of significant variants stabilizes and is not affected by further random sampling.

(Click the image to restart the gif animation)

No comments:

Post a Comment