Thursday 7 April 2016

Sequencing is not counting!

Posted by: Greg Gloor

The idea of using compositional data analysis approaches (CoDa) to examine microbiome samples appears to be gaining traction. There have been a recent spate of papers (and not just from us) that are going in the right direction by acknowledging that we have compositional data.

However, to my mind, most papers do go far enough, and still make strong assumptions about the data. I want to briefly outline why we must embrace the idea of CoDa in its entirety.

Fundamentally, sequencing is not counting. To see this, lets set up a very simple thought experiment. In ecology, it is common to count species within some area, and then to use these counts for the analysis.

In the counting example, we have 5 tigers and 20 ladybugs in some defined area which is a random sample of the environment. We can assume that the animals are free to move both within the box and between the box and the environment, and that the density of animals inside and outside the box is about the same. Therefore, if one more tiger should wander in from the outside, that not alter the number of ladybugs and vice-versa. Moreover, if a space alien happens to land in the counting box, it will likely not alter the count of tigers or ladybugs (unless it lands on and squishes one!). We can see that the abundance of tigers and ladybugs can be essentially treated as uncorrelated. In addition, we can normalize for different box sizes rather simply by scaling the size of the box. So if one student measures a box of size 1, and another a box of size 2, we can adjust for sampling effort fairly simply. This is the rationale behind normalizing for sequencing depth, and if we had counts would work marvelously for sequencing data.


But, this is not the case with sequencing. Every DNA sequencing instrument has a fixed upper limit on the number of reads that can be delivered, and this is illustrated by having exactly 25 cells inside the box. Visualized in this way, we can see the difference in how the same counts of 5 tigers and 20 ladybugs must behave in the face of a perturbation. Either a tiger or a ladybug must be displaced if another tiger wanders onto the sequencing grid. Another way of saying this is that the tiger and ladybug numbers observed are correlated, since an increase in one necessarily involves a decrease in the other. If the space alien now enters the sequencing grid, we see that again we must displace a tiger or a ladybug. This causes all manner of problems for traditional approaches because of spurious correlation, negative correlation bias and sub-compositional correlation effects.

This problem generalizes to any number of species, and is not alleviated by increasing the number of cells in the grid. Attempts to normalize read counts, which are the same as normalizing the number of squares in the grid are doomed to fail because of this property.

So what can we do? We have to embrace the idea that the actual numbers we observe from a sequencing run are irrelevant, and the only information available is relative information: that is, that the ratios between the ladybugs and tigers is the only measurable property. We have begun the process of adapting the full measure of CoDa tools to microbiome studies and our first paper is just out in Annals of Epidemiology: It's all relative: analyzing microbiome data as compositions, and I hope you give it a read. There is also a supplement that contains all the code used to generate the paper.

Although there are many issues to be resolved, mainly because sparse data can be problematic with  CoDa approach, I think that it is worth checking out for your datasets. In a later post, I will outline how the general approach we use to account for the sparsity problem.