Wednesday 15 April 2015

The count we see for an OTU or ORF is just a guideline

Posted by: Greg Gloor

A problem when normalizing by the geometric mean is that the geometric mean is not defined if any of the values in a sample are 0 (see ref 1). However, we must keep in mind that just because we see a value of 0 for a given operational taxonomic unit (OTU) in this experiment, it does not necessarily mean that the true value of the OTU is actually 0.

Values of 0 for an OTU can arise because the OTU sequence could not occur in the experiment, or because the OTU could exist in one group but not the other, or because the OTU was very rare in one or more samples making its selection from the library subject to chance. In the first case, the OTU would not be found in any sample, and that OTU could simply be deleted from the dataset without effect. In the second case, the OTU would be represented in one group but not the other. In the third case, where an OTU has at least one count in at least one sample, a value of 0 could arise in other samples because of sampling and sequencing depth. In the latter two cases, it is possible that the OTU could have been detected if more reads per sample were obtained or if more replicates of the library were sequenced. Current practice in 16S rRNA gene sequencing studies is to assume that an observed value of 0 in a sample represents the actual value. In RNA-seq it is common to remove all genes where the total sum across all samples is small (usually with a mean of 2 or less and no more than about 10 counts in any sample). In either case, the assumption is that variables with very low counts are generally irrelevant (Although there are some exceptions that I will point out in later posts).

The distribution of counts in a replicate OTU when the first OTU has counts between 0 and 3. The same library of sixteen different 16S rRNA gene amplification samples were sequenced on two different Illumina HiSeq runs, and the count of OTUs that had values of 0 to 3 in one replicate, shown by the black bar, were tabulated for the other replicate, shown by the grey bar. Sequencing depth for each replicate was within 10% of the the other replicate. 
This assumption was tested by sequencing the exact same library from 16 different samples on two individual Illumina runs, and then determining the OTU count in one run if the OTU had a count of 0, 1, 2, or 3 in the other run. The figure shows the result, and it can be seen that the count observed for an OTU in one replicate is often very different from the count observed for the other replicate. Similar observations hold for RNA-seq (2) It is clear that the absolute number of counts observed varies between pure technical replicates and as expected the underlying distributions approximate what would be expected for random sample of the input library.

Another way of saying this is that: Given a set of counts for a sample, there are many possible sets of counts that are as likely if we sequenced them tomorrow.

What we need is some way to estimate what a count of 0 (or any other count for that matter) actually represents when we observe it in our datasets.

 1) Wikipedia geometric mean
 2) ANOVA-Like Differential Expression (ALDEx) Analysis for Mixed Population RNA-Seq

No comments:

Post a Comment