Posted by: Greg Gloor
A problem when normalizing by the geometric mean is that the geometric mean is not defined if any of the values in a sample are 0 (see ref 1). However, we must keep in mind that just because we see a value of 0 for a given operational taxonomic unit (OTU) in this experiment, it does not necessarily mean that the true value of the OTU is actually 0.
Values of 0 for an OTU can arise because the OTU sequence could not occur in the experiment, or because the OTU could exist in one group but not the other, or because the OTU was very rare in one or more samples making its selection from the library subject to chance. In the first case, the OTU would not be found in any sample, and that OTU could simply be deleted from the dataset without effect. In the second case, the OTU would be represented in one group but not the other. In the third case, where an OTU has at least one count in at least one sample, a value of 0 could arise in other samples because of sampling and sequencing depth. In the latter two cases, it is possible that the OTU could have been detected if more reads per sample were obtained or if more replicates of the library were sequenced. Current practice in 16S rRNA gene sequencing studies is to assume that an observed value of 0 in a sample represents the actual value. In RNA-seq it is common to remove all genes where the total sum across all samples is small (usually with a mean of 2 or less and no more than about 10 counts in any sample). In either case, the assumption is that variables with very low counts are generally irrelevant (Although there are some exceptions that I will point out in later posts).
Another way of saying this is that: Given a set of counts for a sample, there are many possible sets of counts that are as likely if we sequenced them tomorrow.
What we need is some way to estimate what a count of 0 (or any other count for that matter) actually represents when we observe it in our datasets.
1) Wikipedia geometric mean
2) ANOVA-Like Differential Expression (ALDEx) Analysis for Mixed Population RNA-Seq
Values of 0 for an OTU can arise because the OTU sequence could not occur in the experiment, or because the OTU could exist in one group but not the other, or because the OTU was very rare in one or more samples making its selection from the library subject to chance. In the first case, the OTU would not be found in any sample, and that OTU could simply be deleted from the dataset without effect. In the second case, the OTU would be represented in one group but not the other. In the third case, where an OTU has at least one count in at least one sample, a value of 0 could arise in other samples because of sampling and sequencing depth. In the latter two cases, it is possible that the OTU could have been detected if more reads per sample were obtained or if more replicates of the library were sequenced. Current practice in 16S rRNA gene sequencing studies is to assume that an observed value of 0 in a sample represents the actual value. In RNA-seq it is common to remove all genes where the total sum across all samples is small (usually with a mean of 2 or less and no more than about 10 counts in any sample). In either case, the assumption is that variables with very low counts are generally irrelevant (Although there are some exceptions that I will point out in later posts).
Another way of saying this is that: Given a set of counts for a sample, there are many possible sets of counts that are as likely if we sequenced them tomorrow.
What we need is some way to estimate what a count of 0 (or any other count for that matter) actually represents when we observe it in our datasets.
1) Wikipedia geometric mean
2) ANOVA-Like Differential Expression (ALDEx) Analysis for Mixed Population RNA-Seq
No comments:
Post a Comment