Friday 10 April 2015

Getting your high-throughput sequencing data in shape

Posted by: Greg Gloor

High throughput sequencing data is often spoken of as "counts" (RNA-seq) or as "relative abundances" (16S rRNA gene sequencing, metagenomics). What does this really mean, and how is the data derived?

Whenever we conduct a HTS experiment we are usually interested in counting and comparing the relative number of molecules in an input sample. In other words, comparing the abundance of molecules in a tube treated one way to the abundance of the same set of molecules treated in a different way. We then make a library of those DNA or RNA molecules, apply them to the sequencer, collect and map the results into a table of counts where the counts per gene (OTU, feature) are tabulated. These counts are normalized for sequencing depth in one of two major ways. First, by an RNA-seq type normalization that seeks to find correction factors that give the same number of counts per sample, or second by converting the values to proportions or percentages. Then statistical tests are done to determine what, if anything is different between treatments or groups.

A simple thought experiment shows why this approach can give incorrect inferences. Let us take an original set of molecules containing 100 different features (genes or OTUs), and ensure that 99 of them do not change. We let one molecule increase in abundance by 2-fold  increments for 20 time increments. This would be similar to an experiment where a single bacterial species overgrows in an environment until it becomes the dominant organism, but where the other organisms are neither killed nor eliminated. An equivalent in RNA-Seq would be an experiment where the expression of one gene is induced from an extremely strong promoter. Such an environment would look like the left panel of the figure labelled as Input. This is the original data if we could physically count the molecules, and any analysis should be done on data with the same shape.




When we make a library and sequence these 20 samples the sequencer can provide only a fixed upper number of reads per sample. In the case of an Illumina HiSeq with 200 million reads, this would work out to approximately 10 million reads per sample. Thus, the molecules from  each time point now represent a random sample of the original input molecules, and the number of reads per feature is essentially reduced to reads per 10 million reads. This changes the shape of the data as shown in the middle panel labeled  proportion. This is the shape of the data as analyzed by almost all bioinformatic tools.

If we compare different time points can we make the correct inference? Here is where it gets interesting. If we are comparing time point 1 to time point 5, we would probably be correct in that the black sequence has increased in abundance, and the remainder are unchanged. However, if we compare later time points, say point 15 to  point 20, we could be very wrong. We could easily infer that black is unchanged, and that the others decreased!

Data transformation to the rescue. For this reason, data transformations on proportions are often suggested, and the transformation that recaptures the shape of the input data the best is the centered log-ratio transformation shown in the right panel. In this transformation all values in a sample are divided by the geometric mean abundance of the features and then the logarithm of the ratio is taken (I will deal with the issue of 0 values in a later post). This transformation, pioneered by Aitchison in 1986, is a simple, general purpose tool that will recapitulate the accurate shape of the data provided most of the data points are either unchanged, or change at random. Values are now expressed as the ratio to the geometric mean abundance, so the red feature is rarer than average in all samples, and the black feature goes from slightly more abundant than average to substantially more abundant than anything else in the sample. In this case, it should be clear that we would likely make the correct statistical inference regarding abundance changes for each feature regardless of the time increments compared.


No comments:

Post a Comment