Thursday, 23 April 2015

Weighted UniFrac in all its forms

Posted by: Ruth Grace Wong

It turns out, weighted UniFrac is slightly different depending on what pipeline you’re using. Here’s the lowdown:

Original weighted UniFrac

The formula used in the original UniFrac paper (Lozupone et al. 2005) is thus:
Given a rooted tree for which reads are associated with the leaves on the tree. For each edge of the tree, you can determine the proportional abundance of the reads associated with all its descendant leaves. The weighted UniFrac measure is calculated by multiplying each branch length of the tree with the difference in proportional abundance of its descendants between the two samples, and taking the sum.


The supplementary material includes a proof that weighted UniFrac is a distance metric, by showing that, given samples a, b, and c, the distance between a and b must always be less than or equal to the sum of the distance between a to c and c to b. This satisfies the triangle inequality requirement. The same tree is used for all three distances calculated in the proof, and the distances can range from zero to the total branch lengths * the height of the phylogenetic tree (which is greater than one). Also, as stated in the UniFrac Revealed blog post, this form of weighted UniFrac produces the same distance matrix no matter how the phylogenetic tree is rooted.

Normalized weighted UniFrac

A later publication by the same group (Lozupone et al. 2007) describes a normalized weighted UniFrac, which outputs values between 0 and 1:

The first step in applying weighted UniFrac is to calculate the raw weighted UniFrac value (u), according to the first equation:



If the phylogenetic tree is not ultrametric (i.e., if different sequences in the sample have evolved at different rates), clustering with weighted UniFrac will place more emphasis on communities that contain quickly evolving taxa. Since these taxa are assigned more branch length, a comparison of the communities that contain them will tend to produce higher values of u. In some situations, it may be desirable to normalize u so that it has a value of 0 for identical communities and 1 for nonoverlapping communities. This is accomplished by dividing u by a scaling factor (D), which is the average distance of each sequence from the root, as shown in the equation as follows: This is accomplished by dividing u by a scaling factor (D), which is the average distance of each sequence from the root, as shown in the equation as follows:


Some may prefer the normalization so that the range of possible distance values is the same between experiments. However, this disrupts the triangle inequality property. Take the following example data set with three samples and 12 taxa:

          0   1     2   5    4    8    7   13   10   11   18   15
sample1 1141 330 12382 477  483  504 1809 1373   96   78 4514   63
sample2  697 260  6340 544 1700 2869  469  424   28   62   86 3066
sample3 1696 332  1078 290 2795 6125  434  121 3056 2788   63 5772

The corresponding tree, in Newick format, is

(((7:0.48127,(2:0.01401,13:0.02789)0.840:0.07613)0.626:0.04214,(8:0.21777,5:0.03645)0.881:0.37634)0.747:0.16584,(((((0:0.1161,4:0.20244)0.876:0.07231,1:0.05902)0.726:0.25695,15:0.47267)0.336:0.04988,18:0.01402)0.871:0.23856,(10:0.05189,11:0.03639)0.925:0.20682)0.553:0.00433);

If UniFrac is calculated with the normalization between all three samples, the distance between sample 1 and 3 ( 0.5884264) is 0.0202271 larger than the sum of the distances between samples 1& 2 ( 0.3404242) and 2 & 3 ( 0.2277751). If UniFrac is calculated without the extra division, the distances are: 1 to 2 = 0.09914178, 2 to 3 = 0.0835136, and 1 to 3 = 0.1814773, and the triangle inequality is satisfied, as expected.

Implementation in QIIME (FastUniFrac)

True to form, the Qiime implementation is identical to the original formula

return (branch_lengths * abs((i/float(i_sum))-(j/float(j_sum)))).sum()

If you look at a distance matrix output by Qiime, you will probably find at least one value greater than one. Qiime functions can also be run with normalized weighted UniFrac by using the dist_weighted_normalized_unifrac parameter rather than dist_weighted_normalized_unifrac (see the Qiime documentation for details).

Implementation in Mothur

The mothur weighted unifrac code implements normalized weighted UniFrac, as noted in the Discussion section of the Mothur weighted UniFrac algorithm page.

Implementation in GUniFrac

The GUniFrac paper (Chen et al. 2012) erroneously cites the normalized weighted formula as follows:

The (normalized) weighted UniFrac distance (Lozupone et al., 2007) weights the branch length with abundance difference and is defined as


Instead of using the average distance of each sequence to the root in the denominator, they’ve used the branch length.

They say in their paper that normalized UniFrac overweights long branch lengths and isn’t very sensitive to differential proportional abundance on short branch lengths. To fix this, they propose their own normalized GUniFrac formula:

To attenuate the weight on branches with large proportions, we may instead use the relative difference |piA − piB|/(piA + piB) (∈ [0,1]) in the formulation. We denote this distance measure as


This formulation does not fit the triangle inequality either. When run with the three example microbial communities used above, it yields the same distances as the normalized weighted UniFrac measurement.

Closing  thoughts

To summarize, there are three different types of weighted UniFrac out there (original, normalized, and GUniFrac). The original and normalized version are the most common, and are usually analyzed with Euclidean statistical methods, such as PCoA plots, even though the normalized version is not a Euclidean metric.

Thursday, 16 April 2015

Metabolic markers of the vaginal microbiome

Posted by: Jean M Macklaim

Most of my PhD work was on the vaginal microbiome. At the end of my PhD another graduate student, Amy McMillan, started to uncover the metabolic markers of the vaginal microbiome and conditions like bacterial vaginosis (BV). Her paper has been under review at PNAS, and a draft was recently made available on the arXiv: A multi-platform metabolomics approach identifies novel biomarkers associated with bacterial diversity in the human vagina (McMillan et al, 2015)

Unfortunately, someone beat us to the punch. A paper was published this week in mBio by Srinivasan et al.: Metabolic Signatures of Bacterial Vaginosis (Srinivasan et al., 2015. mBio)

Although with similar aims, our paper not only differentiates BV from health with metabolic markers, but we explore the effect of pregnancy in women from Rwanda.

In a positive spin (if there is any light side to getting scooped), Srinivasan et al. did observe a number of metabolic markers we had previously predicted using metatranscriptomics: Comparative meta-RNA-seq of the vaginal microbiota and differential expression by Lactobacillus iners in health and dysbiosis (Macklaim et al., 2013. Microbiome Journal). Notably, we predicted elevated glycerol and glycerol-3-P utilizing enzymes under BV conditions, and Srinivasan et al. reported both these molecules depleted in BV.

In some ways it's very reassuring to see your previous work validated in another way/by another group. But, unfortunately, our metatranscriptome paper was not cited for this finding.


Figure 3 from Macklaim et al., 2013. Microbiome Journal, Comparative meta-RNA-seq of the vaginal microbiota and differential expression by Lactobacillus iners in health and dysbiosis


Wednesday, 15 April 2015

The count we see for an OTU or ORF is just a guideline

Posted by: Greg Gloor

A problem when normalizing by the geometric mean is that the geometric mean is not defined if any of the values in a sample are 0 (see ref 1). However, we must keep in mind that just because we see a value of 0 for a given operational taxonomic unit (OTU) in this experiment, it does not necessarily mean that the true value of the OTU is actually 0.

Values of 0 for an OTU can arise because the OTU sequence could not occur in the experiment, or because the OTU could exist in one group but not the other, or because the OTU was very rare in one or more samples making its selection from the library subject to chance. In the first case, the OTU would not be found in any sample, and that OTU could simply be deleted from the dataset without effect. In the second case, the OTU would be represented in one group but not the other. In the third case, where an OTU has at least one count in at least one sample, a value of 0 could arise in other samples because of sampling and sequencing depth. In the latter two cases, it is possible that the OTU could have been detected if more reads per sample were obtained or if more replicates of the library were sequenced. Current practice in 16S rRNA gene sequencing studies is to assume that an observed value of 0 in a sample represents the actual value. In RNA-seq it is common to remove all genes where the total sum across all samples is small (usually with a mean of 2 or less and no more than about 10 counts in any sample). In either case, the assumption is that variables with very low counts are generally irrelevant (Although there are some exceptions that I will point out in later posts).

The distribution of counts in a replicate OTU when the first OTU has counts between 0 and 3. The same library of sixteen different 16S rRNA gene amplification samples were sequenced on two different Illumina HiSeq runs, and the count of OTUs that had values of 0 to 3 in one replicate, shown by the black bar, were tabulated for the other replicate, shown by the grey bar. Sequencing depth for each replicate was within 10% of the the other replicate. 
This assumption was tested by sequencing the exact same library from 16 different samples on two individual Illumina runs, and then determining the OTU count in one run if the OTU had a count of 0, 1, 2, or 3 in the other run. The figure shows the result, and it can be seen that the count observed for an OTU in one replicate is often very different from the count observed for the other replicate. Similar observations hold for RNA-seq (2) It is clear that the absolute number of counts observed varies between pure technical replicates and as expected the underlying distributions approximate what would be expected for random sample of the input library.

Another way of saying this is that: Given a set of counts for a sample, there are many possible sets of counts that are as likely if we sequenced them tomorrow.

What we need is some way to estimate what a count of 0 (or any other count for that matter) actually represents when we observe it in our datasets.

 1) Wikipedia geometric mean
 2) ANOVA-Like Differential Expression (ALDEx) Analysis for Mixed Population RNA-Seq

Friday, 10 April 2015

Getting your high-throughput sequencing data in shape

Posted by: Greg Gloor

High throughput sequencing data is often spoken of as "counts" (RNA-seq) or as "relative abundances" (16S rRNA gene sequencing, metagenomics). What does this really mean, and how is the data derived?

Whenever we conduct a HTS experiment we are usually interested in counting and comparing the relative number of molecules in an input sample. In other words, comparing the abundance of molecules in a tube treated one way to the abundance of the same set of molecules treated in a different way. We then make a library of those DNA or RNA molecules, apply them to the sequencer, collect and map the results into a table of counts where the counts per gene (OTU, feature) are tabulated. These counts are normalized for sequencing depth in one of two major ways. First, by an RNA-seq type normalization that seeks to find correction factors that give the same number of counts per sample, or second by converting the values to proportions or percentages. Then statistical tests are done to determine what, if anything is different between treatments or groups.

A simple thought experiment shows why this approach can give incorrect inferences. Let us take an original set of molecules containing 100 different features (genes or OTUs), and ensure that 99 of them do not change. We let one molecule increase in abundance by 2-fold  increments for 20 time increments. This would be similar to an experiment where a single bacterial species overgrows in an environment until it becomes the dominant organism, but where the other organisms are neither killed nor eliminated. An equivalent in RNA-Seq would be an experiment where the expression of one gene is induced from an extremely strong promoter. Such an environment would look like the left panel of the figure labelled as Input. This is the original data if we could physically count the molecules, and any analysis should be done on data with the same shape.




When we make a library and sequence these 20 samples the sequencer can provide only a fixed upper number of reads per sample. In the case of an Illumina HiSeq with 200 million reads, this would work out to approximately 10 million reads per sample. Thus, the molecules from  each time point now represent a random sample of the original input molecules, and the number of reads per feature is essentially reduced to reads per 10 million reads. This changes the shape of the data as shown in the middle panel labeled  proportion. This is the shape of the data as analyzed by almost all bioinformatic tools.

If we compare different time points can we make the correct inference? Here is where it gets interesting. If we are comparing time point 1 to time point 5, we would probably be correct in that the black sequence has increased in abundance, and the remainder are unchanged. However, if we compare later time points, say point 15 to  point 20, we could be very wrong. We could easily infer that black is unchanged, and that the others decreased!

Data transformation to the rescue. For this reason, data transformations on proportions are often suggested, and the transformation that recaptures the shape of the input data the best is the centered log-ratio transformation shown in the right panel. In this transformation all values in a sample are divided by the geometric mean abundance of the features and then the logarithm of the ratio is taken (I will deal with the issue of 0 values in a later post). This transformation, pioneered by Aitchison in 1986, is a simple, general purpose tool that will recapitulate the accurate shape of the data provided most of the data points are either unchanged, or change at random. Values are now expressed as the ratio to the geometric mean abundance, so the red feature is rarer than average in all samples, and the black feature goes from slightly more abundant than average to substantially more abundant than anything else in the sample. In this case, it should be clear that we would likely make the correct statistical inference regarding abundance changes for each feature regardless of the time increments compared.