Thursday 23 April 2015

Weighted UniFrac in all its forms

Posted by: Ruth Grace Wong

It turns out, weighted UniFrac is slightly different depending on what pipeline you’re using. Here’s the lowdown:

Original weighted UniFrac

The formula used in the original UniFrac paper (Lozupone et al. 2005) is thus:
Given a rooted tree for which reads are associated with the leaves on the tree. For each edge of the tree, you can determine the proportional abundance of the reads associated with all its descendant leaves. The weighted UniFrac measure is calculated by multiplying each branch length of the tree with the difference in proportional abundance of its descendants between the two samples, and taking the sum.


The supplementary material includes a proof that weighted UniFrac is a distance metric, by showing that, given samples a, b, and c, the distance between a and b must always be less than or equal to the sum of the distance between a to c and c to b. This satisfies the triangle inequality requirement. The same tree is used for all three distances calculated in the proof, and the distances can range from zero to the total branch lengths * the height of the phylogenetic tree (which is greater than one). Also, as stated in the UniFrac Revealed blog post, this form of weighted UniFrac produces the same distance matrix no matter how the phylogenetic tree is rooted.

Normalized weighted UniFrac

A later publication by the same group (Lozupone et al. 2007) describes a normalized weighted UniFrac, which outputs values between 0 and 1:

The first step in applying weighted UniFrac is to calculate the raw weighted UniFrac value (u), according to the first equation:



If the phylogenetic tree is not ultrametric (i.e., if different sequences in the sample have evolved at different rates), clustering with weighted UniFrac will place more emphasis on communities that contain quickly evolving taxa. Since these taxa are assigned more branch length, a comparison of the communities that contain them will tend to produce higher values of u. In some situations, it may be desirable to normalize u so that it has a value of 0 for identical communities and 1 for nonoverlapping communities. This is accomplished by dividing u by a scaling factor (D), which is the average distance of each sequence from the root, as shown in the equation as follows: This is accomplished by dividing u by a scaling factor (D), which is the average distance of each sequence from the root, as shown in the equation as follows:


Some may prefer the normalization so that the range of possible distance values is the same between experiments. However, this disrupts the triangle inequality property. Take the following example data set with three samples and 12 taxa:

          0   1     2   5    4    8    7   13   10   11   18   15
sample1 1141 330 12382 477  483  504 1809 1373   96   78 4514   63
sample2  697 260  6340 544 1700 2869  469  424   28   62   86 3066
sample3 1696 332  1078 290 2795 6125  434  121 3056 2788   63 5772

The corresponding tree, in Newick format, is

(((7:0.48127,(2:0.01401,13:0.02789)0.840:0.07613)0.626:0.04214,(8:0.21777,5:0.03645)0.881:0.37634)0.747:0.16584,(((((0:0.1161,4:0.20244)0.876:0.07231,1:0.05902)0.726:0.25695,15:0.47267)0.336:0.04988,18:0.01402)0.871:0.23856,(10:0.05189,11:0.03639)0.925:0.20682)0.553:0.00433);

If UniFrac is calculated with the normalization between all three samples, the distance between sample 1 and 3 ( 0.5884264) is 0.0202271 larger than the sum of the distances between samples 1& 2 ( 0.3404242) and 2 & 3 ( 0.2277751). If UniFrac is calculated without the extra division, the distances are: 1 to 2 = 0.09914178, 2 to 3 = 0.0835136, and 1 to 3 = 0.1814773, and the triangle inequality is satisfied, as expected.

Implementation in QIIME (FastUniFrac)

True to form, the Qiime implementation is identical to the original formula

return (branch_lengths * abs((i/float(i_sum))-(j/float(j_sum)))).sum()

If you look at a distance matrix output by Qiime, you will probably find at least one value greater than one. Qiime functions can also be run with normalized weighted UniFrac by using the dist_weighted_normalized_unifrac parameter rather than dist_weighted_normalized_unifrac (see the Qiime documentation for details).

Implementation in Mothur

The mothur weighted unifrac code implements normalized weighted UniFrac, as noted in the Discussion section of the Mothur weighted UniFrac algorithm page.

Implementation in GUniFrac

The GUniFrac paper (Chen et al. 2012) erroneously cites the normalized weighted formula as follows:

The (normalized) weighted UniFrac distance (Lozupone et al., 2007) weights the branch length with abundance difference and is defined as


Instead of using the average distance of each sequence to the root in the denominator, they’ve used the branch length.

They say in their paper that normalized UniFrac overweights long branch lengths and isn’t very sensitive to differential proportional abundance on short branch lengths. To fix this, they propose their own normalized GUniFrac formula:

To attenuate the weight on branches with large proportions, we may instead use the relative difference |piA − piB|/(piA + piB) (∈ [0,1]) in the formulation. We denote this distance measure as


This formulation does not fit the triangle inequality either. When run with the three example microbial communities used above, it yields the same distances as the normalized weighted UniFrac measurement.

Closing  thoughts

To summarize, there are three different types of weighted UniFrac out there (original, normalized, and GUniFrac). The original and normalized version are the most common, and are usually analyzed with Euclidean statistical methods, such as PCoA plots, even though the normalized version is not a Euclidean metric.

2 comments:

  1. Nice post Ruth. Makes it clear why we don't get the same answer with different tools for weighted unifrac.

    ReplyDelete
    Replies
    1. The formulation of normalized weighted UniFrac from Chen et al, 2012 and Lozupone et al., 2007 are mathematically equivalent. The denominator in Chen et al, 2012 has a nicer interpretation as the sum of the weights, while the numerator is a weighted average of branch-specific distances.

      Delete