Gloor lab musings: Weighted UniFrac in all its forms

Posted by: Ruth Grace Wong

It turns out, weighted UniFrac is slightly different depending on what pipeline you’re using. Here’s the lowdown:

Original weighted UniFrac

The formula used in the original UniFrac paper (Lozupone et al. 2005) is thus:

Given a rooted tree for which reads are associated with the leaves on the tree. For each edge of the tree, you can determine the proportional abundance of the reads associated with all its descendant leaves. The weighted UniFrac measure is calculated by multiplying each branch length of the tree with the difference in proportional abundance of its descendants between the two samples, and taking the sum.

The supplementary material includes a proof that weighted UniFrac is a distance metric, by showing that, given samples a, b, and c, the distance between a and b must always be less than or equal to the sum of the distance between a to c and c to b. This satisfies the triangle inequality requirement. The same tree is used for all three distances calculated in the proof, and the distances can range from zero to the total branch lengths * the height of the phylogenetic tree (which is greater than one). Also, as stated in the UniFrac Revealed blog post, this form of weighted UniFrac produces the same distance matrix no matter how the phylogenetic tree is rooted.

Normalized weighted UniFrac

A later publication by the same group (Lozupone et al. 2007) describes a normalized weighted UniFrac, which outputs values between 0 and 1:

The first step in applying weighted UniFrac is to calculate the raw weighted UniFrac value (u), according to the first equation:

If the phylogenetic tree is not ultrametric (i.e., if different sequences in the sample have evolved at different rates), clustering with weighted UniFrac will place more emphasis on communities that contain quickly evolving taxa. Since these taxa are assigned more branch length, a comparison of the communities that contain them will tend to produce higher values of u. In some situations, it may be desirable to normalize u so that it has a value of 0 for identical communities and 1 for nonoverlapping communities. This is accomplished by dividing u by a scaling factor (D), which is the average distance of each sequence from the root, as shown in the equation as follows: This is accomplished by dividing u by a scaling factor (D), which is the average distance of each sequence from the root, as shown in the equation as follows:

Some may prefer the normalization so that the range of possible distance values is the same between experiments. However, this disrupts the triangle inequality property. Take the following example data set with three samples and 12 taxa:

0 1 2 5 4 8 7 13 10 11 18 15

sample1 1141 330 12382 477 483 504 1809 1373 96 78 4514 63

sample2 697 260 6340 544 1700 2869 469 424 28 62 86 3066

sample3 1696 332 1078 290 2795 6125 434 121 3056 2788 63 5772

The corresponding tree, in Newick format, is

(((7:0.48127,(2:0.01401,13:0.02789)0.840:0.07613)0.626:0.04214,(8:0.21777,5:0.03645)0.881:0.37634)0.747:0.16584,(((((0:0.1161,4:0.20244)0.876:0.07231,1:0.05902)0.726:0.25695,15:0.47267)0.336:0.04988,18:0.01402)0.871:0.23856,(10:0.05189,11:0.03639)0.925:0.20682)0.553:0.00433);

If UniFrac is calculated with the normalization between all three samples, the distance between sample 1 and 3 ( 0.5884264) is 0.0202271 larger than the sum of the distances between samples 1& 2 ( 0.3404242) and 2 & 3 ( 0.2277751). If UniFrac is calculated without the extra division, the distances are: 1 to 2 = 0.09914178, 2 to 3 = 0.0835136, and 1 to 3 = 0.1814773, and the triangle inequality is satisfied, as expected.

Implementation in QIIME (FastUniFrac)

True to form, the Qiime implementation is identical to the original formula

return (branch_lengths * abs((i/float(i_sum))-(j/float(j_sum)))).sum()

If you look at a distance matrix output by Qiime, you will probably find at least one value greater than one. Qiime functions can also be run with normalized weighted UniFrac by using the dist_weighted_normalized_unifrac parameter rather than dist_weighted_normalized_unifrac (see the Qiime documentation for details).

Implementation in Mothur

The mothur weighted unifrac code implements normalized weighted UniFrac, as noted in the Discussion section of the Mothur weighted UniFrac algorithm page.

Implementation in GUniFrac

The GUniFrac paper (Chen et al. 2012) erroneously cites the normalized weighted formula as follows:

The (normalized) weighted UniFrac distance (Lozupone et al., 2007) weights the branch length with abundance difference and is defined as

Instead of using the average distance of each sequence to the root in the denominator, they’ve used the branch length.

They say in their paper that normalized UniFrac overweights long branch lengths and isn’t very sensitive to differential proportional abundance on short branch lengths. To fix this, they propose their own normalized GUniFrac formula:

To attenuate the weight on branches with large proportions, we may instead use the relative difference |piA − piB|/(piA + piB) (∈ [0,1]) in the formulation. We denote this distance measure as

This formulation does not fit the triangle inequality either. When run with the three example microbial communities used above, it yields the same distances as the normalized weighted UniFrac measurement.

Closing thoughts

To summarize, there are three different types of weighted UniFrac out there (original, normalized, and GUniFrac). The original and normalized version are the most common, and are usually analyzed with Euclidean statistical methods, such as PCoA plots, even though the normalized version is not a Euclidean metric.

Gloor lab musings

Thursday, 23 April 2015

Weighted UniFrac in all its forms