Thursday 9 February 2017

Random paper found in my pile: measuring distance in compositions

Posted by: Greg Gloor

I've been thinking a bit about distance metrics for compositional data, and in particular about the Jensen-Shannon distance (JSD) metric used in the famous Enterotypes paper. While, in my opinion and others, enterotypes are the result of measuring abundance not variance, the Jensen-Shannon distance has been of interest to me because it is based on Kullback-Leibler divergence which has log-ratios as a term.  Others have noted that the JSD is not quite as consistent as the Aitchison distance for compositional data, but I decided to look for myself. There is a nice tutorial about how to implement JSD and identify enterotypes, which was posted to convince people of the validity of the JSD and the approach used to determine enterotypes. 

I've written before about why high throughput sequencing data are compositions, and was thinking about how to test if JSD is a valid compositional distance metric when I came across this working paper from the Proceedings of the IAMG in 1998. The paper gives a very simple set of tests, and I thought that they could easily be applied to JSD and Bray-Curtis dissimilarity.

The approach is very simple. It depends on the distance metric giving the same answer when the data are permuted (rearranged), scaled, perturbed (rotated),  or subset (one OTU left out). A useful distance metric should give the same answer regardless of these alterations. The test is simple, given two pairs of compositions:

x1 = [0.1,0.2,0.7] and x2 = [0.2,0.1,0.7]
                 or
x3 = [0.3,0.4,0.3] and x4 = [0.4,0.3,0.3]

and the subset of them s1, s2, s3,s4 (the first two parts of the compositions), or a perturbation of them, p1,p2,p3,p4, (multiplication by a constant, which rotates the data), the distance should be the same.

The paper tested a broad range of distances including the Aitchison distance, which is the Euclidian distance determined on the centred log-ratio data. The conclusion of the paper was that the Aitchison distance was the only one that gave consistent answers.  In the table below, d() is the distance for the pairs of compositions, x is the original composition, s is the subset and p is the perturbed data.

On to the test:

                            d(x1,x2)     d(s1,s2)     d(p1,p2)     d(x3,x4)      d(s3,s4)     d(p3,p4)
Aitchison                 0.98           0.98            0.98           0.41            0.41            0.41
JSD                         0.13           0.13            0.15           0.08            0.08            0.06
Bray-Curtis              0.10           0.33            0.20           0.10            0.14            0.07

So what does this mean?

The Aitchison distance, which is used for compositional biplots, or for clustering gives a consistent answer whether the data is complete, subset or rotated.: it is the same for d(x), d(s) and d(p). Yay!

The JSD used in the enterotypes paper and others by the same group does pretty well, but is not as consistent as the Aitchison distance. The JSD gives the same answer for the complete data and for the subset, but a different answer when he data are perturbed. So, changes in the scale of the data in one or more dimensions, or which rotate the data around one or more parts will give a different answer with this metric.

The Bray-Curtis difference fails miserably when the variables are altered in any way. This metric is not reproducible in a compositional context, and so will give potentially different answers when the data are subset (maybe let's remove those rare OTUs or leave them in - which plot looks better?), or rotated (maybe we had a different amplification efficiency in this tube than that one?). This surprised me a bit, since we had previously shown that the Bray-Curtis metric was fairly consistent when the samples were subset.

In the end, this confirms in my mind that the Aitchison distance is the one that we should be using whenever possible for compositional data such as 16S rRNA gene sequencing data, transcriptome data, metagenomic data, etc for distance-based metrics like clustering. We have seen in the past that this is true empirically, eg. in the cross-sectional study of microbiome v. age in China.