Monday, 1 March 2021

Measuring effect size in ALDEx2


I'm in the throes of submitting a paper on effect sizes in ALDEx2, and so I thought I would take a stab at a nice concise blog post to summarize and simplify. 

Effect sizes are standardized mean differences between groups, and are designed to convey what the experimentalist wants to know: how different are my groups given the treatment. In contrast, a very un-nuanced interpretation is that P-values provide a measure of whether the difference observed is due to chance or not. 

More formally effect sizes measure the standardized mean difference between groups, and are equivalent to a Z score; how many standard deviations separate the mean of group 1 and group 2? Not surprisingly, effect size measures are very sensitive to the underlying distribution of the data. In datasets where it is not possible to assume normality, effect size measures are very difficult to generate and interpret.

In high throughput sequencing data the data are often not Normally-distributed as shown in the 'Group Distribution' plot which shows the density of the S:E:W:E feature in the ALDEx2 selex test dataset. The problem is how to find a sensible effect size for these two ugly data distributions. 

Enter the ALDEx2 effect size measure which gives an estimate of the median standardized difference between groups. Note the wording, we take the media of the standardized group difference. How does this work?

If we have two distributions contained in two vectors 

a = [a1, a2, ... an
b = [b1, b2, ... bm]

then the difference between the two groups is simply a new vector, 

diff = a-b.

If the two vectors are different sizes, R will simply recycle the shorter vector. The distribution of the difference vector is shown in the 'Btw Grp Difference' density plot.

Taking the median of diff gives us a non-parametric measure of the between group difference, and this is about -2. Note that this is different than asking what is the difference between the medians of the two groups, but for infinite sized datasets these measures would be the same. 

How do we scale this? One way would be to estimate a summary statistic of dispersion such as the pooled median absolute deviation, but this would not use all of the information in the distributions and makes assumptions. So what we do in ALDEx2, of course, is to generate a dispersion vector, 

disp = max(a-Pa, b-Pb),

where P denotes a random permutation of vector a or b. The distribution of the dispersion is shown in the 'Win Grp Dispersion' density plot. Taking the median of disp results in a non-parametric measure of the dispersion, and is just over 2. If the dispersion of the two vectors is roughly equal, then the estimate is derived from both jointly. However if the dispersion of the two vectors is not equal, then the dispersion estimate is skewed toward being drawn from the most disperse vector.

So now we have what we need. 

eff = median(diff/disp),

and we have a nice clean summary effect size statistic that makes no assumptions about the distribution of the data, and the distribution of effect sizes is shown in the 'Effect size' density plot, and the median value is about 0.9 in this example.

Modelling shows that eff is approximately 1.4 times that of Cohen's d, as expected for a well-behaved non-parametric estimator and importantly gives sensible answers when the underlying data are derived from Normal, uniform, skewed, Cauchy, or multimodal distributions. 

These measures are what you see when you look at the aldex.effect output:

diff.btw = median(diff)
diff.win = median(disp)
effect = eff

ALDEx2 will even give you the 95% confidence interval of the effect size estimate if you ask it; try aldex.effect(x, CI=TRUE).
 

Sunday, 21 February 2021

Speeding up aldex.effect()

 

I've been optimizing various functions in ALDEx2 on Bioconductor to make it more efficient. One bottleneck has been the aldex.effect() function which calculates an effect size for the difference between distributions. I will write a separate post on how this calculation works, but I'm happy to say that the new effect size calculation is about 3X faster than the old one. . 

I was aided in this by the fabulous profvis package that gives a graphical overview of where the time and space bottlenecks are in R code. I discovered several bottlenecks. 

First, when the original version was first written for R v2.8 we had substantial memory issues that were solved by adding in gc() calls to free up memory. In the current version of R v4.0 the garbage collection occurs in the background as seen in the profvis()output. Removing these legacy calls resulted in a large speedup. 

Second, the calculations done in the background are usually non-parametric such as finding maximum or minimum values, or medians. These were replaced with the pmax() function from base R and replacing apply() functions with corresponding rowMedians() and colMedians() functions from the Rfast package on CRAN

These simple changes result in quite remarkable speed increases. The function is now able to process 128 Dirichlet MC replicates from 14 samples and 1600 rows in about 1.6 seconds, down from about 4.7 seconds; about a 3X speedup. As a side effect, the memory footprint is cut in half from about 1500 MB to about 700 MB, making the entire function more efficient in time and space.

You can try it yourself by installing the development version of ALDEx2 from Bioconductor and running the following code.

   if (!requireNamespace("BiocManager", quietly = TRUE))
       install.packages("BiocManager")
    BiocManager::install(version='devel')
    BiocManager::install("ALDEx2")
    library(ALDEx2)
    data(selex)
    conds <- c(rep("N",7),rep("S",7))
    x <- aldex.clr(selex, conds)
    system.time(x.e <- aldex.effect(x))


Thursday, 9 February 2017

Random paper found in my pile: measuring distance in compositions

Posted by: Greg Gloor

I've been thinking a bit about distance metrics for compositional data, and in particular about the Jensen-Shannon distance (JSD) metric used in the famous Enterotypes paper. While, in my opinion and others, enterotypes are the result of measuring abundance not variance, the Jensen-Shannon distance has been of interest to me because it is based on Kullback-Leibler divergence which has log-ratios as a term.  Others have noted that the JSD is not quite as consistent as the Aitchison distance for compositional data, but I decided to look for myself. There is a nice tutorial about how to implement JSD and identify enterotypes, which was posted to convince people of the validity of the JSD and the approach used to determine enterotypes. 

I've written before about why high throughput sequencing data are compositions, and was thinking about how to test if JSD is a valid compositional distance metric when I came across this working paper from the Proceedings of the IAMG in 1998. The paper gives a very simple set of tests, and I thought that they could easily be applied to JSD and Bray-Curtis dissimilarity.

The approach is very simple. It depends on the distance metric giving the same answer when the data are permuted (rearranged), scaled, perturbed (rotated),  or subset (one OTU left out). A useful distance metric should give the same answer regardless of these alterations. The test is simple, given two pairs of compositions:

x1 = [0.1,0.2,0.7] and x2 = [0.2,0.1,0.7]
                 or
x3 = [0.3,0.4,0.3] and x4 = [0.4,0.3,0.3]

and the subset of them s1, s2, s3,s4 (the first two parts of the compositions), or a perturbation of them, p1,p2,p3,p4, (multiplication by a constant, which rotates the data), the distance should be the same.

The paper tested a broad range of distances including the Aitchison distance, which is the Euclidian distance determined on the centred log-ratio data. The conclusion of the paper was that the Aitchison distance was the only one that gave consistent answers.  In the table below, d() is the distance for the pairs of compositions, x is the original composition, s is the subset and p is the perturbed data.

On to the test:

                            d(x1,x2)     d(s1,s2)     d(p1,p2)     d(x3,x4)      d(s3,s4)     d(p3,p4)
Aitchison                 0.98           0.98            0.98           0.41            0.41            0.41
JSD                         0.13           0.13            0.15           0.08            0.08            0.06
Bray-Curtis              0.10           0.33            0.20           0.10            0.14            0.07

So what does this mean?

The Aitchison distance, which is used for compositional biplots, or for clustering gives a consistent answer whether the data is complete, subset or rotated.: it is the same for d(x), d(s) and d(p). Yay!

The JSD used in the enterotypes paper and others by the same group does pretty well, but is not as consistent as the Aitchison distance. The JSD gives the same answer for the complete data and for the subset, but a different answer when he data are perturbed. So, changes in the scale of the data in one or more dimensions, or which rotate the data around one or more parts will give a different answer with this metric.

The Bray-Curtis difference fails miserably when the variables are altered in any way. This metric is not reproducible in a compositional context, and so will give potentially different answers when the data are subset (maybe let's remove those rare OTUs or leave them in - which plot looks better?), or rotated (maybe we had a different amplification efficiency in this tube than that one?). This surprised me a bit, since we had previously shown that the Bray-Curtis metric was fairly consistent when the samples were subset.

In the end, this confirms in my mind that the Aitchison distance is the one that we should be using whenever possible for compositional data such as 16S rRNA gene sequencing data, transcriptome data, metagenomic data, etc for distance-based metrics like clustering. We have seen in the past that this is true empirically, eg. in the cross-sectional study of microbiome v. age in China.

Thursday, 7 April 2016

Sequencing is not counting!

Posted by: Greg Gloor

The idea of using compositional data analysis approaches (CoDa) to examine microbiome samples appears to be gaining traction. There have been a recent spate of papers (and not just from us) that are going in the right direction by acknowledging that we have compositional data.

However, to my mind, most papers do go far enough, and still make strong assumptions about the data. I want to briefly outline why we must embrace the idea of CoDa in its entirety.

Fundamentally, sequencing is not counting. To see this, lets set up a very simple thought experiment. In ecology, it is common to count species within some area, and then to use these counts for the analysis.

In the counting example, we have 5 tigers and 20 ladybugs in some defined area which is a random sample of the environment. We can assume that the animals are free to move both within the box and between the box and the environment, and that the density of animals inside and outside the box is about the same. Therefore, if one more tiger should wander in from the outside, that not alter the number of ladybugs and vice-versa. Moreover, if a space alien happens to land in the counting box, it will likely not alter the count of tigers or ladybugs (unless it lands on and squishes one!). We can see that the abundance of tigers and ladybugs can be essentially treated as uncorrelated. In addition, we can normalize for different box sizes rather simply by scaling the size of the box. So if one student measures a box of size 1, and another a box of size 2, we can adjust for sampling effort fairly simply. This is the rationale behind normalizing for sequencing depth, and if we had counts would work marvelously for sequencing data.


But, this is not the case with sequencing. Every DNA sequencing instrument has a fixed upper limit on the number of reads that can be delivered, and this is illustrated by having exactly 25 cells inside the box. Visualized in this way, we can see the difference in how the same counts of 5 tigers and 20 ladybugs must behave in the face of a perturbation. Either a tiger or a ladybug must be displaced if another tiger wanders onto the sequencing grid. Another way of saying this is that the tiger and ladybug numbers observed are correlated, since an increase in one necessarily involves a decrease in the other. If the space alien now enters the sequencing grid, we see that again we must displace a tiger or a ladybug. This causes all manner of problems for traditional approaches because of spurious correlation, negative correlation bias and sub-compositional correlation effects.

This problem generalizes to any number of species, and is not alleviated by increasing the number of cells in the grid. Attempts to normalize read counts, which are the same as normalizing the number of squares in the grid are doomed to fail because of this property.

So what can we do? We have to embrace the idea that the actual numbers we observe from a sequencing run are irrelevant, and the only information available is relative information: that is, that the ratios between the ladybugs and tigers is the only measurable property. We have begun the process of adapting the full measure of CoDa tools to microbiome studies and our first paper is just out in Annals of Epidemiology: It's all relative: analyzing microbiome data as compositions, and I hope you give it a read. There is also a supplement that contains all the code used to generate the paper.

Although there are many issues to be resolved, mainly because sparse data can be problematic with  CoDa approach, I think that it is worth checking out for your datasets. In a later post, I will outline how the general approach we use to account for the sparsity problem.

Tuesday, 27 October 2015

Wading through the deluge of scientific literature

Posted by: Jean M Macklaim

The amount of science published every day is massive, making it extremely difficult to keep up with literature in your given field. I used to rely on keyword searches and RSS feeds or email updates on specific topics ("microbiome", "bacterial vaginosis", "probiotics", etc.) but this becomes a chore to keep up, and missing a day or two leave you with dozens of emails to catch up with. Not only that, but each weekly update on a topic will consist of 50-80 article titles that you have to read and sort and determine which are relevant enough to follow up on. I stumbled across this article in Nature (How to tame the flood of literature, Elizabeth Gibney) outlining some of these issues and suggested solutions.

So *is* there a better solution? Maybe not a catch-all tool, but here are some tools and software that have helped me out.

Automated recommendation tools


Google Scholar (online)
Google scholar is my primary tool for searching and finding publications. Once you make a scholar profile (linked to your Google account),  you'll get an automated list of your own publications and citations (see mine here), including some metrics on your citation rate. Once you have a profile, Google will automatically suggest publications that are related to your work (they appear under "My Updates"). I find these updates are the most relevant to my research and what I am interested in.



Additionally, you can save articles you find via Google Scholar to "My Library". This keeps a list with links to the articles, along with citation metrics and information. However, this does not replace a citation manager as there is currently no way to export your list of articles, or automatically insert citations into a manuscript.

PubChase (online)
This is a new tool I am using, but I already see the potential. Like Google Scholar, it uses information about what you've published, but with the addition of articles you've saved to your Library as a resource to recommend articles relevant to you. Your Library can be populated by uploading a BibTeX (.bib) file, which nearly any reference manager will export for you.

Pubchase has some other useful features I haven't yet explored. It automatically links your library entries to PDF files (where possible) so you can read papers directly from your account. They also provide tablet, mobile, and computer apps for OSX, iOS, and Android, allowing you to read and access your library from any device. You can also make your library public for other members to follow, or follow someone else's library.


Reference managers

A reference manager is essential for writing a manuscript and keeping track of citations. I don't have a favourite yet, but here are a few I've tried and liked:

Papers (Mac and iOS only; $)
An app you can install on your Mac and iPhone/iPad to access and view citations, and make notes and mark up PDF articles. It had a nice interface, and easily links to most word processing software (citations can be added to Word, plain text apps, Google Docs, etc.). The one major downside is this software has a cost, although there is a student discount rate.

Zotero (Mac, Windows, Linux; free)
Cross-platform, and free. Does the job, but the interface is not as clean as I'd like (seems a little 90s Windows-esque). Ability to insert citations and build reference lists for most document types.

Mendeley (Mac, Windows, Linux, iOS, Android; free)
Another nice looking application. It manages papers and citations. Has more functionality than I need though, and I find myself lost in menus sometimes. Seems to currently only support adding citations and building reference lists in Microsoft Word documents.


Other tools

Evernote (Mac, iOS, Windows, Linux, Android; free)
A multi-multi-purpose digital notebook. I use this for everything in my life, and it has become my digital lab book. You can add notes, PDFs, images, code, basically anything you want. You can compile notes into notebooks and tag topics. You can share with your workmates. It basically keeps my scientific life organized. Did I mention its FREE (I've found no need to pay for the premium subscriptions).


Blogs and social networking sites

Another source I use to keep up-to-date. A number of researchers maintain blogs with current information about their research or research in the field. I recommend using an RSS reader to receive the posts as they become live. Some I recommend:

CommaFeed (online)

Feedly (online, and available apps)

Also, the current Safari and Chrome browsers have RSS functionality built in.

Other sources include Facebook, Twitter, ResearchGate, and Google+. I find it's only useful to follow these if you are already an avid user. It's not feasible for me to check with every social networking site every day, and so I don't rely on these to keep up.


In case you are interested, here are some of the blogs I follow:



Simply Statistics (Jeff Leek, Roger Peng, and Rafa Irizarry)

Bits of DNA (Lior Pachter)




Please comment below on any other software tools that make your research life easier!

Monday, 24 August 2015

Science is hard

Posted by: Jean M Macklaim


An interesting article on Christie Aschwanden on FiveThirtyEight - worth a read!