Gloor lab musings: 2021

I'm in the throes of submitting a paper on effect sizes in ALDEx2, and so I thought I would take a stab at a nice concise blog post to summarize and simplify.

Effect sizes are standardized mean differences between groups, and are designed to convey what the experimentalist wants to know: how different are my groups given the treatment. In contrast, a very un-nuanced interpretation is that P-values provide a measure of whether the difference observed is due to chance or not.

More formally effect sizes measure the standardized mean difference between groups, and are equivalent to a Z score; how many standard deviations separate the mean of group 1 and group 2? Not surprisingly, effect size measures are very sensitive to the underlying distribution of the data. In datasets where it is not possible to assume normality, effect size measures are very difficult to generate and interpret.

In high throughput sequencing data the data are often not Normally-distributed as shown in the 'Group Distribution' plot which shows the density of the S:E:W:E feature in the ALDEx2 selex test dataset. The problem is how to find a sensible effect size for these two ugly data distributions.

Enter the ALDEx2 effect size measure which gives an estimate of the median standardized difference between groups. Note the wording, we take the media of the standardized group difference. How does this work?

If we have two distributions contained in two vectors

a = [a1, a2, ... an]
b = [b1, b2, ... bm]

then the difference between the two groups is simply a new vector,

diff = a-b.

If the two vectors are different sizes, R will simply recycle the shorter vector. The distribution of the difference vector is shown in the 'Btw Grp Difference' density plot.

Taking the median of diff gives us a non-parametric measure of the between group difference, and this is about -2. Note that this is different than asking what is the difference between the medians of the two groups, but for infinite sized datasets these measures would be the same.

How do we scale this? One way would be to estimate a summary statistic of dispersion such as the pooled median absolute deviation, but this would not use all of the information in the distributions and makes assumptions. So what we do in ALDEx2, of course, is to generate a dispersion vector,

disp = max(a-Pa, b-Pb),

where P denotes a random permutation of vector a or b. The distribution of the dispersion is shown in the 'Win Grp Dispersion' density plot. Taking the median of disp results in a non-parametric measure of the dispersion, and is just over 2. If the dispersion of the two vectors is roughly equal, then the estimate is derived from both jointly. However if the dispersion of the two vectors is not equal, then the dispersion estimate is skewed toward being drawn from the most disperse vector.

So now we have what we need.

eff = median(diff/disp),

and we have a nice clean summary effect size statistic that makes no assumptions about the distribution of the data, and the distribution of effect sizes is shown in the 'Effect size' density plot, and the median value is about 0.9 in this example.

Modelling shows that eff is approximately 1.4 times that of Cohen's d, as expected for a well-behaved non-parametric estimator and importantly gives sensible answers when the underlying data are derived from Normal, uniform, skewed, Cauchy, or multimodal distributions.

These measures are what you see when you look at the aldex.effect output:

diff.btw = median(diff)

diff.win = median(disp)

effect = eff

ALDEx2 will even give you the 95% confidence interval of the effect size estimate if you ask it; try aldex.effect(x, CI=TRUE).

I've been optimizing various functions in ALDEx2 on Bioconductor to make it more efficient. One bottleneck has been the aldex.effect() function which calculates an effect size for the difference between distributions. I will write a separate post on how this calculation works, but I'm happy to say that the new effect size calculation is about 3X faster than the old one. .

I was aided in this by the fabulous profvis package that gives a graphical overview of where the time and space bottlenecks are in R code. I discovered several bottlenecks.

First, when the original version was first written for R v2.8 we had substantial memory issues that were solved by adding in gc() calls to free up memory. In the current version of R v4.0 the garbage collection occurs in the background as seen in the profvis()output. Removing these legacy calls resulted in a large speedup.

Second, the calculations done in the background are usually non-parametric such as finding maximum or minimum values, or medians. These were replaced with the pmax() function from base R and replacing apply() functions with corresponding rowMedians() and colMedians() functions from the Rfast package on CRAN.

These simple changes result in quite remarkable speed increases. The function is now able to process 128 Dirichlet MC replicates from 14 samples and 1600 rows in about 1.6 seconds, down from about 4.7 seconds; about a 3X speedup. As a side effect, the memory footprint is cut in half from about 1500 MB to about 700 MB, making the entire function more efficient in time and space.

You can try it yourself by installing the development version of ALDEx2 from Bioconductor and running the following code.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
   BiocManager::install(version='devel')
   BiocManager::install("ALDEx2")
   library(ALDEx2)
data(selex)
  conds <- c(rep("N",7),rep("S",7))
x <- aldex.clr(selex, conds)
system.time(x.e <- aldex.effect(x))

Gloor lab musings

Monday 1 March 2021

Measuring effect size in ALDEx2

Sunday 21 February 2021

Speeding up aldex.effect()