Tuesday 27 October 2015

Wading through the deluge of scientific literature

Posted by: Jean M Macklaim

The amount of science published every day is massive, making it extremely difficult to keep up with literature in your given field. I used to rely on keyword searches and RSS feeds or email updates on specific topics ("microbiome", "bacterial vaginosis", "probiotics", etc.) but this becomes a chore to keep up, and missing a day or two leave you with dozens of emails to catch up with. Not only that, but each weekly update on a topic will consist of 50-80 article titles that you have to read and sort and determine which are relevant enough to follow up on. I stumbled across this article in Nature (How to tame the flood of literature, Elizabeth Gibney) outlining some of these issues and suggested solutions.

So *is* there a better solution? Maybe not a catch-all tool, but here are some tools and software that have helped me out.

Automated recommendation tools


Google Scholar (online)
Google scholar is my primary tool for searching and finding publications. Once you make a scholar profile (linked to your Google account),  you'll get an automated list of your own publications and citations (see mine here), including some metrics on your citation rate. Once you have a profile, Google will automatically suggest publications that are related to your work (they appear under "My Updates"). I find these updates are the most relevant to my research and what I am interested in.



Additionally, you can save articles you find via Google Scholar to "My Library". This keeps a list with links to the articles, along with citation metrics and information. However, this does not replace a citation manager as there is currently no way to export your list of articles, or automatically insert citations into a manuscript.

PubChase (online)
This is a new tool I am using, but I already see the potential. Like Google Scholar, it uses information about what you've published, but with the addition of articles you've saved to your Library as a resource to recommend articles relevant to you. Your Library can be populated by uploading a BibTeX (.bib) file, which nearly any reference manager will export for you.

Pubchase has some other useful features I haven't yet explored. It automatically links your library entries to PDF files (where possible) so you can read papers directly from your account. They also provide tablet, mobile, and computer apps for OSX, iOS, and Android, allowing you to read and access your library from any device. You can also make your library public for other members to follow, or follow someone else's library.


Reference managers

A reference manager is essential for writing a manuscript and keeping track of citations. I don't have a favourite yet, but here are a few I've tried and liked:

Papers (Mac and iOS only; $)
An app you can install on your Mac and iPhone/iPad to access and view citations, and make notes and mark up PDF articles. It had a nice interface, and easily links to most word processing software (citations can be added to Word, plain text apps, Google Docs, etc.). The one major downside is this software has a cost, although there is a student discount rate.

Zotero (Mac, Windows, Linux; free)
Cross-platform, and free. Does the job, but the interface is not as clean as I'd like (seems a little 90s Windows-esque). Ability to insert citations and build reference lists for most document types.

Mendeley (Mac, Windows, Linux, iOS, Android; free)
Another nice looking application. It manages papers and citations. Has more functionality than I need though, and I find myself lost in menus sometimes. Seems to currently only support adding citations and building reference lists in Microsoft Word documents.


Other tools

Evernote (Mac, iOS, Windows, Linux, Android; free)
A multi-multi-purpose digital notebook. I use this for everything in my life, and it has become my digital lab book. You can add notes, PDFs, images, code, basically anything you want. You can compile notes into notebooks and tag topics. You can share with your workmates. It basically keeps my scientific life organized. Did I mention its FREE (I've found no need to pay for the premium subscriptions).


Blogs and social networking sites

Another source I use to keep up-to-date. A number of researchers maintain blogs with current information about their research or research in the field. I recommend using an RSS reader to receive the posts as they become live. Some I recommend:

CommaFeed (online)

Feedly (online, and available apps)

Also, the current Safari and Chrome browsers have RSS functionality built in.

Other sources include Facebook, Twitter, ResearchGate, and Google+. I find it's only useful to follow these if you are already an avid user. It's not feasible for me to check with every social networking site every day, and so I don't rely on these to keep up.


In case you are interested, here are some of the blogs I follow:



Simply Statistics (Jeff Leek, Roger Peng, and Rafa Irizarry)

Bits of DNA (Lior Pachter)




Please comment below on any other software tools that make your research life easier!

Monday 24 August 2015

Science is hard

Posted by: Jean M Macklaim


An interesting article on Christie Aschwanden on FiveThirtyEight - worth a read!

Wednesday 20 May 2015

From the blogosphere - batch effects in sequence data

Posted by: Jean M Macklaim

We've performed a number of sequencing runs lately in replicates (replicated in PCR amplification, library preparation, sequencing run, and sequencing platform) to explore the technical variation, including batch effects. We know technical variation can change your results in big ways.

A good example of batch effect has been brought to attention recently:

http://simplystatistics.org/2015/05/20/is-it-species-or-is-it-batch-they-are-confounded-so-we-cant-know/

http://www.the-scientist.com/?articles.view/articleNo/43003/title/Batch-Effect-Behind-Species-Specific-Results-/

In summary, two major papers (1 and 2) nearly a decade apart report the same conclusion: human tissues are more similar in gene expression to any other human tissue than to the equivalent tissue in mouse.

Whoa, what?

So human lung tissue is more similar to human eye tissue than to mouse lung?

As it turns out, this likely isn't true. The observation is due to the confounding batch effect. In other words, what the authors thought was explained by differences in species was actually confounded by the differences in platforms/analyses (in the first case e.g. a different microarray with a different set of probes were used for mouse vs human). The most recent 2014 study has been refuted for confounding batch effect too in at least one publication.

So how does one deal with confounding variables?

The most robust way is to sufficiently randomize your samples so technical parameters don't line up with the biological parameters you are testing, and ensure you maintain a consistent protocol throughout.

This, of course, does not guarantee you won't confound your data, but it should help prevent a systemic bias in sample preparation. If you process all your controls on Tuesdays and your affected samples on Fridays, you won't be able to determine if the effect is due to the condition or the day of the week. As another example, it's important to randomize sample loadings to ensure you don't line up all of condition1 in row 1 of your 96-well plate, and condition2 in row 2 etc. This is where is is essential to collect sufficient metadata about your samples in order to explicitly examine confounding variables.

And of course, ALWAYS look at your data

Tuesday 12 May 2015

The count you get is just a guideline - part 2

Posted by: Greg Gloor

In previous posts I have shown the effect of sequencing on the underlying shape of the data, and that the data are better thought of as likelihoods of observing a particular count for a particular gene or OTU. Now I'm going to show you the effect of combining those two ideas into an approach to examine datasets of this type that minimizes false positives while maintaining power.

The two-step method is implemented in the ALDEx2 package on Bioconductor. Step one is to model the technical variance by sampling instances from the Dirichlet distribution. This distribution is essentially a multivariate Poisson distribution, and is a good estimator for the underlying technical variation that we have seen in previous blog posts. Step two is to use the centered log-ratio transformation to re-capitulate the underlying shape of the data. These two steps are implemented in the aldex.clr function.

We can now use these transformed instances for standard statistical tests. Below shows an example for a dataset where we are interested in knowing which variants differ in abundance between two conditions. What we want is to find out which variants will give us the same answer regardless of the underlying technical variation. The way we do this is to calculate P values, and Benjamin-Hochberg adjusted P values for each Dirichlet instance, and then to tabulate the mean values. The mean P value for random data should converge on 0.5. The mean P value (and BH adjusted values) for real differences should be relatively invariant to technical replication. That is, we should make the same conclusions if we sequence today or tomorrow.

The figure shows four plots of what we get from a different numbers of Dirichlet instances (number on the left), coupled with a histogram of the P values and the number of BH adjusted values that pass a threshold of <0.05 (number on the right). P values <0.05 are in orange, BH <0.05 are in red. As you can see we start off with 138 'significant' variants with a single instance, this is also about what we get if we just use our raw data. After only 5 Dirichlet instances we have reduced this to 80 'significant' variants, and by 100 or so, we stabilize on a value of 75 or so.


Below is an animated gif that walks through 2000 instances. It is pretty clear that inferred technical variation has a large effect on what is called 'significant', and that we can remove at least some of this stochastic noise quite simply. Also it is clear that the number of significant variants stabilizes and is not affected by further random sampling.

(Click the image to restart the gif animation)

Thursday 23 April 2015

Weighted UniFrac in all its forms

Posted by: Ruth Grace Wong

It turns out, weighted UniFrac is slightly different depending on what pipeline you’re using. Here’s the lowdown:

Original weighted UniFrac

The formula used in the original UniFrac paper (Lozupone et al. 2005) is thus:
Given a rooted tree for which reads are associated with the leaves on the tree. For each edge of the tree, you can determine the proportional abundance of the reads associated with all its descendant leaves. The weighted UniFrac measure is calculated by multiplying each branch length of the tree with the difference in proportional abundance of its descendants between the two samples, and taking the sum.


The supplementary material includes a proof that weighted UniFrac is a distance metric, by showing that, given samples a, b, and c, the distance between a and b must always be less than or equal to the sum of the distance between a to c and c to b. This satisfies the triangle inequality requirement. The same tree is used for all three distances calculated in the proof, and the distances can range from zero to the total branch lengths * the height of the phylogenetic tree (which is greater than one). Also, as stated in the UniFrac Revealed blog post, this form of weighted UniFrac produces the same distance matrix no matter how the phylogenetic tree is rooted.

Normalized weighted UniFrac

A later publication by the same group (Lozupone et al. 2007) describes a normalized weighted UniFrac, which outputs values between 0 and 1:

The first step in applying weighted UniFrac is to calculate the raw weighted UniFrac value (u), according to the first equation:



If the phylogenetic tree is not ultrametric (i.e., if different sequences in the sample have evolved at different rates), clustering with weighted UniFrac will place more emphasis on communities that contain quickly evolving taxa. Since these taxa are assigned more branch length, a comparison of the communities that contain them will tend to produce higher values of u. In some situations, it may be desirable to normalize u so that it has a value of 0 for identical communities and 1 for nonoverlapping communities. This is accomplished by dividing u by a scaling factor (D), which is the average distance of each sequence from the root, as shown in the equation as follows: This is accomplished by dividing u by a scaling factor (D), which is the average distance of each sequence from the root, as shown in the equation as follows:


Some may prefer the normalization so that the range of possible distance values is the same between experiments. However, this disrupts the triangle inequality property. Take the following example data set with three samples and 12 taxa:

          0   1     2   5    4    8    7   13   10   11   18   15
sample1 1141 330 12382 477  483  504 1809 1373   96   78 4514   63
sample2  697 260  6340 544 1700 2869  469  424   28   62   86 3066
sample3 1696 332  1078 290 2795 6125  434  121 3056 2788   63 5772

The corresponding tree, in Newick format, is

(((7:0.48127,(2:0.01401,13:0.02789)0.840:0.07613)0.626:0.04214,(8:0.21777,5:0.03645)0.881:0.37634)0.747:0.16584,(((((0:0.1161,4:0.20244)0.876:0.07231,1:0.05902)0.726:0.25695,15:0.47267)0.336:0.04988,18:0.01402)0.871:0.23856,(10:0.05189,11:0.03639)0.925:0.20682)0.553:0.00433);

If UniFrac is calculated with the normalization between all three samples, the distance between sample 1 and 3 ( 0.5884264) is 0.0202271 larger than the sum of the distances between samples 1& 2 ( 0.3404242) and 2 & 3 ( 0.2277751). If UniFrac is calculated without the extra division, the distances are: 1 to 2 = 0.09914178, 2 to 3 = 0.0835136, and 1 to 3 = 0.1814773, and the triangle inequality is satisfied, as expected.

Implementation in QIIME (FastUniFrac)

True to form, the Qiime implementation is identical to the original formula

return (branch_lengths * abs((i/float(i_sum))-(j/float(j_sum)))).sum()

If you look at a distance matrix output by Qiime, you will probably find at least one value greater than one. Qiime functions can also be run with normalized weighted UniFrac by using the dist_weighted_normalized_unifrac parameter rather than dist_weighted_normalized_unifrac (see the Qiime documentation for details).

Implementation in Mothur

The mothur weighted unifrac code implements normalized weighted UniFrac, as noted in the Discussion section of the Mothur weighted UniFrac algorithm page.

Implementation in GUniFrac

The GUniFrac paper (Chen et al. 2012) erroneously cites the normalized weighted formula as follows:

The (normalized) weighted UniFrac distance (Lozupone et al., 2007) weights the branch length with abundance difference and is defined as


Instead of using the average distance of each sequence to the root in the denominator, they’ve used the branch length.

They say in their paper that normalized UniFrac overweights long branch lengths and isn’t very sensitive to differential proportional abundance on short branch lengths. To fix this, they propose their own normalized GUniFrac formula:

To attenuate the weight on branches with large proportions, we may instead use the relative difference |piA − piB|/(piA + piB) (∈ [0,1]) in the formulation. We denote this distance measure as


This formulation does not fit the triangle inequality either. When run with the three example microbial communities used above, it yields the same distances as the normalized weighted UniFrac measurement.

Closing  thoughts

To summarize, there are three different types of weighted UniFrac out there (original, normalized, and GUniFrac). The original and normalized version are the most common, and are usually analyzed with Euclidean statistical methods, such as PCoA plots, even though the normalized version is not a Euclidean metric.

Thursday 16 April 2015

Metabolic markers of the vaginal microbiome

Posted by: Jean M Macklaim

Most of my PhD work was on the vaginal microbiome. At the end of my PhD another graduate student, Amy McMillan, started to uncover the metabolic markers of the vaginal microbiome and conditions like bacterial vaginosis (BV). Her paper has been under review at PNAS, and a draft was recently made available on the arXiv: A multi-platform metabolomics approach identifies novel biomarkers associated with bacterial diversity in the human vagina (McMillan et al, 2015)

Unfortunately, someone beat us to the punch. A paper was published this week in mBio by Srinivasan et al.: Metabolic Signatures of Bacterial Vaginosis (Srinivasan et al., 2015. mBio)

Although with similar aims, our paper not only differentiates BV from health with metabolic markers, but we explore the effect of pregnancy in women from Rwanda.

In a positive spin (if there is any light side to getting scooped), Srinivasan et al. did observe a number of metabolic markers we had previously predicted using metatranscriptomics: Comparative meta-RNA-seq of the vaginal microbiota and differential expression by Lactobacillus iners in health and dysbiosis (Macklaim et al., 2013. Microbiome Journal). Notably, we predicted elevated glycerol and glycerol-3-P utilizing enzymes under BV conditions, and Srinivasan et al. reported both these molecules depleted in BV.

In some ways it's very reassuring to see your previous work validated in another way/by another group. But, unfortunately, our metatranscriptome paper was not cited for this finding.


Figure 3 from Macklaim et al., 2013. Microbiome Journal, Comparative meta-RNA-seq of the vaginal microbiota and differential expression by Lactobacillus iners in health and dysbiosis


Wednesday 15 April 2015

The count we see for an OTU or ORF is just a guideline

Posted by: Greg Gloor

A problem when normalizing by the geometric mean is that the geometric mean is not defined if any of the values in a sample are 0 (see ref 1). However, we must keep in mind that just because we see a value of 0 for a given operational taxonomic unit (OTU) in this experiment, it does not necessarily mean that the true value of the OTU is actually 0.

Values of 0 for an OTU can arise because the OTU sequence could not occur in the experiment, or because the OTU could exist in one group but not the other, or because the OTU was very rare in one or more samples making its selection from the library subject to chance. In the first case, the OTU would not be found in any sample, and that OTU could simply be deleted from the dataset without effect. In the second case, the OTU would be represented in one group but not the other. In the third case, where an OTU has at least one count in at least one sample, a value of 0 could arise in other samples because of sampling and sequencing depth. In the latter two cases, it is possible that the OTU could have been detected if more reads per sample were obtained or if more replicates of the library were sequenced. Current practice in 16S rRNA gene sequencing studies is to assume that an observed value of 0 in a sample represents the actual value. In RNA-seq it is common to remove all genes where the total sum across all samples is small (usually with a mean of 2 or less and no more than about 10 counts in any sample). In either case, the assumption is that variables with very low counts are generally irrelevant (Although there are some exceptions that I will point out in later posts).

The distribution of counts in a replicate OTU when the first OTU has counts between 0 and 3. The same library of sixteen different 16S rRNA gene amplification samples were sequenced on two different Illumina HiSeq runs, and the count of OTUs that had values of 0 to 3 in one replicate, shown by the black bar, were tabulated for the other replicate, shown by the grey bar. Sequencing depth for each replicate was within 10% of the the other replicate. 
This assumption was tested by sequencing the exact same library from 16 different samples on two individual Illumina runs, and then determining the OTU count in one run if the OTU had a count of 0, 1, 2, or 3 in the other run. The figure shows the result, and it can be seen that the count observed for an OTU in one replicate is often very different from the count observed for the other replicate. Similar observations hold for RNA-seq (2) It is clear that the absolute number of counts observed varies between pure technical replicates and as expected the underlying distributions approximate what would be expected for random sample of the input library.

Another way of saying this is that: Given a set of counts for a sample, there are many possible sets of counts that are as likely if we sequenced them tomorrow.

What we need is some way to estimate what a count of 0 (or any other count for that matter) actually represents when we observe it in our datasets.

 1) Wikipedia geometric mean
 2) ANOVA-Like Differential Expression (ALDEx) Analysis for Mixed Population RNA-Seq

Friday 10 April 2015

Getting your high-throughput sequencing data in shape

Posted by: Greg Gloor

High throughput sequencing data is often spoken of as "counts" (RNA-seq) or as "relative abundances" (16S rRNA gene sequencing, metagenomics). What does this really mean, and how is the data derived?

Whenever we conduct a HTS experiment we are usually interested in counting and comparing the relative number of molecules in an input sample. In other words, comparing the abundance of molecules in a tube treated one way to the abundance of the same set of molecules treated in a different way. We then make a library of those DNA or RNA molecules, apply them to the sequencer, collect and map the results into a table of counts where the counts per gene (OTU, feature) are tabulated. These counts are normalized for sequencing depth in one of two major ways. First, by an RNA-seq type normalization that seeks to find correction factors that give the same number of counts per sample, or second by converting the values to proportions or percentages. Then statistical tests are done to determine what, if anything is different between treatments or groups.

A simple thought experiment shows why this approach can give incorrect inferences. Let us take an original set of molecules containing 100 different features (genes or OTUs), and ensure that 99 of them do not change. We let one molecule increase in abundance by 2-fold  increments for 20 time increments. This would be similar to an experiment where a single bacterial species overgrows in an environment until it becomes the dominant organism, but where the other organisms are neither killed nor eliminated. An equivalent in RNA-Seq would be an experiment where the expression of one gene is induced from an extremely strong promoter. Such an environment would look like the left panel of the figure labelled as Input. This is the original data if we could physically count the molecules, and any analysis should be done on data with the same shape.




When we make a library and sequence these 20 samples the sequencer can provide only a fixed upper number of reads per sample. In the case of an Illumina HiSeq with 200 million reads, this would work out to approximately 10 million reads per sample. Thus, the molecules from  each time point now represent a random sample of the original input molecules, and the number of reads per feature is essentially reduced to reads per 10 million reads. This changes the shape of the data as shown in the middle panel labeled  proportion. This is the shape of the data as analyzed by almost all bioinformatic tools.

If we compare different time points can we make the correct inference? Here is where it gets interesting. If we are comparing time point 1 to time point 5, we would probably be correct in that the black sequence has increased in abundance, and the remainder are unchanged. However, if we compare later time points, say point 15 to  point 20, we could be very wrong. We could easily infer that black is unchanged, and that the others decreased!

Data transformation to the rescue. For this reason, data transformations on proportions are often suggested, and the transformation that recaptures the shape of the input data the best is the centered log-ratio transformation shown in the right panel. In this transformation all values in a sample are divided by the geometric mean abundance of the features and then the logarithm of the ratio is taken (I will deal with the issue of 0 values in a later post). This transformation, pioneered by Aitchison in 1986, is a simple, general purpose tool that will recapitulate the accurate shape of the data provided most of the data points are either unchanged, or change at random. Values are now expressed as the ratio to the geometric mean abundance, so the red feature is rarer than average in all samples, and the black feature goes from slightly more abundant than average to substantially more abundant than anything else in the sample. In this case, it should be clear that we would likely make the correct statistical inference regarding abundance changes for each feature regardless of the time increments compared.