4 Case studies

To better understand the potential impact of taxonomic bias on DA analysis in practice, we conducted several case studies spanning a range of biological scenarios and sequencing technologies.

4.1 Foliar fungi experiment

Measurement of control communities along with the primary experimental samples can enable researchers to directly measure and remove the effect of bias prior to or concurrent with downstream DA analysis (McLaren, Willis, and Callahan (2019)). Gnotobiotic community experiments are well suited to this form of calibration via community controls since, unlike most natural ecosystems, it is possible to assemble ‘mock communities’ containing all species in known relative abundances.

Leopold and Busby (2020) are the first to use mock-based calibration in a gnotobiotic microbiome experiment. In a study of host-commensal-pathogen interactions, Leopold and Busby (2020) inoculated plants with 8 commensal fungal species and subsequently exposed plants to a fungal pathogen. To investigate the joint effect of colonization order and host genotype, plants varied in source location and in which commensal colonized first. The authors used ITS amplicon sequencing to measure communities before and after pathogen infection, along with mock communities with quantified genome concentrations of DNA from the 9 species. These DNA mock communities allowed the authors to directly measure the bias due to the sequencing workflow following DNA extraction and correct for it when analyzing the primary experimental samples following the method of McLaren, Willis, and Callahan (2019). Although extraction is not accounted for, considerable bias might still be expected due to preferential PCR binding and amplification (CITE) and the substantial ribosomal copy-number-variation (CNV) in fungi (Lofgren et al. (2019)). The authors found a 13X difference between the most and least efficiently measured commensal, while the pathogen was measured 40X more efficiently than the least efficiently measured commensal. We reanalyzed this experiment to understand the impact that bias might have had on the study’s analyses were it not accounted for.

Leopold and Busby (2020) used the ITS sequencing profiles of the pre-infection samples to understand the impact of host genotype and an experimental treatment—which species was allowed to colonize first—on the commensal community composition prior to infection. Two DA analyses were conducted to address these questions. First, the authors used linear regression of log species proportion (via negative-binomial regression to account for random read-sampling error) to ask whether host genotype, treatment, and their interaction significantly impacted overall community composition. Second, the authors quantified the strength of priority effects—the advantage of being allowed to colonize first—in different host genotypes by the LFC in the proportion of a species when allowed to colonize first versus later with the majority of commensals.

Since these analysis are both based on proportions, bias could in principle impact their results. We re-ran both analyses with and without bias correction, finding nearly identical results regardless of whether bias is accounted for⁶. To understand why, we compared the multiplicative variation taxa proportions with that of the mean efficiency. The proportions of each taxon varied substantially across samples, much more so than the mean efficiency, which had a geometric range of 1.57X and a geometric standard deviation of just 1.06X (SI Figure D.1). Why does the mean efficiency vary so little despite substantial multiplicative variation in the efficiency among taxa and in the taxa proportions across samples? This answer relates to the fact that the mean efficiency is a weighted arithmetic average of species proportions and so is insensitive to large multiplicative variation in low-proportion species. The pre-infection samples in the Leopold and Busby (2020) study are always dominated by the three species with the highest efficiencies, whose efficiencies vary only by 3X (and just 1.5X between the two most dominant). The lowest efficiency species, despite having large variation in their proportion on the log scale, are always rare (proportions \(\ll 1\), even after bias correction) and hence don’t significantly affect the mean efficiency. Because the log mean efficiency varies much less than the log proportions for each taxon, there is little room for it to distort the results of DA analyses based on variation in log proportions.

We additionally considered the potential for bias to impact analysis of how commensal taxa responded to infection. In particular, it is interesting to consider whether any commensals are able to increase in absolute concentration in response to infection. The experiment performed by Leopold and Busby (2020) does not allow absolute-abundance measurement; nevertheless, the results of the previous sections allow us to consider the impact that bias would have on an analysis of absolute genome concentration measured using the total-community normalization method (Equation (2.8)). It is first useful to consider how bias impacts the estimated change in log proportion of each commensal following infection. The pathogen is absent in pre-infection samples but tends to dominate the community post-infection (SI Figure D.2), leading to an average increase in mean efficiency ranging from 3.5X to 5X across host genotypes and an expectation that commensals will typically decrease in proportion due to the sum-to-one constraint. Without bias correction, the log proportion of commensals decrease in each host genotype, often substantially (log-e fold changes less than -2); however, bias leads to overly negative LFCs due to the higher mean efficiency in post-infection samples. Bias correction substantially lessens the magnitude of LFC estimates, in some cases leading to estimates that are approximately 0 or slightly positive (SI Figure D.3).

Now, consider the impact this shift in mean efficiency might have on a regression analysis with log absolute genome concentration as the response, if measured using the total-community normalization method (Equation (2.8)). We consider two scenarios. Scenario 1: Total genome concentration in each sample is perfectly known and used to measure the concentrations of individual taxa via Equation (2.8). In this case, bias in the MGS measurements would create absolute errors in the LFCs for genome concentration identical to those for proportions. The scientific error, however, may be much worse. Since the sum-to-one constraint of proportions no longer applies, the LFCs will necessarily be larger (less negative), such that the absolute error caused by the change in mean efficiency could plausibly create substantial biological errors where taxa that persist or even increase appear to decrease. Scenario 2: Total ITS concentration is measured by qPCR with the same primers and PCR protocol used for ITS sequencing, and used without any correction to measure individual taxa concentration with the total-community normalization method. In this case, we should expect the increase in mean efficiency to have little to no impact on LFCs, since it would be offset by the increase in mean efficiency of the total-concentration measurement. Moreover, since the total ITS concentration and ITS sequencing are both performed on the extracted DNA, we can also expect the effect of taxonomic bias caused by DNA sequencing to be offset. Thus, the systematic shift in mean efficiency following infection may or may not significantly impact the LFC estimates depending on how the total-community measurement is made prior to normalization.

4.2 Vaginal microbiomes of pregant women

The vaginal microbiome during pregnancy has been a source of intensive study due to its apparent connection with the health of both mother and her developing child. Many MGS studies have found associations of specific microbial species and community characteristics with rates of urinary tract and sexually-transmitted infections, bacterial vaginosis (BV), and preterm birth. Yet these associations vary across studies of different populations and using different MGS methods. DA analyses of the vaginal microbiome are commonly based on proportions, creating an opportunity for taxonomic bias to impact results. A number of studies have experimentally demonstrated substantial taxonomic bias among MGS protocols and individual steps (such as extraction and PCR amplification) in vaginal samples or in vitro samples of vaginally-associated species (Yuan et al. (2012),Brooks et al. (2015),Gill et al. (2016),Graspeuntner et al. (2018)). Bias has been proposed as a potential explanation for discrepancies across studies (Callahan et al. (2017), Others?), but there has so far been little quantitative analysis of this possibility. Here we use empirical bias measurements from control samples to investigate the role of bias in proportion-based DA analyses of vaginal microbiomes from a recent large-cohort study of pregnant women.

As part of the Multi-Omic Microbiome Study: Pregnancy Initiative (MOMS-PI) study, Fettweis et al. (2019) collected longitudinal samples from over 1500 pregnant women, including nearly 600 that were measured by amplicon sequencing of the 16S V1-V3 hypervariable region to yield species-level bacterial taxonomic profiles. Taxonomic bias of this MGS protocol was previously investigated by Brooks et al. (2015) and McLaren, Willis, and Callahan (2019), using measurements by Brooks et al. (2015) of cellular mock communities of seven common, clinically-relevant vaginal bacterial species. Of these, Lactobacillus iners had the highest efficiency, which was nearly 30X larger than that of the species with the lowest efficiency, Gardnerella vaginalis. A second Lactobacillus species, L. crispatus, had an efficiency that was approximately 2X less than L. iners and 15X greater than G. vaginalis. These species, along with the unculturable Lachnospiraceae BVAB1, are the most common top (most-abundant) species (SI analysis) and can reach high proportions in individual samples, indicating that shifts between them might drive large changes in the mean efficiency which might in turn impact DA results.

We sought to assess this possibility using a joint analysis of the control measurements from Brooks et al. (2015) and the microbiome measurements from the MOMS-PI study. To obtain the taxonomic bias among all species identified in the MOMS-PI measurements, we used the taxonomic relationships with the seven control species to impute the efficiencies for the remaining species. We used these imputed efficiencies to calibrate (correct the effect of bias in) the MOMS-PI measurements, examine variation in the mean efficiency varied across samples, and compare the results of a DA analysis with and without bias correction.

The mean efficiency varies substantially across vaginal samples (Figure 4.1. This variation appears to be primarily driven by variation in which species is most abundant, samples in which a Lactobacillus species is most abundant (after calibration) typically have an efficiency that is 3-20X greater than samples in which G. vaginalis is most abundant. Shifts between Lactobacillus-dominance and Gardnerella-dominance are common in between-women comparisons and occasionally occur between consecutive visits in individual women (SI Figure D.4). These shifts typically result in substantial fold changes in mean efficiency (SI Figure D.4) and can cause spurious fold changes in the trajectories of lower-abundance species (SI Figure D.5).

Figure 4.1: The mean efficiency in vaginal samples from the MOMS-PI study varies with the most abundant species.

We next considered whether the observed variation in mean efficiency could cause systematic error in a DA analysis. In particular, we hypothesized that DA analysis of species proportions versus a covariate that is associated with Lactobacillus and/or Gardnerella would be particularly prone to spurious results. Patient metadata was not available due to privacy restrictions; we therefore sought a clinically relevant covariate to use in a regression analysis that could be determined directly from the microbiome profiles and that we a priori expected to be associated with the proportions of Lactobacillus and Gardnerella. Alpha diversity metrics such as species richness, the Shannon index, and the (Inverse) Simpson index have been repeatedly found to be strongly positively associated with bacterial vaginosis (BV; Srinivasan et al. (2012), Cartwright et al. (2018)). Cartwright et al. (2018) found that an observed Simpson Diversity Index of 0.82 (corresponding to an order-2 effective number of species of 5.6) classified high diversity samples as BV positive with a sensitivity of 100% and specificity of 85.1%. In addition, it is commonly observed that samples from women with and without BV that are dominated by Lactobacillus spp. tend to have higher diversity, whereas samples dominated by Gardnerella tend to have lower diversity. We therefore chose to perform a DA analysis of species proportion versus alpha diversity, hypothesizing that Lactobacillus and Gardnerella would drive a negative association of mean efficiency with diversity and thereby distort DA estimates for all species. We split samples into low, medium, and high diversity groups based on Shannon diversity in observed (uncalibrated) microbiome profiles. We then estimated the LFC in proportion from low- to high-diversity samples for a 30 common species by simple linear regression, using calibrated (bias-corrected) and observed (uncorrected) microbiome profiles. As expected, the mean efficiency was higher in low-diversity samples due to a larger fraction of samples dominated by Lactobacillus and a lower fraction with samples dominated by Gardnerella. The decline in mean efficiency in the high diversity group caused a concomitant increase in the LFCs of all species when bias was not accounted for. This error due to bias resulted in serious magnitude errors in nearly all species and sign errors in over one third, with errors of both types seen for several clinically relevant species.

4.2.1 Notes

It might be cleaner story-wise to redo the diversity partitioning using the threshold identified by Cartwright et al. (2018)
For the DA analysis of proportions vs. diversity, I tried other types of DA analysis methods (Gamma-Poisson regression and regression on ranks, similar to Wilcoxon test) to understand how the impact of bias differs across methods. In each case bias has a big effect but the species whose estimates are most affected varies. I would need to revise and verify this analysis before including its results. One reason for doing so would be to show that our qualitative finding about how bias distorts results applies beyond the basic linear regression model.
Decreases in mean efficiency during transitions from Lactobacillus to Gardnerella dominance can be expected to be even more extreme for commonly-used vaginal microbiome primers that fail to amplify Gardnerella.

4.3 Gut microbiome

(Has not been updated)

It is clear that an individual species with an unusually high or low efficiency can drive large changes in the mean efficiency when it shifts from a very large (\(\sim 1\)) to very small proportion (\(\ll 1\)) of the sample. Consideration of dynamics in the human gut illustrates another a way in which large shifts in higher-level taxa might create similar shifts, due to the fact that efficiencies do not vary independently among closely related species. Gut microbiomes are dominated by two phyla: The Bacteroidetes and the Firmicutes, the ratio of which varies substantially across individuals. Within a sample, there are usually multiple species contributing to the abundance of each phylum, with the median Inverse Simpson diversity being around 4 for each phylum in the HMP2 IBD study. Thus changes from Bacteroidetes to Firmicutes dominance are unlikely to be driven by a single species. However, if Firmicutes species tend to have systematically different efficiencies than Bacteroidetes species, then changes in phylum dominance can be expected to drive substantial changes in the mean efficiency even if many species contribute. Indeed, several studies have found DNA extraction protocols to more efficiently lyse Gram-negative Bacteroidetes species than Gram-positive Firmicutes species (though substantial variation within phyla also occurs, McLaren, Willis, and Callahan (2019)).

Systematic error in DA regression results requires that the variation in log mean efficiency is correlated with the covariate. The examples above each suggest plausible biologically-significant scenarios in which such correlations might arise. The Bacteroidetes-to-Firmicutes ratio has also been linked to host conditions. Yet in the HMP2 IBD dataset, the large variation in this ratio is largely independent of disease status, suggesting that the primary effect of any variation in mean efficiency driven by these phyla would be to increase noise rather than create systematic error in DA analysis.

4.4 Microbial growth in marine sediments

Our route for mean efficiency to become associated with the covariate is if the covariate quantifies a biological process that preferentially selects for a microbial trait that also tends to increase or decrease measurement efficiency. We illustrate this potential mechanism with a study of microbial growth in marine sediments.

The surface layers of marine sediments harbor diverse and abundant microbiomes. Total cell density and species richness decrease with depth as resources are consumed in older, deeper sediment layers; however, some taxa are able to grow and increase in density as they are slowly buried. Lloyd et al. (2020) performed a systematic assessment of growth rates of bacterial and archaeal taxa over a depth of 10 cm (corresponding to ~40 years of burial time) in sediment of the White Oak River estuary. To estimate growth rate, the authors first measured absolute cell density of microbial taxa using the total-community normalization method (Equation (2.8)), with taxa proportions measured with 16S amplicon sequencing and total community density measured by directly counting cells using epifluorescence microscopy (CARD-FISH). The authors refer to these absolute densities as FRAxC measurements, for ‘fraction of 16S reads times total cell counts’. The FRAxC measurements were used to infer growth rate from the slope of a simple linear regression of log cell density against burial time over the first 3 cm below the bioirrigation layer (corresponding to \(\sim 8\) years of burial). To validate the inference of positive growers, the authors compared the growth rates from FRAxC-based inference from two sediment cores, qPCR measurements in these cores for a few reference taxa, and FRAxC-based inference in two replicate laboratory incubation experiments.

Taxonomic bias could lead to systematic error in FRAxC-derived growth rates if sample mean efficiency tends to systematically vary with burial time (or equivalently, depth). One possibility is that microbes with tougher cell walls tend to persist longer (alive or dead) in the sediment, while at the same time being more difficult to extract DNA from than microbes with weaker cell walls. In this case, we would expect the relative abundance of tougher species to increase with depth and hence the mean extraction efficiency to decrease, which in turn would lead to inflated growth-rate estimates for all taxa; a possible end result would be that taxa that decay sufficiently slowly would be mistakenly inferred to have positive growth.

We can test the hypothesis that systematic variation in log mean efficiency with depth distorts inferred growth rates using the qPCR measurements of two clades also collected by Lloyd et al. (2020). In Section B, we argue that taxonomic bias in targeted qPCR measurements is expected to create a roughly constant fold error and yield fold changes in absolute density that are relatively unaffected by changes in mean efficiency. Thus comparing qPCR to FRAxC growth rates on the same reference taxa allows us to estimate the systematic error in FRAxC growth rates. Because the systematic error in regression slopes due to bias is the same for each taxon (Equation (3.3)), these comparisons allow us to draw conclusions about the accuracy of FRAxC growth rates for all taxa. The first soil core included qPCR measurements of a single archaeal clade, Bathyarchaeota, for which growth rates by qPCR and FRAxC were nearly identical (doubling rates of 0.099/yr by FRAxC and 0.097/yr by qPCR). The second soil core included qPCR measurements of Bathyarchaeota and a second clade, Thermoprofundales/MBG-D. In this core, FRAxC and qPCR growth rates differed more substantially, with growth rates from FRAxC being larger by 0.012/yr for Bathyarchaeota (0.112/yr by FRAxC and 0.1/yr by qPCR) and by 0.086/yr for Thermoprofundales/MBG-D (0.294/yr by FRAxC and 0.208/yr by qPCR). A low number of experimental samples and noise in both the FRAxC and qPCR measurements place significant uncertainty in these measurements; however, the fact that FRAxC-derived growth rates are larger than qPCR-derived rates in all three cases is consistent with our hypothesis that mean efficiency decreases with depth in a manner that systematically biases FRAxC-derived rates to higher values. The differences in growth rate are small in absolute terms; however, the maximum observed difference of 0.086/yr suggests an error large enough to impact results for some taxa classified as positive growers, whose FRAxC growth rates ranged between 0.04/yr and 0.5/yr. Overall, the comparison between FRAxC and qPCR measurements gives support to the study conclusions, but suggest that species at the lower end of this range of positive FRAxC-derived rates may in fact be merely persisting or even slowly declining in abundance.

4.5 Conclusions

References

Brooks, J Paul, David J Edwards, Michael D Harwich, Maria C Rivera, Jennifer M Fettweis, Myrna G Serrano, Robert A Reris, et al. 2015. “The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies.” BMC Microbiol. BioMed Central. https://doi.org/10.1186/s12866-015-0351-6.

Callahan, Benjamin J, Daniel B DiGiulio, Daniela S Aliaga Goltsman, Christine L Sun, Elizabeth K Costello, Pratheepa Jeganathan, Joseph R Biggio, et al. 2017. “Replication and refinement of a vaginal microbial signature of preterm birth in two racially distinct cohorts of US women.” Proc. Natl. Acad. Sci. U. S. A. 114 (37): 9966–71. https://doi.org/10.1073/pnas.1705899114.

Cartwright, Charles P., Amanda J. Pherson, Ayla B. Harris, Matthew S. Clancey, and Melinda B. Nye. 2018. “Multicenter study establishing the clinical validity of a nucleic-acid amplification–based assay for the diagnosis of bacterial vaginosis.” Diagn. Microbiol. Infect. Dis. 92 (3): 173–78. https://doi.org/10.1016/j.diagmicrobio.2018.05.022.

Fettweis, Jennifer M., Myrna G. Serrano, Jamie Paul Brooks, David J. Edwards, Philippe H. Girerd, Hardik I. Parikh, Bernice Huang, et al. 2019. “The vaginal microbiome and preterm birth.” Nat. Med. 25 (6): 1012–21. https://doi.org/10.1038/s41591-019-0450-2.

Gill, Christina, Janneke H. H. M. van de Wijgert, Frances Blow, and Alistair C. Darby. 2016. “Evaluation of Lysis Methods for the Extraction of Bacterial DNA for Analysis of the Vaginal Microbiota.” Edited by Peter E. Larsen. PLoS One 11 (9): e0163148. https://doi.org/10.1371/journal.pone.0163148.

Graspeuntner, Simon, Nathalie Loeper, Sven Künzel, John F. Baines, and Jan Rupp. 2018. “Selection of validated hypervariable regions is crucial in 16S-based microbiota studies of the female genital tract.” Sci. Rep. 8 (1): 9678. https://doi.org/10.1038/s41598-018-27757-8.

Leopold, Devin R, and Posy E Busby. 2020. “Host Genotype and Colonist Arrival Order Jointly Govern Plant Microbiome Composition and Function.” Curr. Biol. 30 (16): 3260–3266.e5. https://doi.org/10.1016/j.cub.2020.06.011.

Lloyd, Karen G., Jordan T. Bird, Joy Buongiorno, Emily Deas, Richard Kevorkian, Talor Noordhoek, Jacob Rosalsky, and Taylor Roy. 2020. “Evidence for a Growth Zone for Deep-Subsurface Microbial Clades in Near-Surface Anoxic Sediments.” Appl. Environ. Microbiol. 86 (19): 1–15. https://doi.org/10.1128/AEM.00877-20.

Lofgren, Lotus A., Jessie K. Uehling, Sara Branco, Thomas D. Bruns, Francis Martin, and Peter G. Kennedy. 2019. “Genome‐based estimates of fungal rDNA copy number variation across phylogenetic scales and ecological lifestyles.” Mol. Ecol. 28 (4): 721–30. https://doi.org/10.1111/mec.14995.

McLaren, Michael R, Amy D Willis, and Benjamin J Callahan. 2019. “Consistent and correctable bias in metagenomic sequencing experiments.” Elife 8 (September): 46923. https://doi.org/10.7554/eLife.46923.

Srinivasan, Sujatha, Noah G. Hoffman, Martin T. Morgan, Frederick A. Matsen, Tina L. Fiedler, Robert W. Hall, Frederick J. Ross, et al. 2012. “Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria.” Edited by Adam J. Ratner. PLoS One 7 (6): e37818. https://doi.org/10.1371/journal.pone.0037818.

Yuan, Sanqing, Dora B. Cohen, Jacques Ravel, Zaid Abdo, and Larry J. Forney. 2012. “Evaluation of Methods for the Extraction and Purification of DNA from the Human Microbiome.” Edited by Jack Anthony Gilbert. PLoS One 7 (3): e33865. https://doi.org/10.1371/journal.pone.0033865.

The two sets of analyses can be seen at the following links: negative-binomial regression analysis; priority-effects analysis ↩︎