4 Case studies  

To better understand the potential impact of taxonomic bias on DA analysis in practice, we conducted several case studies spanning a range of biological scenarios and sequencing technologies.

4.1 Foliar fungi experiment

The taxonomic bias species present in ‘mock communities’ of known composition can be directly measured, and the measured bias can then be used to correct downstream DA analysis (McLaren, Willis, and Callahan (2019)). In practice it is difficult to construct control communities that span the many species present in complex natural communities. However, gnotobiotic community experiments are well suited to this form of calibration via community controls since it is possible to assemble mock communities containing all species in known relative abundances.

In a study of host-commensal-pathogen interactions, Leopold and Busby (2020) inoculated plants with 8 commensal fungal species and subsequently exposed plants to a fungal pathogen. The authors used ITS amplicon sequencing to measure communities before and after pathogen infection. Motivated by the substantial ribosomal copy-number variation (CNV) in fungi (Lofgren et al. (2019)), the authors also performed control measurements of mock communities that they constructed from quantified genomic DNA of the 9 species in the experiment; these controls were used to measure taxonomic bias with the method of McLaren, Willis, and Callahan (2019). The authors found a 13-fold difference between the most and least efficiently measured commensal, while the pathogen was measured 40-fold more efficiently than the least efficiently measured commensal.

Leopold and Busby (2020) performed two related DA analyses on the pre-infection communities: the first characterized the relative importance of host genetics and species arrival order on species relative abundances in the fully-established community, and the second quantified the strength of ‘priority effects’—the advantage gained by a species from being allowed to colonize first. Both analyses were based on fold differences (FDs) in species proportions and so in principle were sensitive to taxonomic bias. To improve accuracy, the authors incorporated the bias measured from the control samples into analysis-specific calibration procedures.

We repeated the two DA analyses of Leopold and Busby (2020) with and without calibration and found that the results did not meaningfully differ (SI Analysis of host genetics and arrival order, SI Analysis of priority effects). To understand why, we examined the variation in species proportions and the mean efficiency across the pre-infection communities (SI Analysis, SI Figure B.1). Despite the 13-fold variation in the efficiencies among species, the mean efficiency hardly varied across samples (SI Figure B.1C), having a geometric range of 1.62 and a geometric standard deviation of 1.05. This consistency in the mean efficiency was despite the fact that each species showed substantial multiplicative variation (SI Figure B.1A). But the pre-infection samples were always dominated by the three species with the highest efficiencies, which varied by 3-fold and by just 1.5-fold between the two most dominant. The mean efficiency, equal to the proportion-weighted arithmetic average of species efficiencies, is insensitive to species present at low relative abundance and so remained relatively constant across samples. Because the multiplicative variation in the mean efficiency was much smaller than that in the proportions of individual species, it had a negligible impact on the inferred FDs and the DA analyses based on them.

We performed an additional DA analysis on the data from Leopold and Busby (2020) to investigate whether any commensals increased in absolute concentration in response to infection (SI Analysis). The pathogen is absent in pre-infection samples but tends to dominate the community post-infection, resulting in a substantially higher mean efficiency in post-infection samples (SI Figure B.2). Across different host genotypes, the average post-infection increase in mean efficiency ranged from 2.5-fold to 5.2-fold. Using gamma-Poisson regression of the observed read counts, we estimated the average LFD in commensal species proportions following infection with and without calibration (SI Figure B.3). The commensals’ proportions typically decreased post-infection, which is expected given the pathogen’s growth to most abundant community member and the sum-to-one constraint faced by proportions. However, failure to account for bias caused the decrease to be overestimated, by an amount corresponding to the inverse change in log mean efficiency. For commensal-host pairs with relatively small observed decreases, bias correction greatly reduced the magnitude of the negative LFDs or, in several cases, resulted in LFDs that were near 0 or slightly positive.

Although Leopold and Busby (2020) did not include absolute-abundance measurements, we can consider the impact taxonomic bias would have on an absolute DA analysis in a simple scenario in which total genome concentration of each pre- and post-infection sample is perfectly known. The total genomic concentrations of each sample, combined with the species proportions measured by MGS, can be combined into a measurement of species concentrations using the total-abundance normalization method (Equation (2.11)). In this case, the bias in the MGS measurements will create absolute errors in the estimated LFDs of genome concentration of equal magnitude to the errors in the LFD estimates for proportions. The scientific error, however, might become worse. Suppose that total abundance increased by 2-fold due to pathogen growth; in this case, species whose proportions remained approximately constant would have increased in absolute abundance by around 2-fold. Bias shifts FD estimates downwards by 2.5-fold to 5.2-fold (depending on host genotype). Hence, without bias correction, we would instead conclude that these species decreased.

4.2 Vaginal microbiomes of pregnant women

A growing number of MGS studies have used DA analysis to describe associations between members of the vaginal microbiome and health conditions including urinary tract infections, sexually-transmitted infections, bacterial vaginosis, and preterm birth. However, these associations often vary between studies. DA analyses of the vaginal microbiome are commonly based on proportions, creating an opportunity for taxonomic bias to impact results. Substantial taxonomic bias has been experimentally demonstrated in MGS protocols applied to vaginal samples or in vitro samples of vaginally-associated species (Yuan et al. (2012), Brooks et al. (2015), Gill et al. (2016), Graspeuntner et al. (2018)). The different biases of different MGS protocols have been proposed as one explanation for discrepancies in DA results across studies (Callahan et al. (2017)).

As part of the Multi-Omic Microbiome Study: Pregnancy Initiative (MOMS-PI) study, Fettweis et al. (2019) collected longitudinal samples from over 1500 pregnant women, including nearly 600 that were taxonomically profiled by amplicon sequencing of the 16S V1-V3 hypervariable region. The taxonomic bias of the MOMS-PI MGS protocol was previously quantified by McLaren, Willis, and Callahan (2019) using control measurements by Brooks et al. (2015) of cellular mock communities of seven common, clinically-relevant vaginal bacterial species. Of these, Lactobacillus iners had the highest efficiency, which was nearly 30-fold larger than that of the species with the lowest efficiency, Gardnerella vaginalis. A second Lactobacillus species, L. crispatus, had an efficiency that was approximately 2-fold less than L. iners and 15-fold greater than G. vaginalis. These species, along with the unculturable Lachnospiraceae BVAB1, are most frequently the most abundant species in a sample and commonly reach high proportions of over 70% (SI Analysis). Therefore, shifts between these dominant species could drive large changes in the sample mean efficiency, which might in turn distort DA analyses.

We sought to understand the potential impact of bias on DA analysis of vaginal microbiome profiles from the MOMS-PI study. We first examined variation in the mean efficiency across samples under the assumption that bias in the MOMS-PI study was accurately represented by the Brooks et al. (2015) control measurements. Using taxonomic relatedness to impute the efficiencies of species not in the controls, we were able to calibrate the MOMS-PI profiles and estimate the mean efficiency of each sample (SI Analysis). The mean efficiency varies substantially across vaginal samples (Figure 4.1), with samples in which a Lactobacillus species is most abundant typically having a mean efficiency that is 3-fold to 20-fold greater than samples in which G. vaginalis is most abundant. The vaginal microbiome sometimes shifted between Lactobacillus-dominance and Gardnerella-dominance between consecutive visits in individual women (SI Figure B.4), which caused magnitude and direction errors in the trajectories of lower-abundance species (SI Figure B.5).

The mean efficiency in vaginal samples from the MOMS-PI study varies with the most abundant species.

Figure 4.1: The mean efficiency in vaginal samples from the MOMS-PI study varies with the most abundant species.

These observations suggest that a proportion-based DA analysis of a health condition that is associated with Lactobacillus and/or Gardnerella would be prone to spurious results. To demonstrate this possibility, we performed a DA analysis of the MOMS-PI samples with respect to a simulated health condition (SI Analysis). Bacterial vaginosis (BV) has been repeatedly found to be associated with a dramatically reduced proportion of Lactobacillus spp., an increased proportion of G. vaginalis, and higher within-community (alpha) diversity (Srinivasan et al. (2012), Cartwright et al. (2018)). As a proxy for clinical BV, we split samples into high diversity (BV-like), intermediate diversity, and low diversity (non-BV-like) groups based on Shannon diversity in observed (uncalibrated) microbiome profiles. The mean efficiency was lower in the high-diversity BV-like samples (mean LFD of -1.6) due to a lower abundance of Lactobacillus spp.. We estimated the LFD in proportion from non-BV-like to BV-like samples for 30 prevalent bacterial species, with and without bias correction, using gamma-Poisson (or negative-binomial) regression of the observed read counts to fit a linear model to log species proportions. We accounted for bias by including an offset term in the linear model equal to the log ratio of the species’ efficiency to the sample mean efficiency. As expected, bias correction reduced the estimated LFD from non-BV-like to BV-like groups for every species tested, though the importance of this effect varied among species (SI Figure B.6).

4.3 Human gut microbiomes

We wondered whether DA studies of the human gut microbiome might be less affected by bias than studies of the vaginal microbiome, due to the greater alpha diversity observed within gut microbiome samples. Recall that the mean efficiency is an abundance-weighted average over the species in a sample. Thus, all else equal, we expect the mean efficiency to vary less across samples from ecosystems where samples have higher alpha (within-sample) diversity, as is the case for stool samples relative to vaginal samples (Huttenhower et al. (2012)). On the other hand, the proportion of individual species may also vary less less across samples when alpha diversity is higher, in which case the effect of bias on DA results need not diminish.

To investigate this question, we analyzed bacterial species profiles of vaginal and stool samples derived from shotgun sequencing in the Human Microbiome Project (Huttenhower et al. (2012), SI Analysis). On average, stool profiles had substantially greater order-2 alpha diversity than vaginal profiles (SI Figure B.7A; geometric mean (GM) \(\md\) geometric standard deviation (GSD) of 6.3 \(\md\) 1.7 in stool samples and 1.4 \(\md\) 1.6 in vaginal samples). To assess the potential importance of bias for proportion-based DA analyses in the two ecosystems, we quantified the multiplicative variation in the mean efficiency across samples for a large number of possible taxonomic biases under the assumption that the measured profiles reflected the truth. Across simulation replicates, the GSD in the mean efficiency was typically lower in stool than in vaginal samples (SI Figure B.7B; ratio of GSD in vaginal samples to GSD in stool samples of 1.6 \(\md\) 1.5 across 1000 simulations). Notably, however, the multiplicative variation in species also tended to be lower (SI Figure B.7C); the GSD in the proportion of a gut species across gut samples was an average of 1.75-fold lower than that of a vaginal species across vaginal samples. Recall that the importance of bias estimation of the FD in a species’ proportion depends on GSD in the mean efficiency relative the GSD in the species’ proportion (Section 3). Thus these results suggest that although the mean efficiency likely varies less across stool than vaginal samples, bias may be just as problematic for proportion-based DA analyses.

4.4 Microbial growth in marine sediments

The surface layers of marine sediments harbor diverse and abundant microbiomes. Total cell density and species richness decrease with depth as resources are consumed in older, deeper sediment layers; however, some taxa are able to grow and increase in density as they are slowly buried. Lloyd et al. (2020) performed a systematic assessment of growth rates of bacterial and archaeal taxa over a depth of 10 cm (corresponding to ~40 years of burial time) in sediment of the White Oak River estuary. Taxa proportions were measured with 16S amplicon sequencing and total community densities were measured by counting cells using epifluorescence microscopy. The authors then calculated cell concentrations for each taxon by multiplying the taxon’s proportion by the total concentration, a method they referred to as FRAxC measurements for ‘fraction of 16S reads times total cell counts’. Taxon-specific growth rates were inferred from the slope of a simple linear regression of the log of the calculated concentration of each taxon against burial time over the first 3 cm below the bioirrigation layer (corresponding to ~8 years of burial).

The FRAxC concentration measurements made by Lloyd et al. (2020) correspond to the total-abundance approach to inferring absolute abundance (Equation (2.11)). Accordingly, taxonomic bias could lead to systematic error in FRAxC-derived growth rates if sample mean efficiency tends to systematically vary with burial time. Such systematic variation could occur if microbes with tougher cell walls both persist longer in the sediment and are more difficult to extract DNA from than microbes with weaker cell walls. In this scenario, the relative abundance of tougher species will increase with burial time, as the cells of weaker species degrade. This shifting composition will cause the sample mean efficiency to decrease with time, leading to inflated growth-rate estimates for all taxa. Taxa that decay with depth sufficiently slowly would mistakenly be inferred to have positive growth from the FRAxC-calculated abundances.

Importantly, Lloyd et al. (2020) also used qPCR to measure the absolute abundance of two taxa from these same samples. By comparing qPCR to FRAxC growth rates for these taxa, we can estimate the systematic error in FRAxC growth rates (SI Analysis) Because the systematic error that bias causes in the regression slope is the same for each taxon (Section 3), these comparisons allow us to draw conclusions about the accuracy of FRAxC growth rates for all taxa.

The first soil core included qPCR measurements of a single archaeal taxon, Bathyarchaeota, for which growth rates by qPCR and FRAxC were nearly identical (doubling rates of 0.099/yr by FRAxC and 0.097/yr by qPCR). The second soil core included qPCR measurements of Bathyarchaeota and a second taxon, Thermoprofundales/MBG-D. In this core, FRAxC and qPCR growth rates differed more substantially, with growth rates from FRAxC being larger by 0.012/yr for Bathyarchaeota (0.112/yr by FRAxC and 0.1/yr by qPCR) and by 0.086/yr for Thermoprofundales/MBG-D (0.294/yr by FRAxC and 0.208/yr by qPCR). Uncertainty in the FRAxC- and qPCR-derived growth rates is not reported and is likely substantial; however, the fact that the FRAxC-derived rates are larger than qPCR-derived rates in all three cases is consistent with our hypothesis that mean efficiency decreased with depth in a manner that systematically biased FRAxC-derived rates to higher values. The differences in growth rate are small in absolute terms; however, the maximum observed difference of 0.086/yr suggests an error large enough to impact results for some taxa classified as positive growers, whose FRAxC growth rates ranged between 0.04/yr and 0.5/yr. In summary, the comparison between FRAxC and qPCR measurements supports the overall study conclusions, but also suggests that species at the low end of positive FRAxC-derived rates may be merely persisting or even slowly declining in abundance.

4.5 Summary and discussion

The effect that consistent taxonomic bias has on proportion-based DA analyses depends on the MGS protocol, the biological system, and the sample comparisons under investigation. Our case studies explore a limited range of possibilities; however, we can begin to form some general conclusions by placing them within the context of the three scenarios described in Section 3.

We saw several cases where the mean efficiency is stable across samples, so that bias does not affect LFD estimates (Scenario 1). In the pre-infection fungal microbiome samples of Leopold and Busby (2020), we observed that the mean efficiency remained stable despite substantial bias and large multiplicative variation in species proportions among samples. Because the variation in the mean efficiency was much less than that of the individual species, the impact of bias on DA analysis was negligible. In the vaginal case study, we also observed that the mean efficiency was relative stable across vaginal microbiomes that were dominated by the same species, despite substantial bias and large LFDs in non-dominant species. In both cases, the stability of the mean efficiency can be understood by the fact that it is an additive average over species and so is primarily determined by the most abundant species in the sample.

Yet we also observed cases where the mean efficiency varied substantially. In several cases, the (log) mean efficiency co-varied sufficiently strongly with the condition of interest to cause substantial systematic errors in LFD estimates (Scenario 3). Examples include the comparison of foliar fungal microbiomes pre- and post-infection, for which the mean efficiency substantially increased post-infection, and the comparison of vaginal microbiomes with low versus high diversity in the MOMS-PI study, for which the mean efficiency was typically lower in high-diversity samples. Our analysis of marine sediment communities is consistent with a systematic decline of the mean efficiency with burial time that is sufficient to substantially inflate the estimated growth rates of slowly changing species. The mean efficiency cannot differ by more than the species proportions constituting it. Therefore, the error in the estimated LFDs tend to be practically significant only for species whose LFDs are substantially smaller in magnitude than the largest LFD.

Variation in the mean efficiency may also be substantial, yet unassociated with the condition of interest (Scenario 2). Although we did not directly observe this scenario directly, we have reason to think that it may be common in real microbiome studies. Our simulation analysis of gut microbiomes suggest that even in diverse ecosystems, bias will often cause multiplicative variation in the mean efficiency that is comparable to that of individual species. This variation is much less problematic for DA results when it not associated with the condition under study, since any loss in precision can in principle be offset by an increase in sample size. For certain applications, however, it will be important to remember that the LFDs between individual samples remain unreliable.

Under what conditions should we expect the problematic third scenario? Our case studies suggest one prominent mechanism for causing systematic variation in the mean efficiency: the existence of one or more species with unusually high or low efficiencies that tend to dominate the community more often in one sample condition than another. This mechanism was responsible for the negative effect of bias in our DA analyses of foliar fungal microbiomes following infection and of vaginal microbiomes with low versus high diversity. In experimental systems where no one species frequently forms a large fraction of the community, systematic variation of the mean efficiency can still arise through the collective change of many species that are associated with the condition of interest. We described a plausible instance in the marine-sediment case study, where increased burial time selects for lysis-resistant (and so lower efficiency) species. As another example, treatment with an antibiotic might selectively kill Gram negative species, which also tend to be easier to lyse, thereby decreasing the mean efficiency in fecal samples collected after treatment. A generic mechanism which might spawn such condition-efficiency associations stems from the evolutionary relationships among species. Species with recent common ancestry are expected to share a number of traits that determine measurement efficiency, including cell wall structure, genome size, ribosomal copy number, and PCR binding sequence. They are also expected to share traits related to the condition of interest. If related species tend to have similar efficiencies and similar associations with the condition, then they may drive an association of the mean efficiency with the condition.

These observations provide reasons to both worry and hope. It seems likely that in many studies, the mean efficiency is consistent (first scenario) or is at least not associated with the condition (second scenario), so that DA inferences remain valid. Yet it is not obvious which scenario any study falls into. There are plausible mechanisms leading to systematic variation of the mean efficiency even in ecosystems with high species diversity. Moreover, our explorations suggest that random variation in the mean efficiency is common and distorts comparisons between individual samples. Thus while we should not discount the large set of existing DA results, we should seek ways to better assess the robustness of results from previous studies and to measure and correct the error caused by bias in future ones.

References

Brooks, J Paul, David J Edwards, Michael D Harwich, Maria C Rivera, Jennifer M Fettweis, Myrna G Serrano, Robert A Reris, et al. 2015. The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies.” BMC Microbiol. 15 (1): 66. https://doi.org/10.1186/s12866-015-0351-6.
Callahan, Benjamin J, Daniel B DiGiulio, Daniela S Aliaga Goltsman, Christine L Sun, Elizabeth K Costello, Pratheepa Jeganathan, Joseph R Biggio, et al. 2017. Replication and refinement of a vaginal microbial signature of preterm birth in two racially distinct cohorts of US women.” Proc. Natl. Acad. Sci. U. S. A. 114 (37): 9966–71. https://doi.org/10.1073/pnas.1705899114.
Cartwright, Charles P., Amanda J. Pherson, Ayla B. Harris, Matthew S. Clancey, and Melinda B. Nye. 2018. Multicenter study establishing the clinical validity of a nucleic-acid amplification–based assay for the diagnosis of bacterial vaginosis.” Diagn. Microbiol. Infect. Dis. 92 (3): 173–78. https://doi.org/10.1016/j.diagmicrobio.2018.05.022.
Fettweis, Jennifer M., Myrna G. Serrano, Jamie Paul Brooks, David J. Edwards, Philippe H. Girerd, Hardik I. Parikh, Bernice Huang, et al. 2019. The vaginal microbiome and preterm birth.” Nat. Med. 25 (6): 1012–21. https://doi.org/10.1038/s41591-019-0450-2.
Gill, Christina, Janneke H. H. M. van de Wijgert, Frances Blow, and Alistair C. Darby. 2016. Evaluation of Lysis Methods for the Extraction of Bacterial DNA for Analysis of the Vaginal Microbiota.” Edited by Peter E. Larsen. PLoS One 11 (9): e0163148. https://doi.org/10.1371/journal.pone.0163148.
Graspeuntner, Simon, Nathalie Loeper, Sven Künzel, John F. Baines, and Jan Rupp. 2018. Selection of validated hypervariable regions is crucial in 16S-based microbiota studies of the female genital tract.” Sci. Rep. 8 (1): 9678. https://doi.org/10.1038/s41598-018-27757-8.
Huttenhower, Curtis, Dirk Gevers, Rob Knight, Sahar Abubucker, Jonathan H. Badger, Asif T. Chinwalla, Heather H. Creasy, et al. 2012. Structure, function and diversity of the healthy human microbiome.” Nature 486 (7402): 207–14. https://doi.org/10.1038/nature11234.
Leopold, Devin R, and Posy E Busby. 2020. Host Genotype and Colonist Arrival Order Jointly Govern Plant Microbiome Composition and Function.” Curr. Biol. 30 (16): 3260–3266.e5. https://doi.org/10.1016/j.cub.2020.06.011.
Lloyd, Karen G., Jordan T. Bird, Joy Buongiorno, Emily Deas, Richard Kevorkian, Talor Noordhoek, Jacob Rosalsky, and Taylor Roy. 2020. Evidence for a Growth Zone for Deep-Subsurface Microbial Clades in Near-Surface Anoxic Sediments.” Appl. Environ. Microbiol. 86 (19): 1–15. https://doi.org/10.1128/AEM.00877-20.
Lofgren, Lotus A., Jessie K. Uehling, Sara Branco, Thomas D. Bruns, Francis Martin, and Peter G. Kennedy. 2019. Genome‐based estimates of fungal rDNA copy number variation across phylogenetic scales and ecological lifestyles.” Mol. Ecol. 28 (4): 721–30. https://doi.org/10.1111/mec.14995.
McLaren, Michael R, Amy D Willis, and Benjamin J Callahan. 2019. Consistent and correctable bias in metagenomic sequencing experiments.” Elife 8 (September): 46923. https://doi.org/10.7554/eLife.46923.
Srinivasan, Sujatha, Noah G. Hoffman, Martin T. Morgan, Frederick A. Matsen, Tina L. Fiedler, Robert W. Hall, Frederick J. Ross, et al. 2012. Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria.” Edited by Adam J. Ratner. PLoS One 7 (6): e37818. https://doi.org/10.1371/journal.pone.0037818.
Yuan, Sanqing, Dora B. Cohen, Jacques Ravel, Zaid Abdo, and Larry J. Forney. 2012. Evaluation of Methods for the Extraction and Purification of DNA from the Human Microbiome.” Edited by Jack Anthony Gilbert. PLoS One 7 (3): e33865. https://doi.org/10.1371/journal.pone.0033865.