4 Implications for real-world inference
Are biologically significant errors in DA analyses likely in practice? We can begin to answer this question using a combination of theoretical analysis, consideration of hypothetical scenarios, and analysis of case studies for which some information about taxonomic bias is available.
4.1 Systematic error in slope or LFC estimates
Recall from Section 3 that the error in the slope or LFC in a DA regression analysis is proportional to the covariance of the log mean efficiency with the covariate. This covariance can be split into two components: the standard deviation of (log) mean efficiency and its correlation with the covariate. Hence when considering whether taxonomic bias creates large error, it can be useful to separately ask whether the mean efficiency is likely to vary across samples and whether it is likely to be correlated with a covariate of interest.
One approach to studying the mean efficiency empirically is to analyze data from studies for which control measurements allow direct measurement of species’ efficiencies. To characterize bias in a 16S sequencing protocol developed for the Vaginal Human Microbiome Project (VaHMP), Brooks et al. (2015) performed 16S rRNA gene sequencing of mock communities of seven bacterial species thought to play a critical role in the human vaginal microbiome. This protocol was later used for the VaHMP’s Multi-Omic Microbiome Study: Pregnancy Initiative (MOMS-PI) study (Fettweis et al. (2019)). We used the method for measuring and correcting bias developed by McLaren, Willis, and Callahan (2019) to measure bias from the mock communities and analyze its potential impact on DA analyses that use the MOMSPI measurements (SI Analysis MOMS-PI). The vaginal microbiome during pregnancy is characterized by relatively low diversity, with a single species often forming a vast majority of calibrated or uncalibrated read counts. The three species that most often dominate the calibrated profiles are Gardnerella vaginalis, Lactobacillus iners, and Lactobacillus crispatus. Our mock-community analysis indicate G. vaginalis has the lowest efficiency of the seven species, roughly 20X lower than L. crispatus and 30X lower than L. iners (the most efficiently measured). As a result, the mean efficiency varies substantially across samples, with samples dominated by a Lactobacillus species typically having an efficiency that is 3-20X greater than samples dominated by Gardnerella (Figure 4.1). Shifts between Lactobacillus and Gardnerella dominance are common in between-women comparisons, and also occasionally occur between subsequent sampling points in individual women (SI Figure D.1). Examination of the multiplicative trajectories in the proportions of individual species indicates that such transitions can cause spurious fold changes in lower-abundance species (SI Figure TODO). Decreases in mean efficiency during transitions from Lactobacillus to Gardnerella dominance can be expected to be even more extreme for commonly-used vaginal microbiome primers that fail to amplify Gardnerella.
Figure 4.1: The mean efficiency in vaginal samples from the MOMS-PI study varies with the most abundant species.
A second study in which control communities make it possible to directly analyze the impact of taxonomic bias in experimental samples was performed by Leopold and Busby (2020). To study the interactions between a host plant and its fungal commensals and pathogen, the authors inoculated trees with a fungal synthetic community and later exposed plants to a fungal pathogen. The authors used ITS amplicon sequencing to measure communities before and after infection, along with mock communities which they used to estimate the species efficiencies with the method of McLaren, Willis, and Callahan (2019). (DNA mocks were used, so that bias due to DNA extraction is not included, but other major bias sources such as PCR and ribosomal copy-number-variation are included.) We reanalyzed this experiment to consider the impact that bias has on different possible DA analyses (SI Analysis Fungi). As noted by Leopold and Busby (2020), there was a 13X difference between the most and least efficiently measured commensals, while the pathogen was measured 40X more efficiently than the least efficiently measured commensal. Yet despite this 13X variation among commensals, the mean efficiency is remarkably stable across communities prior to pathogen colonization. Consequently, taxonomic bias creates little systematic error in proportion-based DA analysis of these pre-colonization samples. The high efficiency of the pathogen did, however, typically lead to a roughly 6X increase in the mean efficiency of post-colonization samples. Hence DA analysis of how commensals responded to the pathogen did not first discard pathogen reads would therefore tend to suggest that the commensals had substantially smaller fold changes than they actually did, unless bias were accounted for.
It is clear that an individual species with an unusually high or low efficiency can drive large changes in the mean efficiency when it shifts from a very large (\(\sim 1\)) to very small proportion (\(\ll 1\)) of the sample. Consideration of dynamics in the human gut illustrates another a way in which large shifts in higher-level taxa might create similar shifts, due to the fact that efficiencies do not vary independently among closely related species. Gut microbiomes are dominated by two phyla: The Bacteroidetes and the Firmicutes, the ratio of which varies substantially across individuals. Within a sample, there are usually multiple species contributing to the abundance of each phylum, with the median Inverse Simpson diversity being around 4 for each phylum in the HMP2 IBD study. Thus changes from Bacteroidetes to Firmicutes dominance are unlikely to be driven by a single species. However, if Firmicutes species tend to have systematically different efficiencies than Bacteroidetes species, then changes in phylum dominance can be expected to drive substantial changes in the mean efficiency even if many species contribute. Indeed, several studies have found DNA extraction protocols to more efficiently lyse Gram-negative Bacteroidetes species than Gram-positive Firmicutes species (though substantial variation within phyla also occurs, McLaren, Willis, and Callahan (2019)).
Systematic error in DA results requires that the variation in log mean efficiency is correlated with the covariate. The examples above each suggest plausible biologically-significant scenarios in which such correlations might arise. For example, in the vaginal microbiome a decline in Lactobacillus and rise of species including Gardnerella vaginalis is associated with preterm birth in pregnant women; hence the variation in mean efficiency driven by these species may also create an association of mean efficiency with preterm status. In the plant-fungal experiment of Leopold and Busby (2020), the increase in pathogen proportion after infection leads to a large increase in the mean efficiency over time, which systematic increases the observed decline in the proportions of commensal species (SI Analysis). The Bacteroidetes-to-Firmicutes ratio has also been linked to host conditions. Yet in the HMP2 IBD dataset, the large variation in this ratio is largely independent of disease status, suggesting that the primary effect of any variation in mean efficiency driven by these phyla would be to increase noise rather than create systematic error in DA analysis.
Are there general mechanisms by which we might expect associations to arise between mean efficiency and the covariates commonly of interest in microbiome studies? One possibility is that the ecology and measurement efficiency of species may become correlated simply due to shared evolutionary history. A species’ efficiency is heavily influenced by genetically determined traits such as cell-wall structure, ribosomal-operon copy number, sequence in a primer-targeted region, GC content, and even whether the species is present in a given taxonomic database. Shared evolutionary history creates positive trait associations among more closely related species, such that we should expect a significant degree of phylogenetic conservation in efficiency. Meanwhile, the same shared evolutionary history creates phylogenetic conservation in ecological traits that drives species abundance dynamics. In this way, we may expect evolution to lead to a situation in which groups of species with similar efficiencies have similar shifts in abundance between conditions, creating the potential for commensurate shifts in mean efficiency. For example, a change in diet that favors Bacteroidetes species relative to Firmicutes based on the resulting gut conditions will also increase the relative abundance of easy-to-lyse species.
Associations between efficiency and abundance can also arise more directly when a single microbial trait affects both. For example, microbes at the ocean floor are slowly buried, sinking into a low nutrient, low oxygen environment. Lloyd et al. (2020) estimate the log fold change in the estimated absolute abundance of various taxa with sediment depth (as a proxy for time) to determine which taxa are able to persist and even grow in this difficult environment. It is plausible that microbes with tougher cell walls would tend to persist longer (alive or dead) in the sediment, while at the same time being more difficult to extract DNA from than microbes with weaker cell walls. As the relative abundance of tougher species increases with depth, the mean extraction efficiency might therefore decrease. This decrease would increase the inferred log fold changes and could lead to inferred growth of taxa that are actually just persisting or even slowly dying off. Additional measurements by Lloyd et al. (2020) make it possible to investigate this possibility explicitly. The primary analysis, which used proportion-based density estimates, was supported by targeted measurements (qPCR) of 16S concentration for particular taxa. For instance, in Core 30, the doubling time estimated for the Archaeal phylum Bathyarchaeota was estimated to be 10.1 years using the proportion-based densities and 10.3 years by qPCR (Lloyd et al. (2020) Table 1 and Figure 4). These doubling times correspond to highly similar estimated slopes of log density per year of \(1/10.1 \approx 0.99\) and \(1/10.3\approx 0.97\). The small difference between these slopes suggests that there is little correlation of log mean efficiency with burial time and hence little systematic error in the estimated doubling times derived from the proportion-based density estimates.
Another example of a trait that might simultaneously affect efficiency and relative abundance is ribosomal copy number, which increases the measurement efficiency in ribosomal amplicon experiments and is also linked to differences in ecology and population dynamics among species.
The biological significance of the error in \(\hat \beta\) depends on the relative magnitude of the LFC (or covariance with the covariate \(x\)) in the focal species vs in the mean efficiency. It may generally be true that the largest species LFCs in a DA analysis, which are often of primary interest in a microbiome study, will tend to be larger than the LFC of the mean efficiency, at least in high-diversity settings. Further investigation into this question may improve our confidence in the top hits identified in published DA analyses meeting certain conditions. However, we must remain weary of cases (as in the vaginal-microbiome and plant-fungal-interactions examples above) in which a shift in a single species can drive large changes in mean efficiency and lead to significant errors for other species. The experimental and computational methods discussed in Section 5 can mitigate taxonomic bias in these clearly-problematic cases while also helping to amass empirical evidence on the conditions in which bias is unlikely to cause major inferential errors.
4.2 Loss of precision and power
Another way in which taxonomic bias hinders DA analysis is by increasing the noise in abundance measurements, leading to a reduction in the precision of LFC estimates and thus the power to detect true associations.
There are two distinct mechanisms by which such reductions occur. The first was described in Section 3: Variation in the mean efficiency that is random (in the sense of being unassociated with the covariates in a regression analysis) tends to increase the standard errors in estimated regression coefficients. As noted above, this effect may be the dominant one in cases like the HMP2 IBD dataset where large variation in dominant phylum is independent of disease status. Correcting for the variation in mean efficiency using the calibration methods described in the next section can reverse this effect, thereby increasing the precision as well reducing the systematic error in DA inference.
The second mechanism involves the counting noise that stems from the random sampling of reads in the sequencing experiment. This counting noise is expected to be the dominant source of noise for species whose expected proportion after accounting for taxonomic bias is \(\lesssim 1 / \text{total reads}\) for a given sample. Having an atypically low efficiency makes a species much more likely to be near this threshold and hence to receive highly uncertain measurements. Such imprecise measurements can make it near impossible to make biologically meaningful conclusions. For example, in the experiment of Leopold and Busby (2020) described above, the median read counts in samples taken after pathogen colonization for the 5 commensals with the lowest efficiencies were 0, 0, 0, 1, and 11. These species’ efficiencies ranged from 40X-10X lower than the pathogen’s efficiency; thus, in the absence of bias the counts may have been much higher and it might have been possible to make much more precise conclusions about how their relative abundances changed during pathogen invasion. The literature on the vaginal microbiome during pregnancy provides us with another example: Disagreement among studies over whether Gardnerella is positively associated with preterm birth seems to be in part driven by the fact that several studies used 16S primers that are extremely poor at amplifying Gardnerella and hence lack the precision necessary to find associations even were they to exist. Unfortunately, even perfect knowledge of the measurement efficiencies will not allow us to overcome the low precision caused by small expected read counts; however, the methods described in the section provide principled ways to assess whether the low counts are due to taxonomic bias rather than truly low frequencies in a species’ abundance.