3 Taxonomic bias can cause errors in differential-abundance results

How do the errors caused by taxonomic bias in the abundances measured for individual samples impact DA analysis? Although there are many ways to quantify the changes in abundance that form the basis of a DA analysis, we focus on DA analyses of the (log) fold changes in proportions, ratios, and absolute abundance; these multiplicative difference measures are common (though not ubiquitous) in extant DA analyses and have more direct ecological interpretations (via the processes of exponential growth and decay) than non-multiplicative measures.

3.1 Fold changes between a pair of samples

The building blocks of a DA analysis are the fold changes (FCs) in abundance between individual pairs of samples.

The impact of bias on the measured FCs in species proportions and ratios follows directly from the results of Section 2 for the error in individual-sample measurements. From Equation (2.7), it follows that the measured FC in the proportion of species $i$ from sample $a$ to sample $b$ is \[\begin{align} \tag{3.1} % \tag*{Fold change in proportion} \underbrace{\frac{\widehat{\text{prop}}_{i}(b)}{\widehat{\text{prop}}_{i}(a)}} _\text{measured FC} &= \frac {\text{prop}_{i}(b) \cdot \cancel{\text{efficiency}_{i}} / {\text{efficiency}_S(b)}} {\text{prop}_{i}(a) \cdot \cancel{\text{efficiency}_{i}} / {\text{efficiency}_S(a)}} \\[0.5ex] &= \underbrace{\frac{\text{prop}_{i}(b)}{\text{prop}_{i}(a)}}_\text{actual FC} \cdot \underbrace{\left[\frac{\text{efficiency}_S(b)}{\text{efficiency}_S(a)}\right]^{-1}}_\text{fold error} . \end{align}\] The sample-independent efficiency factor cancels, but the sample-dependent mean efficiency does not, leaving an error equal to the inverse of the change in the mean efficiency. In contrast, when we use Equation (2.9) to compute the error in the FC in the ratio between two species $i$ and $j$, we find that the constant error $\text{efficiency}_{i} / \text{efficiency}_{j}$ exactly cancels, so that the measured FCs remain accurate regardless of whether the mean efficiency varies.

Figure 2.1 (bottom row) illustrates these two different behaviors for the error in the FCs of proportions versus ratios when the mean efficiency varies between samples. Here the mean efficiency decreases by a factor of 2.6 (FC of 0.4X) from Sample 1 to Sample 2, which causes the FC of the proportion of each species to be measured as 2.6X larger than its true value. Though the fold error for all species is the same, the implications depend on the actual FC and correspond to three distinct types of error: an increase in magnitude, a decrease in magnitude, and a change in direction; we refer to the latter as a sign error since in reference to the change in sign in the corresponding log fold change (LFC). We can see each type of error in Figure 2.1. For Species 1, which increases and thus moves in the opposite direction of the mean efficiency, we see an increase in magnitude of the measured FC (actual FC: 2.3X, measured FC: 6.5X). For Species 2, which decreases and thus moves in the same direction as the mean efficiency but by a larger factor, we see an decrease in magnitude (actual FC: 0.15X, measured FC: 0.44X). For Species 2, which decreases by a smaller factor than the mean efficiency, we see a change in direction (actual FC: 0.6X, measured FC: 1.8X), such that the species actually appears to increase (a sign error). In contrast, the fold error in Equation (2.9) completely cancels when we divide the ratio measured for one sample $a$ by another sample $b$.

These differences in the FCs of proportions versus ratios are mirrored when we compare FCs in absolute abundance made with normalization to a total-abundance measurement versus a reference-species measurement. FCs computed from total-abundance normalization are subject to error when the ratio of the two mean efficiencies (that for the MGS measurement and for the total-abundance measurement vary across samples). In contrast, FCs computed from reference-species normalization are not subject to error, so long as the (fold) error in the assumed or measured abundance of the reference species is constant.

3.2 Regression analysis of many samples

DA analysis of multiple samples can be framed as a regression problem in which we analyze the relationship between a microbial response variable, such as the log abundance of species $i$, with one or more covariates, such as the pH of the sampled environment or whether the sample is from a healthy or sick person. The simplest regression analysis uses the simple linear regression model, \[\begin{align} \tag{3.2} \log \widehat{\text{abun}}_i(a) = \alpha_i + \beta_i x(a) + \varepsilon_i(a), \end{align}\] where $x$ is a continuous covariate (e.g. pH) or a binary covariate (e.g. $x=1$ for treated patients and $x=0$ for controls), $\alpha_i$ and $\beta_i$ are regression coefficients, and $\varepsilon_i(a)$ is a mean-zero random variable that reflects the residual (unexplained) variation in the response. In a DA analysis, we are interested in the slope coefficient $\beta$, which describes how the species’ abundance changes with $x$. We are generally uninterested in the intercept coefficient, $\alpha$, which captures differences in the baseline abundance among species and—it is usually hoped—study- and species-specific systematic error such as that created by taxonomic bias.

How does taxonomic bias in fact impact estimates of the slope coefficient, $\beta$? Appendix C derives the effect of bias on the estimated coefficients for the simple linear regression model under estimation via Ordinary Least Squares (OLS) or Maximum Likelihood Estimation (MLE). The result has a simple interpretation that provides intuition for what to expect in more complicated generalized linear regression models used in popular DA methods. Consider the regression (3.2) fit to the measured log abundance of species $i$ across all samples. The measurement error in the response variable, \[\begin{align} \text{error}_{i}(a) \equiv \log \widehat{\text{abun}}_i(a) - \log \text{abun}_i(a) = \log \frac{\widehat{\text{abun}}_i(a)}{\text{abun}_i(a)}, \end{align}\] can be thought of as having three components:

Error that is constant across samples affects the intercept coefficient, $\alpha_i$.
Error that varies systematically with $x$ affects the slope coefficient, $\beta_i$.
Error that varies and is uncorrelated with $x$ affects the residual variation, $\varepsilon_{i}(s)$.

We saw in Section 2 that the error from taxonomic bias has a species-specific, sample-independent component that is constant across samples; this component causes a species-specific systematic error in the estimated intercept, $\alpha_{i}$. The error also has a sample-specific, species-independent component. The portion that is correlated with the covariate $x$ creates a systematic error in the slope coefficient $\beta_{i}$ that is the same for every species and opposite in sign to how the error varies with $x$; the portion that is uncorrelated with $x$ affects the precision (standard errors) of the slope estimates in a species-specific manor.

Figure 3.1 illustrates with a simulated data for the case of log abundance measured using the total-abundance normalization method (Equation (2.10)), assuming that total community abundance is measured perfectly accurately. In this case, the variable component of the error is due entirely to variation in the log mean efficiency of the MGS measurement. Variation in the log mean efficiency that is associated with the covariate $x$ creates a systematic error in the estimated slope $\hat \beta$ equal to the negative of the (scaled) covariance of log mean efficiency with $x$. The absolute error is the same for all species; however, its relative value depends on the magnitude of the covariance of the log mean efficiency with $x$ relative to that of the response (here, $\log \text{abun}_{i}$) with $x$ or, equivalently, the relative magnitudes of their slopes. As in the case of fold changes between pairs of samples, the net effect can be decreases in magnitude (Species 9, 10, and 1 in Figure 3.1), changes in sign (Species 5), or increases in magnitude (remaining species) depending on these relative values. Variation in the log mean efficiency that is uncorrelated with $x$ does not systematically distort $\hat \beta$ but does affect its precision, typically leading to increased standard errors as the variation in log mean efficiency effectively acts as an additional source of noise in measured abundance (Figure 3.1 D). The exception is for species whose residual variation is strongly positively correlated with that of log mean efficiency (here, Species 9), which can appear to have less random variation and receive standard errors that are too small. Decreased magnitudes and increased standard errors can both cause associations to be missed that would otherwise have been detected (Species 10 and 1), while increased magnitudes can turn weak or statistically insignificant associations into strong and statistically significant ones (Species 7, 6 and 4).

Overall, we find that variation in measurement error created by variation in the mean efficiency of the MGS measurement can negatively affect certain DA analyses by creating systematic errors in the slope coefficient estimates and/or reducing their precision. Similar effects occur for regressions of log proportions and of log abundances derived by total-abundance normalization when the total abundance is accurately measured; however, as noted in Section 2, total-abundance measurements that are taxonomically biased in a complementary manner can mitigate these effects. Regression analyses of log-ratios and log abundances derived from reference-species normalization remain unaffected—the errors are constant across samples and so only impact the intercept coefficients.

$Taxonomic bias distorts multi-sample differential abundance inference when the mean efficiency of samples is associated with the covariate of interest. This figure shows the results of a regression analysis of simulated microbiomes consisting of 50 samples and 10 species from two environmental conditions indexed by $x=0$ and $x=1$. In this simulation, the species with the largest efficiency (Species 9) also has the largest positive LFC, which drives the positive association of the log mean efficiency with the condition (shown in Panels A and B). This positive LFC in the log mean efficiency induces a systematic negative shift in the estimated LFCs of all species (Panels C and D). Panel D shows the mean LFC (points) with 95% confidence intervals (CIs), for each species estimated from either the actual or the measured densities. The error (difference in LFC estimates on measured and actual) equals the negative LFC of the mean efficiency (shown in Panel B).$

Figure 3.1: Taxonomic bias distorts multi-sample differential abundance inference when the mean efficiency of samples is associated with the covariate of interest. This figure shows the results of a regression analysis of simulated microbiomes consisting of 50 samples and 10 species from two environmental conditions indexed by $x=0$ and $x=1$. In this simulation, the species with the largest efficiency (Species 9) also has the largest positive LFC, which drives the positive association of the log mean efficiency with the condition (shown in Panels A and B). This positive LFC in the log mean efficiency induces a systematic negative shift in the estimated LFCs of all species (Panels C and D). Panel D shows the mean LFC (points) with 95% confidence intervals (CIs), for each species estimated from either the actual or the measured densities. The error (difference in LFC estimates on measured and actual) equals the negative LFC of the mean efficiency (shown in Panel B).