3 How taxonomic bias affects differential-abundance analysis
How do the measurement errors described in the previous section impact our ability to estimate the changes in microbial abundances across samples or between different host and environmental conditions? Though there are many ways to quantitatively define such change, here we restrict our attention to inferring the multiplicative or (equivalently) log fold change in proportions, ratios, and cell densities, which have direct ecological interpretations (via the processes of exponential growth and death) and are ubiquitous in microbiome DA analysis.
3.1 Change between a pair of samples
Before considering the stereotypical many-sample DA analysis, it is instructive to consider simplest analysis of differential abundance: the measurement of fold changes in abundance between a pair of samples. This case is relevant for understanding common visualizations for comparing abundances across individual samples, such as the ubiquitous proportion bar plot and abundances-through-time trajectories within a single host or environment, and conceptually bridges the single-sample results of the previous section to the many-sample case.
The composition-dependent effect of bias on fold error in proportions leads to error in fold-change measurements that is proportional to the inverse change in mean efficiency. From the error in an individual sample (Equation (2.5)), it follows that the measured fold change in proportion of taxon \(i\) from sample \(a\) to sample \(b\) is \[\begin{align} \tag{3.1} % \tag*{Fold change in proportion} \underbrace{\frac{\widehat{\text{prop}}_{i}(b)}{\widehat{\text{prop}}_{i}(a)}} _\text{measured FC} &= \frac {\text{prop}_{i}(b) \cdot \cancel{\text{efficiency}_{i}} / {\text{mean efficiency}(b)}} {\text{prop}_{i}(a) \cdot \cancel{\text{efficiency}_{i}} / {\text{mean efficiency}(a)}} \\[0.5ex] &= \underbrace{\frac{\text{prop}_{i}(b)}{\text{prop}_{i}(a)}}_\text{actual FC} \cdot \underbrace{\left[\frac{\text{mean efficiency}(b)}{\text{mean efficiency}(a)}\right]^{-1}}_\text{fold error} . \end{align}\] The sample-independent efficiency factor of the error cancels, but the sample-dependent mean efficiency does not, leaving an error equal to the inverse of the change in the mean efficiency from \(s\) to \(t\).
The bottom row of Figure 2.1 illustrates how variation in mean efficiency leads to error in the inferred fold changes between a pair of samples. The mean efficiency decreases by a factor of 2.6 (FC: 0.4X) from Sample 1 to Sample 2. Consequently, the FC of each species is measured to be 2.6X larger than the true value. Though the fold error for all species is the same, the implications depend on the actual FC and correspond to three distinct types of error: an increase in magnitude, a decrease in magnitude, and a change in direction (or sign). We can see each type of error in Figure 2.1. For Species 1, which increases and thus moves in the opposite direction of the mean efficiency, we see an increase in magnitude of the measured FC (actual FC: 2.3X, measured FC: 6.5X). For Species 2, which decreases and thus moves in the same direction as the mean efficiency but by a larger factor, we see an decrease in magnitude (actual FC: 0.15X, measured FC: 0.44X). For Species 2, which decreases by a smaller factor than the mean efficiency, we see a change in direction (actual FC: 0.6X, measured FC: 1.8X), such that the species actually appears to increase!
In contrast, because species ratios are distorted by a constant factor, their measured fold changes remain accurate. The fold error in Equation (2.7) completely cancels when we divide the ratio measured for one sample \(a\) by another sample \(b\).
If other error sources remain negligible (or constant across samples), than this dichotomy continues to apply to proportion- and ratio-based density measurements. The inferred FCs in proportion-based density measurements will be incorrect by a factor equal to the inverse fold change in the mean efficiency, creating magnitude and/or directional errors. Moreover, any species with a constant density will appear to vary inversely with the mean efficiency. In contrast, fold changes in ratio-based density measurements will remain accurate.
3.2 Regression analysis of many samples
DA analysis of many samples across host or environmental conditions can typically be framed as a regression problem, in which we analyze the relationship between a microbial response variable, such as log density of some focal species \(i\), and one or more covariates, such the pH or temperature of the sampled environment or whether the sample is from a healthy or sick person. A large fraction of DA analyses use elaborations on the simple linear regression model, which for a response of log species density can be written \[\begin{align} \tag{3.2} \log \text{density}_i(a) = \alpha + \beta x(a) + \varepsilon_i(a). \end{align}\] Here \(x\) is a continuous covariate (e.g. pH) or a binary covariate (e.g. \(x=1\) for treated patients and \(x=0\) for controls), \(\alpha\) and \(\beta\) are regression coefficients, and \(\varepsilon_i(a)\) is a mean-zero random variable that reflects the residual (unexplained) variation in the response (log density of species \(i\)). Our interest is usually in the coefficient \(\beta\) (slope or average difference between conditions) that describes how the species’ abundance changes with \(x\), while the intercept \(\alpha\) captures differences in the baseline abundance and—we hope—measurement efficiency among species. How does taxonomic bias under our measurement model affect estimates of \(\beta\) in the simple linear regression?
Consider the case where the response is log density that has been estimated using proportion-based density estimation (Equation (2.8)) with error-free estimates of the total density. If the true log density follows the regression Equation (3.2), then it follows from Equation (2.9) the estimated log density equals \[\begin{align} \tag{3.3} \log \widehat{\text{density}}_i(a) % &= \alpha + \beta x(a) + \varepsilon_i(a) + \log \text{efficiency}_i - \log \text{mean efficiency}(a) = [\alpha + \log \text{efficiency}_i] + [\beta - \log \text{mean efficiency}(a)] x(a) + \varepsilon_i(a). \end{align}\] This equation shows that the species-specific portion of the error affects the intercept term while the sample-specific portion (log mean efficiency) affects the slope term. Thus, as in the case of fold changes between two samples (Equation (3.1)), it is the variation in the (log) mean efficiency across samples that we must worry about distorting our DA results.
The variation in measurement error created by the log mean efficiency impacts the point estimate and precision of the regression coefficient \(\beta\). Box 3.3 mathematically describes this effect for estimation via Ordinary Least Squares (OLS) or Maximum Likelihood Estimation (MLE), which Figure 3.1 illustrates using simulated data. Variation in the log mean efficiency that is associated with the covariate \(x\) creates a systematic error in the estimated slope \(\hat \beta\) equal to the negative of the (scaled) covariance of log mean efficiency with \(x\). The absolute error is the same for all species; however, its relative value depends on the magnitude of the covariance of the log mean efficiency with \(x\) relative to that of the response (here, \(\log \text{density}_{i}\)) with \(x\) or, equivalently, the relative magnitudes of their slopes. As in the case of fold changes between pairs of samples, the net effect can be decreases in magnitude (Species 9, 10, and 1 in Figure 3.1), changes in sign (Species 5), or increases in magnitude (remaining species) depending on these relative values. Variation in the log mean efficiency that is uncorrelated with \(x\) does not systematically distort \(\hat \beta\) but does affect its precision, typically leading to increased standard errors as the variation in log mean efficiency effectively acts as an additional source of noise in measured abundance (Figure 3.1 D). The exception is for species whose residual variation is strongly positively correlated with that of log mean efficiency (here, Species 9), which can appear to have less random variation and receive standard errors that are too small. Decreased magnitudes and increased standard errors can both cause associations to be missed that would otherwise have been detected (Species 10 and 1), while increased magnitudes can turn weak or statistically insignificant associations into strong and statistically significant ones (Species 7, 6 and 4).
With this understanding in place, we briefly summarize how taxonomic bias affects estimation of the simple linear model in other abundance types. The results for proportion-based density estimates with accurate total densities also apply to LFC analysis of proportions. Similar results apply to microbiome regression tools (such as corncob; Martin, Witten, and Willis (2020)) that perform regression on the logit (instead of log) proportion of a species; however, the mean efficiency of the entire sample must instead be replaced with the mean efficiency among all species excluding the focal species, which causes the absolute error in regression coefficients to vary somewhat across species. Because ratios and ratio-based densities are subject to consistent multiplicative error, in analysis of log ratios and the log densities derived from them, only the estimated intercept \(\hat \alpha\) is affected by taxonomic bias, while the point estimate and standard error of the estimated slope \(\hat \beta\) remain unaffected.
Figure 3.1: Taxonomic bias distorts multi-sample differential abundance inference when the mean efficiency of samples is associated with the covariate of interest. This figure shows the results of a regression analysis of simulated microbiomes consisting of 50 samples and 10 species from two environmental conditions indexed by \(x=0\) and \(x=1\). In this simulation, the species with the largest efficiency (Species 9) also has the largest positive LFC, which drives the positive association of the log mean efficiency with the condition (shown in Panels A and B). This positive LFC in the log mean efficiency induces a systematic negative shift in the estimated LFCs of all species (Panels C and D). Panel D shows the mean LFC (points) with 95% confidence intervals (CIs), for each species estimated from either the actual or the measured densities. The error (difference in LFC estimates on measured and actual) equals the negative LFC of the mean efficiency (shown in Panel B).
3.3 Box: Error in estimated regression coefficients and standard errors
Results from statistics on regression under measurement error can help us understand the effect of bias on DA regression analysis.
Consider Equation (3.2) for the simple linear regression of log density of a species \(i\) on covariate \(x\). Let \(y\) stand for the actual log density of a focal species \(i\), \(z\) stand for the log density that we’ve measured, and \(d = z - y\) equal the difference between the two, which here is the log efficiency of species \(i\) minus the log mean efficiency of the sample. Let \(s_{xy}\) denote the sample covariance between variables \(x\) and \(y\), \(s^{2}_{x} = s_{xx}\) and \(s_{x}\) denote the sample variance and standard deviation, and \(r_{xy} = s_{xy}/(s_{x}s_{y})\) denote the sample correlation.
The ordinary least squares (OLS) and maximum likelihood (MLE) estimates of the slope of \(z\) equals the sample covariance of \(z\) and \(x\) divided by the sample variance in \(x\) or, equivalently, the sample correlation of \(z\) and \(x\) multiplied by the ratio of their sample standard deviations, \[\begin{align} \hat \beta_z = \frac{s_{zx}}{s^2_x} = r_{zx} \cdot \frac{s_y}{s_x}. \end{align}\] From the (bi)linearity of sample covariances it follows that \[\begin{align} \hat \beta_z = \frac{s_{yx}}{s^2_x} + \frac{s_{dx}}{s^2_x} = \frac{r_{yx} s_y}{s_x} + \frac{r_{dx} s_d}{s_x} = \hat \beta_y + \hat \beta_d, \end{align}\] where \(\hat \beta_y\) and \(\hat \beta_d\) denote the slope estimates for \(y\) and \(d\) (were these values to be known). The absolute error in the estimate \(\hat \beta_{z}\) of \(\hat \beta_{y}\) is therefore \(\hat \beta_{d}\); it is large in a practical sense when \(\hat \beta_{d}\) is large (in absolute value) compared to \(\hat \beta_{y}\), which corresponds to the covariance of \(d\) with \(x\) being large compared to the covariance of \(y\) with \(x\).
In our case, the covariance of \(d\) equals the negative of the covariance of log mean efficiency with \(x\). The absolute error (i.e., the bias in a statistical sense) in \(\hat \beta\) is equals the negative covariance of log mean efficiency scaled by the variance in \(x\). This absolute error is the same for all species; however, its practical significance varies depending on its magnitude relative to that of the slope of the actual log densities. For species that covary with \(x\) more strongly than the log mean efficiency, the error will be relatively small. This situation might occur either because the mean efficiency varies relatively little across samples or because its variation is relatively less correlated with \(x\) compared to the log density of the focal species.
We can similarly understand the impact of measurement error on the precision of our slope estimates. The OLS and MLE estimated standard error in \(\hat \beta\) are both approximately \[\begin{align} \hat{\text{se}}(\hat \beta) \approx \frac{s_{\hat \varepsilon}}{s_x \sqrt{n}}, \end{align}\] where \(s_{\hat \varepsilon}\) is the sample standard deviation of the residuals (Wasserman (2004) Chapter 13). The sample residuals of \(z\), \(y\), and \(d\) have a similar relationship to the regression coefficient estimates, \[\begin{align} \hat \varepsilon_z &\equiv z - \hat \beta_z x \\&= (y + d) - (\hat \beta_y + \hat \beta_d) x \\&= (y - \hat \beta_y x) + (d - \hat \beta_d x) \\&= \hat \varepsilon_y + \hat \varepsilon_d. \end{align}\] (Note, here I’ve omitted the subscript indicating the dependence on the sample.) It follows that the sample variances of the residuals of \(z\), \(y\), and \(d\) are related through \[\begin{align} s^2_{\hat \varepsilon_{z}} = s^2_{\hat \varepsilon_{y} + \hat \varepsilon_{d}} = s^2_{\hat \varepsilon_{y}} + s^2_{\hat \varepsilon_{d}} + 2 s_{\hat \varepsilon_{y} \hat \varepsilon_{d}}. \end{align}\] The standard deviation of the \(z\) residuals is increased above that of the \(y\) residuals when the \(d\) residuals are either uncorrelated or positively correlated with the \(y\) residuals, but may be decreased when the \(y\) and \(d\) residuals are negatively correlated.
In our case, the \(d\) residuals equal the negative residuals of the log mean efficiency. It is plausible that for most species, their residual variation will have a small covariance with log mean efficiency and the net effect of variation in the mean efficiency will be to increase the estimated standard errors, as occurs with most species in Figure 3.1. However, high-efficiency species that vary substantially in proportion across samples may be strongly positively correlated with log mean efficiency such that the estimated standard errors decrease, as we see with Species 9 in Figure 3.1.