3 DA methods vary in their robustness to bias

This section turns to how these errors in individual sample measurements described in the previous section affect the results of cross-sample comparisons and DA analyses of many samples. Microbiome researchers employ many measures for the change in species’ abundance across samples. We focus on analyses of multiplicative or (log) fold changes, which are common across study types and have more direct ecological interpretations than other measures (via the processes of exponential growth and decay). In addition, we briefly consider non-parametric rank-based analyses that are common in microbiome-wide association studies.

3.1 Fold changes between a pair of samples

The building blocks of a multiplicative DA analysis are the fold changes (FCs) in abundance between individual pairs of samples. Intuitively, only abundance measurements that have proportional errors will have FCs that are completely robust to bias.

The phenomenon by which non-proportional errors due to bias distort measured FCs can be seen most simply for species proportions. It follows from Equation (2.7) that the measured FC in the proportion of a species \(i\) from samples \(a\) to \(b\) is \[\begin{align} \tag{3.1} \underbrace{\frac{\widehat{\text{prop}}_{i}(b)}{\widehat{\text{prop}}_{i}(a)}} _\text{measured FC} &= \frac {\text{prop}_{i}(b) \cdot \cancel{\text{efficiency}_{i}} / {\text{efficiency}_S(b)}} {\text{prop}_{i}(a) \cdot \cancel{\text{efficiency}_{i}} / {\text{efficiency}_S(a)}} \\[0.5ex] &= \underbrace{\frac{\text{prop}_{i}(b)}{\text{prop}_{i}(a)}}_\text{actual FC} \cdot \underbrace{\left[\frac{\text{efficiency}_S(b)}{\text{efficiency}_S(a)}\right]^{-1}}_\text{fold error} . \end{align}\] The sample-independent efficiency factor cancels, but the sample-dependent mean efficiency does not, leaving an error in the measured FC equal to the inverse change in mean efficiency. In contrast, the ratio between species \(i\) and \(j\) has proportional error and so its FC is unaffected by bias. From Equation (2.9), \[\begin{align} \tag{3.2} \underbrace{\frac{\widehat{\text{ratio}}_{i/j}(b)}{\widehat{\text{ratio}}_{i/j}(a)}} _\text{measured FC} &= \frac {\text{ratio}_{i/j}(b) \cdot \cancel{\text{efficiency}_{i} / {\text{efficiency}_j}}} {\text{ratio}_{i/j}(a) \cdot \cancel{\text{efficiency}_{i} / {\text{efficiency}_j}}} \\[0.5ex] &= \underbrace{\frac{\text{ratio}_{i/j}(b)}{\text{ratio}_{i/j}(a)}}_\text{actual FC} ; \end{align}\] the constant error, equal to the ratio between the species’ efficiencies, completely cancels, leaving an accurately measured FC.

Figure 2.1 (bottom row) illustrates these two different behaviors for the error in the FCs of proportions versus ratios when the mean efficiency varies between samples. Here the mean efficiency decreases by a factor of 2.6 (FC of 0.4X) from Sample 1 to Sample 2, which causes the FC of the proportion of each species to be measured as 2.6X larger than its true value. Though the fold error for all species is the same, the implications depend on the actual FC and correspond to three distinct types of error: an increase in magnitude, a decrease in magnitude, and a change in direction; we refer to the latter as a sign error in reference to the corresponding log fold change (LFC). We can see each type of error in Figure 2.1. For Species 1, which increases and thus moves in the opposite direction of the mean efficiency, we see an increase in magnitude of the measured FC (actual FC: 2.3X, measured FC: 6.5X). For Species 2, which decreases and thus moves in the same direction as the mean efficiency but by a larger factor, we see an decrease in magnitude (actual FC: 0.15X, measured FC: 0.44X). For Species 2, which decreases by a smaller factor than the mean efficiency, we see a change in direction (actual FC: 0.6X, measured FC: 1.8X), such that the species actually appears to increase (a sign error). In contrast, the fold error in Equation (2.9) completely cancels when we divide the ratio measured for one sample \(a\) by another sample \(b\).

Absolute-abundance measurements for which bias creates non-proportional errors are subject to the same limitation as proportions. In particular, abundance measured by normalization of MGS proportions to a total-abundance measurement will yield errors in FCs given by the inverse change in the mean efficiency, unless this change is offset by error in the total-abundance measurement itself. On the other hand, abundances measured by normalization to one or more reference species are capable of having proportional errors that cancel in FC calculations.

3.2 Regression analysis of many samples

DA analysis of multiple samples can be framed as a regression problem in which we analyze the relationship between a microbial response variable, such as the log abundance of species \(i\), with one or more covariates, such as the pH of the sampled environment or whether the sample is from a healthy or sick person. The simplest regression analysis uses the simple linear regression model, \[\begin{align} \tag{3.3} \log \widehat{\text{abun}}_i(a) = \alpha_i + \beta_i x(a) + \varepsilon_i(a), \end{align}\] where \(x\) is a continuous covariate (e.g. pH) or a binary covariate (e.g. \(x=1\) for treated patients and \(x=0\) for controls), \(\alpha_i\) and \(\beta_i\) are regression coefficients, and \(\varepsilon_i(a)\) is a mean-zero random variable that reflects the residual (unexplained) variation in the response. In a DA analysis, we are interested in the slope coefficient \(\beta\), which describes how the species’ abundance changes with \(x\). We are generally uninterested in the intercept coefficient, \(\alpha\), which captures differences in the baseline abundance among species and—it is usually hoped—study- and species-specific systematic error such as that created by taxonomic bias.

How does taxonomic bias in fact impact estimates of the slope coefficient, \(\beta\)? Appendix C derives the effect of bias on the estimated coefficients for the simple linear regression model under estimation via Ordinary Least Squares (OLS) or Maximum Likelihood Estimation (MLE). The result has a simple interpretation that provides intuition for what to expect in more complicated generalized linear regression models used in popular DA methods. Consider the regression (3.3) fit to the measured log abundance of species \(i\) across all samples. The measurement error in the response variable, \[\begin{align} \text{error}_{i}(a) \equiv \log \widehat{\text{abun}}_i(a) - \log \text{abun}_i(a) = \log \frac{\widehat{\text{abun}}_i(a)}{\text{abun}_i(a)}, \end{align}\] can be thought of as having three components:

  • Error that is constant across samples affects the intercept coefficient, \(\alpha_i\).
  • Error that varies systematically with \(x\) affects the slope coefficient, \(\beta_i\).
  • Error that varies and is uncorrelated with \(x\) affects the residual variation, \(\varepsilon_{i}(s)\).

We saw in Section 2 that the error from taxonomic bias has a species-specific, sample-independent component that is constant across samples; this component causes a species-specific systematic error in the estimated intercept, \(\alpha_{i}\). The error also has a sample-specific, species-independent component. The portion that is correlated with the covariate \(x\) creates a systematic error in the slope coefficient \(\beta_{i}\) that is the same for every species and opposite in sign to how the error varies with \(x\); the portion that is uncorrelated with \(x\) affects the precision (standard errors) of the slope estimates in a species-specific manor.

Figure 3.1 illustrates with a simulated data for the case of log abundance measured using the total-abundance normalization method (Equation (2.11)), assuming that total community abundance is measured perfectly accurately. In this case, the variable component of the error is due entirely to variation in the log mean efficiency of the MGS measurement. Variation in the log mean efficiency that is associated with the covariate \(x\) creates a systematic error in the estimated slope \(\hat \beta\) equal to the negative of the (scaled) covariance of log mean efficiency with \(x\). The absolute error is the same for all species; however, its relative value depends on the magnitude of the covariance of the log mean efficiency with \(x\) relative to that of the response (here, \(\log \text{abun}_{i}\)) with \(x\) or, equivalently, the relative magnitudes of their slopes. As in the case of fold changes between pairs of samples, the net effect can be decreases in magnitude (Species 9, 10, and 1 in Figure 3.1), changes in sign (Species 5), or increases in magnitude (remaining species) depending on these relative values. Variation in the log mean efficiency that is uncorrelated with \(x\) does not systematically distort \(\hat \beta\) but does affect its precision, typically leading to increased standard errors as the variation in log mean efficiency effectively acts as an additional source of noise in measured abundance (Figure 3.1 D). The exception is for species whose residual variation is strongly positively correlated with that of log mean efficiency (here, Species 9), which can appear to have less random variation and receive standard errors that are too small. Decreased magnitudes and increased standard errors can both cause associations to be missed that would otherwise have been detected (Species 10 and 1), while increased magnitudes can turn weak or statistically insignificant associations into strong and statistically significant ones (Species 7, 6 and 4).

Overall, we find that variation in measurement error created by variation in the mean efficiency of the MGS measurement can negatively affect certain DA analyses by creating systematic errors in the slope coefficient estimates and/or reducing their precision. Similar effects occur for regressions of log proportions and of log abundances derived by total-abundance normalization when the total abundance is accurately measured; however, as noted in Section 2, total-abundance measurements that are taxonomically biased in a complementary manner can mitigate these effects. Regression analyses of log-ratios and log abundances derived from reference-species normalization remain unaffected—the errors are constant across samples and so only impact the intercept coefficients.

Taxonomic bias distorts multi-sample differential abundance inference when the mean efficiency of samples is associated with the covariate of interest. This figure shows the results of a regression analysis of simulated microbiomes consisting of 50 samples and 10 species from two environmental conditions indexed by \(x=0\) and \(x=1\). In this simulation, the species with the largest efficiency (Species 9) also has the largest positive LFC, which drives the positive association of the log mean efficiency with the condition (shown in Panels A and B). This positive LFC in the log mean efficiency induces a systematic negative shift in the estimated LFCs of all species (Panels C and D). Panel D shows the mean LFC (points) with 95% confidence intervals (CIs), for each species estimated from either the actual or the measured densities. The error (difference in LFC estimates on measured and actual) equals the negative LFC of the mean efficiency (shown in Panel B).

Figure 3.1: Taxonomic bias distorts multi-sample differential abundance inference when the mean efficiency of samples is associated with the covariate of interest. This figure shows the results of a regression analysis of simulated microbiomes consisting of 50 samples and 10 species from two environmental conditions indexed by \(x=0\) and \(x=1\). In this simulation, the species with the largest efficiency (Species 9) also has the largest positive LFC, which drives the positive association of the log mean efficiency with the condition (shown in Panels A and B). This positive LFC in the log mean efficiency induces a systematic negative shift in the estimated LFCs of all species (Panels C and D). Panel D shows the mean LFC (points) with 95% confidence intervals (CIs), for each species estimated from either the actual or the measured densities. The error (difference in LFC estimates on measured and actual) equals the negative LFC of the mean efficiency (shown in Panel B).

3.2.1 [TODO] Rank-based analyses