3 How bias affects DA results

We now turn to how the errors in MGS-based abundance measurements affect DA results. Our focus is on multiplicative DA—the fold differences (FD) in the relative or absolute abundance of species between samples or between experimental conditions. Multiplicative DA has a direct ecological interpretation via the multiplicative processes of exponential growth and decay and forms the basis for several popular DA methods (REF DESeq2, ALDEx2(?), and corncob; a LV paper; fido). We also consider non-parametric rank-based methods; these methods are popular in case-control studies for which ecological interpretability is often of secondary importance to discovering stable associations between species and the condition of interest (REFs).

3.1 Fold change between a pair of samples

The building blocks of a multiplicative DA analysis are the measured FDs in a species’ abundance between pairs of individual samples. If the fold error (FE) in a species’ abundance measurement is constant across samples, then it will not affect the measured FDs between samples. In Section 2, we showed that consistent taxonomic bias creates a constant FE in the ratio between two species $i$ and $j$, equal to the ratio in their efficiencies (Equation (2.9)). This error completely cancels when calculating the FD of this ratio from sample $a$ to $b$, \[\begin{align} \tag{3.1} \underbrace{\frac{\tilde R_{i/j}^{(b)}}{\tilde R_{i/j}^{(a)}}}_\text{measured FD} = \frac {R_{i/j}^{(b)} \cdot \cancel{B_i / B_j}} {R_{i/j}^{(a)} \cdot \cancel{B_i / B_j}} = \underbrace{\frac{R_{i/j}^{(b)}}{R_{i/j}^{(a)}}}_\text{actual FD} . \end{align}\] In contrast, bias created a FE in the proportion of a species (Equation (2.7)) that varied inversely with the mean efficiency of the sample. Therefore, this error does not cancel when calculating the FD in the proportion, \[\begin{align} \tag{3.2} \underbrace{\frac{\tilde P_i^{(b)}}{\tilde P_i^{(a)}}}_\text{measured FD} = \frac {P_i(b) \cdot \cancel{B_i} / \bar B^{(b)}} {P_i(a) \cdot \cancel{B_i} / \bar B^{(a)}} = \underbrace{\frac{P_i^{(b)}}{P_i^{(a)}}}_\text{actual FD} \cdot \underbrace{\left[\frac{\bar B^{(b)}}{\bar B^{(a)}}\right]^{-1}}_\text{FE} . \end{align}\] The varying mean efficiency does not cancel, leaving an FE in the FD of the species’ proportion that is equal to the inverse FD in the mean efficiency. Notably, this FE is the same for every species.

The different effects of taxonomic bias on the measured FDs in proportions versus ratios are illustrated in a hypothetical example in Figure 2.1 (bottom row). In this example, the mean efficiency decreases 2.6-fold from Sample 1 to Sample 2 (FD of 0.4), causing the FD in the proportion of each species to appear 2.6-fold larger than its true value. Though the FE in the FD is the same for each species, the biological implications of the error vary. In particular, there are three distinct types of error: an increase in the magnitude of the measured FD, a decrease in that magnitude, and a change in the direction of the FD. We refer to the latter as a sign error in reference to the sign of the measured log fold difference (LFD). We can see each type of error in Figure 2.1. For Species 1, which increases in abundance and thus moves in the opposite direction of the mean efficiency, we see an increase in magnitude of the measured FD (actual FD: 2.3, measured FD: 6.5). For Species 2, which decreases and thus moves in the same direction as the mean efficiency but by a larger factor, we see a decrease in magnitude (actual FD: 0.15, measured FD: 0.44). For Species 2, which decreases by a smaller factor than the mean efficiency, we see a change in direction (actual FD: 0.6, measured FD: 1.8), such that the species actually appears to increase (a sign error). In contrast, the ratios between species change by the same factors in the actual and measured taxonomic profiles. For example, the ratio of Species 2 to Species 3 changes from $1/1$ to $1/4$ in the actual profiles and from $3$ to $3/4$ in the measured profiles, a decrease by 4-fold either way.

Taxonomic bias affects the measured FDs in a species’ absolute abundance in a similar fashion. First, consider a species’ abundance measurement $\tilde A_i^{(a)}$ derived from a non-MGS measurement of total-community abundance using Equation (2.11). Equation (2.13) indicates that the FE in this measurement equals the (constant) species efficiency multiplied by ${B^{{\text{[tot]}}(a)}}/{\bar B^{(a)}}$, the ratio of the mean efficiency of the total-abundance measurement to that of the MGS measurement. This ratio can vary, creating error in the measured FD between samples. Notably, if the FDs in the mean efficiency of the total-abundance measurement mirrors that of the MGS measurement, then the two offset each other and lead to more stable FEs (and hence more accurate FDs) in $\tilde A_i^{(a)}$. We discuss how this possibility might be exploited in real experimental workflows in Section 5.

Now consider a species’ abundance derived from a reference species using Equation (2.14). Equation (2.15) indicates that the FE equals a constant ratio in species’ efficiencies multiplied by the FE in the measurement of the reference species’ abundance. If the abundance of the reference species can determined up to constant FE, then the FE in $\tilde A_i^{(a)}$ will also be constant, leading to accurate FDs.

3.2 Regression analysis of many samples

Most DA analyses can be understood as a regression analysis. For example, the simple linear regression model, \[\begin{align} \tag{3.3} y^{(a)} = \alpha + \beta x^{(a)} + \varepsilon^{(a)}, \end{align}\] can be used to relate a microbial abundance variable $y$ to a covariate $x$ and the residual (or unexplained) error $\varepsilon$. For example, $y$ may be the log absolute abundance of species $i$ ($\log A_i^{(a)}$) and $x$ may be a continuous variable (such as pH) or a binary variable (such as $x=1$ for treated patients and $x=0$ for untreated controls). The regression coefficients $\alpha$ and $\beta$ are parameters to be estimated from the data. A DA analysis is primarily interested in the slope coefficient $\beta$, which determines how the average or expected value of $y$ increases with $x$. For a binary covariate, $\beta$ equals the average difference in $y$ between samples with $x=1$ versus those with $x=0$. Taking the response $y$ to be the logarithm of a species’ relative or absolute abundance makes the DA analysis multiplicative, with the coefficient $\beta$ corresponding to the average log fold difference (LFD) in abundance between conditions or for a unit increase in $x$.

Appendix B derives mathematical results for the effect that measurement error in the response caused by taxonomic bias has on LFDs estimated via simple linear regression. Here we illustrate these results with an example. Suppose that we wish to estimate the average LFD in the absolute abundance of every species between samples from two conditions, such as treated and untreated patients. Further suppose that we have measured absolute abundances by multiplying the proportions from MGS with an accurate measurement of total-community abundance. For each species $i$, we set $y^{(a)}$ in Equation (3.3) equal to $\log \tilde A_{i}^{(a)}$, and use either Ordinary Least Squares (OLS) or Maximum Likelihood Estimation (MLE) to estimate the coefficients $\alpha$ and $\beta$. From Equation (2.13), we know that the FE in the abundance of each species equals its efficiency relative to the mean efficiency of the sample.

To understand how the FE in species abundances affects the estimated LFD for a given species, it is useful to consider three possible scenarios for how the mean efficiency varies across samples. In the first scenario, the mean efficiency remains stable across samples, such that the FE in $\log \tilde A_{i}^{(a)}$ is approximately constant. In this case, the FE induced by bias mostly cancels and does not substantially affect the estimated LFD. In the second scenario, the mean efficiency varies substantially but is not associated with the condition, creating random FEs in the individual sample measurements. In this case, the LFDs between individual samples will be highly distorted, but these errors will effectively average out in the regression and the estimated LFD remains reliable. Although the random FEs caused by bias do not create a systematic error in the LFD estimate, they will tend to reduce the precision (i.e., increase the standard error) of the estimate. In the third scenario, the mean efficiency systematically varies with the condition, causing a systematic error in the estimated LFD regardless of the number of samples. In particular, the portion of additive variation in the log mean efficiency that is linearly associated with $x$ creates an equal but opposite additive error in the estimated LFD for each species. Although the absolute additive error in the LFDs is the same for all species, its biological significance varies, depending on its magnitude relative to that of the species’ true LFD.

Figure 3.1 shows a simulated example in which the mean efficiency varies systematically with the condition of interest. Here, the average log mean efficiency is higher in the condition corresponding to $x=1$, due to an increase of the high-efficiency Species 9. This systematic increase in the log mean efficiency causes an artificial decrease in the estimated LFD for each species by an equal amount. As we saw for the FD between individual samples, the biological significance of the systematic error in the LFD estimates varies by species, despite the absolute error being constant across species. We see decreases in the magnitude of the LFD estimate (Species 9, 10, and 1 in Figure 3.1), changes in sign (Species 5), or increases in magnitude (remaining species) depending the value of the actual LFD relative to the error. The portion of the variation in the log mean efficiency that is uncorrelated with $x$ increases the standard errors for most species (Figure 3.1 D). The exception is Species 9, which is the primary driver of the mean efficiency dynamics in this example. Decreased magnitudes and increased standard errors can both cause associations to be missed that would otherwise have been detected (Species 10 and 1), while increased magnitudes can turn weak or statistically insignificant associations into strong and statistically significant ones (Species 7, 6 and 4).

$Taxonomic bias distorts multi-sample differential abundance inference when the mean efficiency of samples is associated with the covariate of interest. This figure shows the results of a regression analysis of simulated microbiomes consisting of 50 samples and 10 species from two environmental conditions indexed by $x=0$ and $x=1$. In this simulation, the species with the largest efficiency (Species 9) also has the largest positive LFC, which drives the positive association of the log mean efficiency with the condition (shown in Panels A and B). This positive LFC in the log mean efficiency induces a systematic negative shift in the estimated LFCs of all species (Panels C and D). Panel D shows the mean LFC (points) with 95% confidence intervals (CIs), for each species estimated from either the actual or the measured densities. The error (difference in LFC estimates on measured and actual) equals the negative LFC of the mean efficiency (shown in Panel B).$

Figure 3.1: Taxonomic bias distorts multi-sample differential abundance inference when the mean efficiency of samples is associated with the covariate of interest. This figure shows the results of a regression analysis of simulated microbiomes consisting of 50 samples and 10 species from two environmental conditions indexed by $x=0$ and $x=1$. In this simulation, the species with the largest efficiency (Species 9) also has the largest positive LFC, which drives the positive association of the log mean efficiency with the condition (shown in Panels A and B). This positive LFC in the log mean efficiency induces a systematic negative shift in the estimated LFCs of all species (Panels C and D). Panel D shows the mean LFC (points) with 95% confidence intervals (CIs), for each species estimated from either the actual or the measured densities. The error (difference in LFC estimates on measured and actual) equals the negative LFC of the mean efficiency (shown in Panel B).

3.3 Rank-based analyses

Rank-based or ‘nonparametric’ DA methods work by first applying a cross-sample rank-transformation to the abundance variable $y$: The sample in which $y$ has the smallest value receives a rank of 1, the sample with the second smallest value receives a rank of 2, etc. These methods then analyze how the average rank varies across conditions. Rank-based methods commonly used in microbiome DA analysis include the Wilcoxon-Mann-Whitney and Kruskal-Wallis tests for a discrete covariate (e.g. case vs control) and Spearman’s rank correlation for a continuous covariate (e.g. pH). They have been applied to species proportions (REFs), ratios (REFs), and absolute abundances (REFs).

Multiplying a variable by a constant factor does not change its cross-sample ranks; thus rank-based DA methods are unaffected by constant FE in the abundance variable $y$. We therefore conclude that consistent taxonomic bias does not affect rank-based DA analyses of species ratios or absolute abundances derived from a reference species. In contrast, variable FE in $y$ (driven by variation in the mean efficiency) can affect the cross-sample ranks; if the FE is associated with the covariate $x$, it can cause systematic error in the DA result. Thus consistent bias may impact rank-based analyses of species proportions and absolute abundances derived from a total-community abundance measurement. Future work should consider whether rank-based DA methods are significantly more robust than multiplicative DA methods to variation in the mean efficiency.