2 How taxonomic bias affects abundance measurements
To understand how bias affects the measured differential abundance between samples, we first consider how it affects measurements of individual samples.
2.1 Model of MGS measurement
Our primary tool for understanding the impact of taxonomic bias on MGS measurement is the theoretical model of MGS measurement developed and empirically validated by McLaren, Willis, and Callahan (2019). This model is the simplest model of MGS measurement that includes multiplicative taxonomic bias while respecting the compositional nature of sequencing measurements, in which the total read count for a sample is unrelated to its total cell number or density (Gloor et al. (2017)). We consider a set of microbial species that are measured by a specific MGS protocol that extracts, sequences, and bioinformatically analyzes samples to assign reads to various microbial species. We assume that assigned reads have been correctly assigned to the given species and sample. Thus we ignore the possibility of contamination or false-positive taxonomic assignment, though we allow that many reads may not receive a taxonomic assignment and so ultimately be discarded. Unless otherwise stated, we treat the sequencing measurement as deterministic, ignoring the ‘noise’ or random error due to the random sampling of sequencing reads and other aspects of the MGS process.
Our model stipulates that the assigned read count of species \(i\) in sample \(a\) equals its cell density multiplied by a species-specific factor and a sample-specific factor, \[\begin{align} \tag{2.1} \text{reads}_{i}(a) = \text{density}_{i}(a) \quad \cdot \underbrace{\text{efficiency}_{i}}_{\substack{\text{species specific,} \\ \text{sample independent}}} \cdot \quad \underbrace{\text{sequencing effort}(a)}_{\substack{\text{species independent,} \\ \text{sample specific}}}. \end{align}\] The species-specific factor is the relative measurement efficiency (or efficiency for short) of the species, which represents how much more easily that species is measured (converted from cells to assigned reads) relative to an abitrarily chosen reference species (McLaren, Willis, and Callahan (2019)). The sample-specific factor, which we call the sequencing effort of the sample, reflects the fact that the number of reads per unit cell density varies among samples even in the absence of taxonomic bias due to variation in total density, library normalization, and total sequencing-run output. Equation (2.1) implies that the total reads from all species in the sample equal \[\begin{align} \tag{2.2} \text{total reads}(a) = \text{total density}(a) \cdot \text{mean efficiency}(a) \cdot \text{sequencing effort}(a), \end{align}\] where \[\begin{align} \tag{2.3} \text{mean efficiency}(a) \equiv \frac{\sum_{j}\text{density}_j(a)\cdot \text{efficiency}_j}{\text{total density}(a)} \end{align}\] is the average or mean efficiency of cells in the sample.
2.2 Relative abundances (proportions and ratios)
We distinguish between two types of species-level relative abundances. The first type consists of the ratios among the abundances of two or more species, such as a ratio of 5:4:2 among three species. The second type consists of the proportion of an individual species, equal to its abundance divided by that of a particular root taxon (e.g. Prokaryotes). DA analyses of proportions consider how the proportion of a species changes indepently of other species, whereas DA analyses of ratios consider the conserted change in relative abundance among two or more species. Although proportion-based analysis is dominant, microbiome researchers increasingly use ratio-based analyses adapted from the statistical framework of Compositional Data Analysis (CoDA). The different manner in which taxonomic bias affects ratios vs. proportions (McLaren, Willis, and Callahan (2019)) forms the basis in subsequent sections for understanding the extent to which its effects cancel in DA analysis.
The standard measurement of the proportion of species \(i\) in sample \(a\) is given by the ratio of its read count to the total, \[\begin{align} \tag{2.4} \widehat{\text{prop}}_{i}(a) = \frac{\text{reads}_i(a)}{\text{total reads}(a)}. \end{align}\] From Equations (2.1), (2.2), and (2.4), it follows that \[\begin{align} \tag{2.5} \widehat{\text{prop}}_{i}(a) &= \text{prop}_{i}(a) \cdot \frac{\text{efficiency}_{i}}{\text{mean efficiency}(a)}, \end{align}\] where \(\text{prop}_{i}(a) = \text{density}_{i}(a) / \text{total density}(a)\) is the actual proportion. Equation (2.5) says that taxonomic bias creates a multiplicative error equal to the ratio of the species’ efficiency to the mean efficiency in the sample. Consequently, the same species can be over- or under-measured depending on the composition of the given sample (Figure 2.1). For instance, in samples from two hypothetical communities in Figure 2.1, Species 3 has an efficiency of 6 and is under-measured in Sample 1 (which has a mean efficiency of 8.33) but over-measured in Sample 2 (which has a mean efficiency of 3.15).
The measured ratio between species \(i\) and \(j\) is given by the ratio of their read counts, \[\begin{align} \tag{2.6} \widehat{\text{ratio}}_{i/j}(a) = \frac{\text{reads}_i(a)}{\text{reads}_j(a)}. \end{align}\] From Equations (2.1) and (2.6), it follows that \[\begin{align} \tag{2.7} \widehat{\text{ratio}}_{i/j}(a) % \frac{\text{reads}_{i}(a)}{\text{reads}_{j}(a)} &= \frac{\text{density}_{i}(a)}{\text{density}_{j}(a)} \cdot \frac{\text{efficiency}_{i}}{\text{efficiency}_{j}}. \end{align}\] The fold error in the ratio between two species simply equals the ratio in their efficiencies and is therefore constant across samples. For instance, in the example of Figure 2.1, the ratio of Species 3 (with an efficiency of 6) to Species 1 (with an efficiency of 1) is over-estimated by a factor of 6 in both communities despite their varying compositions.
Figure 2.1: Taxonomic bias creates sample-dependent multiplicative errors in species proportions, which can lead to inaccurate fold changes between samples. Top row: Error in proportions measured by MGS in two microbiome samples that contain different relative abundances of three species. Bottom row: Error in the measured fold-change in the third species that is derived from these measurements.
Difference between species proportions and species ratios: Consistent taxonomic bias at the species level creates a consistent fold error in the ratios among species, but a varying fold error in the proportions of individual species. But a proportion is, after all, just a ratio between two taxa in which the denominator taxon is the entire taxonomic domain under study. The difference in behavior from that of ratios between species arises because the efficiency of higher-level taxa, which consist of the aggregate of many species, varies across samples with the changing relative abundances among its constituent species.
2.3 Absolute abundances (cell densities)
A wide range of experimental methods exist for converting the relative-abundance information in MGS measurements into measurements of (absolute) cell densities; however, these methods can be generally partitioned into two types, based on whether the absolute-abundance information derives from knowing the density of the total community or knowing the density of one or more individual species.
Methods using the density of the total community: The total microbial density of a community can be assayed by a variety of methods (reviewed in Appendix B). Once the total density is known, the density of individual species can be estimated by multiplying the total density by the species’ proportions (as measured by MGS), \[\begin{align} \tag{2.8} \widehat{\text{density}}_{i}(a) &= \widehat{\text{prop}}_{i}(a) \cdot \widehat{\text{total density}}(a) \\&= \text{reads}_{i}(a) \cdot \frac{\widehat{\text{total density}}(a)}{\text{total reads}(a)}. \end{align}\] We can rewrite the species’ density measurements to account for the effect of taxonomic bias on the MGS proportions (Equation (2.5)) and any error in the total density measurement, \[\begin{align} \tag{2.9} \widehat{\text{density}}_{i}(a) = \text{density}_{i}(a) \cdot \frac{\text{efficiency}_{i}}{\text{mean efficiency}(a)} % \cdot \frac{\widehat{\text{total density}}(a)}{\text{total density}(a)}. \cdot \begin{array}{c} \text{fold error in} \\ \widehat{\text{total density}}(a) \end{array}. \end{align}\] The variable fold error that bias creates in the measured proportions directly transfers to the density measurements.
Taxonomic bias can also create systematic error in the total-density measurements (Appendix A.2). For now, we suppose that the total density can be measured accurately; later, Section 5 describes how experiments might be designed such that taxonomic bias in the total density measurement offsets that in the MGS proportions, leading to a (more) consistent fold error in the measured densities.
Methods using the density of one or more reference species: The density of one or more species can be directly measured using targeted measurement methods like species-specific qPCR; alternatively, the densities of one or more species may be thought to be constant across samples either from background knowledge or because they have been spiked in at a fixed density (these situations are reviewed in Appendix B). These reference species with known or constant density can be used to convert read counts to densities for all species. For instance, given the density of a reference species \(r\), we can estimate the density of a focal species \(i\) by \[\begin{align} \tag{2.10} % \widehat{\text{density}}_{i}(a) = \frac{\text{reads}_{i}(a)}{\text{reads}_{r}(a)} \cdot \widehat{\text{density}}_{r}(a). \widehat{\text{density}}_{i}(a) = \text{reads}_{i}(a) \cdot \frac{\widehat{\text{density}}_{r}(a)}{\text{reads}_{r}(a)}. \end{align}\] Alternatively, if we only know that the reference species is constant, we can set \(\widehat{\text{density}}_{r}(a)\) in Equation (2.10) equal to 1 to estimate the density in unknown but fixed units, which is sufficient for multiplicative DA analysis. (Appendix A considers possible extensions to multiple reference species.) It follows from Equation (2.7) that the error caused by taxonomic bias in this density measurement is \[\begin{align} \tag{2.11} \widehat{\text{density}}_{i}(a) = \text{density}_{i}(a) \cdot \frac{\text{efficiency}_{i}}{\text{efficiency}_{r}} \cdot \begin{array}{c} \text{fold error in} \\ \widehat{\text{density}_r}(a) \end{array}. \end{align}\] Here, it is the constant error in the measured ratio of species \(i\) to \(r\) that propagates into the density measurement. There will generally be systematic error in the density of the reference species; however, if the systematic fold error is constant across samples, so will be that of the densities of other species.
Differences between the two approaches: Equations (2.9) and (2.11) show that the manner in which taxonomic bias in the sequencing measurement impacts the two approaches mirrors that of proportions and ratios. If the fold error in the total-density or reference species measurement is negligible (or at least constant) then the fold errors in species densities by the total-density approach vary across samples with the mean efficiency, whereas the fold errors by the reference-species approach are constant. The fundamental determinant of whether constant error is obtained is not which of these two equations is ultimately used, but rather whether densities of individual species, which (we assume) have constant efficiency, or the total community, which has varying efficiency, provide the absolute-abundance information. Studies using spike-ins as constant reference species will often, instead of using Equation (2.10), first use the spike-in species to estimate the total density, and then estimate the density of focal species with Equation (2.8). If the calculation is done carefully, however, this method still yields constant fold errors so long as the species efficiencies (including those of the spike-in) are constant across samples (Appendix A.3).