2 How taxonomic bias affects abundance measurements
To understand how bias affects the measured differential abundance between samples, we first consider how it affects measurements of individual samples.
2.1 Model of MGS measurement
Our primary tool for understanding the impact of taxonomic bias MGS measurement is the theoretical model of MGS measurement developed and empirically validated by McLaren, Willis, and Callahan (2019). This model is the simplest model of MGS measurement that includes multiplicative taxonomic bias while respecting the compositional nature of sequencing measurements, in which the total read count for a sample is unrelated to its total cell number or density (Gloor et al. (2017)). We consider a set of microbial species that are measured by a specific MGS protocol. The protocol assigns each sequencing read to at most one sample andsample and species, and we ignore the possibility of contamination or false-positive taxonomic assignments—reads assigned to a given species and sample really do come from that species and sample. Our model stipulates that the assigned read count of species \(i\) in sample \(a\) equals its cell density multiplied by a species-specific, sample-independent factor and a species-independent, sample-specific factor, \[\begin{align} \tag{2.1} \text{reads}_{i}(a) = \text{density}_{i}(a) \quad \cdot \underbrace{\text{efficiency}_{i}}_{\substack{\text{species specific,} \\ \text{sample independent}}} \cdot \quad \underbrace{\text{sequencing effort}(a)}_{\substack{\text{species independent,} \\ \text{sample specific}}}. \end{align}\] The species-specific factor is the relative measurement efficiency (or efficiency for short) of the species, which represents how much more easily that species is measured (converted from cells to assigned reads) relative to an abitrarily chosen reference species (McLaren, Willis, and Callahan (2019)). The sample-specific factor, which we call the sequencing effort of the sample, reflects the fact that the number of reads per unit cell density varies among samples even in the absence of taxonomic bias due to variation in total density, library normalization, and total sequencing-run output. Equation (2.1) implies that the total reads from all species in the sample equal \[\begin{align} \tag{2.2} \text{total reads}(a) = \text{total density}(a) \cdot \text{mean efficiency}(a) \cdot \text{sequencing effort}(a), \end{align}\] where \[\begin{align} \tag{2.3} \text{mean efficiency}(a) \equiv \frac{\sum_{j}\text{density}_j(a)\cdot \text{efficiency}_j}{\text{total density}(a)} \end{align}\] is the average or mean efficiency of cells in the sample.
2.2 Relative abundances (proportions and ratios)
The term relative abundance carries multiple meanings in microbiome science. Microbiome researchers often use the relative abundance of a species in a sample to refer to its proportion, or number of cells from that species divided by the total number of cells from the taxonomic domain of interest (e.g. Prokaryotes). A second view, drawn from the statistical framework of Compositional Data Analysis, equates the relative abundances of species within a sample with the ratios of cell densities among two or more species (e.g. a ratio of 5:4:2 among three species). By this view, it is meaningless to refer to the relative abundance of an individual species. Although other interpretations exist6, we focus on the proportion and ratio views as they underly most relative and absolute DA methods and the differences between them clarify when and why the effects of bias cancel in a DA analysis.
The standard measurement of the proportion of species \(i\) in sample \(a\) is given by the ratio of its read count to the total, \[\begin{align} \tag{2.4} \widehat{\text{prop}}_{i}(a) = \frac{\text{reads}_i(a)}{\text{total reads(a)}}. \end{align}\] From Equations (2.1), (2.2), and (2.4), it follows that \[\begin{align} \tag{2.5} \widehat{\text{prop}}_{i}(a) &= \text{prop}_{i}(a) \cdot \frac{\text{efficiency}_{i}}{\text{mean efficiency}(a)}, \end{align}\] where \(\text{prop}_{i}(a) = \text{density}_{i}(a) / \text{total density}(a)\) is the actual proportion. Equation (2.5) says that taxonomic bias creates a multiplicative error equal to the ratio of the species’ efficiency to the mean efficiency in the sample. Consequently, the same species can be over- or under-measured depending on the composition of the given sample (Figure 2.1). For instance, in samples from two hypothetical communities in Figure 2.1, Species 3 has an efficiency of 6 and is under-measured in Sample 1 (which has a mean efficiency of 8.33) but over-measured in Sample 2 (which has a mean efficiency of 3.15).
The measured ratio between species \(i\) and \(j\) is given by the ratio of their read counts, \[\begin{align} \tag{2.6} \widehat{\text{ratio}}_{i/j}(a) = \frac{\text{reads}_i(a)}{\text{reads}_j(a)}. \end{align}\] From Equations (2.1) and (2.6), it follows that \[\begin{align} \tag{2.7} \widehat{\text{ratio}}_{i/j}(a) % \frac{\text{reads}_{i}(a)}{\text{reads}_{j}(a)} &= \frac{\text{density}_{i}(a)}{\text{density}_{j}(a)} \cdot \frac{\text{efficiency}_{i}}{\text{efficiency}_{j}}. \end{align}\] The fold error in the ratio between two species simply equals the ratio in their efficiencies and is therefore constant across samples. For instance, in the example of Figure 2.1, the ratio of Species 3 (with an efficiency of 6) to Species 1 (with an efficiency of 1) is over-estimated by a factor of 6 in both communities despite their varying compositions.
Figure 2.1: Taxonomic bias creates sample-dependent multiplicative errors in species proportions, which can lead to inaccurate fold changes between samples. Top row: Error in proportions measured by MGS in two microbiome samples that contain different relative abundances of three species. Bottom row: Error in the measured fold-change in the third species that is derived from these measurements.
2.3 Absolute abundances (cell densities)
A range of experimental and computational methods can be used to convert the relative-abundance information in the read counts into measurements of (absolute) cell densities; however, these methods can largely be partitioned into those that use the measured proportions or the measured ratios provided by MGS. The error in density measurements due to taxonomic bias therefore follows that of the proportion and ratio measurements described above.
Proportion-based density measurement is based on the idea that if we know the (true) proportion of a species and the (true) total density of all species, we can multiply these numbers obtain the species’ density. This method therefore involves multiplying the proportion of a species measured from MGS with a measurement of total microbial density obtain through other means (e.g., through direct cell counting or qPCR of a universal marker gene) to measure its density, \[\begin{align} \tag{2.8} \widehat{\text{density}}_{i}(a) = \widehat{\text{prop}}_{i}(a) \cdot \widehat{\text{total density}}(a). \end{align}\] Substituting the expression (2.5) for the measured proportion and allowing for error in the total density measurement, leads to the expression \[\begin{align} \tag{2.9} \widehat{\text{density}}_{i}(a) = \text{density}_{i}(a) \cdot \frac{\text{efficiency}_{i}}{\text{mean efficiency}(a)} \cdot \begin{array}{c} \text{fold error in} \\ \widehat{\text{total density}}(a) \end{array}. \end{align}\] The multiplicative error consists of two factors: the error in the estimated proportion (due to taxonomic bias in the MGS measurement), and any error in the estimated total density. In the case of perfect knowledge of the total density, the fold error in proportion-based density estimates varies across samples inversely with the mean efficiency.
Ratio-based density estimation follows from the fact that, in the absence of taxonomic bias, the conversion rate from reads to density is just the sequencing effort of the sample and is thus the same for all species (Equation (2.1)). Knowing the density of at least one reference species therefore enables us to convert from reads to density for all species. One approach to obtaining such reference species is to add one or more extraneous species in known (typically constant) abundance (a “spike-in”) to each sample prior to sequencing (REFs). A second approach is to determine a set of naturally occurring species that are thought to have a constant, though typically unknown, abundance across samples; in this case we can only estimate species’ densities relative to these “housekeeping species” (REFs). A third (seemingly untried) possibility is to measure the density of one or more naturally occurring species using a method such as targeted CFU counting, fluorescence flow cytometry or ddPCR directly on cells, or ddPCR or qPCR on extracted DNA. Given the estimate of the density of reference taxon \(r\), we estimate the density of taxon \(i\) by multipling the ratio of \(i\) to \(r\) in the reads by the density of \(r\), \[\begin{align} \tag{2.10} \widehat{\text{density}}_{i}(a) = \frac{\text{reads}_{i}(a)}{\text{reads}_{r}(a)} \cdot \widehat{\text{density}}_{r}(a) \end{align}\] (Appendix ??? describes how this equation can be extended to multiple reference species.) For a constant reference species with unknown density, we can set \(\widehat{\text{density}}_{r}(a)\) equal to 1 to estimate density in units of the density of the reference species. The error in this ratio-based density estimate is \[\begin{align} \tag{2.11} \widehat{\text{density}}_{i}(a) = \text{density}_{i}(a) \cdot \frac{\text{efficiency}_{i}}{\text{efficiency}_{r}} \cdot \begin{array}{c} \text{fold error in} \\ \widehat{\text{density}_r}(a) \end{array}. \end{align}\] Again, the multiplicative error consists of two factors: the error in the estimated ratio (due to taxonomic bias in the MGS measurement), and error in the estimated density of the reference taxa. So long as the fold error in the reference density is constant across samples, so will be the error in the estimated density of the focal species \(i\).
- Note: The current distinction between proportion- and ratio-based density estimation is incomplete, as the example of spike-in normalization shows: Whether the density of species \(i\) is estimated by Equation (2.10), or by first estimating total density from the ratio of non-spike-in to spike-in reads and then using Equation (2.8), the results are mathematically identical. Future revisions should clarify this point.
References
e.g. the rank order within a sample, or value associated with a species after a centered-log-ratio transform↩︎