2 How taxonomic bias affects abundance measurements

We begin by considering how taxonomic bias affects the relative- and absolute-abundance measurements which serve as the input to microbiome DA analyses.

Our primary tool for understanding the impact of taxonomic bias on MGS measurement is the theoretical model of MGS measurement developed and empirically validated by McLaren, Willis, and Callahan (2019). This model is the simplest that respects the multiplicative nature of taxonomic bias and the compositional nature of MGS measurements, in which the total read count for a sample is unrelated to its total cell number or density (Gloor et al. (2017)). We consider a set of microbiome samples measured by a specific MGS protocol that extracts, sequences, and bioinformatically analyzes to taxonomically assign reads to a set of microbial species \(S\). Various forms of taxonomic assignment are possible; for simplicity, we suppose that reads are assigned to the species level, with reads that cannot be uniquely assigned being discarded. We ignore the possibility that reads are incorrectly assigned to the wrong species or sample. Unless otherwise stated, we treat the sequencing measurement as deterministic, ignoring the ‘random’ variation in read counts that arise from the sampling of sequencing reads and other aspects of the MGS process.

In our model, the assigned read count of a species \(i\) in a sample \(a\) equals its abundance multiplied by species-specific and sample-specific factors, \[\begin{align} \tag{2.1} \text{reads}_{i}(a) = \text{abun}_{i}(a) \quad \cdot \underbrace{\text{efficiency}_{i}}_{\substack{\text{species specific,} \\ \text{sample independent}}} \cdot \quad \underbrace{\text{effort}(a)}_{\substack{\text{species independent,} \\ \text{sample specific}}}. \end{align}\] The species-specific factor, \(\text{efficiency}_{i}\), equals the relative measurement efficiency (or simply efficiency) of the species—how much more easily that species is measured (converted from cells to assigned reads) relative to a arbitrary fixed reference species (McLaren, Willis, and Callahan (2019)). We assume the efficiencies of particular species are consistent across samples. The variation in efficiency among species corresponds to the taxonomic bias of the MGS protocol. The sample-specific factor, \(\text{effort}(a)\), we call the sequencing effort for that sample; it captures the variation in the total number of assigned reads due to experimental features such as library normalization and total sequencing-run output. Equation (2.1) implies that the total number of assigned reads in sample \(a\) equals \[\begin{align} \tag{2.2} \text{reads}_S(a) = \text{abun}_S(a) \cdot \text{efficiency}_S(a) \cdot \text{effort}(a), \end{align}\] where \(\text{abun}_{S}(a) \equiv \sum_{j\in S}\text{abun}_j(a)\) is the total abundance of species in \(S\) and \[\begin{align} \tag{2.3} \text{efficiency}_S(a) \equiv \frac{\sum_{j\in S}\text{abun}_j(a) \cdot \text{efficiency}_j}{\text{abun}_S(a)} \end{align}\] is the mean efficiency over all species in \(S\).

2.1 Relative abundance (proportions and ratios)

We distinguish between two types of species-level relative abundances within a sample. The proportion of species \(i\) in sample \(a\) equals to its abundance relative to the total abundance of all species in \(S\), \[\begin{align} \tag{2.4} \text{prop}_{i}(a) &\equiv \frac{\text{abun}_i(a)}{\text{abun}_S(a)}. % \\&= \frac{\text{abun}_i(a)}{\sum_{i \in S}\text{abun}_i(a)}. \end{align}\] The ratio between two species \(i\) and \(j\) equals the abundance of \(i\) relative to that of \(j\), \[\begin{align} \tag{2.5} \text{ratio}_{i/j}(a) = \frac{\text{abun}_i(a)}{\text{abun}_j(a)}. \end{align}\] Proportions and ratios each form the basis for popular DA methods; ratio-based methods are commonly referred to as Compositional Data Analysis (CoDA) methods.

Taxonomic bias has a different effect on species proportions versus species ratios. The proportion of a species is typically measured by its proportion of assigned reads, \[\begin{align} \tag{2.6} \widehat{\text{prop}}_{i}(a) = \frac{\text{reads}_i(a)}{\text{reads}_S(a)}. \end{align}\] We can rewrite the right-hand side (using Equations (2.1), (2.2), and (2.6)) to find \[\begin{align} \tag{2.7} \widehat{\text{prop}}_{i}(a) &= \text{prop}_{i}(a) \cdot \underbrace{\frac{\text{efficiency}_{i}}{\text{efficiency}_S(a)}}_{\substack{\text{variable} \\ \text{fold error}}}. \end{align}\] Taxonomic bias creates a multiplicative error in the species’ proportion equal to its efficiency relative to the mean efficiency in the sample. Consequently, the species’s proportion is measured as too high in samples that are dominated by species with lower efficiencies, and measured as too low in samples that are dominated by species with higher efficiencies. This phenomenon is illustrated in two hypothetical communities in Figure 2.1. Species 3 has an efficiency of 6; it is under-measured in Sample 1, which has a mean efficiency of 8.33, but over-measured in Sample 2, which has a mean efficiency of 3.15.

The measured ratio between species \(i\) and \(j\) is given by the ratio of their read counts, \[\begin{align} \tag{2.8} \widehat{\text{ratio}}_{i/j}(a) = \frac{\text{reads}_i(a)}{\text{reads}_j(a)}. \end{align}\] From Equations (2.1) and (2.8), it follows that \[\begin{align} \tag{2.9} \widehat{\text{ratio}}_{i/j}(a) % &= \frac{\text{abun}_{i}(a)}{\text{abun}_{j}(a)} &= \text{ratio}_{i/j}(a) \cdot \underbrace{\frac{\text{efficiency}_{i}}{\text{efficiency}_{j}}}_{\substack{\text{constant} \\ \text{fold error}}}. \end{align}\] Taxonomic bias creates a multiplicative measurement error in the ratio that is equal to the ratio in their efficiencies; the error is therefore constant across samples. For instance, in Figure 2.1, the ratio of Species 3 (with an efficiency of 6) to Species 1 (with an efficiency of 1) is over-estimated by a factor of 6 in both communities despite their varying compositions.

Taxonomic bias creates sample-dependent multiplicative errors in species proportions, which can lead to inaccurate fold changes between samples. Top row: Error in proportions measured by MGS in two hypothetical microbiome samples that contain different relative abundances of three species. Bottom row: Error in the measured fold-change in the third species that is derived from these measurements. Species’ proportions may be measured as too high or too low depending on sample composition. For instance, Species 3 has an efficiency of 6 and is under-measured in Sample 1 (which has a mean efficiency of 8.33) but over-measured in Sample 2 (which has a mean efficiency of 3.15).

Figure 2.1: Taxonomic bias creates sample-dependent multiplicative errors in species proportions, which can lead to inaccurate fold changes between samples. Top row: Error in proportions measured by MGS in two hypothetical microbiome samples that contain different relative abundances of three species. Bottom row: Error in the measured fold-change in the third species that is derived from these measurements. Species’ proportions may be measured as too high or too low depending on sample composition. For instance, Species 3 has an efficiency of 6 and is under-measured in Sample 1 (which has a mean efficiency of 8.33) but over-measured in Sample 2 (which has a mean efficiency of 3.15).

The fact that the multiplicative error in proportions varies while that in ratios is constant forms the basis for understanding the effect of bias on different methods for measuring absolute abundances from MGS and the extent to which bias cancels in different DA analyses.

2.2 Absolute abundance

Researchers often would like to know how the absolute abundance of a species changes. In this context, absolute means not relative to other species; typically, the abundance of interest is relative to non-microbial entities such the volume, mass, or amount of host DNA in the sample. A wide range of experimental methods have been used to convert the relative abundances from MGS measurements into measurements of absolute abundances. Yet there has so far been little consideration as to how the resulting measurements are affected by taxonomic bias in the MGS measurements or in the supplemental measurements, such as flow cytometry and qPCR, that they often involve. Here and in Appendix B we describe how various methods for measuring absolute abundance are predicted to be affected by taxonomic bias, with an eye towards determining whether the measurement error is variable or constant.

Our discussion supposes the ‘abundance’ of interest is cell number, though it applies to other quantities, such as biomass and genome copy number, that may be more relevant in different biological contexts.

We consider two general classes of methods: Those in which absolute-abundance information is derived from measurement of the aggregate abundance of the total community, and those in which it drives from targeted measurement (or prior knowledge) of one or more particular species.

Normalization by total abundance: Measurements of species absolute abundances are often obtained by making a (non-MGS) measurement of total community abundance, equating it with the aggregate abundance of the species \(S\) measured by MGS, and multiplying this total by the proportions from MGS, \[\begin{align} \tag{2.10} \widehat{\text{abun}}_{i}(a) &= \widehat{\text{prop}}_{i}(a) \cdot \widehat{\text{abun}}_S(a). \end{align}\] This measurement is affected by taxonomic bias in the MGS measurement as well as in the total-abundance measurement. For example, measurement of total community abundance by 16S qPCR is affected by variation among species in extraction efficiency, 16S copy number, and PCR binding and amplification bias. If the species-level efficiencies of the total-abundance measurement are constant across samples and we neglect other sources, then we can express the total-abundance measurement as \[\begin{align} \tag{2.11} \widehat{\text{abun}}_S(a) &= \sum_{i\in S} \text{abun}_i(a) \cdot \text{efficiency}^{\text{tot}}_i \\&= \text{abun}_S(a) \cdot \text{efficiency}^{\text{tot}}_S(a), \end{align}\] where \(\text{efficiency}_{i}^{\text{tot}}(a)\) is the absolute measurement efficiency of species \(i\) for the total-abundance measurement and \(\text{efficiency}^{\text{tot}}_S(a)\) is the mean efficiency of the total-abundance measurement in the sample. The equations (2.11) and (2.7) for the error in total-abundance and proportion measurements imply that the species abundance measurement (2.10) has error given by \[\begin{align} \tag{2.12} \widehat{\text{abun}}_{i}(a) = \text{abun}_{i}(a) \cdot \frac{\text{efficiency}_{i} \cdot \text{efficiency}^{\text{tot}}_S(a)}{\text{efficiency}_S(a)}. \end{align}\]

Equation (2.12) indicates that the multiplicative error in the measured absolute abundance of a species equals its MGS efficiency relative to the mean MGS efficiency in the sample, multiplied by the mean efficiency of the total measurement. As in the case of proportions (Equation (2.7)), the error depends on sample composition through the two mean efficiency terms and so may vary across samples. On the other hand, if the mean efficiency of the total-abundance measurement mirrors that of the MGS measurement, the two can offset and lead to a relatively stable error. We discuss how this possibility might apply to real experimental workflows below.

Normalization by one or more reference species: Suppose we had a measurement of the (absolute) abundance of a reference species \(r\). In the absence of taxonomic bias, all species are expected to have the same ratio of reads to abundance in a sample (Equation (2.1)). Thus the abundance-to-reads ratio for species \(r\) can serve as a conversion factor allowing us to obtain the abundance of an arbitrary species \(i\), \[\begin{align} \tag{2.13} \widehat{\text{abun}}_{i}(a) = \text{reads}_{i}(a) \cdot \frac{\widehat{\text{abun}}_{r}(a)}{\text{reads}_{r}(a)}. \end{align}\] The abundance of one or more reference species can be directly measured using targeted measurement methods like species-specific qPCR and used with Equation (2.13) to obtain absolute abundance for all species. To our knowledge, this approach has not previously been suggested. Instead, reference-based measurement has been used in the context of spike-in experiments or normalization to a species (such as the host) which is treated as having a constant abundance. In a spike-in experiment, the abundance of reference species are added in known (up to experimental error) abundances to each sample. When normalizing to a species that is treated as having a constant but unknown abundance, we set \(\widehat{\text{abun}}_{r}(a)\) in Equation (2.13) equal to 1 and interpret the abundances as having fixed but unknown units, which is sufficient for multiplicative DA analysis.

The error in the abundance measurement (2.13) due to taxonomic bias in the MGS measurement is given by \[\begin{align} \tag{2.14} \widehat{\text{abun}}_{i}(a) = \text{abun}_{i}(a) \cdot \frac{\text{efficiency}_{i}}{\text{efficiency}_{r}} \cdot \begin{array}{c} \text{fold error in} \\ \widehat{\text{abun}_r}(a) \end{array}. \end{align}\] The constant error in the measured ratio of species \(i\) to \(r\) (see Equation (2.9)) propagates to the abundance measurement. There will generally also be systematic error in the abundance of the reference species; however, if the systematic fold error is constant across samples, so will be that of the abundances of other species.

Spike-ins are instead sometimes used to measure absolute abundances using Equation (2.10). In this case, an intermediate step is taken in which the total community abundance is first measured by the ratio of non-spike-in to spike-in reads. If done correctly, this calculation yields results that are identical to directly applying Equation (2.13) (Appendix A.3).

Difference between the two approaches: Reference-species normalization yields constant fold errors because it is based on species-level read counts and abundances and we assume that efficiencies are constant at the species level. In contrast, total-abundance normalization is based on aggregates of species (for the calculation of proportion and the total-abundance measurement) and so depends on mean efficiencies, which can vary across samples.

References

Gloor, Gregory B., Jean M. Macklaim, Vera Pawlowsky-Glahn, and Juan J. Egozcue. 2017. “Microbiome Datasets Are Compositional: And This Is Not Optional.” Front. Microbiol. 8 (November): 2224. https://doi.org/10.3389/fmicb.2017.02224.

McLaren, Michael R, Amy D Willis, and Benjamin J Callahan. 2019. “Consistent and correctable bias in metagenomic sequencing experiments.” Elife 8 (September): 46923. https://doi.org/10.7554/eLife.46923.