A Model details

This appendix provides a more detailed description of our deterministic model of the microbiome measurement process, which it uses to derive some of the additional claims of the main text.

A.1 MGS measurement

This section expands the description of the model in the main text to more explicitly lay out our assumptions and to consider the effect of variation in measurement efficiencies within a taxonomic group on its MGS measurement.

A taxon can be any group of organisms. For any taxon \(I\) and sample \(a\), we define the absolute efficiency of that taxon and that sample as the number of reads assigned to that taxon divided by the number of cells of that taxon in the given sample. We assume that these assigned reads really do come from that particular taxon and sample (though this assumption is often violated in practice).

The efficiency of a taxon \(I\) is the average number of reads produced by each of its constituent cells. This number varies due to

Genetic variation among cells
Non-genetic variation among cells
Randomness in the experimental process

Variation encoded in the genome of cells determines properties that affect its efficiency, such as how easy it is to lyse, how easily PCR primers will bind to their target locus, how many copies of a marker gene are in the genome, and whether its sequencing reads can be taxonomically assigned. This variation created by genetically encoded traits is what we refer to as ‘taxonomic bias’ and is the primary focus of this article. In principle, even a single nucleotide change (e.g. in a primer-binding site) could affect the efficiency of a cell. Nevertheless, cells will more similar genomes will tend to have more similar genetically-encoded efficiencies. For the purposes of our analysis, we suppose that cells within the same species have approximately identical genetically-encoded efficiencies. The impact of violations in this assumption can be understood using the approach we describe below for taxa above the species level. More generally, our results can be read as applying to the finest taxonomic unit that is reported by the given MGS method.

Genetically identical cells can differ in cell state such that their efficiencies vary significantly. An extreme example is the effect of sporulation: Spores are generally harder to lyse than vegetative cells, and VanInsberghe et al. (2020) found that the efficiency of a common DNA extraction protocol was 800X on vegetative cells vs spores of a strain of Clostridium difficile. Physiological differences might also arise simply due where they are in the cell cycle. To the extent that the cells of a given genotype tend to have a similar distribution of cell states across samples, the efficiency of the genotype will remain stable. We assume this stability throughout our analysis. Violations in this assumption can again be understood using an approach similar to that we use to considering taxa above the species level.

The final number of reads also varies due to idiosyncratic events that we typically think of (and model) as ‘random’, such as the precise handling of a sample during pipetting or loading of a sequencing flow cell. Such variation may or may not be dwarfed by the ‘random’ variation in the sample composition (i.e., the variation not explained by the covariates in a regression analysis). For simplicity, we ignore this random variation when considering the impact of taxonomic bias on MGS measurement. The effect of random variation in regression analysis is addressed in Appendix C and the effect of random counting error is discussed in Section 4.

The absolute efficiency of a taxon can also vary simply because more effort is put towards sequencing that sample, either intentionally or unintentionally. The sequencing effort might depend on the taxonomic composition and bias of a sample. For example, a sample with higher output DNA will often be diluted more prior to amplification or sequencing. The critical assumption we make regarding sequencing effort is that it affects all species in the sample equally and hence doesn’t affect their relative abundances.

With these assumptions in place, we can partition the multiplicative difference between the reads assigned to a taxon and its density in the source sample into a taxon-specific factor and a sample-specific factor. There is an extra degree of freedom, which we settle by equating the taxon-specific factors with relative measurement efficiencies and arbitrarily choose the relative efficiency of a particular species to equal 1.

Our model for the read counts of a species \(i\) in sample \(a\) can thus be expressed mathematically as \[\begin{align} \text{reads}_{i}(a) = \text{density}_{i}(a) \cdot \text{efficiency}_{i} \cdot \text{effort}(a). \end{align}\] Note that the species’ efficiencies are properties of the species \(i\) but not the sample \(a\). Similarly, let \(\text{density}_{I}(a)\), \(\text{reads}_{I}(a)\), and \(\text{efficiency}_{I}(a)\) denote the density, read count, and relative efficiency of an arbitrary group of species \(I\) in sample \(a\). The read count for taxon \(I\) in sample \(a\) can be written \[\begin{align} \tag{A.1} \text{reads}_{I}(a) = \text{density}_{I}(a) \cdot \text{efficiency}_{I}(a) \cdot \text{effort}(a). \end{align}\] The efficiency of taxon \(I\) in sample \(a\) equals the weighted average of its constituent species, \[\begin{align} \tag{A.2} \text{efficiency}_I(a) \equiv \frac{\sum_{i\in I}\text{density}_i(a)\cdot \text{efficiency}_i}{\text{density}_I(a)}. \end{align}\] The ratio between the read counts of two taxa \(I\) and \(J\) is \[\begin{align} \tag{A.3} \widehat{\text{ratio}}_{I/J}(a) &= \frac{\text{density}_{I}(a)}{\text{density}_{J}(a)} \cdot \frac{\text{efficiency}_{I}(a)}{\text{efficiency}_{J}(a)}. \end{align}\] Equation (2.7) for the error in a species’ proportion and Equation (2.9) for the error in the ratio between two species both follow directly from this more general expression.

A.2 Taxonomic bias in total-density measurement

To understand the impact of taxonomic bias in the total-density measurement on community normalization, we model error in total-density measurements similarly to that for MGS measurements. Let \(\widehat{\text{density}}_{S}(a)\) be the measurement of the density of all species \(S\). We suppose that this number is the sum of the (unobserved) contributions from each species. The contribution of a species \(i\) is proportional to its actual density \(\text{density}_{i}(a)\) and its absolute efficiency for the total-density measurement \(\text{efficiency}_{i}^{\text{tot}}(a)\). The contributions may also be multiplied by a common sample-specific error factor to account for non-linearity of extraction and/or random error, which we ignore for now. The measured total density therefore equals \[\begin{align} \widehat{\text{density}}_S(a) &= \sum_{i\in S} \text{density}_i(a) \cdot \text{efficiency}^{\text{tot}}_i \\&= \text{density}_S(a) \cdot \text{efficiency}^{\text{tot}}_S(a) \end{align}\] where \[\begin{align} \tag{A.4} \text{efficiency}^{\text{tot}}_S(a) \equiv \frac{\sum_{i \in S}\text{density}_j(a)\cdot \text{efficiency}^{\text{tot}}_i}{\text{density}_S(a)} \end{align}\] is the mean efficiency in the sample with respect to the total-density measurement.

The error in the total density depends on the species composition through the mean efficiency of the sample; hence spurious changes in measured density can occur simply due to shifts from high- to low-efficiency species (or vice versa).

How does this error affect the measured densities of individual species when the total is used for normalization? Substituting the mean efficiency of the total measurement for the error term in Equation (2.12) gives \[\begin{align} \tag{A.5} \widehat{\text{density}}_{i}(a) = \text{density}_{i}(a) \cdot \frac{\text{efficiency}_{i} \cdot \text{efficiency}^{\text{tot}}_S(a)}{\text{efficiency}_S(a)}. \end{align}\] The mean efficiency of the total measurement appears in the numerator, while the mean efficiency of the MGS measurement is in the denominator. To the extent that these two are positively associated across samples (specifically, their logarithms are positively linearly correlated), then the variation in one will tend to be offset by the variation in the other. In this case the (geometric) variation in the ratio will be reduced, and the fold error in the species density measurement will tend to be more consistent than if the total density measurement was taxonomically unbiased (i.e. \(\text{efficiency}^\text{tot}_S(a) = 1\) for all samples). The error in species densities for individual samples may not be reduced; but the error in fold changes and regression analysis across samples will be.

The extent to which the log mean efficiency of the MGS measurement and the total measurement are correlated depends on the species efficiencies as well as the changes in species composition across samples, making it difficult to make general statements about this effect. We can get a sense for the effect by considering the correlation between log efficiencies across species between the two measurement types. Suppose that distribution over species of log efficiencies of the MGS measurement has variance \(\sigma^{2}\), the log efficiencies for the total density measurement has variance \(\sigma_{\text{tot}}^{2}\), and the correlation between the two is \(\rho\). The variance in the difference \(\log \text{efficiency}_{i}^{\text{tot}} - \log \text{efficiency}_{i}\) is thus \(\sigma^{2} + \sigma_{\text{tot}}^{2} - \rho \sigma \sigma_{\text{tot}}\). All else equal, the smaller this variance, the smaller we expect the variation in the error factor \(\text{efficiency}^{\text{tot}}_S(a)/ \text{efficiency}_S(a)\).

Unclassified reads: It is common that a given MGS protocol generates sequencing reads from particular taxa that the bioinformatics portion entirely discards, due to an inability to taxonomically assign the reads to the chosen taxonomic units. For instance, 16S data that is analyzed with closed-reference OTU assignment and shotgun data analyzed by mapping to a database of reference genomes will be unable to assign sequences that are less than a chosen similarity cutoff (e.g. 97%) from the reference sequences. These taxa may form an appreciable fraction of the community and thus lead to a large fraction of the total sequenced reads for a given sample may remain unassigned.

So far, we have defined the efficiency of the MGS protocol in terms of taxonomically classified reads and so have treated such taxa as having an MGS efficiency of 0. If such species are included in the total density measurement, then we have a mismatch between the two measurement types. However, we may be able to remove this mismatch by altering the equation we use to form species density measurements. In particular, we can compute the proportion of the focal species from the ratio of its assigned reads to the total sequenced reads, rather than to the total set of assigned reads. This approach raises additional questions around how to treat reads from assignable species that are filtered during quality control steps.

Suppose that the set of species that can be classified by our bioinformatics pipeline, \(S\), is a proper subset of a larger set of species \(S'\) that are present and yield sequencing reads in our sample. Suppose that there is no additional taxonomic bias in our MGS measurement; that is, all species in \(S\) have an efficiency of 1, and the species in \(S' \setminus S\) have an efficiency of 0. Moreover, suppose that all reads from the \(S\) species are classified; that is, there is no loss of reads due to routine QC filtering. Further suppose that we are capable of perfectly measuring the total density of all cells from the full set of species \(S'\); that is, \(\text{efficiency}^{\text{tot}}_{i} = 1\) for all species \(i \in S'\).

In this case, we can overcome the mismatch between MGS and total measurement when we use them to measure species densities, by using the total reads rather than the assigned reads to compute the species proportions. That is, instead of \[\begin{align} \widehat{\text{prop}}_{i}(a) = \frac{\text{reads}_i(a)}{\text{reads}_S(a)}, \end{align}\] we take \[\begin{align} \widehat{\text{prop}}_{i}(a) &= \frac{\text{reads}_i(a)}{\text{total reads}(a)} \\&= \frac{\text{reads}_i(a)}{\text{reads}_{S'}(a)}. \end{align}\] By accounting for the contribution of unknown species when computing proportions, we are able to resolve the mismatch and obtain accurate species densities.

A.3 Using reference species for total-density normalization

Constant reference species are sometimes used to measure total density of \(S\) by the ratio of \(S\) reads to \(R\) reads. For example, a study of Arabidopsis microbiomes used the ratio of bacterial to host reads in shotgun sequencing as a proxy for total bacterial density, which they then used for total-community normalization of 16S amplicon sequencing measurements (Karasov et al. (2020), Regalado et al. (2020)). Chng et al. (2020) similarly used the ratio of bacterial to host or diet reads in shotgun sequencing of mouse fecal samples as a proxy for total bacterial density (though they did not use this measurement for community normalization). Smets et al. (2016) similarly used the ratio of non-spike-in to spike-in reads to estimate total density.

What is the impact of taxonomic bias on these total density estimates and the species densities derived from them? The measured density of \(S\) is \[\begin{align} \widehat{\text{density}}_{S}(a) &= \frac{\text{reads}_S(a)}{\text{reads}_{R}(a)} \\&= \frac{\text{density}_S(a)}{\text{density}_{R}} \cdot \frac{\text{efficiency}_S(a)}{\text{efficiency}_{R}} \\&= \text{density}_S(a) \cdot \text{efficiency}_S(a) \cdot \text{constant}. \end{align}\] Hence variation in the mean efficiency among the species \(S\) across samples creates a variable fold error in the density measurement, similar to direct measurements of total density.

The impact of this variable error for species density measurement via Equation (2.10) depends on whether the species proportions are derived from the same or a different MGS measurement.

If proportions from the same MGS measurement are used, then the measured species densities are \[\begin{align} \widehat{\text{density}}_{i}(a) &= \frac{\text{reads}_i(a)}{\text{reads}_{S}(a)} \cdot \widehat{\text{density}}_{S}(a) \\&= \frac{\text{reads}_{i}(a)}{\text{reads}_{R}(a)}. \end{align}\] where the second line follows by substitution of the previous equation. The variable error in the total density cancels with that in the proportion, yielding a constant fold error. In fact, the measurement we obtain is exactly the same as that from reference normalization (Equation (2.13)).

The situation differs when the total density and community composition are estimated using different sequencing and/or bioinformatic methods, as in the plant microbiome example above. In this case, the mean efficiency of the total density measurement will not equal that in the proportions, and the more general case of total-density normalization discussed in Appendix A.2 applies.

Although we only consider constant reference species, similar behavior occurs if for varying reference species assayed by targeted measurements.

A.4 Linearity of extraction

Some popular absolute-abundance methods use post-extraction DNA (or RNA) density as a proxy for cell density in the original sample; in particular, using florescence-based total DNA quantification or using qPCR or ddPCR to quantify a marker-gene. In addition, some have suggested normalization of species to spike-ins of extraneous DNA added after extraction. Both of these approaches presuppose that DNA extraction is linear in the sense that the DNA concentration is proportional to the cell concentration in the original sample (possibly after correction by known dilution or concentration factors). Deviations from linearity can occur due to random and systematic variation in DNA yields. Here we consider the impact of these deviations on species density measurements.

One source of systematic variation in yield is taxonomic bias. We already explored the impact of taxonomic bias in extraction.

NOTE: This might not be a correct way to talk about this; should perhaps make a more explicit model to verify this argument.

To understand the impact of non-linearity after bias has been accounted for, let \(C(a)\) be a sample-specific factor that accounts for DNA extraction non-linearity after controlling for bias. From Equation (2.12), we have the more general equation \[\begin{align} \tag{A.6} \widehat{\text{density}}_{i}(a) = \text{density}_{i}(a) \cdot \frac{\text{efficiency}_{i} \cdot \text{efficiency}^{\text{tot}}_S(a)}{\text{efficiency}_S(a)} \cdot C(a). \end{align}\] \(C(a)\) might fluctuate randomly among samples, or might vary systematically — for example, if DNA yield saturates at high concentrations then \(C(a)\) will be smaller in higher-concentration samples. If DNA extraction is not linear, then measures of cell density might give more accurate fold changes even if they share less bias with the MGS measurement.

Post-extraction spike-ins and targeted measurements face a similar problem. These perfectly control for bias in fold change estimation if DNA extraction is linear. However, if not then the fold error in \(\widehat{\text{density}_r}(a)\) for a species with a targeted measurement will vary with \(C(a)\) and cause error in fold changes. The same problem applies to DNA spike-ins added after DNA extraction (although different accounting is needed since \(\text{density}_r(a)\) no longer has meaning).

References

Chng, Kern Rei, Tarini Shankar Ghosh, Yi Han Tan, Tannistha Nandi, Ivor Russel Lee, Amanda Hui Qi Ng, Chenhao Li, et al. 2020. “Metagenome-wide association analysis identifies microbial determinants of post-antibiotic ecological recovery in the gut.” Nat. Ecol. Evol. 4 (9): 1256–67. https://doi.org/10.1038/s41559-020-1236-0.

Karasov, Talia L., Manuela Neumann, Alejandra Duque-Jaramillo, Sonja Kersten, Ilja Bezrukov, Birgit Schröppel, Efthymia Symeonidi, et al. 2020. “The relationship between microbial population size and disease in the Arabidopsis thaliana phyllosphere.” bioRxiv. https://doi.org/10.1101/828814.

Regalado, Julian, Derek S. Lundberg, Oliver Deusch, Sonja Kersten, Talia Karasov, Karin Poersch, Gautam Shirsekar, and Detlef Weigel. 2020. “Combining whole-genome shotgun sequencing and rRNA gene amplicon analyses to improve detection of microbe–microbe interaction networks in plant leaves.” ISME J., May, 823492. https://doi.org/10.1038/s41396-020-0665-8.

Smets, Wenke, Jonathan W. Leff, Mark A. Bradford, Rebecca L. McCulley, Sarah Lebeer, and Noah Fierer. 2016. “A method for simultaneous measurement of soil bacterial abundances and community composition via 16S rRNA gene sequencing.” Soil Biol. Biochem. 96: 145–51. https://doi.org/10.1016/j.soilbio.2016.02.003.

VanInsberghe, David, Joseph A. Elsherbini, Bernard Varian, Theofilos Poutahidis, Susan Erdman, and Martin F. Polz. 2020. “Diarrhoeal events can trigger long-term Clostridium difficile colonization with recurrent blooms.” Nat. Microbiol. 5 (4): 642–50. https://doi.org/10.1038/s41564-020-0668-2.