class: center, middle, inverse, title-slide # Calibration without controls ## True microbial change from taxonomically biased microbiome measurements ### Michael McLaren --- <!-- Setup {{{ --> <script> window.MathJax = { tex: { packages: ['base', 'ams', 'cancel', 'color'] }, loader: { load: ['[tex]/cancel', '[tex]/color'] } }; </script> <!-- }}} --> <!-- Opening {{{ --> # Community sequencing lets us measure all microbes or genes in a sample <br> <br> <br> <br> .right-column[ ![Community sequencing](shared_figures/illustrations/mgs-measurement-simple.svg) ] .footnote[Fecal sample from j4p4n on openclipart.org] ??? Community sequencing is often used to simultenously measure the relative or absolute abundance of all microbes or microbial genes in a microbiome The problem is that.. --- # But the relative abundances we measure are _wrong_ .left-column[ Cellular mock communities measured by 16S sequencing by Brooks, et al. (2015),<br> re-analyzed in McLaren, et al. (2019) ] .right-column[ <img src="shared_figures/papers/mclaren2019/brooks-A-all.svg" width="85%" /> ] <!-- .footnote[McLaren, et al. (2019)] --> ??? But the measured relative abundances are _wrong_ TODO explain the brooks data - here I'm showing a figure from our reanalysis of data from Brooks et al, in which X mock community mixtures of 7 vaginally-derived species were measured by 16S sequencing - in it, we can see that across samples and species, there is no direct relationship between the observed and actual proportion of a species info - 7 species, 71 samples, 58 unique mixtures --- # Changes in a single species across samples can also be inaccurate .left-column[ Cellular mock communities measured by 16S sequencing by Brooks, et al. (2015),<br> re-analyzed in McLaren, et al. (2019) ] .right-column[ <img src="slides_files/figure-html/unnamed-chunk-4-1.png" width="80%" /> ] --- layout: true # This error is caused by taxonomic bias in the sequencing measurement <br><br><br> --- <!-- Taxonomic bias definition --> .right-column[ .preview[ **Taxonomic bias** Systematic variation among species in how efficiently the are measured—that is, converted from cells to classified reads ] ] ??? this error in the experiment I showed you can be explained by.. (read slide) - bias arises from many sources, notably including DNA extraction, making it also a problem for shotgun sequencing --- .right-column[ .preview[ Taxonomic bias is - **Universal**: It affects all amplicon _and_ <br> shotgun metagenomics protocols - **Protocol-specific**: Measurements from different<br>protocols are quantitatively incomparable ] ] ??? --- layout: false # Can we mitigate bias to obtain accurate and reproducible microbiome measurements? .right-column[ <br> <img src="shared_figures/papers/mclaren2019/title.svg" width="95%" /> <img src="figures/collabs-elife.svg" width="75%" /> ] ??? transition: Understanding how we can counter taxonomic bias, to obtain accurate and reproducible microbiome meaurements, has been the primary aim of my postdoc work with Ben Callahan at NC State and our ongoing collaboration with the Willis lab at UW --- # Taxonomic bias can be quantified in terms of species-specific measurement efficiencies .left-column[ McLaren, et al. (2019) ] .right-column[ <img src="shared_figures/papers/mclaren2019/brooks-A-estimate.svg" width="85%" /> ] ??? In 2019, we showed that taxonomic bias can be quantified in terms of species-specific measurement efficiencies; The relative efficiencies of various species can be measured from one or more control communities with known relative abundances of those species. We measured these the the Brooks experimetn, shown here (explain with an example). We then showed that from just these 7 numbers, we could explain most of the error, going from (show) to (show). <!-- the relative efficiencies of various species can be measured from one or more community controls with known relative abundances of those species. --> - This error can be largely explained by taxonomic bias in sequencing measurements - define bias, and its problem --- # Variation in measurement efficiency explains the systematic error in mock communities .left-column[ McLaren, et al. (2019) ] .right-column[ <img src="shared_figures/papers/mclaren2019/brooks-B-estimate.svg" width="85%" /> ] ??? Variation in measurement efficiency across species can explain the systematic error in mock communities --- layout: true # The error due to taxonomic bias can be corrected, given a suitable community control .left-column[ McLaren, et al. (2019) ] --- .right-column[ <img src="figures/poster-estimation.svg" width="65%" /> ] ??? From one control, we can measure these efficiencies, and either predict or correct the error across other samples - allowing us to go from this (observed) to this (model prediction). --- .right-column[ <img src="figures/poster-calibration.svg" width="71%" /> ] ??? --- layout:false <br><br><br><br><br><br><br> .right-column[ .preview[ # But the necessary control communities do not yet exist. ]] ??? details (can skip) - Most of us don't have control communities with representative species from the communities we are studying. - And the community controls that do exist haven't actually been validated for the purposes calibrating natural samples. --- <br><br><br><br><br><br><br> .right-column[ .preview[ # Can we account for bias without community controls? ]] -- .right-column[ .preview[ Yes—at least for measuring the _(log) fold change_<br>in relative or absolute abundance. ]] ??? [main message] Yes—at least for the _(log) fold change_ in the relative or absolute abundance of a species. --- # Read / follow this work on [GitHub](https://github.com/mikemc/differential-abundance-theory) .right-column[ <img src="figures/mikemc_differential-abundance-theory.svg" width="90%" /> ] ??? [github.com/mikemc/differential-abundance-theory](https://github.com/mikemc/differential-abundance-theory) --- layout: false # The problem, and a possible solution <br> <br> <br> <br> .right-column[ .preview[ 1. Consistent bias can create spurious change<br> in species' **proportions** 1. Fold change in species **ratios** are unaffected 1. Robust fold change in **absolute abundance**<br>can be obtained from ratios ] ] ??? - Consistent bias can create spurious fold change in species' proportions, which are the most common way that relative abundance is measured. - However, an alternative approach to measuring change in relative abundance, the fold changes in the ratios among species, is unaffected by taxonomic bias, - We can use this fact to obtain robust changes in the absolute abundance of microbial species, which is often what biologists are most interested --- <!-- }}} --> <!-- Body {{{ --> # Why doesn't protocol standardization control for taxonomic bias? -- .right-column[ ### The hope: `\begin{align} \definecolor{measured}{RGB}{0,158,115} % \definecolor{actual}{RGB}{0,114,178} \definecolor{actual}{RGB}{86,180,223} \definecolor{error}{RGB}{213,94,0} \underbrace{{\color{measured}\widehat{\text{proportion}}(\text{species})}}_{\text{measured}} &= \underbrace{{\color{actual}\text{proportion}(\text{species})}}_{\text{actual}} \;\; \times \underbrace{{\color{error}\text{efficiency}(\text{species})}}_{\text{fold error is constant}} \end{align}` ] -- .right-column[ ### The reality: `\begin{align} \underbrace{ {\color{measured}\widehat{\text{proportion}}(\text{species})} }_{\text{measured}} &= \underbrace{{\color{actual}\text{proportion}(\text{species})}}_{\text{actual}} \times \underbrace{{\color{error}\frac{\text{efficiency}(\text{species})}{\text{mean efficiency}(\text{sample})}}}_{\text{fold error varies by sample}} \end{align}` ] -- .right-column[ ### `\(\rightarrow\)` Consistent taxonomic bias creates inconsistent errors in proportions ] ??? In that case, when we look at fold changes in proportion, the error completely cancels out - although we have an erroneous view of the proportion in an individual sample, we can accurately infer the fold changes between samples. --- # In proportion-based analyses, the effect of bias only partially cancels .right-column[ <br> <br> <br> `\begin{flalign} {\color{measured}\widehat{\text{proportion}}(\text{species})} &= {{\color{actual}\text{proportion}(\text{species})}} \times \underbrace{\frac{\overbrace{{\color{error}\text{efficiency}(\text{species})}}^{\text{constant, so cancels}}}{{\color{error}\text{mean efficiency}(\text{sample})}}}_{\text{varies, so does not}} & \end{flalign}` ] -- .right-column[ <br><br> ### `\(\rightarrow\)` Shifts in the mean efficiency can cause spurious changes ] ??? Shifts in the mean efficiency across samples can cause spurious changes in individual species transition: we can find examples of this occurring in practice in the human vaginal microbiome during pregnancy --- layout: false background-image: url('http://vmc.vcu.edu/static/img/myImgs/VMC_home/Diego_Spitaleri_PTB_web-01.jpg') background-size: cover .footnote[Image source: Vaginal Microbiome Consortium, http://vmc.vcu.edu/] ??? - To illustrate the types of problems that can arise, we took the bias we measured in our previous analysis of vaginal mock communities, and considers its impact on inferences about the real vaginal microbiomes of pregnant women that were measured using the same sequencing protocol. --- layout: true # Example: Dynamics of the vaginal community during pregancy .left-column[ 16S measurements from Fettweis, et al. (2019), calibrated following McLaren, et al. (2019) ] --- .right-column[ <img src="slides_files/figure-html/unnamed-chunk-12-1.png" width="80%" /> ] ??? - Fettweiss et al, "Multi'Omic Microbiome Study-Pregnancy Initiative (MOMS-PI) tracked the trajectories of the vaginal microbiomes of pregnant women over multiple visits during pregnancy using 16S sequencing The community transitions from the high efficiency L iners to the low effieincy Gard --- .right-column[ <img src="slides_files/figure-html/unnamed-chunk-13-1.png" width="80%" /> ] ??? causing the mean efficiency to drop by 10X, which causes the FCs of all species to be inflated by 10X. --- .right-column[ <img src="slides_files/figure-html/unnamed-chunk-14-1.png" width="80%" /> ] ??? As a result, Prevotella bivia, which actually decreased by 3X, instead appears to increase by 3X. --- layout: false # How problematic is mean-efficiency variation for microbiome studies? <br><br> -- .right-column[.large[ Large shifts in mean efficiency can be driven by - large changes in individual species _or_ - correlated changes in many species ]] -- .right-column[.large[ Systematic error in case-control analysis occurs if<br> the mean efficiency is correlated with case vs. control ]] -- .right-column[.large[ We don't know if these problems are common ]] ??? - also affects regression over many samples; e.g. case v control; see manuscript - don't know how big of a problem this is; manuscript describes considerations - transition - we made this observation in our elife paper, and considered are their other types of analyses that might be more robust to bias. We hit upon one approach, based on the observation that the (next slide) ??? --- layout: true # Fold errors in species ratios are consistent across samples .left-column[ McLaren, et al. (2019) ] --- .right-column[ <img src="figures/brooks-ratios-error.svg" width="65%" /> ] ??? Here we're seeing the same mock community data as before, in which we the error in species proportions is highly inconsistent. We noted that the error in the ratio between each pair of species was highly consistent. So, for example, the ratio of G. vaginalis to L. crispatus at the far left is always around 20X below its true value. (low priority) TODO: make a version with an error pointing to G. vag and L crip --- <br><br> <br><br> <br><br> .right-column[ `\begin{align} \underbrace{{\color{measured}\frac{\text{reads}(G. vaginalis)}{\text{reads}(L. iners)}}}_{\text{measured ratio}} = \underbrace{{\color{actual} \frac{\text{density}(G. vaginalis)}{\text{density}(L. iners)}}}_{\text{actual ratio}} \times \underbrace{{\color{error} \frac{\text{efficiency}(G. vaginalis)}{\text{efficiency}(L. iners)}}}_{\text{fold error is constant}} \end{align}` ] ??? Mathematically, what is happening is that (explain slide) NOTE: prob better to replace these species names with G vag and L crisp since that matches the example --- layout: false # Changes in ratios are robust to consistent taxonomic bias .left-column[ ] .right-column[ <img src="slides_files/figure-html/unnamed-chunk-16-1.png" width="80%" /> ] ??? - recall example before in which the proportion of P bivia spuriously appeard to increase - Can see that the ratio of P bivia to L iners is correctly observed to increase - More generally, any analysis that depends on the fold change in ratios is robust to consistent taxonomic bias - This includes various forms of differential relative-abundance analysis that have been adopted from compositional data analysis, including those that look at how various log ratio transformations vary across samples or conditions. - But often we aren't intersted in ratios or proportions, but rather in how the absolute cell numbers or densities of species are changing --- # What if we care about change in _absolute_ abundance? <br> .right-column[.large[ Examples of "absolute" abundance include - Microbial cell density per unit sample mass or volume - Microbial cells per host cell ]] -- .right-column[.large[ Absolute-abundance measurements still subject to taxonomic bias! ]] ??? - title; then slide - transition: Absolute abundances can be derived from proportions _or_ ratios --- # The same problems apply to absolute abundances derived from proportions <br> .right-column[ ### Proportion-based density measurement: `\begin{align} {\color{measured}\begin{array}{l} \text{Cell density}\\ \text{of species } I \end{array}} \;\; = \quad {\color{measured}\begin{array}{l} \text{Proportion of species } I \\ \text{from sequencing} \end{array}} \quad \times \quad {\color{measured}\begin{array}{l} \text{Total density} \\ \text{from cytometry} \end{array}} \end{align}` ] -- .right-column[ ### Biased proportions `\(\rightarrow\)` Biased densities ] -- .right-column[ ### Variation in mean efficiency `\(\rightarrow\)` Error in fold changes ] ??? - The same problem applies to absolute abundances based on proportions - Consider the approach some have called 'quantitative microbiome profiling', of estimating absolute cell densities by multiplying the proportion of a species from sequencing by the total cell density measured by flow cytometry or microscopy. - The error in the proportions is directly inherited to cause error in the densities - and again, variation in the mean efficiency can cause errors in the change of individual species across samples transition: Can ratio-based inference again help in this situation? --- # But we can instead measure absolute abundance from ratio to a reference species <br> .right-column[ ### Ratio-based density measurement: `\begin{align} {\color{measured}\begin{array}{l} \text{Cell density}\\ \text{of species } I \end{array}} \;\; = \quad {\color{measured}\frac{\text{Reads of species } I}{\text{Reads of species } R}} % {\color{measured}\begin{array}{l} \text{Ratio of species } I \\ \text{to species } R\end{array}} \quad \times \quad {\color{measured}\begin{array}{l} \text{Density of}\\ \text{species } R \end{array}} \end{align}` ] -- .right-column[ ### `\(\rightarrow\)` Bias creates constant error ] -- .right-column[ ### `\(\rightarrow\)` Fold changes are accurate ] ??? By estimating the density of a focal species by its ratio to --- # The reference species can be any species with a known or measureable abundance <br> -- .right-column[.large[ - 'Housekeeping species' - e.g. the host, pepper mild mottle virus ]] -- .right-column[.large[ - Spike-in of extraneous species - cells or DNA, pre- or post-extraction ]] -- .right-column[.large[ - Species measured by targeted methods - via qPCR, direct ddPCR, CFU counting, etc. ]] ??? --- layout: true # Example: Calibration to a reference species corrects fold-change measurements .left-column[ Cellular mock communities from Brooks, et al. (2015) ] --- .right-column[ <img src="slides_files/figure-html/unnamed-chunk-17-1.png" width="85%" /> ] --- .right-column[ <img src="slides_files/figure-html/unnamed-chunk-18-1.png" width="85%" /> ] --- .right-column[ <img src="slides_files/figure-html/unnamed-chunk-19-1.png" width="85%" /> ] ??? - TODO: script this <!-- }}} --> --- layout:false # We now have multiple methods to address taxonomic bias in differential abundance .left-column[ ] .right-column[ <img src="slides_files/figure-html/unnamed-chunk-20-1.png" width="85%" /> ] ??? - although taxonomic bias may create spurious differential-abundance results, we can address it using - control communities, in cases where these are practical - a particularly straightforward case is syncom experiments - or using ratio-based methods, which apply to relative and absolute abundance analysis - in particular, for experiments aiming to consider the change in absolute abundance of all species, using the typical proportion-based approach, there seems little excuse to not also making targeted measurements of a couple species to check against, and if needed to correct, the community-wide measurements. - Maybe: we introduce two other purely computational approaches in our manuscript --- # Join me in turning these ideas into accessible, well-tested solutions! <!-- # Conclusion / close --> .right-column[ <img src="figures/mikemc_differential-abundance-theory-short.svg" width="90%" /> - Homepage: [github.com/mikemc/differential-abundance-theory](https://github.com/mikemc/differential-abundance-theory) - See latest and past manuscript versions and annotate with [Hypothesis](https://web.hypothes.is/) - Discuss manuscript with GitHub Issues or Discussions - Use GitHub or email me to discuss new tests and tools ] ??? Yet there remains a large gap in terms of experimental protocols and statistical tools, and empirical testing, before these methods offer truly accessible and well-tested solutions. So I'll end by offering inviting any of you who are interested to engage online by reading and commenting on the manuscript, starting a GitHub Discussion, or reaching out to me privately, so that we can together usher in a world in which taxonomic bias and other forms of metagenomics measurement error are directly accounted for, to enable the confidence in microbiome studies needed to turn statistical findings into true scientific knowledge. --- # Many thanks to the Callahan lab and all my collaborators <br> .left-column[Work funded by NIH grant R35GM133745] .right-column[ ![Collaborators](figures/collabs-with-authors.svg) ] ??? while you are all thinking about questions, I want to take the chance to thank the callahan and my collaborators, and particularly the authors of our differential abundance manuscript which I've boxed here TODO: add a note that the box denotes (co)author of manuscript --- # To learn more or get in touch .left-column[ <!-- **Email**<br>m.mclaren42@gmail.com --> **Email**<br>mike@mikemc.cc **Github**<br>@mikemc **Twitter**<br>@mikemc423 ] .right-column[ <img src="shared_figures/papers/mclaren2019/title.svg" width="90%" /> <br> <img src="figures/mikemc_differential-abundance-theory-short.svg" width="90%" /> [github.com/mikemc/differential-abundance-theory](https://github.com/mikemc/differential-abundance-theory) ] --- # References Brooks, J. P. et al. (2015). "The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies". In: _BMC Microbiol._ 15.1, p. 66. DOI: [10.1186/s12866-015-0351-6](https://doi.org/10.1186%2Fs12866-015-0351-6). URL: [http://www.biomedcentral.com/1471-2180/15/66](http://www.biomedcentral.com/1471-2180/15/66). Fettweis, J. M. et al. (2019). "The vaginal microbiome and preterm birth". In: _Nat. Med._ 25.6, pp. 1012-1021. DOI: [10.1038/s41591-019-0450-2](https://doi.org/10.1038%2Fs41591-019-0450-2). URL: [http://www.nature.com/articles/s41591-019-0450-2](http://www.nature.com/articles/s41591-019-0450-2). McLaren, M. R. et al. (2019). "Consistent and correctable bias in metagenomic sequencing experiments". In: _Elife_ 8, p. 46923. DOI: [10.7554/eLife.46923](https://doi.org/10.7554%2FeLife.46923). URL: [https://elifesciences.org/articles/46923](https://elifesciences.org/articles/46923).