class: center, middle, inverse, title-slide # callahan lab meeting ## bias inference in natural samples ### michael mclaren ### 2019-10-04 --- # problem: different protocols yield different and inaccurate measurements <img src="figures/costea-phase3-ordination.svg" width="80%"> --- class:middle ## goal: gain a quantitative understanding of bias and use it to create computational and experimental tools to make metagenomics measurements and inferences accurate and reproducible --- # what we've done so far ### McLaren, Willis, Callahan 2019, eLife - proposed a simple model that explains - bias: actual -> observed - bias: observed(Protocol P) -> observed(Protocol R) - validated the model on defined (mock) bacterial communities - described key implications of the model, including how to + measure bias from control samples + calibrate from control measurements + analyze how bias affects downstream analyses --- # questions we still need to answer - is bias consistent across natural specimens - is bias consistent between natural and artificial specimens - is bias consistent across labs using the same protocol - is bias consistent among closely related taxa # tools we want to develop "user-friendly" statistical machinery to - better measure bias for protocol optimization - evaluate how bias affects downstream analyses - perform calibration --- # outline 1. motivating data 2. theory: generative statistical models for differential-bias inference 3. two fitting procedures I'm currently testing 4. where to go from here ??? goal - to help figure out the next steps; (trying to evaluate and pick one framework to go with for short term projects and create a vignette for. Perhaps say this in the intro.) --- # Costea et al 2017 Phase 2 dataset 4 labs `\(\times\)` 3 extraction protocols `\(\times\)` 2 fecal specimens (ish) ```r sam %>% count(Lab, Protocol) ``` ``` ## # A tibble: 11 x 3 ## Lab Protocol n ## <fct> <fct> <int> ## 1 1 H 8 ## 2 1 Q 9 ## 3 1 W 6 ## 4 6 H 6 ## 5 6 Q 6 ## 6 6 W 6 ## 7 9 H 6 ## 8 9 Q 10 ## 9 15 H 6 ## 10 15 Q 12 ## 11 15 W 7 ``` ??? ish - lab 9 only showing protocols H and Q. need to look into that. --- # Costea et al 2017 Phase 2 dataset ```r sam %>% filter(Lab == 1) %>% count(Protocol, Specimen) ``` ``` ## # A tibble: 6 x 3 ## Protocol Specimen n ## <fct> <fct> <int> ## 1 H A 4 ## 2 H B 4 ## 3 Q A 6 ## 4 Q B 3 ## 5 W A 3 ## 6 W B 3 ``` --- # Costea et al 2017 Phase 2 dataset ## analysis goals - is bias consistent between Specimen A and Specimen B? - effect of lab versus protocol? - is bias consistent among closely related taxa? --- # (deterministic) model of bias <!-- <img src="figures/bias-model.svg" width="100%"> --> <img src="figures/compositional-dependence.svg" width="100%"> ??? note that compositional vectors determined up to arbitrary constant; often want to pin them down, e.g. here chose rel to taxon 1. But don't have to, and zeros are ok --- # differential bias between protocols - Protocol `\(P\)` observes `\(O^{(P)} \sim A \cdot B^{(P)}\)` - Protocol `\(R\)` observes `\(O^{(R)} \sim A \cdot B^{(R)}\)` - their difference is $$ O^{(P)} / \: O^{(R)} \sim A \cdot B^{(P)} \: / \: (A \cdot B^{(R)}) = B^{(P)} / \: B^{(R)} $$ - `\(B^{(P/R)} \equiv B^{(P)} / B^{(R)}\)` is the differential bias of Protocol `\(P\)` to Protocol `\(R\)` - can treat reference observation as "actual" $$ O^{(P)} \sim O^{(R)} \cdot B^{(P/R)} $$ ??? --- # switch to matrix notation - `\(n\)` samples and `\(Q\)` taxa - `\(O\)` is an `\(n \times Q\)` matrix - `\(O_{ij}\)` is the abundance in sample `\(i\)` of taxon `\(j\)` - `\(O_i\)` is a `\(Q\times 1\)` column vector giving the composition of sample `\(i\)` ??? --- # adding noise ### random mulitplicative error `\(O_i \sim A_i \cdot B_i \cdot \epsilon_i\)` or `\(\log O_i \sim \log A_i + \log B_i + \log \epsilon_i\)` `\(\text{cen}[\epsilon_i] \sim 1\)` or `\(E[\log \epsilon_i] \sim 0\)` ### random counting error (multinomial) read counts `\(W_i\)` such that `\(M_i = \sum_j W_{ij}\)` and `\(O_i\)` normalized to proportions `\(W_i \mid M_i \sim \text{Multinomial} \left(M_i, O_i \right)\)` ### random counting error (Poisson) `\(W_{ij} \sim \text{Poisson} \left(M_i O_{ij} \right)_{}\)` ??? Note overloaded use of `\(\sim\)` --- # generalized linear regression formulation - distinguish between samples (aliquots) and specimens (communities) - `\(n\)` samples, `\(Q\)` taxa, and `\(p\)` parameters - `\(X\)` is an `\(n \times p\)` matrix of sample covariates <!-- - `\(X_i'\)` is a `\(1 \times p\)` row vector of covariates for sample `\(i\)` --> - `\(X_i\)` is a `\(p \times 1\)` column vector of covariates for sample `\(i\)` - `\(R\)` is a `\(p \times Q\)` matrix of composition and bias parameters ??? in a differential bias experiment, we'll use X to encode information about the specimen --- # regression formulation for differential bias observed composition of sample `\(i\)` is $$ \log O_i \sim X_i' R + \log \epsilon_i $$ where `X = model.matrix(~ 0 + Specimen + Protocol, sam)` for example, sample `\(i\)` is an observation of Specimen S2 by Protocol P2 $$ E[\log O_i] \sim `\begin{bmatrix} 0 & 1 & 0 & 1 & 0 \end{bmatrix}` `\begin{bmatrix} \text{S1 as measured by P1} \\ \text{S2 as measured by P1} \\ \text{S3 as measured by P1} \\ \text{log bias of P2 to P1} \\ \text{log bias of P3 to P1} \\ \end{bmatrix}` $$ $$ \qquad \quad = `\begin{bmatrix} \text{S2 as measured by P1} + \text{log bias of P2 to P1} \end{bmatrix}` $$ ??? where `\(X\)` encodes the _specimen_ and _protocol_ information and `\(R\)` has parameters for the compositions of the specimens observed by the reference protocol and the differential bias to the reference protocol --- # metacal::complm( ) - compositional linear regression via least-squares estimation suppose $$ P^{(i)} \log W_i = P^{(i)} [ X'_i R + \log \epsilon_i] $$ where `\(P^{(i)}\)` projects `\(\log W_i\)` to the subcomposition of taxa in sample `\(i\)`. find `\(\hat R\)` that minimizes `$$\text{RSS} = \sum_i || P^{(i)} [ Y_i - X'_i \hat R ] ||^2_{}$$` ??? can solve using some tricks from matrix algebra. --- # metacal::complm( ) demo on Phase 2 data subset to Lab 1 to speed up demonstration, ```r sam.lab1 <- sam %>% filter(Lab == 1) obs.pseudo.lab1 <- obs.pseudo[sam.lab1$Sample,] ``` fit with `complm()` with a `formula` interface, ```r fit.mc <- complm(obs.pseudo.lab1 ~ 0 + Specimen + Protocol, sam.lab1) ``` apply R model generics to `complm` object, ```r coef(fit.mc) %>% corner(r = Inf, c = 3) %>% round(3) ``` ``` ## mOTU1 mOTU2 mOTU3 ## SpecimenA 0.965 0.000 -0.304 ## SpecimenB 0.511 1.077 0.357 ## ProtocolQ 0.456 0.020 -0.232 ## ProtocolW 0.367 0.063 -0.134 ``` --- # pros and cons of metacal::complm( ) .pull-left[ pros - extends original `metacal` estimator to flexible regression framework - relatively simple - handles zeros - indirectly accounts for multiplicative random error - super fast implementation should be possible ] .pull-right[ cons - not a generative model (?) - requires user to specify taxa presence <!-- + maybe ok because reasons --> - ignores counting noise - requires pseudocounts for counting zeros - currently slow on big datasets - only a point estimate <!-- + possible fixes include bootstrapping or permutation testing --> - no shrinkage / regularization ] ??? .pull-right[ cons - requires user to specify taxa presence + maybe ok because reasons - ignores count nature of data **and** requires pseudocounts to deal with zeros + could instead use ALDEx2 approach - current implementation is slow on big datasets + but faster implementation should be feasible - only a point estimate + possible fixes include bootstrapping or permutation testing - no shrinkage / regularization ] --- # nnet::multinom( ) - multinomial logistic regression via neural networks maximum likelihood estimation of neural net classifier ```r fit.nnet <- nnet::multinom(obs ~ 0 + Specimen + Protocol, sam) ``` ``` ## # weights: 630 (500 variable) ## initial value 3639147.373960 ## iter 10 value 3170291.741136 ## iter 20 value 3157522.534885 ## iter 30 value 3138826.782369 ## iter 40 value 3089259.774705 ## iter 50 value 3049018.610474 ## iter 60 value 3025528.398309 ## iter 70 value 3015826.458654 ## iter 80 value 3013122.607025 ## iter 90 value 3003711.712341 ## iter 100 value 2996230.167348 ## final value 2996230.167348 ## stopped after 100 iterations ``` --- # nnet::multinom( ) - multinomial logistic regression via neural networks coefficients are relative to mOTU1 ```r coef(fit.nnet) %>% corner(r = 3, c = Inf) %>% round(3) ``` ``` ## SpecimenA SpecimenB ProtocolQ ProtocolW ## mOTU2 -3.375 0.235 0.415 1.010 ## mOTU3 -2.002 -1.258 0.098 0.476 ## mOTU4 -1.153 -3.941 0.345 -0.115 ``` --- # nnet::multinom( ) - multinomial logistic regression via neural networks hypothesis testing using model generics ```r fit.nnet0 <- nnet::multinom(obs ~ 0 + Specimen + Protocol, sam) fit.nnet1 <- nnet::multinom(obs ~ 0 + Specimen + Protocol + Lab, sam, MaxNWts = 3000) ``` ```r anova(fit.nnet1, fit.nnet0) ``` ``` ## Likelihood ratio tests of Multinomial Models ## ## Response: obs ## Model Resid. df Resid. Dev Test Df LR stat. ## 1 0 + Specimen + Protocol 9750 5992460 ## 2 0 + Specimen + Protocol + Lab 9375 5955625 1 vs 2 375 36835.61 ## Pr(Chi) ## 1 ## 2 0 ``` ??? --- # pros and cons of nnet::multinom( ) .pull-left[ pros - base R solution - fast - simple model - handles zeros without pseudocounts - maximum likelihood inference ] .pull-right[ cons - needs more testing regarding zeros - doesn't capture multiplicative error - no shrinkage / regularization - maximum likelihood inference ] ??? not sure if exact inference is possible, or if likelihood ratio test requires large n assumption --- # other approaches - glmnet - multinomial regression with regularization - multinomial regression with random effects - Poisson regression with random effects - Bayesian multinomial logistic normal models + Stan + `pibble()` from Justin Silverman's `stray` package - bootstrapping - hierarchical or of residuals - permutation testing (Glen Satten) ??? Roughly, complm models mult error and multinom models counting error. Ideally should have both, and can think of the random effects and Bayesian mult'l logistic normal models as modeling both.