callahan lab meeting

class: center, middle, inverse, title-slide

# callahan lab meeting
## bias inference in natural samples
### michael mclaren
### 2019-10-04

---

# problem: different protocols yield different and inaccurate measurements

---

class:middle

## goal: gain a quantitative understanding of bias

and use it to create computational and experimental tools to make metagenomics
measurements and inferences accurate and reproducible

---

# what we've done so far

### McLaren, Willis, Callahan 2019, eLife

- proposed a simple model that explains

- bias: actual -> observed

- bias: observed(Protocol P) -> observed(Protocol R)

- validated the model on defined (mock) bacterial communities

- described key implications of the model, including how to

+ measure bias from control samples

+ calibrate from control measurements

+ analyze how bias affects downstream analyses

---

# questions we still need to answer

- is bias consistent across natural specimens

- is bias consistent between natural and artificial specimens

- is bias consistent across labs using the same protocol

- is bias consistent among closely related taxa

# tools we want to develop

"user-friendly" statistical machinery to

- better measure bias for protocol optimization

- evaluate how bias affects downstream analyses

- perform calibration

---

# outline

1. motivating data

2. theory: generative statistical models for differential-bias inference

3. two fitting procedures I'm currently testing

4. where to go from here

???

goal - to help figure out the next steps; (trying to evaluate and pick one
framework to go with for short term projects and create a vignette for. Perhaps
say this in the intro.)

---

# Costea et al 2017 Phase 2 dataset

4 labs `$\times$` 3 extraction protocols `$\times$` 2 fecal specimens (ish)

```r
sam %>%
    count(Lab, Protocol)
```

```
## # A tibble: 11 x 3
##    Lab   Protocol     n
##    <fct> <fct>    <int>
##  1 1     H            8
##  2 1     Q            9
##  3 1     W            6
##  4 6     H            6
##  5 6     Q            6
##  6 6     W            6
##  7 9     H            6
##  8 9     Q           10
##  9 15    H            6
## 10 15    Q           12
## 11 15    W            7
```

???

ish - lab 9 only showing protocols H and Q. need to look into that.

---

# Costea et al 2017 Phase 2 dataset

```r
sam %>%
    filter(Lab == 1) %>%
    count(Protocol, Specimen)
```

```
## # A tibble: 6 x 3
##   Protocol Specimen     n
##   <fct>    <fct>    <int>
## 1 H        A            4
## 2 H        B            4
## 3 Q        A            6
## 4 Q        B            3
## 5 W        A            3
## 6 W        B            3
```

---

# Costea et al 2017 Phase 2 dataset

## analysis goals

- is bias consistent between Specimen A and Specimen B?

- effect of lab versus protocol?

- is bias consistent among closely related taxa?

---

# (deterministic) model of bias

<img src="figures/compositional-dependence.svg" width="100%">

???

note that compositional vectors determined up to arbitrary constant; often want
to pin them down, e.g. here chose rel to taxon 1. But don't have to, and zeros
are ok

---

# differential bias between protocols

- Protocol `$P$` observes `$O^{(P)} \sim A \cdot B^{(P)}$`

- Protocol `$R$` observes `$O^{(R)} \sim A \cdot B^{(R)}$`

- their difference is

$$
  O^{(P)} / \: O^{(R)} \sim A \cdot B^{(P)} \: / \: (A \cdot B^{(R)}) = B^{(P)}
  / \: B^{(R)}
$$

- `$B^{(P/R)} \equiv B^{(P)} / B^{(R)}$` is the differential bias of
  Protocol `$P$` to Protocol `$R$`

- can treat reference observation as "actual"

$$
  O^{(P)} \sim O^{(R)} \cdot B^{(P/R)}
$$

???

---

# switch to matrix notation

- `$n$` samples and `$Q$` taxa

- `$O$` is an `$n \times Q$` matrix

- `$O_{ij}$` is the abundance in sample `$i$` of taxon `$j$`

- `$O_i$` is a `$Q\times 1$` column vector giving the composition of sample `$i$`

???

---

# adding noise

### random mulitplicative error

`$O_i \sim A_i \cdot B_i \cdot \epsilon_i$`
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; or &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
`$\log O_i \sim \log A_i + \log B_i + \log \epsilon_i$`

`$\text{cen}[\epsilon_i] \sim 1$`
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
or &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
`$E[\log \epsilon_i] \sim 0$`

### random counting error (multinomial)

read counts `$W_i$` such that `$M_i = \sum_j W_{ij}$` and `$O_i$` normalized to
proportions

`$W_i \mid M_i \sim \text{Multinomial} \left(M_i, O_i \right)$`

### random counting error (Poisson)

`$W_{ij} \sim \text{Poisson} \left(M_i O_{ij} \right)_{}$`

???

Note overloaded use of `$\sim$`

---

# generalized linear regression formulation

- distinguish between samples (aliquots) and specimens (communities)

- `$n$` samples, `$Q$` taxa, and `$p$` parameters

- `$X$` is an `$n \times p$` matrix of sample covariates

- `$X_i$` is a `$p \times 1$` column vector of covariates for sample `$i$`

- `$R$` is a `$p \times Q$` matrix of composition and bias parameters

???

in a differential bias experiment, we'll use X to encode information about the
specimen

---

# regression formulation for differential bias

observed composition of sample `$i$` is
$$
  \log O_i \sim X_i' R + \log \epsilon_i
$$

where `X = model.matrix(~ 0 + Specimen + Protocol, sam)`

for example, sample `$i$` is an observation of Specimen S2 by Protocol P2

$$
E[\log O_i]
\sim
`\begin{bmatrix}
0 & 1 & 0 & 1 & 0
\end{bmatrix}`
`\begin{bmatrix}
\text{S1 as measured by P1} \\
\text{S2 as measured by P1} \\
\text{S3 as measured by P1} \\
\text{log bias of P2 to P1} \\
\text{log bias of P3 to P1} \\
\end{bmatrix}`
$$
$$
\qquad \quad = 
`\begin{bmatrix}
\text{S2 as measured by P1} +
\text{log bias of P2 to P1}
\end{bmatrix}`
$$

???

where `$X$` encodes the _specimen_ and _protocol_ information and `$R$` has
parameters for the compositions of the specimens observed by the reference
protocol and the differential bias to the reference protocol

---

# metacal::complm( ) - compositional linear regression via least-squares estimation

suppose
$$
  P^{(i)} \log W_i = P^{(i)} [ X'_i R + \log \epsilon_i]
$$
where `$P^{(i)}$` projects `$\log W_i$` to the subcomposition of taxa in sample
`$i$`.

find `$\hat R$` that minimizes
`$$\text{RSS} = \sum_i || P^{(i)} [ Y_i - X'_i \hat R ] ||^2_{}$$`

???

can solve using some tricks from matrix algebra.

---

# metacal::complm( ) demo on Phase 2 data

subset to Lab 1 to speed up demonstration,

```r
sam.lab1 <- sam %>%
    filter(Lab == 1)
obs.pseudo.lab1 <- obs.pseudo[sam.lab1$Sample,]
```
fit with `complm()` with a `formula` interface,

```r
fit.mc <- complm(obs.pseudo.lab1 ~ 0 + Specimen + Protocol, sam.lab1)
```
apply R model generics to `complm` object,

```r
coef(fit.mc) %>% corner(r = Inf, c = 3) %>% round(3)
```

```
##           mOTU1 mOTU2  mOTU3
## SpecimenA 0.965 0.000 -0.304
## SpecimenB 0.511 1.077  0.357
## ProtocolQ 0.456 0.020 -0.232
## ProtocolW 0.367 0.063 -0.134
```

---

# pros and cons of metacal::complm( )

.pull-left[
pros

- extends original `metacal` estimator to flexible regression framework

- relatively simple

- handles zeros

- indirectly accounts for multiplicative random error

- super fast implementation should be possible

]

.pull-right[
cons

- not a generative model (?)

- requires user to specify taxa presence

- ignores counting noise

- requires pseudocounts for counting zeros

- currently slow on big datasets

- only a point estimate

- no shrinkage / regularization

]

???

.pull-right[
cons

- requires user to specify taxa presence
  + maybe ok because reasons

- ignores count nature of data **and** requires pseudocounts to deal with zeros
  + could instead use ALDEx2 approach

- current implementation is slow on big datasets
  + but faster implementation should be feasible

- only a point estimate
  + possible fixes include bootstrapping or permutation testing

- no shrinkage / regularization

]

---

# nnet::multinom( ) - multinomial logistic regression via neural networks

maximum likelihood estimation of neural net classifier

```r
fit.nnet <- nnet::multinom(obs ~ 0 + Specimen + Protocol, sam)
```

```
## # weights:  630 (500 variable)
## initial  value 3639147.373960 
## iter  10 value 3170291.741136
## iter  20 value 3157522.534885
## iter  30 value 3138826.782369
## iter  40 value 3089259.774705
## iter  50 value 3049018.610474
## iter  60 value 3025528.398309
## iter  70 value 3015826.458654
## iter  80 value 3013122.607025
## iter  90 value 3003711.712341
## iter 100 value 2996230.167348
## final  value 2996230.167348 
## stopped after 100 iterations
```

---

# nnet::multinom( ) - multinomial logistic regression via neural networks

coefficients are relative to mOTU1

```r
coef(fit.nnet) %>% corner(r = 3, c = Inf) %>% round(3)
```

```
##       SpecimenA SpecimenB ProtocolQ ProtocolW
## mOTU2    -3.375     0.235     0.415     1.010
## mOTU3    -2.002    -1.258     0.098     0.476
## mOTU4    -1.153    -3.941     0.345    -0.115
```

---

# nnet::multinom( ) - multinomial logistic regression via neural networks

hypothesis testing using model generics

```r
fit.nnet0 <- nnet::multinom(obs ~ 0 + Specimen + Protocol, sam)
fit.nnet1 <- nnet::multinom(obs ~ 0 + Specimen + Protocol + Lab, 
    sam, MaxNWts = 3000)
```

```r
anova(fit.nnet1, fit.nnet0)
```

```
## Likelihood ratio tests of Multinomial Models
## 
## Response: obs
##                           Model Resid. df Resid. Dev   Test    Df LR stat.
## 1       0 + Specimen + Protocol      9750    5992460                      
## 2 0 + Specimen + Protocol + Lab      9375    5955625 1 vs 2   375 36835.61
##   Pr(Chi)
## 1        
## 2       0
```

???

---

# pros and cons of nnet::multinom( )

.pull-left[
pros

- base R solution

- fast

- simple model

- handles zeros without pseudocounts

- maximum likelihood inference

]

.pull-right[
cons

- needs more testing regarding zeros

- doesn't capture multiplicative error

- no shrinkage / regularization

- maximum likelihood inference

]

???

not sure if exact inference is possible, or if likelihood ratio test requires
large n assumption

---

# other approaches

- glmnet - multinomial regression with regularization

- multinomial regression with random effects

- Poisson regression with random effects

- Bayesian multinomial logistic normal models

+ Stan

+ `pibble()` from Justin Silverman's `stray` package

- bootstrapping - hierarchical or of residuals

- permutation testing (Glen Satten)

???

Roughly, complm models mult error and multinom models counting error. Ideally
should have both, and can think of the random effects and Bayesian mult'l
logistic normal models as modeling both.