Genome statistics

NCBI genome lengths and 16S copy numbers

## # A tibble: 9 x 4
##   ncbi_species         ncbi_ssu_count ncbi_total_length ncbi_refseq_catego…
##   <chr>                         <dbl>             <dbl> <chr>              
## 1 Atopobium vaginae                 1           1449613 representative gen…
## 2 Atopobium vaginae                 1           1430526 representative gen…
## 3 Gardnerella vaginal…              2           1667350 reference genome   
## 4 Gardnerella vaginal…              2           1617545 representative gen…
## 5 Lactobacillus crisp…              4           2043161 representative gen…
## 6 Lactobacillus iners               1           1277649 representative gen…
## 7 Prevotella bivia                  4           2521238 representative gen…
## 8 Sneathia amnii                    3           1330224 representative gen…
## 9 Streptococcus agala…              7           2160267 reference genome

NCBI annotations of 1 16S copy may just mean that the genome assemblies did not properly separate out the different 16S copies, so some further investigation of A. vaginae and L. iners in particular is warranted.

rrnDB: 16S copy number

## # A tibble: 3 x 6
##   ncbi_species                 n  mean median   min   max
##   <chr>                    <int> <dbl>  <dbl> <dbl> <dbl>
## 1 Gardnerella vaginalis        5  2         2     2     2
## 2 Sneathia amnii               1  3         3     3     3
## 3 Streptococcus agalactiae    57  6.82      7     5     8

These look like reliable numbers for these three species, and agree with the ncbi annotations.

Yuan2012: 16S copy numbers

Yuan S, Cohen DB, Ravel J, Abdo Z, Forney LJ. 2012. Evaluation of Methods for the Extraction and Purification of DNA from the Human Microbiome. PLoS One 7:e33865.

They determined copy numbers for Atopobium vaginae and Lactobacillus iners by pulse-field gel electrophoresis and found

Species 16s CN
Atopobium vaginae 2
Lactobacillus iners 5

(see their Table 4 and Methods)

Check relatives in the rrnDB

## # A tibble: 2 x 2
## # Groups:   is.na(ncbi_species) [2]
##   `is.na(ncbi_species)`      n
##   <lgl>                  <int>
## 1 FALSE                 101078
## 2 TRUE                   24165

tree w/ tips that have NCBI species

Lactobacillus

Get a tree corresponding to the clade of the MRCA of L. crispatus and L. iners,

Let’s take a look at how L. iners and L. crisp fall on the tree:

These groupings of L. iners and L. crispatus agree with those of Duar2017 (Figure 2).

Next, we will look for nearby species in the rrnDB. For L iners, let’s define an “L. iners group” consisting of all species descending from the MRCA of L. iners and L. gasseri,

## [1] "Lactobacillus iners"       "Lactobacillus hominis"    
## [3] "Lactobacillus taiwanensis" "Lactobacillus johnsonii"  
## [5] "Lactobacillus gasseri"

Check the copy numbers of these species in the rrnDB:

## # A tibble: 12 x 3
##    ncbi_scientific_name          x16s_gene_count evidence                  
##    <chr>                                   <dbl> <chr>                     
##  1 Lactobacillus gasseri                       4 Machine processing of NCB…
##  2 Lactobacillus gasseri ATCC 3…               6 Machine processing of NCB…
##  3 Lactobacillus gasseri DSM 14…               6 Machine processing of NCB…
##  4 Lactobacillus johnsonii                     7 Machine processing of NCB…
##  5 Lactobacillus johnsonii                     7 Machine processing of NCB…
##  6 Lactobacillus johnsonii                     7 Machine processing of NCB…
##  7 Lactobacillus johnsonii                     7 Machine processing of NCB…
##  8 Lactobacillus johnsonii                     7 Machine processing of NCB…
##  9 Lactobacillus johnsonii DPC …               4 Machine processing of NCB…
## 10 Lactobacillus johnsonii FI97…               4 Machine processing of NCB…
## 11 Lactobacillus johnsonii N6.2                4 Machine processing of NCB…
## 12 Lactobacillus johnsonii NCC …               6 Machine processing of NCB…
## # A tibble: 1 x 5
##       n  mean median   min   max
##   <int> <dbl>  <dbl> <dbl> <dbl>
## 1    12  5.75      6     4     7

These numbers are consistent with the number of the CN of 5 found in Yuan2012. Given that L. iners is quite distant from its relatives, I will go with the estimate of 5 for L. iners determined experimentally by Yuan2012.

Now let’s do the same for L. crispatus, defining its group somewhat broadly to include all descendants of the MRCA of crispatus with acidophilus,

## [1] "Lactobacillus acidophilus"     "Lactobacillus gallinarum"     
## [3] "Lactobacillus helveticus"      "Lactobacillus crispatus"      
## [5] "Lactobacillus ultunensis"      "Lactobacillus kitasatonis"    
## [7] "Lactobacillus amylovorus"      "Lactobacillus kefiranofaciens"

Check the copy numbers of these species in the rrnDB:

## # A tibble: 26 x 3
##    ncbi_scientific_name    x16s_gene_count evidence                        
##    <chr>                             <dbl> <chr>                           
##  1 Lactobacillus acidophi…               4 Machine processing of NCBI geno…
##  2 Lactobacillus acidophi…               4 Machine processing of NCBI geno…
##  3 Lactobacillus acidophi…               4 Machine processing of NCBI geno…
##  4 Lactobacillus acidophi…               4 Machine processing of NCBI geno…
##  5 Lactobacillus acidophi…               4 Machine processing of NCBI geno…
##  6 Lactobacillus acidophi…               4 Machine processing of NCBI geno…
##  7 Lactobacillus amylovor…               4 Machine processing of NCBI geno…
##  8 Lactobacillus amylovor…               5 Machine processing of NCBI geno…
##  9 Lactobacillus amylovor…               4 Machine processing of NCBI geno…
## 10 Lactobacillus gallinar…               4 estimated based on Southern hyb…
## 11 Lactobacillus gallinar…               4 estimated based on Southern hyb…
## 12 Lactobacillus gallinar…               5 Machine processing of NCBI geno…
## 13 Lactobacillus helvetic…               4 Machine processing of NCBI geno…
## 14 Lactobacillus helvetic…               4 Machine processing of NCBI geno…
## 15 Lactobacillus helvetic…               5 Machine processing of NCBI geno…
## 16 Lactobacillus helvetic…               5 Machine processing of NCBI geno…
## 17 Lactobacillus helvetic…               5 Machine processing of NCBI geno…
## 18 Lactobacillus helvetic…               4 Machine processing of NCBI geno…
## 19 Lactobacillus helvetic…               4 Machine processing of NCBI geno…
## 20 Lactobacillus helvetic…               4 Machine processing of NCBI geno…
## 21 Lactobacillus helvetic…               4 Machine processing of NCBI geno…
## 22 Lactobacillus helvetic…               4 Machine processing of NCBI geno…
## 23 Lactobacillus helvetic…               4 Machine processing of NCBI geno…
## 24 Lactobacillus helvetic…               4 Machine processing of NCBI geno…
## 25 Lactobacillus helvetic…               4 Machine processing of NCBI geno…
## 26 Lactobacillus kefirano…               4 Machine processing of NCBI geno…
## # A tibble: 1 x 5
##       n  mean median   min   max
##   <int> <dbl>  <dbl> <dbl> <dbl>
## 1    26  4.19      4     4     5

These numbers are consistent with the estimate of 4 from the NCBI genome annotation.

Prevotella bivia

Tree of the clade containing all “Prevotella” NCBI genomes:

Check where P. bivia falls:

So we can use the MRCA of P. bivia and P. melaninogenica to get a set of related species to query the rrnDB.

##  [1] "Prevotella bivia"          "Prevotella amnii"         
##  [3] "Prevotella histicola"      "Prevotella multiformis"   
##  [5] "Prevotella denticola"      "Prevotella fusca"         
##  [7] "Prevotella melaninogenica" "Prevotella jejuni"        
##  [9] "Prevotella scopos"         "Prevotella veroralis"

Check the copy numbers of these species in the rrnDB:

## # A tibble: 6 x 3
##   ncbi_scientific_name        x16s_gene_count evidence                     
##   <chr>                                 <dbl> <chr>                        
## 1 Prevotella denticola                      4 Machine processing of NCBI g…
## 2 Prevotella denticola F0289                4 Machine processing of NCBI g…
## 3 Prevotella fusca JCM 17724                4 Machine processing of NCBI g…
## 4 Prevotella jejuni                         4 Machine processing of NCBI g…
## 5 Prevotella melaninogenica                 4 Machine processing of NCBI g…
## 6 Prevotella melaninogenica …               4 Machine processing of NCBI g…
## # A tibble: 1 x 5
##       n  mean median   min   max
##   <int> <dbl>  <dbl> <dbl> <dbl>
## 1     6     4      4     4     4

These numbers are consistent with the estimate of 4 from the refseq genome.

Atopobium vaginae

Tree of the clade containing all “Atopobium” NCBI genomes:

Check where A. vaginae falls:

Check for any of these three genera in the rrnDB,

## # A tibble: 4 x 3
##   ncbi_scientific_name     x16s_gene_count evidence                        
##   <chr>                              <dbl> <chr>                           
## 1 Atopobium parvulum DSM …               1 Machine processing of NCBI geno…
## 2 Olsenella sp. Marseille…               2 Machine processing of NCBI geno…
## 3 Olsenella sp. oral taxo…               2 Machine processing of NCBI geno…
## 4 Olsenella uli DSM 7084                 1 Machine processing of NCBI geno…

These results suggest that a value of 1-2 is reasonable, but leave it pretty ambiguous which is better. Since the value of 1 in the NCBI annotation for A. vaginae could be an assembly or annotation error, I will go with the larger value of 2 found by Yuan2012.

Final table

## # A tibble: 7 x 3
##   Taxon                    Genome_size Copy_number
##   <chr>                          <dbl>       <dbl>
## 1 Atopobium_vaginae           1440070.           2
## 2 Gardnerella_vaginalis       1642448.           2
## 3 Lactobacillus_crispatus     2043161            4
## 4 Lactobacillus_iners         1277649            5
## 5 Prevotella_bivia            2521238            4
## 6 Sneathia_amnii              1330224            3
## 7 Streptococcus_agalactiae    2160267            7

Save for use in the main analysis:

## ✔ Setting active project to '/home/michael/Dropbox/research/2019-bias-manuscript'
## ✔ Saving 'brooks2015_species_info' to 'data/brooks2015_species_info.rda'

Also make a latex version for use as a supplemental table,