2 How taxonomic bias affects abundance measurements

Taxonomic bias is fundamentally multiplicative. Species vary in how easily their cells are lysed, and the number of lysed cells for a given species following DNA extraction equals the number of input cells multiplied by the probability of lysis per cell, or lysis efficiency, of the species. Similarly, prokaryotic species vary in their number of 16S genes, such that the numbers of 16S sequencing reads for various species in a 16S amplicon experiment is expected to be proportional to the numbers of 16S genes in their genomes. Nevertheless, the nature of MGS experiments are such that consistent taxonomic bias creates non-proportional error in common forms of relative and absolute abundance measurements—error that varies across samples and hence distort cross-sample comparisons. This section reviews the theoretical results of McLaren, Willis, and Callahan (2019) for how taxonomic bias affects MGS measurements of relative abundance and extends them to include measurements of absolute abundance. We show that only some of the measurement methods currently in use have proportional errors, making these methods intrinsically more robust to bias than the others which do not.

2.1 Model of MGS measurement

Our primary tool for understanding the impact of taxonomic bias on MGS measurement is the theoretical model of MGS measurement developed and empirically validated by McLaren, Willis, and Callahan (2019). This model is the simplest that respects the multiplicative nature of taxonomic bias and the compositional nature of MGS measurements, in which the total read count for a sample is unrelated to its total cell number or density (Gloor et al. (2017)). The model as described by McLaren, Willis, and Callahan (2019) considers only relative abundances; here we present a straightforward extension that allows us to also consider absolute abundances.

We consider a set of microbiome samples measured by a specific MGS protocol that extracts, sequences, and taxonomically assigns reads to a set of microbial species \(S\). We make several simplifying assumptions to facilitate our analysis and presentation. First, we consider only species-level assignment, and suppose that reads that cannot be uniquely assigned to a single species in \(S\) are discarded. Second, we ignore the possibility that reads are misassigned to either the wrong species or sample. Third, we suppose that taxonomic bias acts consistently across samples at the level of species—that is, a given species is always measured more efficiently than another to the same degree, regardless of variation in changes in sub-species-level composition or the sample matrix. Finally, unless otherwise stated, we treat the sequencing measurement as deterministic, ignoring the ‘random’ variation in read counts that arise from the sampling of sequencing reads and other aspects of the MGS process. These assumptions, though unrealistic descriptions of most MGS experiments, serve the purpose of clearly demonstrating when and why consistent taxonomic bias creates errors in DA analysis.

Our model describes the mathematical relationship between the read counts obtained by MGS and the true absolute abundances of each species in a sample. For concreteness, we take absolute abundance, or simply abundance, to refer to cell concentration, or number of cells per unit volume, in a sample. Thus we take ‘absolute’ simply to mean not relative to other species; however, our results also apply were ‘absolute’ taken to mean all cells in a sample or in a sampled ecosystem. Similarly, our results apply to abundance units other than cells, such as biomass and genome copy number. Our model stipulates that the assigned read count of a species \(i\) in a sample \(a\) equals its abundance multiplied by species-specific and sample-specific factors, \[\begin{align} \tag{2.1} \text{reads}_{i}(a) = \text{abun}_{i}(a) \quad \cdot \underbrace{\text{efficiency}_{i}}_{\substack{\text{species specific,} \\ \text{sample independent}}} \cdot \quad \underbrace{\text{effort}(a)}_{\substack{\text{species independent,} \\ \text{sample specific}}}. \end{align}\] The species-specific factor, \(\text{efficiency}_{i}\), equals the relative measurement efficiency (or simply efficiency) of the species—how much more easily that species is measured (converted from cells to assigned reads) relative to a arbitrary fixed reference species (McLaren, Willis, and Callahan (2019)). We assume the efficiencies of particular species are consistent across samples. The variation in efficiency among species corresponds to the taxonomic bias of the MGS protocol. The sample-specific factor, \(\text{effort}(a)\), quantifies the effective experimental effort per unit abundance in measuring the sample. Equation (2.1) implies that the total number of assigned reads in sample \(a\) equals \[\begin{align} \tag{2.2} \text{reads}_S(a) = \text{abun}_S(a) \cdot \text{efficiency}_S(a) \cdot \text{effort}(a), \end{align}\] where \(\text{abun}_{S}(a) \equiv \sum_{j\in S}\text{abun}_j(a)\) is the total abundance of species in \(S\) and \[\begin{align} \tag{2.3} \text{efficiency}_S(a) \equiv \frac{\sum_{j\in S}\text{abun}_j(a) \cdot \text{efficiency}_j}{\text{abun}_S(a)} \end{align}\] is the mean efficiency over all species in \(S\) weighted by their abundance.

2.2 Relative abundance (proportions and ratios)

We distinguish between two types of species-level relative abundances within a sample. The proportion of species \(i\) in sample \(a\) equals its abundance relative to the total abundance of all species in \(S\), \[\begin{align} \tag{2.4} \text{prop}_{i}(a) &\equiv \frac{\text{abun}_i(a)}{\text{abun}_S(a)}. % \\&= \frac{\text{abun}_i(a)}{\sum_{i \in S}\text{abun}_i(a)}. \end{align}\] The ratio between two species \(i\) and \(j\) equals the abundance of \(i\) relative to that of \(j\), \[\begin{align} \tag{2.5} \text{ratio}_{i/j}(a) = \frac{\text{abun}_i(a)}{\text{abun}_j(a)}. \end{align}\] Proportions and ratios each form the basis for popular relative DA methods.

The proportion of a species is typically measured by its proportion of assigned reads, \[\begin{align} \tag{2.6} \widehat{\text{prop}}_{i}(a) = \frac{\text{reads}_i(a)}{\text{reads}_S(a)}. \end{align}\] We can rewrite the right-hand side (using Equations (2.1), (2.2), and (2.6)) to find \[\begin{align} \tag{2.7} \widehat{\text{prop}}_{i}(a) &= \text{prop}_{i}(a) \cdot \underbrace{\frac{\text{efficiency}_{i}}{\text{efficiency}_S(a)}}_{\substack{\text{variable} \\ \text{fold error}}}. \end{align}\] Taxonomic bias creates a multiplicative error in the species’ proportion equal to its efficiency relative to the mean efficiency in the sample. Consequently, the species’s proportion is measured as too high in samples that are dominated by species with lower efficiencies, and measured as too low in samples that are dominated by species with higher efficiencies. This phenomenon is illustrated in two hypothetical communities in Figure 2.1. Species 3 has an efficiency of 6; it is under-measured in Sample 1, which has a mean efficiency of 8.33, but over-measured in Sample 2, which has a mean efficiency of 3.15. A demonstration in artificially constructed (‘mock’) communities of bacterial cells is shown in Figure 3C of McLaren, Willis, and Callahan (2019).

The measured ratio between species \(i\) and \(j\) is given by the ratio of their read counts, \[\begin{align} \tag{2.8} \widehat{\text{ratio}}_{i/j}(a) = \frac{\text{reads}_i(a)}{\text{reads}_j(a)}. \end{align}\] From Equations (2.1) and (2.8), it follows that \[\begin{align} \tag{2.9} \widehat{\text{ratio}}_{i/j}(a) % &= \frac{\text{abun}_{i}(a)}{\text{abun}_{j}(a)} &= \text{ratio}_{i/j}(a) \cdot \underbrace{\frac{\text{efficiency}_{i}}{\text{efficiency}_{j}}}_{\substack{\text{constant} \\ \text{fold error}}}. \end{align}\] Taxonomic bias creates a multiplicative measurement error in the ratio that is equal to the ratio in their efficiencies; the error is therefore constant across samples. For instance, in Figure 2.1, the ratio of Species 3 (with an efficiency of 6) to Species 1 (with an efficiency of 1) is over-estimated by a factor of 6 in both communities despite their varying compositions. A demonstration in artificially constructed (‘mock’) communities of bacterial cells is shown in Figure 3D of McLaren, Willis, and Callahan (2019).

Thus taxonomic bias creates variable fold errors in species proportions, but constant fold errors in species ratios. Consequently, a perfect cancellation of errors can only occur in ratio-based DA analysis.

Taxonomic bias creates sample-dependent multiplicative errors in species proportions, which can lead to inaccurate fold changes between samples. Top row: Error in proportions measured by MGS in two hypothetical microbiome samples that contain different relative abundances of three species. Bottom row: Error in the measured fold-change in the third species that is derived from these measurements. Species’ proportions may be measured as too high or too low depending on sample composition. For instance, Species 3 has an efficiency of 6 and is under-measured in Sample 1 (which has a mean efficiency of 8.33) but over-measured in Sample 2 (which has a mean efficiency of 3.15).

Figure 2.1: Taxonomic bias creates sample-dependent multiplicative errors in species proportions, which can lead to inaccurate fold changes between samples. Top row: Error in proportions measured by MGS in two hypothetical microbiome samples that contain different relative abundances of three species. Bottom row: Error in the measured fold-change in the third species that is derived from these measurements. Species’ proportions may be measured as too high or too low depending on sample composition. For instance, Species 3 has an efficiency of 6 and is under-measured in Sample 1 (which has a mean efficiency of 8.33) but over-measured in Sample 2 (which has a mean efficiency of 3.15).

Higher-order taxa: How does bias affect the proportions and ratios of higher-order taxa, such as a genera, phyla, or even all organisms? More generally, we can identify an higher-order taxon \(I\) with a set of species, \(\{i \in I\}\). The abundance and reads of a taxon \(I\) in a sample \(a\) equal the sums over its constituent species, \(\text{abun}_{I}(a) = \sum_{i \in I}\text{abun}_{i}(a)\) and \(\text{reads}_{I}(a) = \sum_{i \in I}\text{reads}_{i}(a)\). The efficiency of taxon \(I\) equals the weighted average of its constituent species, \[\begin{align} \tag{2.10} \text{efficiency}_I(a) \equiv \frac{\sum_{i\in I}\text{abun}_i(a)\cdot \text{efficiency}_i}{\text{abun}_I(a)}. \end{align}\] Thus the efficiency of a higher-level taxon may vary across samples (McLaren, Willis, and Callahan (2019)). An exception is if the constituent species have the same efficiency or always appear in the same abundances relative to each other, in which case the efficiency of the higher-order taxon would stay constant. As an example, suppose that Species 1 and Species 2 in Figure 2.1 were in the same phylum. The efficiency of the phylum would then be \(\tfrac{1}{2} \cdot 1 + \tfrac{1}{2} \cdot 18 = 9.5\) in Sample 1 and \(\tfrac{15}{16} \cdot 1 + \tfrac{1}{16} \cdot 18 \approx 2.1\) in Sample 2. Equations (2.7) and (2.9) continue to describe the measurement error in proportions and ratios involving higher-order taxa, so long as the sample-dependent taxon efficiency defined by Equation (2.10) is used. In this way, we can see that both proportions and ratios among higher-order taxa may have inconsistent fold errors.

2.3 Absolute abundance

Various methods can be used to convert the relative abundances from MGS into absolute abundances. These methods are subject to taxonomic bias in the MGS measurements and in any supplemental, non-MGS measurements that may be used for the conversion. Appendix [REF] develops a general framework to understand how bias affects absolute-abundance measurements for species and higher-order taxa across these methods. Here we present our main results as they apply to currently-used methods.

2.3.1 Methods that measure total-community abundance

The definition (2.4) for the proportion of a species indicates that we can convert proportions to absolute abundances were we to know the absolute abundance of the total community. Given our measured proportions from MGS and a measure of total abundance from a supplemental, non-MGS measurement, \(\widehat{\text{abun}}_{S}(a)\), one can measure the abundances of individual species by \[\begin{align} \tag{2.11} \widehat{\text{abun}}_{i}(a) &= \widehat{\text{prop}}_{i}(a) \cdot \widehat{\text{abun}}_S(a). \end{align}\] Total-abundance measurements that have recently used for this purpose include counting cells with microscopy (Lloyd et al. (2020)) or flow cytometry (Props et al. (2017),Vandeputte et al. (2017),Galazzo et al. (2020)), measuring the concentration of a marker-gene with qPCR or ddPCR (Zhang et al. (2017),Barlow, Bogatyrev, and Ismagilov (2020), Galazzo et al. (2020),Tettamanti Boshier et al. (2020)), and measuring bulk DNA concentration with a florescence-based DNA quantification method (Contijoch et al. (2019)).

Importantly, these methods of measuring total abundance are themselves subject to taxonomic bias. Flow cytometry may, for example, yield lower cell counts for species whose cells tend to clump together or are prone to lysis during steps involved in sample collection, storage, and preparation. The marker-gene concentration measured by qPCR is affected by variation among species in extraction efficiency, marker-gene copy number, and PCR binding and amplification efficiency. We can easily understand the impact of taxonomic bias on the total-abundance measurement under simplifying assumptions analogous to those in our MGS model. Suppose that each species has an absolute efficiency for the total-abundance measurement that is constant across samples. Neglecting other sources of random and systematic error, the total-abundance measurement equals \[\begin{align} \tag{2.12} \widehat{\text{abun}}_S(a) &= \sum_{i\in S} \text{abun}_i(a) \cdot \text{efficiency}^{\text{tot}}_i \\&= \text{abun}_S(a) \cdot \text{efficiency}^{\text{tot}}_S(a), \end{align}\] where \(\text{efficiency}_{i}^{\text{tot}}(a)\) is the absolute measurement efficiency of species \(i\) for the total-abundance measurement and \(\text{efficiency}^{\text{tot}}_S(a)\) is the mean efficiency of the total-abundance measurement in the sample.

The species abundance measurements are affected by taxonomic bias in both the MGS and total-abundance measurement. We can compute this effect by substituting Equations (2.7) and (2.12) into Equation (2.11), yielding \[\begin{align} \tag{2.13} \widehat{\text{abun}}_{i}(a) = \text{abun}_{i}(a) \cdot \frac{\text{efficiency}_{i} \cdot \text{efficiency}^{\text{tot}}_S(a)}{\text{efficiency}_S(a)}. \end{align}\] Equation (2.13) indicates that the multiplicative error in the measured absolute abundance of a species equals its MGS efficiency relative to the mean MGS efficiency in the sample, multiplied by the mean efficiency of the total measurement. As in the case of proportions (Equation (2.7)), the error depends on sample composition through the two mean efficiency terms and so may vary across samples. On the other hand, if the mean efficiency of the total-abundance measurement mirrors that of the MGS measurement, the two can offset and lead to more stable errors. We discuss how this possibility might be exploited in real experimental workflows in Section 5.

2.3.2 Methods that determine or measure a reference species

The definition (2.5) for the ratio between two species suggests another method to convert relative to absolute abundances: By knowing the ratio of species \(i\) to a reference species \(r\) and the abundance of species \(r\), we can determine the abundance of species \(i\). Thus given an MGS measurement and a measurement \(\widehat{\text{abun}}_{r}(a)\) of the abundance of species \(r\), we can measure the abundance of an arbitrary species \(i\) as \[\begin{align} \tag{2.14} \widehat{\text{abun}}_{i}(a) = \text{reads}_{i}(a) \cdot \frac{\widehat{\text{abun}}_{r}(a)}{\text{reads}_{r}(a)}. \end{align}\] Here we consider some practical applications of this method and how their measurements are affected by bias in the MGS measurement and error in \(\widehat{\text{abun}}_{r}(a)\). Appendix [REF] describes the general case where there is a set \(R\) of multiple reference species; here we restrict our attention to a single reference species \(r\).

This approach has so far been solely or primarily used with reference species that have been determined to have an abundance that is constant across samples. In a spike-in experiment, researchers deliberately add (or ‘spike in’) one or more reference species at a known, constant abundance. For a single spike-in species \(r\), the abundances of other species can be estimated by Equation (2.14). In the absence of spike-ins, researchers may be able to determine one or more native species whose abundances are thought to be constant across samples; we refer to such species as housekeeping species by analogy with the housekeeping genes used for absolute-abundance conversion in studies of gene expression. Housekeeping species can sometimes be identified using prior scientific knowledge; for example, in shotgun sequencing experiments, researchers have used sequencing reads from the plant or animal host as a reference. A related approach involves computationally identifying species that are constant between pairs of samples (David et al. (2014)) or between sample conditions (Mandal et al. (2015), Kumar et al. (2018)) under added statistical assumptions about the existence of such species. The abundance of a housekeeping species is typically unknown; therefore, to estimate the abundances of other species, we simply set \(\widehat{\text{abun}}_{r}(a)\) to 1 in Equation (2.14). The resulting abundance measurements have unknown but fixed units, which is sufficient for measuring fold changes across samples.

Equation (2.14) makes clear that the reference species need not be constant so long as we can directly measure it. This observation suggests a new approach to obtaining species absolute abundances, by combining MGS measurements of the community with targeted measurements of the absolute abundance of one or more individual species. Common targeted measurements include using qPCR or ddPCR to measure the concentration of a marker-gene in the extracted DNA, using primers tailored to a particular species. Other methods that are capable of measuring cells (rather than extracted DNA) include applying ddPCR directly to cells, performing CFU counting on selective media, and using flow cytometry with species-specific florescent probes. The measurements from any of these methods could be used for \(\widehat{\text{abun}}_{r}(a)\) in Equation (2.14) to obtain the absolute abundance for all species. Appendix [REF] describes how using multiple reference species and/or statistical modeling can address the fact that any one native reference species is unlikely to be found in all samples.

Despite the diversity of these methods, each can be represented by Equation \tag{2.14}, making it straightforward to understand the impact of taxonomic bias in the MGS measurement. It follows from Equation (2.9) that the measured abundance of species \(i\) is \[\begin{align} \tag{2.15} \widehat{\text{abun}}_{i}(a) = \text{abun}_{i}(a) \cdot \frac{\text{efficiency}_{i}}{\text{efficiency}_{r}} \cdot \underbrace{\frac{\widehat{\text{abun}_r}(a)}{\text{abun}_r(a)}}_{\substack{\text{fold error in} \\ \text{ref. species}}}. % \cdot \begin{array}{c} \text{fold error in} \\ \widehat{\text{abun}_r}(a) \end{align}\] The error thus consists of two terms: the ratio of the efficiency between species \(i\) and species \(r\), and any error in the abundance of the reference species. The ratio of species efficiencies is constant by assumption; thus the total error is constant if the error in the reference abundance is. Note, if a higher-order taxon is used as a reference—for example, using targeted measurements of a genus or family—then the efficiency of the reference taxon may vary across samples, leading to non-proportional errors.

In practice, researchers commonly use a less-direct but equivalent approach to measure species abundances from a reference such as a spike-in. Set \(S\) to be the species of interest excluding the spike-in species \(r\). In this approach, one measures the total abundance of the community \(S\) from the ratio of non-spike-in to spike-in reads, \[\begin{align} \tag{2.16} \widehat{\text{abun}}_S(a) &= \frac{\text{reads}_S(a) \cdot \widehat{\text{abun}}_R(a)}{\text{reads}_r(a)}. \end{align}\] Then, one uses this measurement \(\widehat{\text{abun}}_{S}(a)\) along with the MGS proportions in Equation (2.11) to determine the abundance of individual species. This calculation yields measurements that are identical to those from directly applying Equation (2.14) (Appendix A.3). This observation might seem paradoxical: Equation (2.15) indicates proportional error for the direct approach, whereas (2.13) suggests non-proportional errors for the indirect approach. This apparent contradiction is resolved by the fact the measurement \(\widehat{\text{abun}}_{S}(a)\) by this method has errors that are proportional to \(\text{efficiency}_S(a)\) and thus exactly offset the proportionality with \(1/\text{efficiency}_S(a)\) in the MGS proportions, such that the measured species abundances have constant fold errors.

2.4 Summary

Taxonomic bias in MGS measurements makes all MGS-based measures of relative and absolute abundance inaccurate. For some measures—but not all—consistent bias at the species level creates proportional errors in species abundances. Of two types of relative abundance, the ratio between a pair of species has proportional errors whereas the proportion of an individual species does not. Instead, the multiplicative error in a species’ proportion varies with the mean efficiency across samples. Methods for measuring species absolute abundance that involve measuring the abundance of the total community likewise experience non-proportional errors which vary with the sample mean efficiency; a notable exception is if the mean efficiency of the total-abundance measurement mirrors that of the MGS measurement. Absolute-abundance methods that instead involve measuring or spiking individual reference species yield proportional errors, provided that the abundance of the reference species can be determined up to a proportional error. The efficiency of higher-order taxa may vary across samples even if the efficiencies of their constituent species are constant; as a result, all standard abundance measures can have non-proportional errors at taxonomic resolutions higher than that at which bias is conserved.

References

Barlow, Jacob T., Said R. Bogatyrev, and Rustem F. Ismagilov. 2020. A quantitative sequencing framework for absolute abundance measurements of mucosal and lumenal microbial communities.” Nat. Commun. 11 (1): 1–13. https://doi.org/10.1038/s41467-020-16224-6.
Contijoch, Eduardo J, Graham J Britton, Chao Yang, Ilaria Mogno, Zhihua Li, Ruby Ng, Sean R Llewellyn, et al. 2019. Gut microbiota density influences host physiology and is shaped by host and microbial factors.” Elife 8 (January). https://doi.org/10.7554/eLife.40553.
David, Lawrence A, Arne C Materna, Jonathan Friedman, Maria I Campos-Baptista, Matthew C Blackburn, Allison Perrotta, Susan E Erdman, and Eric J Alm. 2014. Host lifestyle affects human microbiota on daily timescales.” Genome Biol. 15 (7): R89. https://doi.org/10.1186/gb-2014-15-7-r89.
Galazzo, Gianluca, Niels van Best, Birke J. Benedikter, Kevin Janssen, Liene Bervoets, Christel Driessen, Melissa Oomen, et al. 2020. How to Count Our Microbes? The Effect of Different Quantitative Microbiome Profiling Approaches.” Front. Cell. Infect. Microbiol. 10 (August). https://doi.org/10.3389/fcimb.2020.00403.
Gloor, Gregory B., Jean M. Macklaim, Vera Pawlowsky-Glahn, and Juan J. Egozcue. 2017. Microbiome Datasets Are Compositional: And This Is Not Optional.” Front. Microbiol. 8 (November): 2224. https://doi.org/10.3389/fmicb.2017.02224.
Kumar, M. Senthil, Eric V. Slud, Kwame Okrah, Stephanie C. Hicks, Sridhar Hannenhalli, and Héctor Corrada Bravo. 2018. Analysis and correction of compositional bias in sparse sequencing count data.” BMC Genomics 19 (1): 799. https://doi.org/10.1186/s12864-018-5160-5.
Lloyd, Karen G., Jordan T. Bird, Joy Buongiorno, Emily Deas, Richard Kevorkian, Talor Noordhoek, Jacob Rosalsky, and Taylor Roy. 2020. Evidence for a Growth Zone for Deep-Subsurface Microbial Clades in Near-Surface Anoxic Sediments.” Appl. Environ. Microbiol. 86 (19): 1–15. https://doi.org/10.1128/AEM.00877-20.
Mandal, Siddhartha, Will Van Treuren, Richard A. White, Merete Eggesbø, Rob Knight, and Shyamal D. Peddada. 2015. Analysis of composition of microbiomes: a novel method for studying microbial composition.” Microb. Ecol. Heal. Dis. 26 (1): 27663. https://doi.org/10.3402/mehd.v26.27663.
McLaren, Michael R, Amy D Willis, and Benjamin J Callahan. 2019. Consistent and correctable bias in metagenomic sequencing experiments.” Elife 8 (September): 46923. https://doi.org/10.7554/eLife.46923.
Props, Ruben, Frederiek-Maarten Kerckhof, Peter Rubbens, Jo De Vrieze, Emma Hernandez Sanabria, Willem Waegeman, Pieter Monsieurs, Frederik Hammes, and Nico Boon. 2017. Absolute quantification of microbial taxon abundances.” ISME J. 11 (2): 584–87. https://doi.org/10.1038/ismej.2016.117.
Tettamanti Boshier, Florencia A., Sujatha Srinivasan, Anthony Lopez, Noah G. Hoffman, Sean Proll, David N. Fredricks, and Joshua T. Schiffer. 2020. Complementing 16S rRNA Gene Amplicon Sequencing with Total Bacterial Load To Infer Absolute Species Concentrations in the Vaginal Microbiome.” mSystems 5 (2): 1–14. https://doi.org/10.1128/mSystems.00777-19.
Vandeputte, Doris, Gunter Kathagen, Kevin D’hoe, Sara Vieira-Silva, Mireia Valles-Colomer, João Sabino, Jun Wang, et al. 2017. Quantitative microbiome profiling links gut community variation to microbial load.” Nature 551 (7681): 507. https://doi.org/10.1038/nature24460.
Zhang, Zhaojing, Yuanyuan Qu, Shuzhen Li, Kai Feng, Shang Wang, Weiwei Cai, Yuting Liang, et al. 2017. Soil bacterial quantification approaches coupling with relative abundances reflecting the changes of taxa.” Sci. Rep. 7 (1): 1–11. https://doi.org/10.1038/s41598-017-05260-w.