Consistent and correctable bias in metagenomic sequencing experiments, bioRxiv, 2019-02-25
AbstractMarker-gene and metagenomic sequencing have profoundly expanded our ability to measure biological communities. But the measurements they provide differ from the truth, often dramatically, because these experiments are biased towards detecting some taxa over others. This experimental bias makes the taxon or gene abundances measured by different protocols quantitatively incomparable and can lead to spurious biological conclusions. We propose a mathematical model for how bias distorts community measurements based on the properties of real experiments. We validate this model with 16S rRNA gene and shotgun metagenomics data from defined bacterial communities. Our model better fits the experimental data despite being simpler than previous models. We illustrate how our model can be used to evaluate protocols, to understand the effect of bias on downstream statistical analyses, and to measure and correct bias given suitable calibration controls. These results illuminate new avenues towards truly quantitative and reproducible metagenomics measurements.
biorxiv bioinformatics 100-200-users 2019The Linked Selection Signature of Rapid Adaptation in Temporal Genomic Data, bioRxiv, 2019-02-25
Populations can adapt over short, ecological timescales via standing genetic variation. Genomic data collected over tens of generations in both natural and lab populations is increasingly used to find selected loci underpinning such rapid adaptation. Although selection on large effect loci may be detectable in such data, often the fitness differences between individuals have a polygenic architecture, such that selection at any one locus leads to allele frequency changes that are too subtle to distinguish from genetic drift. However, one promising signal comes from the fact that selection on polygenic traits leads to heritable fitness backgrounds that neutral alleles can become stochastically associated with. These associations perturb neutral allele frequency trajectories, creating autocovariance across generations that can be directly measured from temporal genomic data. We develop theory that predicts the magnitude of these temporal autocovariances, showing that it is determined by the level of additive genetic variation, recombination, and linkage disequilibria in a region. Furthermore, by using analytic expressions for the temporal variances and autocovariances in allele frequency, we demonstrate one can estimate the additive genetic variation for fitness and the drift-effective population size from temporal genomic data. Finally, we also show how the proportion of total variation in allele frequency change due to linked selection can be estimated from temporal data. Temporal genomic data offers strong opportunities to identify the role linked selection has on genome-wide diversity over short timescales, and can help bridge population genetic and quantitative genetic studies of adaptation.
biorxiv evolutionary-biology 100-200-users 2019A comparison of three programming languages for a full-fledged next-generation sequencing tool, bioRxiv, 2019-02-23
Background elPrep is an established multi-threaded framework for preparing SAM and BAM files in sequencing pipelines. To achieve good performance, its software architecture makes only a single pass through a SAMBAM file for multiple preparation steps, and keeps sequencing data as much as possible in main memory. Similar to other SAMBAM tools, management of heap memory is a complex task in elPrep, and it became a serious productivity bottleneck in its original implementation language during recent further development of elPrep. We therefore investigated three alternative programming languages Go and Java using a concurrent, parallel garbage collector on the one hand, and C++17 using reference counting on the other hand for handling large amounts of heap objects. We reimplemented elPrep in all three languages and benchmarked their runtime performance and memory use.Results The Go implementation performs best, yielding the best balance between runtime performance and memory use. While the Java benchmarks report a somewhat faster runtime than the Go benchmarks, the memory use of the Java runs is significantly higher. The C++17 benchmarks run significantly slower than both Go and Java, while using somewhat more memory than the Go runs. Our analysis shows that concurrent, parallel garbage collection is better at managing a large heap of objects than reference counting in our case.Conclusions Based on our benchmark results, we selected Go as our new implementation language for elPrep, and recommend considering Go as a good candidate for developing other bioinformatics tools for processing SAMBAM data as well.
biorxiv bioinformatics 100-200-users 2019Cooler scalable storage for Hi-C data and other genomically-labeled arrays, bioRxiv, 2019-02-23
Most existing coverage-based (epi)genomic datasets are one-dimensional, but newer technologies probing interactions (physical, genetic, etc.) produce quantitative maps with two-dimensional genomic coordinate systems. Storage and computational costs mount sharply with data resolution when such maps are stored in dense form. Hence, there is a pressing need to develop data storage strategies that handle the full range of useful resolutions in multidimensional genomic datasets by taking advantage of their sparse nature, while supporting efficient compression and providing fast random access to facilitate development of scalable algorithms for data analysis. We developed a file format called cooler, based on a sparse data model, that can support genomically-labeled matrices at any resolution. It has the flexibility to accommodate various descriptions of the data axes (genomic coordinates, tracks and bin annotations), resolutions, data density patterns, and metadata. Cooler is based on HDF5 and is supported by a Python library and command line suite to create, read, inspect and manipulate cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium. Cooler is cross-platform, BSD-licensed, and can be installed from the Python Package Index or the bioconda repository. The source code is maintained on Github at httpsgithub.commirnylabcooler.
biorxiv bioinformatics 0-100-users 2019Crowdfunded whole-genome sequencing of the celebrity cat Lil BUB identifies causal mutations for her osteopetrosis and polydactyly, bioRxiv, 2019-02-23
Rare diseases and their underlying molecular causes are often poorly studied, posing challenges for patient diagnosis and prognosis. The development of next-generation sequencing and its decreasing costs promises to alleviate such issues by supplying personal genomic information at a moderate price. Here, we used crowdfunding as an alternative funding source to sequence the genome of Lil BUB, a celebrity cat affected by rare disease phenotypes characterized by supernumerary digits, osteopetrosis and dwarfism, all phenotypic traits that also occur in human patients. We discovered that Lil BUB is affected by two distinct mutations a heterozygous mutation in the limb enhancer of the Sonic hedgehog gene, previously associated with polydactyly in Hemingway cats; and a novel homozygous frameshift deletion affecting the TNFRSF11A (RANK) gene, which has been linked to osteopetrosis in humans. We communicated the progress of this project to a large online audience, detailing the 'inner workings' of personalized whole genome sequencing with the aim of improving genetic literacy. Our results highlight the importance of genomic analysis in the identification of disease-causing mutations and support crowdfunding as a means to fund low-budget projects and as a platform for scientific communication.
biorxiv genetics 200-500-users 2019Genomic analysis reveals a functional role for myocardial trabeculae in adults, bioRxiv, 2019-02-23
Since being first described by Leonardo da Vinci in 1513 it has remained an enigma why the endocardial surfaces of the adult heart retain a complex network of muscular trabeculae - with their persistence thought to be a vestige of embryonic development. For causative physiological inference we harness population genomics, image-based intermediate phenotyping and in silico modelling to determine the effect of this complex cardiovascular trait on function. Using deep learning-based image analysis we identified genetic associations with trabecular complexity in 18,097 UK Biobank participants which were replicated in an independently measured cohort of 1,129 healthy adults. Genes in these associated regions are enriched for expression in the fetal heart or vasculature and implicate loci associated with haemodynamic phenotypes and developmental pathways. A causal relationship between increasing trabecular complexity and both ventricular performance and electrical activity are supported by complementary biomechanical simulations and Mendelian randomisation studies. These findings show that myocardial trabeculae are a previously-unrecognised determinant of cardiovascular physiology in adult humans.
biorxiv genomics 0-100-users 2019