The Linked Selection Signature of Rapid Adaptation in Temporal Genomic Data, bioRxiv, 2019-02-25

Populations can adapt over short, ecological timescales via standing genetic variation. Genomic data collected over tens of generations in both natural and lab populations is increasingly used to find selected loci underpinning such rapid adaptation. Although selection on large effect loci may be detectable in such data, often the fitness differences between individuals have a polygenic architecture, such that selection at any one locus leads to allele frequency changes that are too subtle to distinguish from genetic drift. However, one promising signal comes from the fact that selection on polygenic traits leads to heritable fitness backgrounds that neutral alleles can become stochastically associated with. These associations perturb neutral allele frequency trajectories, creating autocovariance across generations that can be directly measured from temporal genomic data. We develop theory that predicts the magnitude of these temporal autocovariances, showing that it is determined by the level of additive genetic variation, recombination, and linkage disequilibria in a region. Furthermore, by using analytic expressions for the temporal variances and autocovariances in allele frequency, we demonstrate one can estimate the additive genetic variation for fitness and the drift-effective population size from temporal genomic data. Finally, we also show how the proportion of total variation in allele frequency change due to linked selection can be estimated from temporal data. Temporal genomic data offers strong opportunities to identify the role linked selection has on genome-wide diversity over short timescales, and can help bridge population genetic and quantitative genetic studies of adaptation.

biorxiv evolutionary-biology 100-200-users 2019

A comparison of three programming languages for a full-fledged next-generation sequencing tool, bioRxiv, 2019-02-23

Background elPrep is an established multi-threaded framework for preparing SAM and BAM files in sequencing pipelines. To achieve good performance, its software architecture makes only a single pass through a SAMBAM file for multiple preparation steps, and keeps sequencing data as much as possible in main memory. Similar to other SAMBAM tools, management of heap memory is a complex task in elPrep, and it became a serious productivity bottleneck in its original implementation language during recent further development of elPrep. We therefore investigated three alternative programming languages Go and Java using a concurrent, parallel garbage collector on the one hand, and C++17 using reference counting on the other hand for handling large amounts of heap objects. We reimplemented elPrep in all three languages and benchmarked their runtime performance and memory use.Results The Go implementation performs best, yielding the best balance between runtime performance and memory use. While the Java benchmarks report a somewhat faster runtime than the Go benchmarks, the memory use of the Java runs is significantly higher. The C++17 benchmarks run significantly slower than both Go and Java, while using somewhat more memory than the Go runs. Our analysis shows that concurrent, parallel garbage collection is better at managing a large heap of objects than reference counting in our case.Conclusions Based on our benchmark results, we selected Go as our new implementation language for elPrep, and recommend considering Go as a good candidate for developing other bioinformatics tools for processing SAMBAM data as well.

biorxiv bioinformatics 100-200-users 2019

Cooler scalable storage for Hi-C data and other genomically-labeled arrays, bioRxiv, 2019-02-23

Most existing coverage-based (epi)genomic datasets are one-dimensional, but newer technologies probing interactions (physical, genetic, etc.) produce quantitative maps with two-dimensional genomic coordinate systems. Storage and computational costs mount sharply with data resolution when such maps are stored in dense form. Hence, there is a pressing need to develop data storage strategies that handle the full range of useful resolutions in multidimensional genomic datasets by taking advantage of their sparse nature, while supporting efficient compression and providing fast random access to facilitate development of scalable algorithms for data analysis. We developed a file format called cooler, based on a sparse data model, that can support genomically-labeled matrices at any resolution. It has the flexibility to accommodate various descriptions of the data axes (genomic coordinates, tracks and bin annotations), resolutions, data density patterns, and metadata. Cooler is based on HDF5 and is supported by a Python library and command line suite to create, read, inspect and manipulate cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium. Cooler is cross-platform, BSD-licensed, and can be installed from the Python Package Index or the bioconda repository. The source code is maintained on Github at httpsgithub.commirnylabcooler.

biorxiv bioinformatics 0-100-users 2019

 

Created with the audiences framework by Jedidiah Carlson

Powered by Hugo