Relic DNA is abundant in soil and obscures estimates of soil microbial diversity, bioRxiv, 2016-03-17
AbstractIt is implicitly assumed that the microbial DNA recovered from soil originates from living cells. However, because relic DNA (DNA from dead cells) can persist in soil for weeks to years, it could impact DNA-based analyses of microbial diversity. We examined a wide range of soils and found that, on average, 40% of prokaryotic and fungal DNA was derived from the relic DNA pool. Relic DNA inflated the observed prokaryotic and fungal diversity by as much as 55%, and caused misestimation of taxon abundances, including taxa integral to key ecosystem processes. These findings imply that relic DNA can obscure treatment effects, spatiotemporal patterns, and relationships between taxa and environmental conditions. Moreover, relic DNA may represent a historical record of microbes formerly living in soil.One Sentence SummarySoils can harbor substantial amounts of DNA from dead microbial cells; this ‘relic’ DNA inflates estimates of microbial diversity and obscures assessments of community structure.
biorxiv ecology 100-200-users 2016Topologically associated domains are ancient features that coincide with Metazoan clusters of extreme noncoding conservation, bioRxiv, 2016-03-16
AbstractIn vertebrates and other Metazoa, developmental genes are found surrounded by dense clusters of highly conserved noncoding elements (CNEs). CNEs exhibit extreme levels of sequence conservation of unexplained origin, with many acting as long-range enhancers during development. Clusters of CNEs, termed genomic regulatory blocks (GRBs), define the span of regulatory interactions for many important developmental regulators. The function and genomic distribution of these elements close to important regulatory genes raises the question of how they relate to the 3D conformation of these loci. We show that GRBs, defined using clusters of CNEs, coincide strongly with the patterns of topological organisation in metazoan genomes, predicting the boundaries of topologically associating domains (TADs) at hundreds of loci. The set of TADs that are associated with high levels of non-coding conservation exhibit distinct properties compared to TADs called in chromosomal regions devoid of extreme non-coding conservation. The correspondence between GRBs and TADs suggests that TADs around developmental genes are ancient, slowly evolving genomic structures, many of which have had conserved spans for hundreds of millions of years. This relationship also explains the difference in TAD numbers and sizes between genomes. While the close correspondence between extreme conservation and the boundaries of this subset of TADs does not reveal the mechanism leading to the conservation of these elements, it provides a functional framework for studying the role of TADs in long-range transcriptional regulation.
biorxiv genomics 0-100-users 2016Computational Pan-Genomics Status, Promises and Challenges, bioRxiv, 2016-03-13
AbstractMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic datasets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this paper, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies, and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains.
biorxiv genomics 0-100-users 2016Sensitive red protein calcium indicators for imaging neural activity, bioRxiv, 2016-03-01
Genetically encoded calcium indicators (GECIs) allow measurement of activity in large populations of neurons and in small neuronal compartments, over times of milliseconds to months. Although GFP-based GECIs are widely used for in vivo neurophysiology, GECIs with red-shifted excitation and emission spectra have advantages for in vivo imaging because of reduced scattering and absorption in tissue, and a consequent reduction in phototoxicity. However, current red GECIs are inferior to the state-of-the-art GFP-based GCaMP6 indicators for detecting and quantifying neural activity. Here we present improved red GECIs based on mRuby (jRCaMP1a, b) and mApple (jRGECO1a), with sensitivity comparable to GCaMP6. We characterized the performance of the new red GECIs in cultured neurons and in mouse, Drosophila, zebrafish and C. elegans in vivo. Red GECIs facilitate deep-tissue imaging, dual-color imaging together with GFP-based reporters, and the use of optogenetics in combination with calcium imaging.
biorxiv neuroscience 0-100-users 2016VcfR a package to manipulate and visualize VCF format data in R, bioRxiv, 2016-02-27
AbstractSoftware to call single nucleotide polymorphisms or related genetic variants has converged on the variant call format (VCF) as the output format of choice. This has created a need for tools to work with VCF files. While an increasing number of software exists to read VCF data, many only extract the genotypes without including the data associated with each genotype that describes its quality. We created the R package vcfR to address this issue. We developed a VCF file exploration tool implemented in the R language because R provides an interactive experience and an environment that is commonly used for genetic data analysis. Functions to read and write VCF files into R as well as functions to extract portions of the data and to plot summary statistics of the data are implemented. VcfR further provides the ability to visualize how various parameterizations of the data affect the results. Additional tools are included to integrate sequence (FASTA) and annotation data (GFF) for visualization of genomic regions such as chromosomes. Conversion functions translate data from the vcfR data structure to formats used by other R genetics packages. Computationally intensive functions are implemented in C++ to improve performance. Use of these tools is intended to facilitate VCF data exploration, including intuitive methods for data quality control and easy export to other R packages for further analysis. VcfR thus provides essential, novel tools currently not available in R.
biorxiv bioinformatics 0-100-users 2016Model-based projections of Zika virus infections in childbearing women in the Americas, bioRxiv, 2016-02-13
AbstractZika virus is a mosquito-borne pathogen that is rapidly spreading across the Americas. Due to associations between Zika virus infection and a range of fetal maladies1,2, the epidemic trajectory of this viral infection poses a significant concern for the nearly 15 million children born in the Americas each year. Ascertaining the portion of this population that is truly at risk is an important priority. One recent estimate3suggested that 5.42 million childbearing women live in areas of the Americas that are suitable for Zika occurrence. To improve on that estimate, which did not take into account the protective effects of herd immunity, we developed a new approach that combines classic results from epidemiological theory with seroprevalence data and highly spatially resolved data about drivers of transmission to make location-specific projections of epidemic attack rates. Our results suggest that 1.65 (1.45–2.06) million childbearing women and 93.4 (81.6–117.1) million people in total could become infected before the first wave of the epidemic concludes. Based on current estimates of rates of adverse fetal outcomes among infected women2,4,5, these results suggest that tens of thousands of pregnancies could be negatively impacted by the first wave of the epidemic. These projections constitute a revised upper limit of populations at risk in the current Zika epidemic, and our approach offers a new way to make rapid assessments of the threat posed by emerging infectious diseases more generally.
biorxiv ecology 0-100-users 2016