Topologically associated domains are ancient features that coincide with Metazoan clusters of extreme noncoding conservation, bioRxiv, 2016-03-16
AbstractIn vertebrates and other Metazoa, developmental genes are found surrounded by dense clusters of highly conserved noncoding elements (CNEs). CNEs exhibit extreme levels of sequence conservation of unexplained origin, with many acting as long-range enhancers during development. Clusters of CNEs, termed genomic regulatory blocks (GRBs), define the span of regulatory interactions for many important developmental regulators. The function and genomic distribution of these elements close to important regulatory genes raises the question of how they relate to the 3D conformation of these loci. We show that GRBs, defined using clusters of CNEs, coincide strongly with the patterns of topological organisation in metazoan genomes, predicting the boundaries of topologically associating domains (TADs) at hundreds of loci. The set of TADs that are associated with high levels of non-coding conservation exhibit distinct properties compared to TADs called in chromosomal regions devoid of extreme non-coding conservation. The correspondence between GRBs and TADs suggests that TADs around developmental genes are ancient, slowly evolving genomic structures, many of which have had conserved spans for hundreds of millions of years. This relationship also explains the difference in TAD numbers and sizes between genomes. While the close correspondence between extreme conservation and the boundaries of this subset of TADs does not reveal the mechanism leading to the conservation of these elements, it provides a functional framework for studying the role of TADs in long-range transcriptional regulation.
biorxiv genomics 0-100-users 2016Computational Pan-Genomics Status, Promises and Challenges, bioRxiv, 2016-03-13
AbstractMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic datasets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this paper, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies, and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains.
biorxiv genomics 0-100-users 2016Sensitive red protein calcium indicators for imaging neural activity, bioRxiv, 2016-03-01
Genetically encoded calcium indicators (GECIs) allow measurement of activity in large populations of neurons and in small neuronal compartments, over times of milliseconds to months. Although GFP-based GECIs are widely used for in vivo neurophysiology, GECIs with red-shifted excitation and emission spectra have advantages for in vivo imaging because of reduced scattering and absorption in tissue, and a consequent reduction in phototoxicity. However, current red GECIs are inferior to the state-of-the-art GFP-based GCaMP6 indicators for detecting and quantifying neural activity. Here we present improved red GECIs based on mRuby (jRCaMP1a, b) and mApple (jRGECO1a), with sensitivity comparable to GCaMP6. We characterized the performance of the new red GECIs in cultured neurons and in mouse, Drosophila, zebrafish and C. elegans in vivo. Red GECIs facilitate deep-tissue imaging, dual-color imaging together with GFP-based reporters, and the use of optogenetics in combination with calcium imaging.
biorxiv neuroscience 0-100-users 2016VcfR a package to manipulate and visualize VCF format data in R, bioRxiv, 2016-02-27
AbstractSoftware to call single nucleotide polymorphisms or related genetic variants has converged on the variant call format (VCF) as the output format of choice. This has created a need for tools to work with VCF files. While an increasing number of software exists to read VCF data, many only extract the genotypes without including the data associated with each genotype that describes its quality. We created the R package vcfR to address this issue. We developed a VCF file exploration tool implemented in the R language because R provides an interactive experience and an environment that is commonly used for genetic data analysis. Functions to read and write VCF files into R as well as functions to extract portions of the data and to plot summary statistics of the data are implemented. VcfR further provides the ability to visualize how various parameterizations of the data affect the results. Additional tools are included to integrate sequence (FASTA) and annotation data (GFF) for visualization of genomic regions such as chromosomes. Conversion functions translate data from the vcfR data structure to formats used by other R genetics packages. Computationally intensive functions are implemented in C++ to improve performance. Use of these tools is intended to facilitate VCF data exploration, including intuitive methods for data quality control and easy export to other R packages for further analysis. VcfR thus provides essential, novel tools currently not available in R.
biorxiv bioinformatics 0-100-users 2016Model-based projections of Zika virus infections in childbearing women in the Americas, bioRxiv, 2016-02-13
AbstractZika virus is a mosquito-borne pathogen that is rapidly spreading across the Americas. Due to associations between Zika virus infection and a range of fetal maladies1,2, the epidemic trajectory of this viral infection poses a significant concern for the nearly 15 million children born in the Americas each year. Ascertaining the portion of this population that is truly at risk is an important priority. One recent estimate3suggested that 5.42 million childbearing women live in areas of the Americas that are suitable for Zika occurrence. To improve on that estimate, which did not take into account the protective effects of herd immunity, we developed a new approach that combines classic results from epidemiological theory with seroprevalence data and highly spatially resolved data about drivers of transmission to make location-specific projections of epidemic attack rates. Our results suggest that 1.65 (1.45–2.06) million childbearing women and 93.4 (81.6–117.1) million people in total could become infected before the first wave of the epidemic concludes. Based on current estimates of rates of adverse fetal outcomes among infected women2,4,5, these results suggest that tens of thousands of pregnancies could be negatively impacted by the first wave of the epidemic. These projections constitute a revised upper limit of populations at risk in the current Zika epidemic, and our approach offers a new way to make rapid assessments of the threat posed by emerging infectious diseases more generally.
biorxiv ecology 0-100-users 2016Shannon An Information-Optimal de Novo RNA-Seq Assembler, bioRxiv, 2016-02-10
De novo assembly of short RNA-Seq reads into transcripts is challenging due to sequence similarities in transcriptomes arising from gene duplications and alternative splicing of transcripts. We present Shannon, an RNA-Seq assembler with an optimality guarantee derived from principles of information theory Shannon reconstructs nearly all information-theoretically reconstructable transcripts. Shannon is based on a theory we develop for de novo RNA-Seq assembly that reveals differing abundances among transcripts to be the key, rather than the barrier, to effective assembly. The assembly problem is formulated as a sparsest-flow problem on a transcript graph, and the heart of Shannon is a novel iterative flow-decomposition algorithm. This algorithm provably solves the information-theoretically reconstructable instances in linear-time even though the general sparsest-flow problem is NP-hard. Shannon also incorporates several additional new algorithmic advances a new error-correction algorithm based on successive cancelation, a multi-bridging algorithm that carefully utilizes read information in the k-mer de Bruijn graph, and an approximate graph partitioning algorithm to split the transcriptome de Bruijn graph into smaller components. In tests on large RNA-Seq datasets, Shannon obtains significant increases in sensitivity along with improvements in specificity in comparison to state-of-the-art assemblers.
biorxiv genomics 0-100-users 2016