Vibrio natriegens, a new genomic powerhouse, bioRxiv, 2016-06-13
Recombinant DNA technology has revolutionized biomedical research with continual innovations advancing the speed and throughput of molecular biology. Nearly all these tools, however, are reliant on Escherichia coli as a host organism, and its lengthy growth rate increasingly dominates experimental time. Here we report the development of Vibrio natriegens, a free-living bacteria with the fastest generation time known, into a genetically tractable host organism. We systematically characterize its growth properties to establish basic laboratory culturing conditions. We provide the first complete Vibrio natriegens genome, consisting of two chromosomes of 3,248,023 bp and 1,927,310 bp that together encode 4,578 open reading frames. We reveal genetic tools and techniques for working with Vibrio natriegens. These foundational resources will usher in an era of advanced genomics to accelerate biological, biotechnological, and medical discoveries.
biorxiv genomics 0-100-users 2016Detecting DNA Methylation using the Oxford Nanopore Technologies MinION sequencer, bioRxiv, 2016-04-05
AbstractNanopore sequencing instruments measure the change in electric current caused by DNA transiting through the pore. In experimental and prototype nanopore sequencing devices it has been shown that the electrolytic current signals are sensitive to base modifications, such as 5-methylcytosine. Here we quantify the strength of this effect for the Oxford Nanopore Technologies MinION sequencer. Using synthetically methylated DNA we are able to train a hidden Markov model to distinguish 5-methylcytosine from unmethylated cytosine in DNA. We demonstrate by sequencing natural human DNA, without any special library preparation, that global patterns of methylation can be detected from low-coverage sequencing and that the methylation status of CpG islands can be reliably predicted from single MinION reads. Our trained model and prediction software is open source and freely available to the community under the MIT license.
biorxiv genomics 200-500-users 2016LeafCutter annotation-free quantification of RNA splicing, bioRxiv, 2016-03-17
AbstractThe excision of introns from pre-mRNA is an essential step in mRNA processing. We developed LeafCutter to study sample and population variation in intron splicing. LeafCutter identifies variable intron splicing events from short-read RNA-seq data and finds alternative splicing events of high complexity. Our approach obviates the need for transcript annotations and circumvents the challenges in estimating relative isoform or exon usage in complex splicing events. LeafCutter can be used both for detecting differential splicing between sample groups, and for mapping splicing quantitative trait loci (sQTLs). Compared to contemporary methods, we find 1.4–2.1 times more sQTLs, many of which help us ascribe molecular effects to disease-associated variants. Strikingly, transcriptome-wide associations between LeafCutter intron quantifications and 40 complex traits increased the number of associated disease genes at 5% FDR by an average of 2.1-fold as compared to using gene expression levels alone. LeafCutter is fast, scalable, easy to use, and available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comdavidaknowlesleafcutter>httpsgithub.comdavidaknowlesleafcutter<jatsext-link>.
biorxiv genomics 100-200-users 2016Topologically associated domains are ancient features that coincide with Metazoan clusters of extreme noncoding conservation, bioRxiv, 2016-03-16
AbstractIn vertebrates and other Metazoa, developmental genes are found surrounded by dense clusters of highly conserved noncoding elements (CNEs). CNEs exhibit extreme levels of sequence conservation of unexplained origin, with many acting as long-range enhancers during development. Clusters of CNEs, termed genomic regulatory blocks (GRBs), define the span of regulatory interactions for many important developmental regulators. The function and genomic distribution of these elements close to important regulatory genes raises the question of how they relate to the 3D conformation of these loci. We show that GRBs, defined using clusters of CNEs, coincide strongly with the patterns of topological organisation in metazoan genomes, predicting the boundaries of topologically associating domains (TADs) at hundreds of loci. The set of TADs that are associated with high levels of non-coding conservation exhibit distinct properties compared to TADs called in chromosomal regions devoid of extreme non-coding conservation. The correspondence between GRBs and TADs suggests that TADs around developmental genes are ancient, slowly evolving genomic structures, many of which have had conserved spans for hundreds of millions of years. This relationship also explains the difference in TAD numbers and sizes between genomes. While the close correspondence between extreme conservation and the boundaries of this subset of TADs does not reveal the mechanism leading to the conservation of these elements, it provides a functional framework for studying the role of TADs in long-range transcriptional regulation.
biorxiv genomics 0-100-users 2016Computational Pan-Genomics Status, Promises and Challenges, bioRxiv, 2016-03-13
AbstractMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic datasets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this paper, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies, and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains.
biorxiv genomics 0-100-users 2016Shannon An Information-Optimal de Novo RNA-Seq Assembler, bioRxiv, 2016-02-10
De novo assembly of short RNA-Seq reads into transcripts is challenging due to sequence similarities in transcriptomes arising from gene duplications and alternative splicing of transcripts. We present Shannon, an RNA-Seq assembler with an optimality guarantee derived from principles of information theory Shannon reconstructs nearly all information-theoretically reconstructable transcripts. Shannon is based on a theory we develop for de novo RNA-Seq assembly that reveals differing abundances among transcripts to be the key, rather than the barrier, to effective assembly. The assembly problem is formulated as a sparsest-flow problem on a transcript graph, and the heart of Shannon is a novel iterative flow-decomposition algorithm. This algorithm provably solves the information-theoretically reconstructable instances in linear-time even though the general sparsest-flow problem is NP-hard. Shannon also incorporates several additional new algorithmic advances a new error-correction algorithm based on successive cancelation, a multi-bridging algorithm that carefully utilizes read information in the k-mer de Bruijn graph, and an approximate graph partitioning algorithm to split the transcriptome de Bruijn graph into smaller components. In tests on large RNA-Seq datasets, Shannon obtains significant increases in sensitivity along with improvements in specificity in comparison to state-of-the-art assemblers.
biorxiv genomics 0-100-users 2016