Topologically associated domains are ancient features that coincide with Metazoan clusters of extreme noncoding conservation, bioRxiv, 2016-03-16

AbstractIn vertebrates and other Metazoa, developmental genes are found surrounded by dense clusters of highly conserved noncoding elements (CNEs). CNEs exhibit extreme levels of sequence conservation of unexplained origin, with many acting as long-range enhancers during development. Clusters of CNEs, termed genomic regulatory blocks (GRBs), define the span of regulatory interactions for many important developmental regulators. The function and genomic distribution of these elements close to important regulatory genes raises the question of how they relate to the 3D conformation of these loci. We show that GRBs, defined using clusters of CNEs, coincide strongly with the patterns of topological organisation in metazoan genomes, predicting the boundaries of topologically associating domains (TADs) at hundreds of loci. The set of TADs that are associated with high levels of non-coding conservation exhibit distinct properties compared to TADs called in chromosomal regions devoid of extreme non-coding conservation. The correspondence between GRBs and TADs suggests that TADs around developmental genes are ancient, slowly evolving genomic structures, many of which have had conserved spans for hundreds of millions of years. This relationship also explains the difference in TAD numbers and sizes between genomes. While the close correspondence between extreme conservation and the boundaries of this subset of TADs does not reveal the mechanism leading to the conservation of these elements, it provides a functional framework for studying the role of TADs in long-range transcriptional regulation.

biorxiv genomics 0-100-users 2016

Computational Pan-Genomics Status, Promises and Challenges, bioRxiv, 2016-03-13

AbstractMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic datasets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this paper, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies, and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains.

biorxiv genomics 0-100-users 2016

Shannon An Information-Optimal de Novo RNA-Seq Assembler, bioRxiv, 2016-02-10

De novo assembly of short RNA-Seq reads into transcripts is challenging due to sequence similarities in transcriptomes arising from gene duplications and alternative splicing of transcripts. We present Shannon, an RNA-Seq assembler with an optimality guarantee derived from principles of information theory Shannon reconstructs nearly all information-theoretically reconstructable transcripts. Shannon is based on a theory we develop for de novo RNA-Seq assembly that reveals differing abundances among transcripts to be the key, rather than the barrier, to effective assembly. The assembly problem is formulated as a sparsest-flow problem on a transcript graph, and the heart of Shannon is a novel iterative flow-decomposition algorithm. This algorithm provably solves the information-theoretically reconstructable instances in linear-time even though the general sparsest-flow problem is NP-hard. Shannon also incorporates several additional new algorithmic advances a new error-correction algorithm based on successive cancelation, a multi-bridging algorithm that carefully utilizes read information in the k-mer de Bruijn graph, and an approximate graph partitioning algorithm to split the transcriptome de Bruijn graph into smaller components. In tests on large RNA-Seq datasets, Shannon obtains significant increases in sensitivity along with improvements in specificity in comparison to state-of-the-art assemblers.

biorxiv genomics 0-100-users 2016

 

Created with the audiences framework by Jedidiah Carlson

Powered by Hugo