Computational Pan-Genomics Status, Promises and Challenges, bioRxiv, 2016-03-13

AbstractMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic datasets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this paper, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies, and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains.

biorxiv genomics 0-100-users 2016

VcfR a package to manipulate and visualize VCF format data in R, bioRxiv, 2016-02-27

AbstractSoftware to call single nucleotide polymorphisms or related genetic variants has converged on the variant call format (VCF) as the output format of choice. This has created a need for tools to work with VCF files. While an increasing number of software exists to read VCF data, many only extract the genotypes without including the data associated with each genotype that describes its quality. We created the R package vcfR to address this issue. We developed a VCF file exploration tool implemented in the R language because R provides an interactive experience and an environment that is commonly used for genetic data analysis. Functions to read and write VCF files into R as well as functions to extract portions of the data and to plot summary statistics of the data are implemented. VcfR further provides the ability to visualize how various parameterizations of the data affect the results. Additional tools are included to integrate sequence (FASTA) and annotation data (GFF) for visualization of genomic regions such as chromosomes. Conversion functions translate data from the vcfR data structure to formats used by other R genetics packages. Computationally intensive functions are implemented in C++ to improve performance. Use of these tools is intended to facilitate VCF data exploration, including intuitive methods for data quality control and easy export to other R packages for further analysis. VcfR thus provides essential, novel tools currently not available in R.

biorxiv bioinformatics 0-100-users 2016

Model-based projections of Zika virus infections in childbearing women in the Americas, bioRxiv, 2016-02-13

AbstractZika virus is a mosquito-borne pathogen that is rapidly spreading across the Americas. Due to associations between Zika virus infection and a range of fetal maladies1,2, the epidemic trajectory of this viral infection poses a significant concern for the nearly 15 million children born in the Americas each year. Ascertaining the portion of this population that is truly at risk is an important priority. One recent estimate3suggested that 5.42 million childbearing women live in areas of the Americas that are suitable for Zika occurrence. To improve on that estimate, which did not take into account the protective effects of herd immunity, we developed a new approach that combines classic results from epidemiological theory with seroprevalence data and highly spatially resolved data about drivers of transmission to make location-specific projections of epidemic attack rates. Our results suggest that 1.65 (1.45–2.06) million childbearing women and 93.4 (81.6–117.1) million people in total could become infected before the first wave of the epidemic concludes. Based on current estimates of rates of adverse fetal outcomes among infected women2,4,5, these results suggest that tens of thousands of pregnancies could be negatively impacted by the first wave of the epidemic. These projections constitute a revised upper limit of populations at risk in the current Zika epidemic, and our approach offers a new way to make rapid assessments of the threat posed by emerging infectious diseases more generally.

biorxiv ecology 0-100-users 2016

Shannon An Information-Optimal de Novo RNA-Seq Assembler, bioRxiv, 2016-02-10

De novo assembly of short RNA-Seq reads into transcripts is challenging due to sequence similarities in transcriptomes arising from gene duplications and alternative splicing of transcripts. We present Shannon, an RNA-Seq assembler with an optimality guarantee derived from principles of information theory Shannon reconstructs nearly all information-theoretically reconstructable transcripts. Shannon is based on a theory we develop for de novo RNA-Seq assembly that reveals differing abundances among transcripts to be the key, rather than the barrier, to effective assembly. The assembly problem is formulated as a sparsest-flow problem on a transcript graph, and the heart of Shannon is a novel iterative flow-decomposition algorithm. This algorithm provably solves the information-theoretically reconstructable instances in linear-time even though the general sparsest-flow problem is NP-hard. Shannon also incorporates several additional new algorithmic advances a new error-correction algorithm based on successive cancelation, a multi-bridging algorithm that carefully utilizes read information in the k-mer de Bruijn graph, and an approximate graph partitioning algorithm to split the transcriptome de Bruijn graph into smaller components. In tests on large RNA-Seq datasets, Shannon obtains significant increases in sensitivity along with improvements in specificity in comparison to state-of-the-art assemblers.

biorxiv genomics 0-100-users 2016

Transmission dynamics of Zika virus in island populations a modelling analysis of the 2013-14 French Polynesia outbreak, bioRxiv, 2016-02-08

AbstractBetween October 2013 and April 2014, more than 30,000 cases of Zika virus (ZIKV) disease were estimated to have attended healthcare facilities in French Polynesia. ZIKV has also been reported in Africa and Asia, and in 2015 the virus spread to South America and the Caribbean. Infection with ZIKV has been associated with neurological complications including Guillain-Barré Syndrome (GBS) and microcephaly, which led the World Health Organization to declare a Public Health Emergency of International Concern in February 2015. To better understand the transmission dynamics of ZIKV, we used a mathematical model to examine the 2013–14 outbreak on the six major archipelagos of French Polynesia. Our median estimates for the basic reproduction number ranged from 2.6–4.8, with an estimated 11.5% (95% CI 7.32–17.9%) of total infections reported. As a result, we estimated that 94% (95% CI 91–97%) of the total population of the six archipelagos were infected during the outbreak. Based on the demography of French Polynesia, our results imply that if ZIKV infection provides complete protection against future infection, it would take 12–20 years before there are a sufficient number of susceptible individuals for ZIKV to reemerge, which is on the same timescale as the circulation of dengue virus serotypes in the region. Our analysis suggests that ZIKV may exhibit similar dynamics to dengue virus in island populations, with transmission characterized by large, sporadic outbreaks with a high proportion of asymptomatic or unreported cases.Author SummarySince the first reported major outbreak of Zika virus disease in Micronesia in 2007, the virus has caused outbreaks throughout the Pacific and South America. Transmitted by the Aedes species of mosquitoes, the virus has been linked to possible neurological complications including Guillain-Barre Syndrome and microcephaly. To improve our understanding of the transmission dynamics of Zika virus in island populations, we analysed the 2013–14 outbreak on the six major archipelagos of French Polynesia. We found evidence that Zika virus infected the majority of population, but only around 12% of total infections on the archipelagos were reported as cases. If infection with Zika virus generates lifelong immunity, we estimate that it would take at least 15–20 years before there are enough susceptible people for the virus to reemerge. Our results suggest that Zika virus could exhibit similar dynamics to dengue virus in the Pacific, producing large but sporadic outbreaks in small island populations.

biorxiv ecology 0-100-users 2016

 

Created with the audiences framework by Jedidiah Carlson

Powered by Hugo