Differential analysis of RNA-Seq incorporating quantification uncertainty, bioRxiv, 2016-06-11
We describe a novel method for the differential analysis of RNA-Seq data that utilizes bootstrapping in conjunction with response error linear modeling to decouple biological variance from inferential variance. The method is implemented in an interactive shiny app called sleuth that utilizes kallisto quantifications and bootstraps for fast and accurate analysis of RNA-Seq experiments.
biorxiv bioinformatics 0-100-users 2016Phased Diploid Genome Assembly with Single Molecule Real-Time Sequencing, bioRxiv, 2016-06-04
AbstractWhile genome assembly projects have been successful in a number of haploid or inbred species, one of the current main challenges is assembling non-inbred or rearranged heterozygous genomes. To address this critical need, we introduce the open-source FALCON and FALCON-Unzip algorithms (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comPacificBiosciencesFALCON>httpsgithub.comPacificBiosciencesFALCON<jatsext-link>) to assemble Single Molecule Real-Time (SMRT®) Sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. We demonstrate the quality of this approach by assembling new reference sequences for three heterozygous samples, including an F1 hybrid of the model species Arabidopsis thaliana, the widely cultivated V. vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata that have challenged short-read assembly approaches. The FALCON-based assemblies were substantially more contiguous and complete than alternate short or long-read approaches. The phased diploid assembly enabled the study of haplotype structures and heterozygosities between the homologous chromosomes, including identifying widespread heterozygous structural variations within the coding sequences.
biorxiv bioinformatics 100-200-users 2016Accurate prediction of single-cell DNA methylation states using deep learning, bioRxiv, 2016-05-28
AbstractRecent technological advances have enabled assaying DNA methylation at single-cell resolution. Current protocols are limited by incomplete CpG coverage and hence methods to predict missing methylation states are critical to enable genome-wide analyses. Here, we report DeepCpG, a computational approach based on deep neural networks to predict DNA methylation states from DNA sequence and incomplete methylation profiles in single cells. We evaluated DeepCpG on single-cell methylation data from five cell types generated using alternative sequencing protocols, finding that DeepCpG yields substantially more accurate predictions than previous methods. Additionally, we show that the parameters of our model can be interpreted, thereby providing insights into the effect of sequence composition on methylation variability.
biorxiv bioinformatics 100-200-users 2016LD Hub a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis, bioRxiv, 2016-05-04
AbstractMotivationLD score regression is a reliable and efficient method of using genome-wide association study (GWAS) summary-level results data to estimate the SNP heritability of complex traits and diseases, partition this heritability into functional categories, and estimate the genetic correlation between different phenotypes. Because the method relies on summary level results data, LD score regression is computationally tractable even for very large sample sizes. However, publicly available GWAS summary-level data are typically stored in different databases and have different formats, making it difficult to apply LD score regression to estimate genetic correlations across many different traits simultaneously.ResultsIn this manuscript, we describe LD Hub – a centralized database of summary-level GWAS results for 177 diseasestraits from different publicly available resourcesconsortia and a web interface that automates the LD score regression analysis pipeline. To demonstrate functionality and validate our software, we replicated previously reported LD score regression analyses of 49 traitsdiseases using LD Hub; and estimated SNP heritability and the genetic correlation across the different phenotypes. We also present new results obtained by uploading a recent atopic dermatitis GWAS meta-analysis to examine the genetic correlation between the condition and other potentially related traits. In response to the growing availability of publicly accessible GWAS summary-level results data, our database and the accompanying web interface will ensure maximal uptake of the LD score regression methodology, provide a useful database for the public dissemination of GWAS results, and provide a method for easily screening hundreds of traits for overlapping genetic aetiologies.Availability and implementationThe web interface and instructions for using LD Hub are available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpldsc.broadinstitute.org>httpldsc.broadinstitute.org<jatsext-link>
biorxiv bioinformatics 0-100-users 2016Impact of knowledge accumulation on pathway enrichment analysis, bioRxiv, 2016-04-20
Pathway-based interpretation of gene lists is a staple of genome analysis. It depends on frequently updated gene annotation databases. We analyzed the evolution of gene annotations over the past seven years and found that the vocabulary of pathways and processes has doubled. This strongly impacts practical analysis of genes 80% of publications we surveyed in 2015 used outdated software that only captured 20% of pathway enrichments apparent in current annotations.
biorxiv bioinformatics 200-500-users 2016plasmidSPAdes Assembling Plasmids from Whole Genome Sequencing Data, bioRxiv, 2016-04-16
ABSTRACTMotivationPlasmids are stably maintained extra-chromosomal genetic elements that replicate independently from the host cell’s chromosomes. Although plasmids harbor biomedically important genes, (such as genes involved in virulence and antibiotics resistance), there is a shortage of specialized software tools for extracting and assembling plasmid data from whole genome sequencing projects.ResultsWe present the plasmidSPAdes algorithm and software tool for assembling plasmids from whole genome sequencing data and benchmark its performance on a diverse set of bacterial genomes.Availability and implementationPLASMIDSPADES is publicly available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpspades.bioinf.spbau.ruplasmidSPAdes>httpspades.bioinf.spbau.ruplasmidSPAdes<jatsext-link>Contactd.antipov@spbu.ru
biorxiv bioinformatics 0-100-users 2016