Comparative assessment of long-read error-correction software applied to RNA-sequencing data, bioRxiv, 2018-11-23
AbstractMotivationLong-read sequencing technologies offer promising alternatives to high-throughput short read sequencing, especially in the context of RNA-sequencing. However these technologies are currently hindered by high error rates in the output data that affect analyses such as the identification of isoforms, exon boundaries, open reading frames, and the creation of gene catalogues. Due to the novelty of such data, computational methods are still actively being developed and options for the error-correction of RNA-sequencing long reads remain limited.ResultsIn this article, we evaluate the extent to which existing long-read DNA error correction methods are capable of correcting cDNA Nanopore reads. We provide an automatic and extensive benchmark tool that not only reports classical error-correction metrics but also the effect of correction on gene families, isoform diversity, bias towards the major isoform, and splice site detection. We find that long read error-correction tools that were originally developed for DNA are also suitable for the correction of RNA-sequencing data, especially in terms of increasing base-pair accuracy. Yet investigators should be warned that the correction process perturbs gene family sizes and isoform diversity. This work provides guidelines on which (or whether) error-correction tools should be used, depending on the application type.Benchmarking software<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgitlab.comleoislLR_EC_analyser>httpsgitlab.comleoislLR_EC_analyser<jatsext-link>
biorxiv bioinformatics 0-100-users 2018Tracing diagnosis trajectories over millions of inpatients reveal an unexpected association between schizophrenia and rhabdomyolysis, bioRxiv, 2018-11-20
AbstractWhile it has been technically feasible to create longitudinal representations of individual health at a nationwide scale, the use of these techniques to identify novel disease associations for the risk stratification of patients has had limited success. Here, we created a large-scale US longitudinal disease network of traced readmission patterns (i.e., disease trajectories), merging data from over 10.4 million inpatients from 350 California hospitals through the Healthcare Cost and Utilization Project between 1980 and 2010. We were able to create longitudinal representations of disease progression mapping over 300 common diseases, including the well-known complication of heart failure after acute myocardial infarction. Surprisingly, out of these generated disease trajectories, we discovered an unknown association between schizophrenia, a chronic mental disorder, and rhabdomyolysis, a rare disease of muscle breakdown. It was found that 92 of 3674 patients (2.5%) with schizophrenia were readmitted for rhabdomyolysis (relative risk, 2.21 [1.80–2.71, confidence interval = 0.95] P-value 9.54E-15), which has a general population incidence of 1 in 10,000. We validated this association using independent electronic health records from over 830,000 patients treated over seven years at the University of California, San Francisco (UCSF) medical center. A case review of 29 patients at UCSF who were treated for schizophrenia and who went on to develop rhabdomyolysis demonstrated that the majority of cases (62%) are idiopathic, which suggests a biological connection between these two diseases. Together, these findings demonstrate the power of using public disease registries in combination with electronic medical records to discover novel disease associations.One Sentence SummaryBased on the longitudinal health records from millions of California inpatient discharges, we created a temporal network that enabled us to understand statewide patterns of hospital readmissions, which led to the novel finding that hospitalization for schizophrenia is significantly associated with rehospitalization for rhabdomyolysis.
biorxiv bioinformatics 0-100-users 2018The Barcode, UMI, Set format and BUStools, bioRxiv, 2018-11-19
AbstractWe introduce the Barcode-UMI-Set format (BUS) for representing pseudoalignments of reads from single-cell RNA-seq experiments. The format can be used with all single-cell RNA-seq technologies, and we show that BUS files can be efficiently generated. BUStools is a suite of tools for working with BUS files and facilitates rapid quantification and analysis of single-cell RNA-seq data. The BUS format therefore makes possible the development of modular, technology-specific, and robust workflows for single-cell RNA-seq analysis.
biorxiv bioinformatics 100-200-users 2018OrthoFinder phylogenetic orthology inference for comparative genomics, bioRxiv, 2018-11-08
AbstractHere, we present a major advance of the OrthoFinder method. This extends OrthoFinder’s high accuracy orthogroup inference to provide phylogenetic inference of orthologs, rooted genes trees, gene duplication events, the rooted species tree, and comparative genomic statistics. Each output is benchmarked on appropriate real or simulated datasets and, where comparable methods exist, OrthoFinder is equivalent to or outperforms these methods. Furthermore, OrthoFinder is the most accurate ortholog inference method on the Quest for Orthologs benchmark test. Finally, OrthoFinder’s comprehensive phylogenetic analysis is achieved with equivalent speed and scalability to the fastest, score-based heuristic methods. OrthoFinder is available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comdavidemmsOrthoFinder>httpsgithub.comdavidemmsOrthoFinder<jatsext-link>.
biorxiv bioinformatics 200-500-users 2018AnnoTree visualization and exploration of a functionally annotated microbial tree of life, bioRxiv, 2018-11-06
AbstractBacterial genomics has revolutionized our understanding of the microbial tree of life; however, mapping and visualizing the distribution of functional traits across bacteria remains a challenge. Here, we introduce AnnoTree - an interactive, functionally annotated bacterial tree of life that integrates taxonomic, phylogenetic, and functional annotation data from nearly 24,000 bacterial genomes. AnnoTree enables visualization of millions of precomputed genome annotations across the bacterial phylogeny, thereby allowing users to explore gene distributions as well as patterns of gene gain and loss across bacteria. Using AnnoTree, we examined the phylogenomic distributions of 28,311 geneprotein families, and measured their phylogenetic conservation, patchiness, and lineage-specificity. Our analyses revealed widespread phylogenetic patchiness among bacterial gene families, reflecting the dynamic evolution of prokaryotic genomes. Genes involved in phage infectiondefense, mobile elements, and antibiotic resistance dominated the list of most patchy traits, as well as numerous intriguing metabolic enzymes that appear to have undergone frequent horizontal transfer. We anticipate that AnnoTree will be a valuable resource for exploring gene histories across bacteria, and will act as a catalyst for biological and evolutionary hypothesis generation.
biorxiv bioinformatics 100-200-users 2018Exploring neighborhoods in large metagenome assembly graphs reveals hidden sequence diversity, bioRxiv, 2018-11-06
Genomes computationally inferred from large metagenomic data sets are often incomplete and may be missing functionally important content and strain variation. We introduce an information retrieval system for large metagenomic data sets that exploits the sparsity of DNA assembly graphs to efficiently extract subgraphs surrounding an inferred genome. We apply this system to recover missing content from genome bins and show that substantial genomic sequence variation is present in a real metagenome. Our software implementation is available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comspacegraphcats>httpsgithub.comspacegraphcats<jatsext-link> spacegraphcats under the 3-Clause BSD License.
biorxiv bioinformatics 100-200-users 2018