Machine Learning-based state-of-the-art methods for the classification of RNA-Seq data, bioRxiv, 2017-03-27
AbstractRNA-Seq measures expression levels of several transcripts simultaneously. The identified reads can be gene, exon, or other region of interest. Various computational tools have been developed for studying pathogen or virus from RNA-Seq data by classifying them according to the attributes in several predefined classes, but still computational tools and approaches to analyze complex datasets are still lacking. The development of classification models is highly recommended for disease diagnosis and classification, disease monitoring at molecular level as well as researching for potential disease biomarkers. In this chapter, we are going to discuss various machine learning approaches for RNA-Seq data classification and their implementation. Advancements in bioinformatics, along with developments in machine learning based classification, would provide powerful toolboxes for classifying transcriptome information available through RNA-Seq data.
biorxiv bioinformatics 100-200-users 2017UpSetR An R Package for the Visualization of Intersecting Sets and their Properties, bioRxiv, 2017-03-26
AbstractVenn and Euler diagrams are a popular yet inadequate solution for quantitative visualization of set intersections. A scalable alternative to Venn and Euler diagrams for visualizing intersecting sets and their properties is needed. We developed UpSetR, an open source R package that employs a scalable matrix-based visualization to show intersections of sets, their size, and other properties. UpSetR is available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpscran.r-project.orgpackage=UpSetR>httpscran.r-project.orgpackage=UpSetR<jatsext-link> and released under the MIT License. A Shiny app is available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgehlenborglab.shinyapps.ioupsetr>httpsgehlenborglab.shinyapps.ioupsetr<jatsext-link>.
biorxiv bioinformatics 200-500-users 2017Assembling metagenomes, one community at a time, bioRxiv, 2017-03-25
AbstractBackgroundMetagenomics allows unprecedented access to uncultured environmental microorganisms. The analysis of metagenomic sequences facilitates gene prediction and annotation, and enables the assembly of draft genomes, including uncultured members of a community. However, while several platforms have been developed for this critical step, there is currently no clear framework for the assembly of metagenomic sequence data.ResultsTo assist with selection of an appropriate metagenome assembler we evaluated the capabilities of nine prominent assembly tools on nine publicly-available environmental metagenomes, as well as three simulated datasets. Overall, we found that SPAdes provided the largest contigs and highest N50 values across 6 of the 9 environmental datasets, followed by MEGAHIT and metaSPAdes. MEGAHIT emerged as a computationally inexpensive alternative to SPAdes, assembling the most complex dataset using less than 500 GB of RAM and within 10 hours.ConclusionsWe found that assembler choice ultimately depends on the scientific question, the available resources and the bioinformatic competence of the researcher. We provide a concise workflow for the selection of the best assembly tool.
biorxiv bioinformatics 100-200-users 2017RNA viruses drove adaptive introgressions between Neanderthals and modern humans, bioRxiv, 2017-03-25
AbstractNeanderthals and modern humans came in contact with each other and interbred at least twice in the past 100,000 years. Such contact and interbreeding likely led both to the transmission of viruses novel to either species and to the exchange of adaptive alleles that provided resistance against the same viruses. Here, we show that viruses were responsible for dozens of adaptive introgressions between Neanderthals and modern humans. We identify RNA viruses—specifically lentiviruses and orthomyxoviruses—as likely drivers of introgressions from Neanderthals to Europeans. Our results imply that many introgressions between Neanderthals and modern humans were adaptive, and that host genetic variation can be used to understand ancient viral epidemics, potentially providing important insights regarding current and future epidemics.One Sentence SummaryOnce out of Africa, modern humans inherited from Neanderthals dozens of genes already adapted against viruses present in their new environment.
biorxiv evolutionary-biology 100-200-users 2017STAR-Fusion Fast and Accurate Fusion Transcript Detection from RNA-Seq, bioRxiv, 2017-03-25
AbstractMotivationFusion genes created by genomic rearrangements can be potent drivers of tumorigenesis. However, accurate identification of functionally fusion genes from genomic sequencing requires whole genome sequencing, since exonic sequencing alone is often insufficient. Transcriptome sequencing provides a direct, highly effective alternative for capturing molecular evidence of expressed fusions in the precision medicine pipeline, but current methods tend to be inefficient or insufficiently accurate, lacking in sensitivity or predicting large numbers of false positives. Here, we describe STAR-Fusion, a method that is both fast and accurate in identifying fusion transcripts from RNA-Seq data.ResultsWe benchmarked STAR-Fusion’s fusion detection accuracy using both simulated and genuine Illumina paired-end RNA-Seq data, and show that it has superior performance compared to popular alternative fusion detection methods.Availability and implementationSTAR-Fusion is implemented in Perl, freely available as open source software at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpstar-fusion.github.io>httpstar-fusion.github.io<jatsext-link>, and supported on Linux.Contactbhaas@broadinstitute.org
biorxiv bioinformatics 0-100-users 2017Visualizing Structure and Transitions for Biological Data Exploration, bioRxiv, 2017-03-25
AbstractWith the advent of high-throughput technologies measuring high-dimensional biological data, there is a pressing need for visualization tools that reveal the structure and emergent patterns of data in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure in data by an information-geometric distance between datapoints. We perform extensive comparison between PHATE and other tools on a variety of artificial and biological datasets, and find that it consistently preserves a range of patterns in data including continual progressions, branches, and clusters. We define a manifold preservation metric DEMaP to show that PHATE produces quantitatively better denoised embeddings than existing visualization methods. We show that PHATE is able to gain unique insight from a newly generated scRNA-seq dataset of human germ layer differentiation. Here, PHATE reveals a dynamic picture of the main developmental branches in unparalleled detail, including the identification of three novel subpopulations. Finally, we show that PHATE is applicable to a wide variety of datatypes including mass cytometry, single-cell RNA-sequencing, Hi-C, and gut microbiome data, where it can generate interpretable insights into the underlying systems.
biorxiv bioinformatics 0-100-users 2017