UpSetR An R Package for the Visualization of Intersecting Sets and their Properties, bioRxiv, 2017-03-26
AbstractVenn and Euler diagrams are a popular yet inadequate solution for quantitative visualization of set intersections. A scalable alternative to Venn and Euler diagrams for visualizing intersecting sets and their properties is needed. We developed UpSetR, an open source R package that employs a scalable matrix-based visualization to show intersections of sets, their size, and other properties. UpSetR is available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpscran.r-project.orgpackage=UpSetR>httpscran.r-project.orgpackage=UpSetR<jatsext-link> and released under the MIT License. A Shiny app is available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgehlenborglab.shinyapps.ioupsetr>httpsgehlenborglab.shinyapps.ioupsetr<jatsext-link>.
biorxiv bioinformatics 200-500-users 2017Assembling metagenomes, one community at a time, bioRxiv, 2017-03-25
AbstractBackgroundMetagenomics allows unprecedented access to uncultured environmental microorganisms. The analysis of metagenomic sequences facilitates gene prediction and annotation, and enables the assembly of draft genomes, including uncultured members of a community. However, while several platforms have been developed for this critical step, there is currently no clear framework for the assembly of metagenomic sequence data.ResultsTo assist with selection of an appropriate metagenome assembler we evaluated the capabilities of nine prominent assembly tools on nine publicly-available environmental metagenomes, as well as three simulated datasets. Overall, we found that SPAdes provided the largest contigs and highest N50 values across 6 of the 9 environmental datasets, followed by MEGAHIT and metaSPAdes. MEGAHIT emerged as a computationally inexpensive alternative to SPAdes, assembling the most complex dataset using less than 500 GB of RAM and within 10 hours.ConclusionsWe found that assembler choice ultimately depends on the scientific question, the available resources and the bioinformatic competence of the researcher. We provide a concise workflow for the selection of the best assembly tool.
biorxiv bioinformatics 100-200-users 2017STAR-Fusion Fast and Accurate Fusion Transcript Detection from RNA-Seq, bioRxiv, 2017-03-25
AbstractMotivationFusion genes created by genomic rearrangements can be potent drivers of tumorigenesis. However, accurate identification of functionally fusion genes from genomic sequencing requires whole genome sequencing, since exonic sequencing alone is often insufficient. Transcriptome sequencing provides a direct, highly effective alternative for capturing molecular evidence of expressed fusions in the precision medicine pipeline, but current methods tend to be inefficient or insufficiently accurate, lacking in sensitivity or predicting large numbers of false positives. Here, we describe STAR-Fusion, a method that is both fast and accurate in identifying fusion transcripts from RNA-Seq data.ResultsWe benchmarked STAR-Fusion’s fusion detection accuracy using both simulated and genuine Illumina paired-end RNA-Seq data, and show that it has superior performance compared to popular alternative fusion detection methods.Availability and implementationSTAR-Fusion is implemented in Perl, freely available as open source software at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpstar-fusion.github.io>httpstar-fusion.github.io<jatsext-link>, and supported on Linux.Contactbhaas@broadinstitute.org
biorxiv bioinformatics 0-100-users 2017Visualizing Structure and Transitions for Biological Data Exploration, bioRxiv, 2017-03-25
AbstractWith the advent of high-throughput technologies measuring high-dimensional biological data, there is a pressing need for visualization tools that reveal the structure and emergent patterns of data in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure in data by an information-geometric distance between datapoints. We perform extensive comparison between PHATE and other tools on a variety of artificial and biological datasets, and find that it consistently preserves a range of patterns in data including continual progressions, branches, and clusters. We define a manifold preservation metric DEMaP to show that PHATE produces quantitatively better denoised embeddings than existing visualization methods. We show that PHATE is able to gain unique insight from a newly generated scRNA-seq dataset of human germ layer differentiation. Here, PHATE reveals a dynamic picture of the main developmental branches in unparalleled detail, including the identification of three novel subpopulations. Finally, we show that PHATE is applicable to a wide variety of datatypes including mass cytometry, single-cell RNA-sequencing, Hi-C, and gut microbiome data, where it can generate interpretable insights into the underlying systems.
biorxiv bioinformatics 0-100-users 2017Multiplexing droplet-based single cell RNA-sequencing using natural genetic barcodes, bioRxiv, 2017-03-21
Droplet-based single-cell RNA-sequencing (dscRNA-seq) has enabled rapid, massively parallel profiling of transcriptomes from tens of thousands of cells. Multiplexing samples for single cell capture and library preparation in dscRNA-seq would enable cost-effective designs of differential expression and genetic studies while avoiding technical batch effects, but its implementation remains challenging. Here, we introduce an in-silico algorithm demuxlet that harnesses natural genetic variation to discover the sample identity of each cell and identify droplets containing two cells. These capabilities enable multiplexed dscRNA-seq experiments where cells from unrelated individuals are pooled and captured at higher throughput than standard workflows. To demonstrate the performance of demuxlet, we sequenced 3 pools of peripheral blood mononuclear cells (PBMCs) from 8 lupus patients. Given genotyping data for each individual, demuxlet correctly recovered the sample identity of > 99% of singlets, and identified doublets at rates consistent with previous estimates. In PBMCs, we demonstrate the utility of multiplexed dscRNA-seq in two applications characterizing cell type specificity and inter-individual variability of cytokine response from 8 lupus patients and mapping genetic variants associated with cell type specific gene expression from 23 donors. Demuxlet is fast, accurate, scalable and could be extended to other single cell datasets that incorporate natural or synthetic DNA barcodes.
biorxiv bioinformatics 0-100-users 2017xCell Digitally portraying the tissue cellular heterogeneity landscape, bioRxiv, 2017-03-07
AbstractTissues are complex milieu consisting of numerous cell-types. Numerous recent methods attempt to enumerate cell subsets from transcriptomes. However, available method used limited source for training and displayed only partial portrayal of the full cellular landscape. Here we present xCell, a novel gene-signature based method for inferring 64 immune and stroma cell-types. We harmonized 1,822 pure human cell-types transcriptomes from various sources, employed curve fitting approach for linear comparison of cell-types, and introduced a novel spillover compensation technique for separating between cell-types. Using extensive in silico analyses and comparison to cytometry immunophenotyping we show that xCell outperforms other methods <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpxCell.ucsf.edu>httpxCell.ucsf.edu<jatsext-link>.
biorxiv bioinformatics 0-100-users 2017