Assembling metagenomes, one community at a time, bioRxiv, 2017-03-25
AbstractBackgroundMetagenomics allows unprecedented access to uncultured environmental microorganisms. The analysis of metagenomic sequences facilitates gene prediction and annotation, and enables the assembly of draft genomes, including uncultured members of a community. However, while several platforms have been developed for this critical step, there is currently no clear framework for the assembly of metagenomic sequence data.ResultsTo assist with selection of an appropriate metagenome assembler we evaluated the capabilities of nine prominent assembly tools on nine publicly-available environmental metagenomes, as well as three simulated datasets. Overall, we found that SPAdes provided the largest contigs and highest N50 values across 6 of the 9 environmental datasets, followed by MEGAHIT and metaSPAdes. MEGAHIT emerged as a computationally inexpensive alternative to SPAdes, assembling the most complex dataset using less than 500 GB of RAM and within 10 hours.ConclusionsWe found that assembler choice ultimately depends on the scientific question, the available resources and the bioinformatic competence of the researcher. We provide a concise workflow for the selection of the best assembly tool.
biorxiv bioinformatics 100-200-users 2017RNA viruses drove adaptive introgressions between Neanderthals and modern humans, bioRxiv, 2017-03-25
AbstractNeanderthals and modern humans came in contact with each other and interbred at least twice in the past 100,000 years. Such contact and interbreeding likely led both to the transmission of viruses novel to either species and to the exchange of adaptive alleles that provided resistance against the same viruses. Here, we show that viruses were responsible for dozens of adaptive introgressions between Neanderthals and modern humans. We identify RNA viruses—specifically lentiviruses and orthomyxoviruses—as likely drivers of introgressions from Neanderthals to Europeans. Our results imply that many introgressions between Neanderthals and modern humans were adaptive, and that host genetic variation can be used to understand ancient viral epidemics, potentially providing important insights regarding current and future epidemics.One Sentence SummaryOnce out of Africa, modern humans inherited from Neanderthals dozens of genes already adapted against viruses present in their new environment.
biorxiv evolutionary-biology 100-200-users 2017STAR-Fusion Fast and Accurate Fusion Transcript Detection from RNA-Seq, bioRxiv, 2017-03-25
AbstractMotivationFusion genes created by genomic rearrangements can be potent drivers of tumorigenesis. However, accurate identification of functionally fusion genes from genomic sequencing requires whole genome sequencing, since exonic sequencing alone is often insufficient. Transcriptome sequencing provides a direct, highly effective alternative for capturing molecular evidence of expressed fusions in the precision medicine pipeline, but current methods tend to be inefficient or insufficiently accurate, lacking in sensitivity or predicting large numbers of false positives. Here, we describe STAR-Fusion, a method that is both fast and accurate in identifying fusion transcripts from RNA-Seq data.ResultsWe benchmarked STAR-Fusion’s fusion detection accuracy using both simulated and genuine Illumina paired-end RNA-Seq data, and show that it has superior performance compared to popular alternative fusion detection methods.Availability and implementationSTAR-Fusion is implemented in Perl, freely available as open source software at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpstar-fusion.github.io>httpstar-fusion.github.io<jatsext-link>, and supported on Linux.Contactbhaas@broadinstitute.org
biorxiv bioinformatics 0-100-users 2017Visualizing Structure and Transitions for Biological Data Exploration, bioRxiv, 2017-03-25
AbstractWith the advent of high-throughput technologies measuring high-dimensional biological data, there is a pressing need for visualization tools that reveal the structure and emergent patterns of data in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure in data by an information-geometric distance between datapoints. We perform extensive comparison between PHATE and other tools on a variety of artificial and biological datasets, and find that it consistently preserves a range of patterns in data including continual progressions, branches, and clusters. We define a manifold preservation metric DEMaP to show that PHATE produces quantitatively better denoised embeddings than existing visualization methods. We show that PHATE is able to gain unique insight from a newly generated scRNA-seq dataset of human germ layer differentiation. Here, PHATE reveals a dynamic picture of the main developmental branches in unparalleled detail, including the identification of three novel subpopulations. Finally, we show that PHATE is applicable to a wide variety of datatypes including mass cytometry, single-cell RNA-sequencing, Hi-C, and gut microbiome data, where it can generate interpretable insights into the underlying systems.
biorxiv bioinformatics 0-100-users 2017Light Sheet Theta Microscopy for High-resolution Quantitative Imaging of Large Biological Systems, bioRxiv, 2017-03-23
AbstractAdvances in tissue clearing and molecular labelling methods are enabling unprecedented optical access to large intact biological systems. These advances fuel the need for high-speed microscopy approaches to image large samples quantitatively and at high resolution. While Light Sheet Microscopy (LSM), with its high planar imaging speed and low photo-bleaching, can be effective, scaling up to larger imaging volumes has been hindered by the use of orthogonal light-sheet illumination. To address this fundamental limitation, we have developed Light Sheet Theta Microscopy (LSTM), which uniformly illuminates samples from same side as the detection objective, thereby eliminating limits on lateral dimensions without sacrificing imaging resolution, depth and speed. We present detailed characterization of LSTM, and show that this approach achieves rapid high-resolution imaging of large intact samples with superior uniform high-resolution than LSM. LSTM is a significant step in high-resolution quantitative mapping of structure and function of large intact biological systems.
biorxiv neuroscience 0-100-users 2017Single-cell RNA-seq of human induced pluripotent stem cells reveals cellular heterogeneity and cell state transitions between subpopulations, bioRxiv, 2017-03-23
AbstractHeterogeneity of cell states represented in pluripotent cultures have not been described at the transcriptional level. Since gene expression is highly heterogeneous between cells, single-cell RNA sequencing can be used to identify how individual pluripotent cells function. Here, we present results from the analysis of single-cell RNA sequencing data from 18,787 individual WTC CRISPRi human induced pluripotent stem cells. We developed an unsupervised clustering method, and through this identified four subpopulations distinguishable on the basis of their pluripotent state including a core pluripotent population (48.3%), proliferative (47.8%), early-primed for differentiation (2.8%) and late-primed for differentiation (1.1%). For each subpopulation we were able to identify the genes and pathways that define differences in pluripotent cell states. Our method identified four transcriptionally distinct predictor gene sets comprised of 165 unique genes that denote the specific pluripotency states; and using these sets, we developed a multigenic machine learning prediction method to accurately classify single cells into each of the subpopulations. Compared against a set of established pluripotency markers, our method increases prediction accuracy by 10%, specificity by 20%, and explains a substantially larger proportion of deviance (up to 3-fold) from the prediction model. Finally, we developed an innovative method to predict cells transitioning between subpopulations, and support our conclusions with results from two orthogonal pseudotime trajectory methods.
biorxiv genomics 0-100-users 2017