bioinformatics | audiences

Graphmap2 - splice-aware RNA-seq mapper for long reads, bioRxiv, 2019-07-31

AbstractIn this paper we present Graphmap2, a splice-aware mapper built on our previously developed DNA mapper Graphmap. Graphmap2 is tailored for long reads produced by Pacific Biosciences and Oxford Nanopore devices. It uses several newly developed algorithms which enable higher precision and recall of correctly detected transcripts and exon boundaries. We compared its performance with the state-of-the-art tools Minimap2 and Gmap. On both simulated and real datasets Graphmap2 achieves higher mappability and more correctly recognized exons and their ends. In addition we present an analysis of potential of splice aware mappers and long reads for the identification of previously unknown isoforms and even genes. The Graphmap2 tool is publicly available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comlbcb-scigraphmap2>httpsgithub.comlbcb-scigraphmap2<jatsext-link>.

biorxiv bioinformatics 0-100-users 2019

Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit, bioRxiv, 2019-07-26

AbstractPresent workflows for producing human genome assemblies from long-read technologies have cost and production time bottlenecks that prohibit efficient scaling to large cohorts. We demonstrate an optimized PromethION nanopore sequencing method for eleven human genomes. The sequencing, performed on one machine in nine days, achieved an average 63x coverage, 42 Kb read N50, 90% median read identity and 6.5x coverage in 100 Kb+ reads using just three flow cells per sample. To assemble these data we introduce new computational tools Shasta - a de novo long read assembler, and MarginPolish & HELEN - a suite of nanopore assembly polishing algorithms. On a single commercial compute node Shasta can produce a complete human genome assembly in under six hours, and MarginPolish & HELEN can polish the result in just over a day, achieving 99.9% identity (QV30) for haploid samples from nanopore reads alone. We evaluate assembly performance for diploid, haploid and trio-binned human samples in terms of accuracy, cost, and time and demonstrate improvements relative to current state-of-the-art methods in all areas. We further show that addition of proximity ligation (Hi-C) sequencing yields near chromosome-level scaffolds for all eleven genomes.

biorxiv bioinformatics 200-500-users 2019

On the discovery of population-specific state transitions from multi-sample multi-condition single-cell RNA sequencing data, bioRxiv, 2019-07-26

AbstractSingle-cell RNA sequencing (scRNA-seq) has quickly become an empowering technology to profile the transcriptomes of individual cells on a large scale. Many early analyses of differential expression have aimed at identifying differences between subpopulations, and thus are focused on finding subpopulation markers either in a single sample or across multiple samples. More generally, such methods can compare expression levels in multiple sets of cells, thus leading to cross-condition analyses. However, given the emergence of replicated multi-condition scRNA-seq datasets, an area of increasing focus is making sample-level inferences, termed here as differential state analysis. For example, one could investigate the condition-specific responses of cell subpopulations measured from patients from each condition; however, it is not clear which statistical framework best handles this situation. In this work, we surveyed the methods available to perform cross-condition differential state analyses, including cell-level mixed models and methods based on aggregated “pseudobulk” data. We developed a flexible simulation platform that mimics both single and multi-sample scRNA-seq data and provide robust tools for multi-condition analysis within the muscat R package.

biorxiv bioinformatics 100-200-users 2019

RootNav 2.0 Deep Learning for Automatic Navigation of Complex Plant Root Architectures, bioRxiv, 2019-07-20

AbstractWe present a new image analysis approach that provides fully-automatic extraction of complex root system architectures from a range of plant species in varied imaging setups. Driven by modern deep-learning approaches, RootNav 2.0 replaces previously manual and semi-automatic feature extraction with an extremely deep multi-task Convolutional Neural Network architecture. The network has been designed to explicitly combine local pixel information with global scene information in order to accurately segment small root features across high-resolution images. In addition, the network simultaneously locates seeds, and first and second order root tips to drive a search algorithm seeking optimal paths throughout the image, extracting accurate architectures without user interaction. The proposed method is evaluated on images of wheat (Triticum aestivum L.) from a seedling assay. The results are compared with semi-automatic analysis via the original RootNav tool, demonstrating comparable accuracy, with a 10-fold increase in speed. We then demonstrate the ability of the network to adapt to different plant species via transfer learning, offering similar accuracy when transferred to an Arabidopsis thaliana plate assay. We transfer for a final time to images of Brassica napus from a hydroponic assay, and still demonstrate good accuracy despite many fewer training images. The tool outputs root architectures in the widely accepted RSML standard, for which numerous analysis packages exist (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httprootsystemml.github.io>httprootsystemml.github.io<jatsext-link>), as well as segmentation masks compatible with other automated measurement tools.

biorxiv bioinformatics 0-100-users 2019

Human Genome Assembly in 100 Minutes, bioRxiv, 2019-07-17

AbstractDe novo genome assembly provides comprehensive, unbiased genomic information and makes it possible to gain insight into new DNA sequences not present in reference genomes. Many de novo human genomes have been published in the last few years, leveraging a combination of inexpensive short-read and single-molecule long-read technologies. As long-read DNA sequencers become more prevalent, the computational burden of generating assemblies persists as a critical factor. The most common approach to long-read assembly, using an overlap-layout-consensus (OLC) paradigm, requires all-to-all read comparisons, which quadratically scales in computational complexity with the number of reads. We assert that recently achievements in sequencing technology (i.e. with accuracy ~99% and read length ~10-15k) enables a fundamentally better strategy for OLC that is effectively linear rather than quadratic. Our genome assembly implementation, Peregrine uses sparse hierarchical minimizers (SHIMMER) to index reads thereby avoiding the need for an all-to-all read comparison step. Peregrine can assemble 30x human PacBio CCS read datasets in less than 30 CPU hours and around 100 wall-clock minutes to a high contiguity assembly (N50 > 20Mb). The continued advance of sequencing technologies coupled with the Peregrine assembler enables routine generation of human de novo assemblies. This will allow for population scale measurements of more comprehensive genomic variations -- beyond SNPs and small indels -- as well as novel applications requiring rapid access to de novo assemblies.

biorxiv bioinformatics 100-200-users 2019

Evaluating probabilistic programming and fast variational Bayesian inference in phylogenetics, bioRxiv, 2019-07-15

AbstractRecent advances in statistical machine learning techniques have led to the creation of probabilistic programming frameworks. These frameworks enable probabilistic models to be rapidly prototyped and fit to data using scalable approximation methods such as variational inference. In this work, we explore the use of the Stan language for probabilistic programming in application to phylogenetic models. We show that many commonly used phylogenetic models including the general time reversible (GTR) substitution model, rate heterogeneity among sites, and a range of coalescent models can be implemented using a probabilistic programming language. The posterior probability distributions obtained via the black box variational inference engine in Stan were compared to those obtained with reference implementations of Markov chain Monte Carlo (MCMC) for phylogenetic inference. We find that black box variational inference in Stan is less accurate than MCMC methods for phylogenetic models, but requires far less compute time. Finally, we evaluate a custom implementation of mean-field variational inference on the Jukes-Cantor substitution model and show that a specialized implementation of variational inference can be two orders of magnitude faster and more accurate than a general purpose probabilistic implementation.

biorxiv bioinformatics 0-100-users 2019