bioinformatics | audiences

Oxford Nanopore Sequencing, Hybrid Error Correction, and de novo Assembly of a Eukaryotic Genome, bioRxiv, 2015-01-15

Monitoring the progress of DNA molecules through a membrane pore has been postulated as a method for sequencing DNA for several decades. Recently, a nanopore-based sequencing instrument, the Oxford Nanopore MinION, has become available that we used for sequencing the S. cerevisiae genome. To make use of these data, we developed a novel open-source hybrid error correction algorithm Nanocorr (httpsgithub.comjgurtowskinanocorr) specifically for Oxford Nanopore reads, as existing packages were incapable of assembling the long read lengths (5-50kbp) at such high error rate (between ~5 and 40% error). With this new method we were able to perform a hybrid error correction of the nanopore reads using complementary MiSeq data and produce a de novo assembly that is highly contiguous and accurate the contig N50 length is more than ten-times greater than an Illumina-only assembly (678kb versus 59.9kbp), and has greater than 99.88% consensus identity when compared to the reference. Furthermore, the assembly with the long nanopore reads presents a much more complete representation of the features of the genome and correctly assembles gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly.

biorxiv bioinformatics 100-200-users 2015

SpeedSeq Ultra-fast personal genome analysis and interpretation, bioRxiv, 2014-12-05

Comprehensive interpretation of human genome sequencing data is a challenging bioinformatic problem that typically requires weeks of analysis, with extensive hands-on expert involvement. This informatics bottleneck inflates genome sequencing costs, poses a computational burden for large-scale projects, and impedes the adoption of time-critical clinical applications such as personalized cancer profiling and newborn disease diagnosis, where the actionable timeframe can measure in hours or days. We developed SpeedSeq, an open-source genome analysis platform that vastly reduces computing time. SpeedSeq accomplishes read alignment, duplicate removal, variant detection and functional annotation of a 50X human genome in <24 hours, even using one low-cost server. SpeedSeq offers competitive or superior performance to current methods for detecting germline and somatic single nucleotide variants (SNVs), indels, and structural variants (SVs) and includes novel functionality for SV genotyping, SV annotation, fusion gene detection, and rapid identification of actionable mutations. SpeedSeq will help bring timely genome analysis into the clinical realm.

biorxiv bioinformatics 0-100-users 2014

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing, bioRxiv, 2014-08-15

We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.

biorxiv bioinformatics 100-200-users 2014

Poretools a toolkit for analyzing nanopore sequence data, bioRxiv, 2014-07-24

Motivation Nanopore sequencing may be the next disruptive technology in genomics. Nanopore sequencing has many attractive properties including the ability to detect single DNA molecules without prior amplification, the lack of reliance on expensive optical components, and the ability to sequence very long fragments. The MinION from Oxford Nanopore Technologies (ONT) is the first nanopore sequencer to be commercialised and made available to early-access users. The MinION(TM) is a USB-connected, portable nanopore sequencer which permits real-time analysis of streaming event data. A cloud-based service is available to translate events into nucleotide base calls. However, software support to deal with such data is limited, and the community lacks a standardized toolkit for the analysis of nanopore datasets. Results We introduce poretools, a flexible toolkit for manipulating and exploring datasets generated by nanopore sequencing devices from MinION for the purposes of quality control and downstream analysis. Poretools operates directly on the native FAST5 (a variant of the HDF5 standard) file format produced by ONT and provides a wealth of format conversion utilities and data exploration and visualization tools. Availability and implementation Poretools is open source software and is written in Python as both a suite of command line utilities and a Python application programming interface. Source code and user documentation are freely available in Github at httpsgithub.comarq5xporetools Contact n.j.loman@bham.ac.uk, aaronquinlan@gmail.com Supplementary information An IPython notebook demonstrating the use and functionality of poretools in greater detail is available from the Github repository.

biorxiv bioinformatics 0-100-users 2014

Flexible analysis of transcriptome assemblies with Ballgown, bioRxiv, 2014-03-31

We have built a statistical package called Ballgown for estimating differential expression of genes, transcripts, or exons from RNA sequencing experiments. Ballgown is designed to work with the popular Cufflinks transcript assembly software and uses well-motivated statistical methods to provide estimates of changes in expression. It permits statistical analysis at the transcript level for a wide variety of experimental designs, allows adjustment for confounders, and handles studies with continuous covariates. Ballgown provides improved statistical significance estimates as compared to the Cuffdiff differential expression tool included with Cufflinks. We demonstrate the flexibility of the Ballgown package by re-analyzing 667 samples from the GEUVADIS study to identify transcript-level eQTLs and identify non-linear artifacts in transcript data. Our package is freely available from httpsgithub.comalyssafrazeeballgown

biorxiv bioinformatics 0-100-users 2014

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2, bioRxiv, 2014-02-20

In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq data, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data. DESeq2 uses shrinkage estimation for dispersions and fold changes to improve stability and interpretability of the estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression and facilitates downstream tasks such as gene ranking and visualization. DESeq2 is available as an RBioconductor package.

biorxiv bioinformatics 0-100-users 2014