SpeedSeq Ultra-fast personal genome analysis and interpretation, bioRxiv, 2014-12-05
Comprehensive interpretation of human genome sequencing data is a challenging bioinformatic problem that typically requires weeks of analysis, with extensive hands-on expert involvement. This informatics bottleneck inflates genome sequencing costs, poses a computational burden for large-scale projects, and impedes the adoption of time-critical clinical applications such as personalized cancer profiling and newborn disease diagnosis, where the actionable timeframe can measure in hours or days. We developed SpeedSeq, an open-source genome analysis platform that vastly reduces computing time. SpeedSeq accomplishes read alignment, duplicate removal, variant detection and functional annotation of a 50X human genome in <24 hours, even using one low-cost server. SpeedSeq offers competitive or superior performance to current methods for detecting germline and somatic single nucleotide variants (SNVs), indels, and structural variants (SVs) and includes novel functionality for SV genotyping, SV annotation, fusion gene detection, and rapid identification of actionable mutations. SpeedSeq will help bring timely genome analysis into the clinical realm.
biorxiv bioinformatics 0-100-users 2014When to use Quantile Normalization?, bioRxiv, 2014-12-05
Normalization and preprocessing are essential steps for the analysis of high-throughput data including next-generation sequencing and microarrays. Multi-sample global normalization methods, such as quantile normalization, have been successfully used to remove technical variation from noisy data. These methods rely on the assumption that observed global changes across samples are due to unwanted technical variability. Transforming the data to remove these differences has the potential to remove interesting biologically driven global variation and therefore may not be appropriate depending on the type and source of variation. Currently, it is up to the subject matter experts, for example biologists, to determine if the stated assumptions are appropriate or not. Here, we propose a data-driven method to test for the assumptions of global normalization methods. We demonstrate the utility of our method (quantro), by applying it to multiple gene expression and DNA methylation and show examples of when global normalization methods are not appropriate. We also perform a Monte Carlo simulation study to illustrate how our method generally outperforms the current approach. An R-package implementing our method is available on Bioconductor (httpwww.bioconductor.orgpackagesreleasebiochtmlquantro.html).
biorxiv genomics 0-100-users 2014Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants, bioRxiv, 2014-10-15
We compared whole-exome sequencing (WES) and whole-genome sequencing (WGS) in six unrelated individuals. In the regions targeted by WES capture (81.5% of the consensus coding genome), the mean numbers of single-nucleotide variants (SNVs) and small insertionsdeletions (indels) detected per sample were 84,192 and 13,325, respectively, for WES, and 84,968 and 12,702, respectively, for WGS. For both SNVs and indels, the distributions of coverage depth, genotype quality, and minor read ratio were more uniform for WGS than for WES. After filtering, a mean of 74,398 (95.3%) high-quality (HQ) SNVs and 9,033 (70.6%) HQ indels were called by both platforms. A mean of 105 coding HQ SNVs and 32 indels were identified exclusively by WES, whereas 692 HQ SNVs and 105 indels were identified exclusively by WGS. We Sanger sequenced a random selection of these exclusive variants. For SNVs, the proportion of false-positive variants was higher for WES (78%) than for WGS (17%). The estimated mean number of real coding SNVs (656, ~3% of all coding HQ SNVs) identified by WGS and missed by WES was greater than the number of SNVs identified by WES and missed by WGS (26). For indels, the proportions of false-positive variants were similar for WES (44%) and WGS (46%). Finally, WES was not reliable for the detection of copy number variations, almost all of which extended beyond the targeted regions. Although currently more expensive, WGS is more powerful than WES for detecting potential disease-causing mutations within WES regions, particularly those due to SNVs.
biorxiv genetics 0-100-users 2014Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing, bioRxiv, 2014-08-15
We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.
biorxiv bioinformatics 100-200-users 2014Poretools a toolkit for analyzing nanopore sequence data, bioRxiv, 2014-07-24
Motivation Nanopore sequencing may be the next disruptive technology in genomics. Nanopore sequencing has many attractive properties including the ability to detect single DNA molecules without prior amplification, the lack of reliance on expensive optical components, and the ability to sequence very long fragments. The MinION from Oxford Nanopore Technologies (ONT) is the first nanopore sequencer to be commercialised and made available to early-access users. The MinION(TM) is a USB-connected, portable nanopore sequencer which permits real-time analysis of streaming event data. A cloud-based service is available to translate events into nucleotide base calls. However, software support to deal with such data is limited, and the community lacks a standardized toolkit for the analysis of nanopore datasets. Results We introduce poretools, a flexible toolkit for manipulating and exploring datasets generated by nanopore sequencing devices from MinION for the purposes of quality control and downstream analysis. Poretools operates directly on the native FAST5 (a variant of the HDF5 standard) file format produced by ONT and provides a wealth of format conversion utilities and data exploration and visualization tools. Availability and implementation Poretools is open source software and is written in Python as both a suite of command line utilities and a Python application programming interface. Source code and user documentation are freely available in Github at httpsgithub.comarq5xporetools Contact n.j.loman@bham.ac.uk, aaronquinlan@gmail.com Supplementary information An IPython notebook demonstrating the use and functionality of poretools in greater detail is available from the Github repository.
biorxiv bioinformatics 0-100-users 2014Reagent contamination can critically impact sequence-based microbiome analyses, bioRxiv, 2014-07-17
AbstractThe study of microbial communities has been revolutionised in recent years by the widespread adoption of culture independent analytical techniques such as 16S rRNA gene sequencing and metagenomics. One potential confounder of these sequence-based approaches is the presence of contamination in DNA extraction kits and other laboratory reagents. In this study we demonstrate that contaminating DNA is ubiquitous in commonly used DNA extraction kits, varies greatly in composition between different kits and kit batches, and that this contamination critically impacts results obtained from samples containing a low microbial biomass. Contamination impacts both PCR based 16S rRNA gene surveys and shotgun metagenomics. These results suggest that caution should be advised when applying sequence-based techniques to the study of microbiota present in low biomass environments. We provide an extensive list of potential contaminating genera, and guidelines on how to mitigate the effects of contamination. Concurrent sequencing of negative control samples is strongly advised.
biorxiv molecular-biology 100-200-users 2014