Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation, bioRxiv, 2015-08-30
RNA-seq technology is widely used in biomedical and basic science research. These studies rely on complex computational methods that quantify expression levels for observed transcripts. We find that current computational methods can lead to hundreds of false positive results related to alternative isoform usage. This flaw in the current methodology stems from a lack of modeling sample-specific bias that leads to drops in coverage and is related to sequence features like fragment GC content and GC stretches. By incorporating features that explain this bias into transcript expression models, we greatly increase the specificity of transcript expression estimates, with more than a four-fold reduction in the number of false positives for reported changes in expression. We introduce alpine, a method for estimation of bias-corrected transcript abundance. The method is available as a Bioconductor package that includes data visualization tools useful for bias discovery.
biorxiv bioinformatics 100-200-users 2015Salmon provides accurate, fast, and bias-aware transcript expression estimates using dual-phase inference, bioRxiv, 2015-06-28
We introduce Salmon, a new method for quantifying transcript abundance from RNA-seq reads that is highly-accurate and very fast. Salmon is the first transcriptome-wide quantifier to model and correct for fragment GC content bias, which we demonstrate substantially improves the accuracy of abundance estimates and the reliability of subsequent differential expression analysis compared to existing methods that do not account for these biases. Salmon achieves its speed and accuracy by combining a new dual-phase parallel inference algorithm and feature-rich bias models with an ultra-fast read mapping procedure. These innovations yield both exceptional accuracy and order-of-magnitude speed benefits over alignment-based methods.
biorxiv bioinformatics 100-200-users 2015SAMBAM format v1.5 extensions for de novo assemblies, bioRxiv, 2015-05-30
Summary The plain text Sequence AlignmentMap (SAM) file format and its companion binary form (BAM) are a generic alignment format for storing read alignments against reference sequences (and unmapped reads) together with structured meta-data (Li et al., 2009). Driven by the needs of the 1000 Genomes Project which sequenced many individual human genomes, early SAMBAM usage focused on pairwise alignments of reads to a reference. However, through the CIGAR P operator multiple sequence alignments can also be preserved. Herein we describe clarifications and additions in version 1.5 of the specification to facilitate storing de novo sequence alignments Padded reference sequences (with gap characters), annotation of reads or regions of the reference, and the option of embedding the reference sequence within the file. Availability The latest public release of the specification is at httpsamtools.sourceforge.netSAM1.pdf, with in development drafts at httpsgithub.comsamtoolshts-specs under version control.
biorxiv bioinformatics 0-100-users 2015Streaming algorithms for identification of pathogens and antibiotic resistance potential from real-time MinIONTM sequencing, bioRxiv, 2015-05-16
AbstractThe recently introduced Oxford Nanopore MinION platform generates DNA sequence data in real-time. This opens immense potential to shorten the sample-to-results time and is likely to lead to enormous benefits in rapid diagnosis of bacterial infection and identification of drug resistance. However, there are very few tools available for streaming analysis of real-time sequencing data. Here, we present a framework for streaming analysis of MinION real-time sequence data, together with probabilistic streaming algorithms for species typing, multi-locus strain typing, gene presence strain-typing and antibiotic resistance profile identification. Using three culture isolate samples as well as a mixed-species sample, we demonstrate that bacterial species and strain information can be obtained within 30 minutes of sequencing and using about 500 reads, initial drug-resistance profiles within two hours, and complete resistance profiles within 10 hours. Multi-locus strain typing required more than 15x coverage to generate confident assignments, whereas gene-presence typing could detect the presence of a known strain with 0.5x coverage. We also show that our pipeline can process over 100 times more data than the current throughput of the MinION on a desktop computer.
biorxiv bioinformatics 100-200-users 2015An evaluation of the accuracy and speed of metagenome analysis tools, bioRxiv, 2015-04-10
Metagenome studies are becoming increasingly widespread, yielding important insights into microbial communities covering diverse environments from terrestrial and aquatic ecosystems to human skin and gut. With the advent of high-throughput sequencing platforms, the use of large scale shotgun sequencing approaches is now commonplace. However, a thorough independent benchmark comparing state-of-the-art metagenome analysis tools is lacking. Here, we present a benchmark where the most widely used tools are tested on complex, realistic data sets. Our results clearly show that the most widely used tools are not necessarily the most accurate, that the most accurate tool is not necessarily the most time consuming, and that there is a high degree of variability between available tools. These findings are important as the conclusions of any metagenomics study are affected by errors in the predicted community composition. Data sets and results are freely available from httpwww.ucbioinformatics.orgmetabenchmark.html
biorxiv bioinformatics 100-200-users 2015A complete bacterial genome assembled de novo using only nanopore sequencing data, bioRxiv, 2015-02-20
A method for de novo assembly of data from the Oxford Nanopore MinION instrument is presented which is able to reconstruct the sequence of an entire bacterial chromosome in a single contig. Initially, overlaps between nanopore reads are detected. Reads are then subjected to one or more rounds of error correction by a multiple alignment process employing partial order graphs. After correction, reads are assembled using the Celera assembler. Finally, the assembly is polished using signal-level data from the nanopore employing a novel hidden Markov model. We show that this method is able to assemble nanopore reads from Escherichia coli K-12 MG1655 into a single contig of length 4.6Mb permitting a full reconstruction of gene order. The resulting draft assembly has 98.4% nucleotide identity compared to the finished reference genome. After polishing the assembly with our signal-level HMM, the nucleotide identity is improved to 99.4%. We show that MinION sequencing data can be used to reconstruct genomes without the need for a reference sequence or data from other sequencing platforms.
biorxiv bioinformatics 100-200-users 2015