Accelerating Sequence Alignment to Graphs, bioRxiv, 2019-05-28
AbstractAligning DNA sequences to an annotated reference is a key step for genotyping in biology. Recent scientific studies have demonstrated improved inference by aligning reads to a variation graph, i.e., a reference sequence augmented with known genetic variations. Given a variation graph in the form of a directed acyclic string graph, the sequence to graph alignment problem seeks to find the best matching path in the graph for an input query sequence. Solving this problem exactly using a sequential dynamic programming algorithm takes quadratic time in terms of the graph size and query length, making it difficult to scale to high throughput DNA sequencing data. In this work, we propose the first parallel algorithm for computing sequence to graph alignments that leverages multiple cores and single-instruction multiple-data (SIMD) operations. We take advantage of the available inter-task parallelism, and provide a novel blocked approach to compute the score matrix while ensuring high memory locality. Using a 48-core Intel Xeon Skylake processor, the proposed algorithm achieves peak performance of 317 billion cell updates per second (GCUPS), and demonstrates near linear weak and strong scaling on up to 48 cores. It delivers significant performance gains compared to existing algorithms, and results in run-time reduction from multiple days to three hours for the problem of optimally aligning high coverage long (PacBioONT) or short (Illumina) DNA reads to an MHC human variation graph containing 10 million vertices.AvailabilityThe implementation of our algorithm is available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comParBLiSSPaSGAL>httpsgithub.comParBLiSSPaSGAL<jatsext-link>. Data sets used for evaluation are accessible using <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsalurulab.cc.gatech.eduPaSGAL>httpsalurulab.cc.gatech.eduPaSGAL<jatsext-link>.
biorxiv bioinformatics 100-200-users 2019RNASeqR an R package for automated two-group RNA-Seq analysis workflow, bioRxiv, 2019-05-28
RNA-Seq analysis has revolutionized researchers' understanding of the transcriptome in biological research. Assessing the differences in transcriptomic profiles between tissue samples or patient groups enables researchers to explore the underlying biological impact of transcription. RNA-Seq analysis requires multiple processing steps and huge computational capabilities. There are many well-developed R packages for individual steps; however, there are few RBioconductor packages that integrate existing software tools into a comprehensive RNA-Seq analysis and provide fundamental end-to-end results in pure R environment so that researchers can quickly and easily get fundamental information in big sequencing data. To address this need, we have developed the open source RBioconductor package, RNASeqR. It allows users to run an automated RNA-Seq analysis with only six steps, producing essential tabular and graphical results for further biological interpretation. The features of RNASeqR include six-step analysis, comprehensive visualization, background execution version, and the integration of both R and command-line software. RNASeqR provides fast, light-weight, and easy-to-run RNA-Seq analysis pipeline in pure R environment. It allows users to efficiently utilize popular software tools, including both RBioconductor and command-line tools, without predefining the resources or environments. RNASeqR is freely available for Linux and macOS operating systems from Bioconductor (httpsbioconductor.orgpackagesreleasebiochtmlRNASeqR.html).
biorxiv bioinformatics 100-200-users 2019The advantages and disadvantages of short- and long-read metagenomics to infer bacterial and eukaryotic community composition, bioRxiv, 2019-05-27
AbstractBackgroundThe first step in understanding ecological community diversity and dynamics is quantifying community membership. An increasingly common method for doing so is through metagenomics. Because of the rapidly increasing popularity of this approach, a large number of computational tools and pipelines are available for analysing metagenomic data. However, the majority of these tools have been designed and benchmarked using highly accurate short read data (i.e. illumina), with few studies benchmarking classification accuracy for long error-prone reads (PacBio or Oxford Nanopore). In addition, few tools have been benchmarked for non-microbial communities.ResultsHere we use simulated error prone Oxford Nanopore and high accuracy Illumina read sets to systematically investigate the effects of sequence length and taxon type on classification accuracy for metagenomic data from both microbial and non-microbial communities. We show that very generally, classification accuracy is far lower for non-microbial communities, even at low taxonomic resolution (e.g. family rather than genus).ConclusionsWe then show that for two popular taxonomic classifiers, long error-prone reads can significantly increase classification accuracy, and this is most pronounced for non-microbial communities. This work provides insight on the expected accuracy for metagenomic analyses for different taxonomic groups, and establishes the point at which read length becomes more important than error rate for assigning the correct taxon.
biorxiv bioinformatics 100-200-users 2019Algorithms for efficiently collapsing reads with Unique Molecular Identifiers, bioRxiv, 2019-05-25
AbstractBackgroundUnique Molecular Identifiers (UMI) are used in many experiments to find and remove PCR duplicates. Although there are many tools for solving the problem of deduplicating reads based on their finding reads with the same alignment coordinates and UMIs, many tools either cannot handle substitution errors, or require expensive pairwise UMI comparisons that do not efficiently scale to larger datasets.ResultsWe formulate the problem of deduplicating UMIs in a manner that enables optimizations to be made, and more efficient data structures to be used. We implement our data structures and optimizations in a tool called UMICollapse, which is able to deduplicate over one million unique UMIs of length 9 at a single alignment position in around 26 seconds.ConclusionsWe present a new formulation of the UMI deduplication problem, and show that it can be solved faster, with more sophisticated data structures.
biorxiv bioinformatics 100-200-users 2019EpiScanpy integrated single-cell epigenomic analysis, bioRxiv, 2019-05-25
ABSTRACTEpigenetic single-cell measurements reveal a layer of regulatory information not accessible to single-cell transcriptomics, however single-cell-omics analysis tools mainly focus on gene expression data. To address this issue, we present epiScanpy, a computational framework for the analysis of single-cell DNA methylation and single-cell ATAC-seq data. EpiScanpy makes the many existing RNA-seq workflows from scanpy available to large-scale single-cell data from other -omics modalities. We introduce and compare multiple feature space constructions for epigenetic data and show the feasibility of common clustering, dimension reduction and trajectory learning techniques. We benchmark epiScanpy by interrogating different single-cell brain mouse atlases of DNA methylation, ATAC-seq and transcriptomics. We find that differentially methylated and differentially open markers between cell clusters enrich transcriptome-based cell type labels by orthogonal epigenetic information.
biorxiv bioinformatics 0-100-users 2019Novel Rhabdovirus and an almost complete drain fly transcriptome recovered from two independent contaminations of clinical samples, bioRxiv, 2019-05-24
AbstractMetagenomic approaches enable an open exploration of microbial communities without requiring a priori knowledge of a sample’s composition by shotgun sequencing the total RNA or DNA of the sample. Such an approach is valuable for exploratory diagnostics of novel pathogens in clinical practice. Yet, one may also identify surprising off-target findings. Here we report a mostly complete transcriptome from a drain fly (likely Psychoda alternata) as well as a novel Rhabdovirus-like virus recovered from two independent contaminations of RNA sequencing libraries from clinical samples of cerebral spinal fluid (CSF) and serum, out of a total of 724 libraries sequenced at the same laboratory during a 2-year time span. This drain fly genome shows a considerable divergence from previously sequenced insects, which may obscure common clinical metagenomic analyses not expecting such contaminations. The classification of these contaminant sequences allowed us to identify infected drain flies as the likely origin of the novel Rhabdovirus-like sequence, which could have been erroneously linked to human pathology, had they been ignored.
biorxiv bioinformatics 0-100-users 2019