HASLR Fast Hybrid Assembly of Long Reads, bioRxiv, 2020-01-28
AbstractThird generation sequencing technologies from platforms such as Oxford Nanopore Technologies and Pacific Biosciences have paved the way for building more contiguous assemblies and complete reconstruction of genomes. The larger effective length of the reads generated with these technologies has provided a mean to overcome the challenges of short to mid-range repeats. Currently, accurate long read assemblers are computationally expensive while faster methods are not as accurate. Therefore, there is still an unmet need for tools that are both fast and accurate for reconstructing small and large genomes. Despite the recent advances in third generation sequencing, researchers tend to generate second generation reads for many of the analysis tasks. Here, we present HASLR, a hybrid assembler which uses both second and third generation sequencing reads to efficiently generate accurate genome assemblies. Our experiments show that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on all the samples compared to other tested assemblers. Furthermore, the generated assemblies in terms of contiguity and accuracy are on par with the other tools on most of the samples.AvailabilityHASLR is an open source tool available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comvpc-ccghaslr>httpsgithub.comvpc-ccghaslr<jatsext-link>.
biorxiv bioinformatics 0-100-users 2020Comparison of visualisation tools for single-cell RNAseq data, bioRxiv, 2020-01-26
In the last decade, single cell RNAseq (scRNAseq) datasets have grown from a single cell to millions of cells. Due to its high dimensionality, the scRNAseq data contains a lot of valuable information, however, it is not always feasible to visualise and share it in a scientific report or an article publication format. Recently, a lot of interactive analysis and visualisation tools have been developed to address this issue and facilitate knowledge transfer in the scientific community. In this study, we review and compare several of the currently available analysis and visualisation tools and benchmark those that allow to visualize the scRNAseq data on the web and share it with others. To address the problem of format compatibility for most visualisation tools, we have also developed a user-friendly R package, sceasy, which allows users to convert their own scRNAseq datasets into a specific data format for visualisation.
biorxiv bioinformatics 0-100-users 2020BlastFrost Fast querying of 100,000s of bacterial genomes in Bifrost graphs, bioRxiv, 2020-01-23
AbstractBlastFrost is a highly efficient method for querying 100,000s of genome assemblies. It builds on Bifrost, a recently developed dynamic data structure for compacted and colored de Bruijn graphs from bacterial genomes. BlastFrost queries a Bifrost data structure for sequences of interest, and extracts local subgraphs, thereby enabling the efficient identification of the presence or absence of individual genes or single nucleotide sequence variants. Here we describe the algorithms and implementation of BlastFrost. We also present two exemplar practical applications. In the first, we determined the presence of the individual genes within the SPI-2 Salmonella pathogenicity island within a collection of 926 representative genomes in minutes. In the second application, we determined the existence of known single nucleotide polymorphisms associated with fluoroquinolone resistance in the genes gyrA, gyrB and parE among 190, 209 Salmonella genomes. BlastFrost is available for download at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comnluhmannBlastFrost>httpsgithub.comnluhmannBlastFrost<jatsext-link>.
biorxiv bioinformatics 0-100-users 2020Post-prediction Inference, bioRxiv, 2020-01-23
AbstractMany modern problems in medicine and public health leverage machine learning methods to predict outcomes based on observable covariates [1, 2, 3, 4]. In an increasingly wide array of settings, these predicted outcomes are used in subsequent statistical analysis, often without accounting for the distinction between observed and predicted outcomes [1, 5, 6, 7, 8, 9]. We call inference with predicted outcomes post-prediction inference. In this paper, we develop methods for correcting statistical inference using outcomes predicted with an arbitrary machine learning method. Rather than trying to derive the correction from the first principles for each machine learning tool, we make the observation that there is typically a low-dimensional and easily modeled representation of the relationship between the observed and predicted outcomes. We build an approach for the post-prediction inference that naturally fits into the standard machine learning framework. We estimate the relationship between the observed and predicted outcomes on the testing set and use that model to correct inference on the validation set and subsequent statistical models. We show our postpi approach can correct bias and improve variance estimation (and thus subsequent statistical inference) with predicted outcome data. To show the broad range of applicability of our approach, we show postpi can improve inference in two totally distinct fields modeling predicted phenotypes in repurposed gene expression data [10] and modeling predicted causes of death in verbal autopsy data [11]. We have made our method available through an open-source R package [<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comSiruoWangpostpi>httpsgithub.comSiruoWangpostpi<jatsext-link>]
biorxiv bioinformatics 0-100-users 2020Algorithmic Learning for Auto-deconvolution of GC-MS Data to Enable Molecular Networking within GNPS, bioRxiv, 2020-01-15
AbstractGas chromatography-mass spectrometry (GC-MS) represents an analytical technique with significant practical societal impact. Spectral deconvolution is an essential step for interpreting GC-MS data. No public GC-MS repositories that also enable repository-scale analysis exist, in part because deconvolution requires significant user input. We therefore engineered a scalable machine learning workflow for the Global Natural Product Social Molecular Networking (GNPS) analysis platform to enable the mass spectrometry community to store, process, share, annotate, compare, and perform molecular networking of GC-MS data. The workflow performs auto-deconvolution of compound fragmentation patterns via unsupervised non-negative matrix factorization, using a Fast Fourier Transform-based strategy to overcome scalability limitations. We introduce a “balance score” that quantifies the reproducibility of fragmentation patterns across all samples. We demonstrate the utility of the platform with breathomics analysis applied to the early detection of oesophago-gastric cancer, and by creating the first molecular spatial map of the human volatilome.
biorxiv bioinformatics 0-100-users 2020Gapless assembly of maize chromosomes using long read technologies, bioRxiv, 2020-01-15
AbstractCreating gapless telomere-to-telomere assemblies of complex genomes is one of the ultimate challenges in genomics. We used long read technologies and an optical map based approach to produce a maize genome assembly composed of only 63 contigs. The B73-Ab10 genome includes gapless assemblies of chromosome 3 (236 Mb) and chromosome 9 (162 Mb), multiple highly repetitive centromeres and heterochromatic knobs, and 53 Mb of the Ab10 meiotic drive haplotype.
biorxiv bioinformatics 0-100-users 2020