bioinformatics | audiences

Assembly of Long Error-Prone Reads Using Repeat Graphs, bioRxiv, 2018-01-13

ABSTRACTThe problem of genome assembly is ultimately linked to the problem of the characterization of all repeat families in a genome as a repeat graph. The key reason the de Bruijn graph emerged as a popular short read assembly approach is because it offered an elegant representation of all repeats in a genome that reveals their mosaic structure. However, most algorithms for assembling long error-prone reads use an alternative overlap-layout-consensus (OLC) approach that does not provide a repeat characterization. We present the Flye algorithm for constructing the A-Bruijn (assembly) graph from long error-prone reads, that, in contrast to the k-mer-based de Bruijn graph, assembles genomes using an alignment-based A-Bruijn graph. In difference from existing assemblers, Flye does not attempt to construct accurate contigs (at least at the initial assembly stage) but instead simply generates arbitrary paths in the (unknown) assembly graph and further constructs an assembly graph from these paths. Counter-intuitively, this fast but seemingly reckless approach results in the same graph as the assembly graph constructed from accurate contigs. Flye constructs (overlapping) contigs with possible assembly errors at the initial stage, combines them into an accurate assembly graph, resolves repeats in the assembly graph using small variations between various repeat instances that were left unresolved during the initial assembly stage, constructs a new, less tangled assembly graph based on resolved repeats, and finally outputs accurate contigs as paths in this graph. We benchmark Flye against several state-of-the-art Single Molecule Sequencing assemblers and demonstrate that it generates better or comparable assemblies for all analyzed datasets.

biorxiv bioinformatics 0-100-users 2018

Fast automated reconstruction of genome-scale metabolic models for microbial species and communities, bioRxiv, 2018-01-13

AbstractGenome-scale metabolic models are instrumental in uncovering operating principles of cellular metabolism and model-guided re-engineering. Recent applications of metabolic models have also demonstrated their usefulness in unraveling cross-feeding within microbial communities. Yet, the application of genome-scale models, especially to microbial communities, is lagging far behind the availability of sequenced genomes. This is largely due to the time-consuming steps of manual cura-tion required to obtain good quality models and thus physiologically meaningful simulation results. Here, we present an automated tool – CarveMe – for reconstruction of species and community level metabolic models. We introduce the concept of a universal model, which is manually curated and simulation-ready. Starting with this universal model and annotated genome sequences, CarveMe uses a top-down approach to build single-species and community models in a fast and scalable manner. We build reconstructions for two model organisms, Escherichia coli and Bacillus subtillis, as well as a collection of human gut bacteria, and show that CarveMe models perform similarly to manually curated models in reproducing experimental phenotypes. Finally, we demonstrate the scalability of CarveMe through reconstructing 5587 bacterial models. Overall, CarveMe provides an open-source and user-friendly tool towards broadening the use of metabolic modeling in studying microbial species and communities.

biorxiv bioinformatics 100-200-users 2018

Limitation of alignment-free tools in total RNA-seq quantification, bioRxiv, 2018-01-12

AbstractBackgroundAlignment-free RNA quantification tools have significantly increased the speed of RNA-seq analysis. However, it is unclear whether these state-of-the-art RNA-seq analysis pipelines can quantify small RNAs as accurately as they do with long RNAs in the context of total RNA quantification.ResultWe comprehensively tested and compared four RNA-seq pipelines on the accuracies of gene quantification and fold-change estimation on a novel total RNA benchmarking dataset, in which small non-coding RNAs are highly represented along with other long RNAs. The four RNA-seq pipelines were of two commonly-used alignment-free pipelines and two variants of alignment-based pipelines. We found that all pipelines showed high accuracies for quantifying the expressions of long and highly-abundant genes. However, alignment-free pipelines showed systematically poorer performances in quantifying lowly-abundant and small RNAs.ConclusionWe have shown that alignment-free and traditional alignment-based quantification methods performed similarly for common gene targets, such as protein-coding genes. However, we identified a potential pitfall in analyzing and quantifying lowly-expressed genes and small RNAs with alignment-free pipelines, especially when these small RNAs contain mutations.

biorxiv bioinformatics 100-200-users 2018

Comparison of computational methods for imputing single-cell RNA-sequencing data, bioRxiv, 2018-01-01

AbstractSingle-cell RNA-sequencing (scRNA-seq) is a recent breakthrough technology, which paves the way for measuring RNA levels at single cell resolution to study precise biological functions. One of the main challenges when analyzing scRNA-seq data is the presence of zeros or dropout events, which may mislead downstream analyses. To compensate the dropout effect, several methods have been developed to impute gene expression since the first Bayesian-based method being proposed in 2016. However, these methods have shown very diverse characteristics in terms of model hypothesis and imputation performance. Thus, large-scale comparison and evaluation of these methods is urgently needed now. To this end, we compared eight imputation methods, evaluated their power in recovering original real data, and performed broad analyses to explore their effects on clustering cell types, detecting differentially expressed genes, and reconstructing lineage trajectories in the context of both simulated and real data. Simulated datasets and case studies highlight that there are no one method performs the best in all the situations. Some defects of these methods such as scalability, robustness and unavailability in some situations need to be addressed in future studies.

biorxiv bioinformatics 0-100-users 2018

DeepGS Predicting phenotypes from genotypes using Deep Learning, bioRxiv, 2018-01-01

AbstractMotivationGenomic selection (GS) is a new breeding strategy by which the phenotypes of quantitative traits are usually predicted based on genome-wide markers of genotypes using conventional statistical models. However, the GS prediction models typically make strong assumptions and perform linear regression analysis, limiting their accuracies since they do not capture the complex, non-linear relationships within genotypes, and between genotypes and phenotypes.ResultsWe present a deep learning method, named DeepGS, to predict phenotypes from genotypes. Using a deep convolutional neural network, DeepGS uses hidden variables that jointly represent features in genotypic markers when making predictions; it also employs convolution, sampling and dropout strategies to reduce the complexity of high-dimensional marker data. We used a large GS dataset to train DeepGS and compare its performance with other methods. In terms of mean normalized discounted cumulative gain value, DeepGS achieves an increase of 27.70%~246.34% over a conventional neural network in selecting top-ranked 1% individuals with high phenotypic values for the eight tested traits. Additionally, compared with the widely used method RR-BLUP, DeepGS still yields a relative improvement ranging from 1.44% to 65.24%. Through extensive simulation experiments, we also demonstrated the effectiveness and robustness of DeepGS for the absent of outlier individuals and subsets of genotypic markers. Finally, we illustrated the complementarity of DeepGS and RR-BLUP with an ensemble learning approach for further improving prediction performance.AvailabilityDeepGS is provided as an open source R package available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comcma2015DeepGS>httpsgithub.comcma2015DeepGS<jatsext-link>.

biorxiv bioinformatics 0-100-users 2018

Reproducible Bioinformatics Project A community for reproducible bioinformatics analysis pipelines, bioRxiv, 2017-12-27

AbstractBackgroundReproducibility of a research is a key element in the modern science and it is mandatory for any industrial application. It represents the ability of replicating an experiment independently by the location and the operator. Therefore, a study can be considered reproducible only if all used data are available and the exploited computational analysis workflow is clearly described. However, today for reproducing a complex bioinformatics analysis, the raw data and a list of tools used in the workflow could be not enough to guarantee the reproducibility of the results obtained. Indeed, different releases of the same tools andor of the system libraries (exploited by such tools) might lead to sneaky reproducibility issues.ResultsTo address this challenge, we established the Reproducible Bioinformatics Project (RBP), which is a non-profit and open-source project, whose aim is to provide a schema and an infrastructure, based on docker images and R package, to provide reproducible results in Bioinformatics. One or more Docker images are then defined for a workflow (typically one for each task), while the workflow implementation is handled via R-functions embedded in a package available at github repository. Thus, a bioinformatician participating to the project has firstly to integrate herhis workflow modules into Docker image(s) exploiting an Ubuntu docker image developed ad hoc by RPB to make easier this task. Secondly, the workflow implementation must be realized in R according to an R-skeleton function made available by RPB to guarantee homogeneity and reusability among different RPB functions. Moreover shehe has to provide the R vignette explaining the package functionality together with an example dataset which can be used to improve the user confidence in the workflow utilization.ConclusionsReproducible Bioinformatics Project provides a general schema and an infrastructure to distribute robust and reproducible workflows. Thus, it guarantees to final users the ability to repeat consistently any analysis independently by the used UNIX-like architecture.

biorxiv bioinformatics 0-100-users 2017