Assembly of Long Error-Prone Reads Using Repeat Graphs, bioRxiv, 2018-01-13

ABSTRACTThe problem of genome assembly is ultimately linked to the problem of the characterization of all repeat families in a genome as a repeat graph. The key reason the de Bruijn graph emerged as a popular short read assembly approach is because it offered an elegant representation of all repeats in a genome that reveals their mosaic structure. However, most algorithms for assembling long error-prone reads use an alternative overlap-layout-consensus (OLC) approach that does not provide a repeat characterization. We present the Flye algorithm for constructing the A-Bruijn (assembly) graph from long error-prone reads, that, in contrast to the k-mer-based de Bruijn graph, assembles genomes using an alignment-based A-Bruijn graph. In difference from existing assemblers, Flye does not attempt to construct accurate contigs (at least at the initial assembly stage) but instead simply generates arbitrary paths in the (unknown) assembly graph and further constructs an assembly graph from these paths. Counter-intuitively, this fast but seemingly reckless approach results in the same graph as the assembly graph constructed from accurate contigs. Flye constructs (overlapping) contigs with possible assembly errors at the initial stage, combines them into an accurate assembly graph, resolves repeats in the assembly graph using small variations between various repeat instances that were left unresolved during the initial assembly stage, constructs a new, less tangled assembly graph based on resolved repeats, and finally outputs accurate contigs as paths in this graph. We benchmark Flye against several state-of-the-art Single Molecule Sequencing assemblers and demonstrate that it generates better or comparable assemblies for all analyzed datasets.

biorxiv bioinformatics 0-100-users 2018

Fast automated reconstruction of genome-scale metabolic models for microbial species and communities, bioRxiv, 2018-01-13

AbstractGenome-scale metabolic models are instrumental in uncovering operating principles of cellular metabolism and model-guided re-engineering. Recent applications of metabolic models have also demonstrated their usefulness in unraveling cross-feeding within microbial communities. Yet, the application of genome-scale models, especially to microbial communities, is lagging far behind the availability of sequenced genomes. This is largely due to the time-consuming steps of manual cura-tion required to obtain good quality models and thus physiologically meaningful simulation results. Here, we present an automated tool – CarveMe – for reconstruction of species and community level metabolic models. We introduce the concept of a universal model, which is manually curated and simulation-ready. Starting with this universal model and annotated genome sequences, CarveMe uses a top-down approach to build single-species and community models in a fast and scalable manner. We build reconstructions for two model organisms, Escherichia coli and Bacillus subtillis, as well as a collection of human gut bacteria, and show that CarveMe models perform similarly to manually curated models in reproducing experimental phenotypes. Finally, we demonstrate the scalability of CarveMe through reconstructing 5587 bacterial models. Overall, CarveMe provides an open-source and user-friendly tool towards broadening the use of metabolic modeling in studying microbial species and communities.

biorxiv bioinformatics 100-200-users 2018

DeepGS Predicting phenotypes from genotypes using Deep Learning, bioRxiv, 2018-01-01

AbstractMotivationGenomic selection (GS) is a new breeding strategy by which the phenotypes of quantitative traits are usually predicted based on genome-wide markers of genotypes using conventional statistical models. However, the GS prediction models typically make strong assumptions and perform linear regression analysis, limiting their accuracies since they do not capture the complex, non-linear relationships within genotypes, and between genotypes and phenotypes.ResultsWe present a deep learning method, named DeepGS, to predict phenotypes from genotypes. Using a deep convolutional neural network, DeepGS uses hidden variables that jointly represent features in genotypic markers when making predictions; it also employs convolution, sampling and dropout strategies to reduce the complexity of high-dimensional marker data. We used a large GS dataset to train DeepGS and compare its performance with other methods. In terms of mean normalized discounted cumulative gain value, DeepGS achieves an increase of 27.70%~246.34% over a conventional neural network in selecting top-ranked 1% individuals with high phenotypic values for the eight tested traits. Additionally, compared with the widely used method RR-BLUP, DeepGS still yields a relative improvement ranging from 1.44% to 65.24%. Through extensive simulation experiments, we also demonstrated the effectiveness and robustness of DeepGS for the absent of outlier individuals and subsets of genotypic markers. Finally, we illustrated the complementarity of DeepGS and RR-BLUP with an ensemble learning approach for further improving prediction performance.AvailabilityDeepGS is provided as an open source R package available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comcma2015DeepGS>httpsgithub.comcma2015DeepGS<jatsext-link>.

biorxiv bioinformatics 0-100-users 2018

Reproducible Bioinformatics Project A community for reproducible bioinformatics analysis pipelines, bioRxiv, 2017-12-27

AbstractBackgroundReproducibility of a research is a key element in the modern science and it is mandatory for any industrial application. It represents the ability of replicating an experiment independently by the location and the operator. Therefore, a study can be considered reproducible only if all used data are available and the exploited computational analysis workflow is clearly described. However, today for reproducing a complex bioinformatics analysis, the raw data and a list of tools used in the workflow could be not enough to guarantee the reproducibility of the results obtained. Indeed, different releases of the same tools andor of the system libraries (exploited by such tools) might lead to sneaky reproducibility issues.ResultsTo address this challenge, we established the Reproducible Bioinformatics Project (RBP), which is a non-profit and open-source project, whose aim is to provide a schema and an infrastructure, based on docker images and R package, to provide reproducible results in Bioinformatics. One or more Docker images are then defined for a workflow (typically one for each task), while the workflow implementation is handled via R-functions embedded in a package available at github repository. Thus, a bioinformatician participating to the project has firstly to integrate herhis workflow modules into Docker image(s) exploiting an Ubuntu docker image developed ad hoc by RPB to make easier this task. Secondly, the workflow implementation must be realized in R according to an R-skeleton function made available by RPB to guarantee homogeneity and reusability among different RPB functions. Moreover shehe has to provide the R vignette explaining the package functionality together with an example dataset which can be used to improve the user confidence in the workflow utilization.ConclusionsReproducible Bioinformatics Project provides a general schema and an infrastructure to distribute robust and reproducible workflows. Thus, it guarantees to final users the ability to repeat consistently any analysis independently by the used UNIX-like architecture.

biorxiv bioinformatics 0-100-users 2017

 

Created with the audiences framework by Jedidiah Carlson

Powered by Hugo