Reports | audiences

Assembly of Long Error-Prone Reads Using Repeat Graphs, bioRxiv, 2018-01-13

ABSTRACTThe problem of genome assembly is ultimately linked to the problem of the characterization of all repeat families in a genome as a repeat graph. The key reason the de Bruijn graph emerged as a popular short read assembly approach is because it offered an elegant representation of all repeats in a genome that reveals their mosaic structure. However, most algorithms for assembling long error-prone reads use an alternative overlap-layout-consensus (OLC) approach that does not provide a repeat characterization. We present the Flye algorithm for constructing the A-Bruijn (assembly) graph from long error-prone reads, that, in contrast to the k-mer-based de Bruijn graph, assembles genomes using an alignment-based A-Bruijn graph. In difference from existing assemblers, Flye does not attempt to construct accurate contigs (at least at the initial assembly stage) but instead simply generates arbitrary paths in the (unknown) assembly graph and further constructs an assembly graph from these paths. Counter-intuitively, this fast but seemingly reckless approach results in the same graph as the assembly graph constructed from accurate contigs. Flye constructs (overlapping) contigs with possible assembly errors at the initial stage, combines them into an accurate assembly graph, resolves repeats in the assembly graph using small variations between various repeat instances that were left unresolved during the initial assembly stage, constructs a new, less tangled assembly graph based on resolved repeats, and finally outputs accurate contigs as paths in this graph. We benchmark Flye against several state-of-the-art Single Molecule Sequencing assemblers and demonstrate that it generates better or comparable assemblies for all analyzed datasets.

biorxiv bioinformatics 0-100-users 2018

Fast automated reconstruction of genome-scale metabolic models for microbial species and communities, bioRxiv, 2018-01-13

AbstractGenome-scale metabolic models are instrumental in uncovering operating principles of cellular metabolism and model-guided re-engineering. Recent applications of metabolic models have also demonstrated their usefulness in unraveling cross-feeding within microbial communities. Yet, the application of genome-scale models, especially to microbial communities, is lagging far behind the availability of sequenced genomes. This is largely due to the time-consuming steps of manual cura-tion required to obtain good quality models and thus physiologically meaningful simulation results. Here, we present an automated tool – CarveMe – for reconstruction of species and community level metabolic models. We introduce the concept of a universal model, which is manually curated and simulation-ready. Starting with this universal model and annotated genome sequences, CarveMe uses a top-down approach to build single-species and community models in a fast and scalable manner. We build reconstructions for two model organisms, Escherichia coli and Bacillus subtillis, as well as a collection of human gut bacteria, and show that CarveMe models perform similarly to manually curated models in reproducing experimental phenotypes. Finally, we demonstrate the scalability of CarveMe through reconstructing 5587 bacterial models. Overall, CarveMe provides an open-source and user-friendly tool towards broadening the use of metabolic modeling in studying microbial species and communities.

biorxiv bioinformatics 100-200-users 2018

Accurate allele frequencies from ultra-low coverage pool-seq samples in evolve-and-resequence experiments, bioRxiv, 2018-01-12

AbstractEvolve-and-resequence (E+R) experiments leverage next-generation sequencing technology to track the allele frequency dynamics of populations as they evolve. While previous work has shown that adaptive alleles can be detected by comparing frequency trajectories from many replicate populations, this power comes at the expense of high-coverage (>100x) sequencing of many pooled samples, which can be cost-prohibitive. Here, we show that accurate estimates of allele frequencies can be achieved with very shallow sequencing depths (<5x) via inference of known founder haplotypes in small genomic windows. This technique can be used to efficiently estimate frequencies for any number of bi-allelic SNPs in populations of any model organism founded with sequenced homozygous strains. Using both experimentally-pooled and simulated samples of Drosophila melanogaster, we show that haplotype inference can improve allele frequency accuracy by orders of magnitude for up to 50 generations of recombination, and is robust to moderate levels of missing data, as well as different selection regimes. Finally, we show that a simple linear model generated from these simulations can predict the accuracy of haplotype-derived allele frequencies in other model organisms and experimental designs. To make these results broadly accessible for use in E+R experiments, we introduce HAF-pipe, an open-source software tool for calculating haplotype-derived allele frequencies from raw sequencing data. Ultimately, by reducing sequencing costs without sacrificing accuracy, our method facilitates E+R designs with higher replication and resolution, and thereby, increased power to detect adaptive alleles.

biorxiv evolutionary-biology 0-100-users 2018

Direct imaging of the circular chromosome in a live bacterium, bioRxiv, 2018-01-12

New assays for quantitative imaging1–6 and sequencing7–11 have yielded great progress towards understanding the organizational principles of chromosomes. Yet, even for the well-studied model bacterium Escherichia coli, many basic questions remain unresolved regarding chromosomal (sub-)structure2,11, its mechanics1,2,12 and dynamics13,14, and the link between structure and function1,15,16. Here we resolve the spatial organization of the circular chromosome of bacteria by directly imaging the chromosome in live E. coli cells with a broadened cell shape. The chromosome was observed to exhibit a torus topology with a 4.2 μm toroidal length and 0.4 μm bundle thickness. On average, the DNA density along the chromosome shows dense right and left arms that branch from a lower-density origin of replication, and are connected at the terminus of replication by an ultrathin flexible string of DNA. At the single-cell level, the DNA density along the torus is found to be strikingly heterogeneous, with blob-like Mbp-size domains that undergo major dynamic rearrangements, splitting and merging at a minute timescale. We show that prominent domain boundaries at the terminus and origin of replication are induced by MatP proteins, while weaker transient domain boundaries are facilitated by the global transcription regulators HU and Fis. These findings provide an architectural basis for the understanding of the spatial organization of bacterial genomes.

biorxiv microbiology 100-200-users 2018

High accuracy haplotype-derived allele frequencies from ultra-low coverage pool-seq samples, bioRxiv, 2018-01-12

AbstractEvolve-and-resequence experiments leverage next-generation sequencing technology to track allele frequency dynamics of populations as they evolve. While previous work has shown that adaptive alleles can be detected by comparing frequency trajectories from many replicate populations, this power comes at the expense of high-coverage (>100x) sequencing of many pooled samples, which can be cost-prohibitive. Here we show that accurate estimates of allele frequencies can be achieved with very shallow sequencing depths (<5x) via inference of known founder haplotypes in small genomic windows. This technique can be used to efficiently estimate frequencies for any number of alleles in any model system. Using both experimentally-pooled and simulated samples of Drosophila melanogaster, we show that haplotype inference can improve allele frequency accuracy by orders of magnitude, and that high accuracy is maintained after up to 200 generations of recombination, even in the presence of missing data or incomplete founder knowledge. By reducing sequencing costs without sacrificing accuracy, our method enables analysis of samples from more time-points and replicates, increasing the statistical power to detect adaptive alleles.

biorxiv evolutionary-biology 0-100-users 2018

Limitation of alignment-free tools in total RNA-seq quantification, bioRxiv, 2018-01-12

AbstractBackgroundAlignment-free RNA quantification tools have significantly increased the speed of RNA-seq analysis. However, it is unclear whether these state-of-the-art RNA-seq analysis pipelines can quantify small RNAs as accurately as they do with long RNAs in the context of total RNA quantification.ResultWe comprehensively tested and compared four RNA-seq pipelines on the accuracies of gene quantification and fold-change estimation on a novel total RNA benchmarking dataset, in which small non-coding RNAs are highly represented along with other long RNAs. The four RNA-seq pipelines were of two commonly-used alignment-free pipelines and two variants of alignment-based pipelines. We found that all pipelines showed high accuracies for quantifying the expressions of long and highly-abundant genes. However, alignment-free pipelines showed systematically poorer performances in quantifying lowly-abundant and small RNAs.ConclusionWe have shown that alignment-free and traditional alignment-based quantification methods performed similarly for common gene targets, such as protein-coding genes. However, we identified a potential pitfall in analyzing and quantifying lowly-expressed genes and small RNAs with alignment-free pipelines, especially when these small RNAs contain mutations.

biorxiv bioinformatics 100-200-users 2018