A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases, bioRxiv, 2017-01-28
AbstractEmerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this paper, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290x faster than BWA-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥ 5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and > 60, 000 genomes.
biorxiv bioinformatics 100-200-users 2017Rapid de novo assembly of the European eel genome from nanopore sequencing reads, bioRxiv, 2017-01-21
AbstractWe have sequenced the genome of the endangered European eel using the MinION by Oxford Nanopore, and assembled these data using a novel algorithm specifically designed for large eukaryotic genomes. For this 860 Mbp genome, the entire computational process takes two days on a single CPU. The resulting genome assembly significantly improves on a previous draft based on short reads only, both in terms of contiguity (N50 1.2 Mbp) and structural quality. This combination of affordable nanopore sequencing and light-weight assembly promises to make high-quality genomic resources accessible for many non-model plants and animals.
biorxiv genomics 100-200-users 2017Genome Graphs, bioRxiv, 2017-01-19
AbstractThere is increasing recognition that a single, monoploid reference genome is a poor universal reference structure for human genetics, because it represents only a tiny fraction of human variation. Adding this missing variation results in a structure that can be described as a mathematical graph a genome graph. We demonstrate that, in comparison to the existing reference genome (GRCh38), genome graphs can substantially improve the fractions of reads that map uniquely and perfectly. Furthermore, we show that this fundamental simplification of read mapping transforms the variant calling problem from one in which many non-reference variants must be discovered de-novo to one in which the vast majority of variants are simply re-identified within the graph. Using standard benchmarks as well as a novel reference-free evaluation, we show that a simplistic variant calling procedure on a genome graph can already call variants at least as well as, and in many cases better than, a state-of-the-art method on the linear human reference genome. We anticipate that graph-based references will supplant linear references in humans and in other applications where cohorts of sequenced individuals are available.
biorxiv bioinformatics 100-200-users 2017Direct visualization of transcriptional activation by physical enhancer–promoter proximity, bioRxiv, 2017-01-12
A long-standing question in metazoan gene regulation is how remote enhancers communicate with their target promoters over long distances. Combining genome editing and quantitative live imaging we simultaneously visualize physical enhancer–promoter communication and transcription in Drosophila embryos. Enhancers regulating pair rule stripes of even-skipped expression activate transcription of a reporter gene over a distance of 150 kb. We show in individual cells that activation only occurs after the enhancer comes into close proximity with its regulatory target and that upon dissociation transcription ceases almost immediately. We further observe distinct topological conformations of the eve locus, depending on the spatial identity of the activating stripe enhancer. In addition, long-range activation results in transcriptional competition at the endogenous eve locus, causing corresponding developmental defects. Overall, we demonstrate that sustained physical proximity and enhancer–promoter engagement are required for enhancer action, and we provide a path to probe the implications of long-range regulation on cellular fates.
biorxiv biophysics 100-200-users 2017Critical Assessment of Metagenome Interpretation – a benchmark of computational metagenomics software, bioRxiv, 2017-01-10
AbstractIn metagenome analysis, computational methods for assembly, taxonomic profiling and binning are key components facilitating downstream biological data interpretation. However, a lack of consensus about benchmarking datasets and evaluation metrics complicates proper performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on datasets of unprecedented complexity and realism. Benchmark metagenomes were generated from ~700 newly sequenced microorganisms and ~600 novel viruses and plasmids, including genomes with varying degrees of relatedness to each other and to publicly available ones and representing common experimental setups. Across all datasets, assembly and genome binning programs performed well for species represented by individual genomes, while performance was substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below the family level. Parameter settings substantially impacted performances, underscoring the importance of program reproducibility. While highlighting current challenges in computational metagenomics, the CAMI results provide a roadmap for software selection to answer specific research questions.
biorxiv bioinformatics 100-200-users 2017Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples, bioRxiv, 2017-01-10
Genome sequencing has become a powerful tool for studying emerging infectious diseases; however, genome sequencing directly from clinical samples without isolation remains challenging for viruses such as Zika, where metagenomic sequencing methods may generate insufficient numbers of viral reads. Here we present a protocol for generating coding-sequence complete genomes comprising an online primer design tool, a novel multiplex PCR enrichment protocol, optimised library preparation methods for the portable MinION sequencer (Oxford Nanopore Technologies) and the Illumina range of instruments, and a bioinformatics pipeline for generating consensus sequences. The MinION protocol does not require an internet connection for analysis, making it suitable for field applications with limited connectivity. Our method relies on multiplex PCR for targeted enrichment of viral genomes from samples containing as few as 50 genome copies per reaction. Viral consensus sequences can be achieved starting with clinical samples in 1-2 days following a simple laboratory workflow. This method has been successfully used by several groups studying Zika virus evolution and is facilitating an understanding of the spread of the virus in the Americas.
biorxiv genomics 100-200-users 2017