Fast and Accurate Genomic Analyses using Genome Graphs, bioRxiv, 2017-09-28
AbstractThe human reference genome serves as the foundation for genomics by providing a scaffold for alignment of sequencing reads, but currently only reflects a single consensus haplotype, which impairs read alignment and downstream analysis accuracy. Reference genome structures incorporating known genetic variation have been shown to improve the accuracy of genomic analyses, but have so far remained computationally prohibitive for routine large-scale use. Here we present a graph genome implementation that enables read alignment across 2,800 diploid genomes encompassing 12.6 million SNPs and 4.0 million indels. Our Graph Genome Pipeline requires 6.5 hours to process a 30x coverage WGS sample on a system with 36 CPU cores compared with 11 hours required by the GATK Best Practices pipeline. Using complementary benchmarking experiments based on real and simulated data, we show that using a graph genome reference improves read mapping sensitivity and produces a 0.5% increase in variant calling recall, or about 20,000 additional variants being detected per sample, while variant calling specificity is unaffected. Structural variations (SVs) incorporated into a graph genome can be genotyped accurately under a unified framework. Finally, we show that iterative augmentation of graph genomes yields incremental gains in variant calling accuracy. Our implementation is a significant advance towards fulfilling the promise of graph genomes to radically enhance the scalability and accuracy of genomic analyses.
biorxiv bioinformatics 100-200-users 2017CUT&RUN Targeted in situ genome-wide profiling with high efficiency for low cell numbers, bioRxiv, 2017-09-25
SUMMARYCleavage Under Targets and Release Using Nuclease (CUT&RUN) is an epigenomic profiling strategy in which antibody-targeted controlled cleavage by micrococcal nuclease releases specific protein-DNA complexes into the supernatant for paired-end DNA sequencing. As only the targeted fragments enter into solution, and the vast majority of DNA is left behind, CUT&RUN has exceptionally low background levels. CUT&RUN outperforms the most widely-used Chromatin Immunoprecipitation (ChIP) protocols in resolution, signal-to-noise, and depth of sequencing required. In contrast to ChIP, CUT&RUN is free of solubility and DNA accessibility artifacts and can be used to profile insoluble chromatin and to detect long-range 3D contacts without cross-linking. Here we present an improved CUT&RUN protocol that does not require isolation of nuclei and provides high-quality data starting with only 100 cells for a histone modification and 1000 cells for a transcription factor. From cells to purified DNA CUT&RUN requires less than a day at the lab bench.
biorxiv genomics 100-200-users 2017Multi-platform discovery of haplotype-resolved structural variation in human genomes, bioRxiv, 2017-09-24
ABSTRACTThe incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, and strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three human parent–child trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per human genome. We also discover 156 inversions per genome—most of which previously escaped detection. Fifty-eight of the inversions we discovered intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The method and the dataset serve as a gold standard for the scientific community and we make specific recommendations for maximizing structural variation sensitivity for future large-scale genome sequencing studies.
biorxiv genomics 100-200-users 2017Strelka2 Fast and accurate variant calling for clinical sequencing applications, bioRxiv, 2017-09-24
We describe Strelka2 (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comIlluminastrelka>httpsgithub.comIlluminastrelka<jatsext-link>), an open-source small variant calling method for clinical germline and somatic sequencing applications. Strelka2 introduces a novel mixture-model based estimation of indel error parameters from each sample, an efficient tiered haplotype modeling strategy and a normal sample contamination model to improve liquid tumor analysis. For both germline and somatic calling, Strelka2 substantially outperforms current leading tools on both variant calling accuracy and compute cost.
biorxiv bioinformatics 100-200-users 2017Ancient genomes from North Africa evidence prehistoric migrations to the Maghreb from both the Levant and Europe, bioRxiv, 2017-09-22
ABSTRACTThe extent to which prehistoric migrations of farmers influenced the genetic pool of western North Africans remains unclear. Archaeological evidence suggests the Neolithization process may have happened through the adoption of innovations by local Epipaleolithic communities, or by demic diffusion from the Eastern Mediterranean shores or Iberia. Here, we present the first analysis of individuals’ genome sequences from early and late Neolithic sites in Morocco, as well as Early Neolithic individuals from southern Iberia. We show that Early Neolithic Moroccans are distinct from any other reported ancient individuals and possess an endemic element retained in present-day Maghrebi populations, confirming a long-term genetic continuity in the region. Among ancient populations, Early Neolithic Moroccans are distantly related to Levantine Natufian hunter-gatherers (∼9,000 BCE) and Pre-Pottery Neolithic farmers (∼6,500 BCE). Although an expansion in Early Neolithic times is also plausible, the high divergence observed in Early Neolithic Moroccans suggests a long-term isolation and an early arrival in North Africa for this population. This scenario is consistent with early Neolithic traditions in North Africa deriving from Epipaleolithic communities who adopted certain innovations from neighbouring populations. Late Neolithic (∼3,000 BCE) Moroccans, in contrast, share an Iberian component, supporting theories of trans-Gibraltar gene flow. Finally, the southern Iberian Early Neolithic samples share the same genetic composition as the Cardial Mediterranean Neolithic culture that reached Iberia ∼5,500 BCE. The cultural and genetic similarities of the Iberian Neolithic cultures with that of North African Neolithic sites further reinforce the model of an Iberian migration into the Maghreb.SIGNIFICANCE STATEMENTThe acquisition of agricultural techniques during the so-called Neolithic revolution has been one of the major steps forward in human history. Using next-generation sequencing and ancient DNA techniques, we directly test if Neolithization in North Africa occurred through the transmission of ideas or by demic diffusion. We show that Early Neolithic Moroccans are composed of an endemic Maghrebi element still retained in present-day North African populations and distantly related to Epipaleolithic communities from the Levant. However, late Neolithic individuals from North Africa are admixed, with a North African and a European component. Our results support the idea that the Neolithization of North Africa might have involved both the development of Epipaleolithic communities and the migration of people from Europe.
biorxiv genetics 100-200-users 2017Updating the 97% identity threshold for 16S ribosomal RNA OTUs, bioRxiv, 2017-09-22
AbstractThe 16S ribosomal RNA (rRNA) gene is widely used to survey microbial communities. Sequences are often clustered into Operational Taxonomic Units (OTUs) as proxies for species. The canonical clustering threshold is 97% identity, which was proposed in 1994 when few 16S rRNA sequences were available, motivating a reassessment on current data. Using a large set of high-quality 16S rRNA sequences from finished genomes, I assessed the correspondence of OTUs to species for five representative clustering algorithms using four accuracy metrics. All algorithms had comparable accuracy when tuned to a given metric. Optimal identity thresholds that best approximated species were ∼99% for full-length sequences and ∼100% for the V4 hypervariable region.
biorxiv bioinformatics 100-200-users 2017