Best Practices for Benchmarking Germline Small Variant Calls in Human Genomes, bioRxiv, 2018-02-24
AbstractAssessing accuracy of NGS variant calling is immensely facilitated by a robust benchmarking strategy and tools to carry it out in a standard way. Benchmarking variant calls requires careful attention to definitions of performance metrics, sophisticated comparison approaches, and stratification by variant type and genome context. The Global Alliance for Genomics and Health (GA4GH) Benchmarking Team has developed standardized performance metrics and tools for benchmarking germline small variant calls. This team includes representatives from sequencing technology developers, government agencies, academic bioinformatics researchers, clinical laboratories, and commercial technology and bioinformatics developers for whom benchmarking variant calls is essential to their work. Benchmarking variant calls is a challenging problem for many reasons<jatslist list-type=bullet><jatslist-item>Evaluating variant calls requires complex matching algorithms and standardized counting because the same variant may be represented differently in truth and query callsets.<jatslist-item><jatslist-item>Defining and interpreting resulting metrics such as precision (aka positive predictive value = TP(TP+FP)) and recall (aka sensitivity = TP(TP+FN)) requires standardization to draw robust conclusions about comparative performance for different variant calling methods.<jatslist-item><jatslist-item>Performance of NGS methods can vary depending on variant types and genome context; and as a result understanding performance requires meaningful stratification.<jatslist-item><jatslist-item>High-confidence variant calls and regions that can be used as “truth” to accurately identify false positives and negatives are difficult to define, and reliable calls for the most challenging regions and variants remain out of reach.<jatslist-item>We have made significant progress on standardizing comparison methods, metric definitions and reporting, as well as developing and using truth sets. Our methods are publicly available on GitHub (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comga4ghbenchmarking-tools>httpsgithub.comga4ghbenchmarking-tools<jatsext-link>) and in a web-based app on precisionFDA, which allow users to compare their variant calls against truth sets and to obtain a standardized report on their variant calling performance. Our methods have been piloted in the precisionFDA variant calling challenges to identify the best-in-class variant calling methods within high-confidence regions. Finally, we recommend a set of best practices for using our tools and critically evaluating the results.
biorxiv genomics 100-200-users 2018Emerging Evidence of Chromosome Folding by Loop Extrusion, bioRxiv, 2018-02-17
AbstractChromosome organization poses a remarkable physical problem with many biological consequences how can molecular interactions between proteins at the nanometer scale organize micron-long chromatinized DNA molecules, insulating or facilitating interactions between specific genomic elements? The mechanism of active loop extrusion holds great promise for explaining interphase and mitotic chromosome folding, yet remains difficult to assay directly. We discuss predictions from our polymer models of loop extrusion with barrier elements, and review recent experimental studies that provide strong support for loop extrusion, focusing on perturbations to CTCF and cohesin assayed via Hi-C in interphase. Finally, we discuss a likely molecular mechanism of loop extrusion by SMC complexes.
biorxiv genomics 100-200-users 2018On the design of CRISPR-based single cell molecular screens, bioRxiv, 2018-01-30
AbstractSeveral groups recently reported coupling CRISPRCas9 perturbations and single cell RNA-seq as a potentially powerful approach for forward genetics. Here we demonstrate that vector designs for such screens that rely on cis linkage of guides and distally located barcodes suffer from swapping of intended guide-barcode associations at rates approaching 50% due to template switching during lentivirus production, greatly reducing sensitivity. We optimize a published strategy, CROP-seq, that instead uses a Pol II transcribed copy of the sgRNA sequence itself, doubling the rate at which guides are assigned to cells to 94%. We confirm this strategy performs robustly and further explore experimental best practices for CRISPRCas9-based single cell molecular screens.
biorxiv genomics 100-200-users 2018The Juicebox Assembly Tools module facilitates de novo assembly of mammalian genomes with chromosome-length scaffolds for under 1000, bioRxiv, 2018-01-29
Hi-C contact maps are valuable for genome assembly (Lieberman-Aiden, van Berkum et al. 2009; Burton et al. 2013; Dudchenko et al. 2017). Recently, we developed Juicebox, a system for the visual exploration of Hi-C data (Durand, Robinson et al. 2016), and 3D-DNA, an automated pipeline for using Hi-C data to assemble genomes (Dudchenko et al. 2017). Here, we introduce “Assembly Tools,” a new module for Juicebox, which provides a point-and-click interface for using Hi-C heatmaps to identify and correct errors in a genome assembly. Together, 3D-DNA and the Juicebox Assembly Tools greatly reduce the cost of accurately assembling complex eukaryotic genomes. To illustrate, we generated de novo assemblies with chromosome-length scaffolds for three mammals the wombat, Vombatus ursinus (3.3Gb), the Virginia opossum, Didelphis virginiana (3.3Gb), and the raccoon, Procyon lotor (2.5Gb). The only inputs for each assembly were Illumina reads from a short insert DNA-Seq library (300 million Illumina reads, maximum length 2x150 bases) and an in situ Hi-C library (100 million Illumina reads, maximum read length 2x150 bases), which cost <$1000.
biorxiv genomics 100-200-users 2018Resistance gene discovery and cloning by sequence capture and association genetics, bioRxiv, 2018-01-16
Genetic resistance is the most economic and environmentally sustainable approach for crop disease protection. Disease resistance (R) genes from wild relatives are a valuable resource for breeding resistant crops. However, introgression of R genes into crops is a lengthy process often associated with co-integration of deleterious linked genes1, 2 and pathogens can rapidly evolve to overcome R genes when deployed singly3. Introducing multiple cloned R genes into crops as a stack would avoid linkage drag and delay emergence of resistance-breaking pathogen races4. However, current R gene cloning methods require segregating or mutant progenies5–10, which are difficult to generate for many wild relatives due to poor agronomic traits. We exploited natural pan-genome variation in a wild diploid wheat by combining association genetics with R gene enrichment sequencing (AgRenSeq) to clone four stem rust resistance genes in <6 months. RenSeq combined with diversity panels is therefore a major advance in isolating R genes for engineering broad-spectrum resistance in crops.
biorxiv genomics 100-200-users 2018Easy Hi-C A simple efficient protocol for 3D genome mapping in small cell populations, bioRxiv, 2018-01-11
Despite the growing interest in studying the mammalian genome organization, it is still challenging to map the DNA contacts genome-wide. Here we present easy Hi-C (eHi-C), a highly efficient method for unbiased mapping of 3D genome architecture. The eHi-C protocol only involves a series of enzymatic reactions and maximizes the recovery of DNA products from proximity ligation. We show that eHi-C can be performed with 0.1 million cells and yields high quality libraries comparable to Hi-C.
biorxiv genomics 0-100-users 2018