Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes, bioRxiv, 2019-03-11
AbstractMulti-nucleotide variants (MNVs), defined as two or more nearby variants existing on the same haplotype in an individual, are a clinically and biologically important class of genetic variation. However, existing tools for variant interpretation typically do not accurately classify MNVs, and understanding of their mutational origins remains limited. Here, we systematically survey MNVs in 125,748 whole exomes and 15,708 whole genomes from the Genome Aggregation Database (gnomAD). We identify 1,996,125 MNVs across the genome with constituent variants falling within 2 bp distance of one another, of which 31,510 exist within the same codon, including 405 predicted to result in gain of a nonsense mutation, 1,818 predicted to rescue a nonsense mutation event that would otherwise be caused by one of the constituent variants, and 16,481 additional variants predicted to alter protein sequences. We show that the distribution of MNVs is highly non-uniform across the genome, and that this non-uniformity can be largely explained by a variety of known mutational mechanisms, such as CpG deamination, replication error by polymerase zeta, or polymerase slippage at repeat junctions. We also provide an estimate of the dinucleotide mutation rate caused by polymerase zeta. Finally, we show that differential CpG methylation drives MNV differences across functional categories. Our results demonstrate the importance of incorporating haplotype-aware annotation for accurate functional interpretation of genetic variation, and refine our understanding of genome-wide mutational mechanisms of MNVs.
biorxiv genomics 0-100-users 2019Whole exome sequencing and characterization of coding variation in 49,960 individuals in the UK Biobank, bioRxiv, 2019-03-09
SUMMARYThe UK Biobank is a prospective study of 502,543 individuals, combining extensive phenotypic and genotypic data with streamlined access for researchers around the world. Here we describe the first tranche of large-scale exome sequence data for 49,960 study participants, revealing approximately 4 million coding variants (of which ~98.4% have frequency < 1%). The data includes 231,631 predicted loss of function variants, a >10-fold increase compared to imputed sequence for the same participants. Nearly all genes (>97%) had ≥1 predicted loss of function carrier, and most genes (>69%) had ≥10 loss of function carriers. We illustrate the power of characterizing loss of function variation in this large population through association analyses across 1,741 phenotypes. In addition to replicating a range of established associations, we discover novel loss of function variants with large effects on disease traits, including PIEZO1 on varicose veins, COL6A1 on corneal resistance, MEPE on bone density, and IQGAP2 and GMPR on blood cell traits. We further demonstrate the value of exome sequencing by surveying the prevalence of pathogenic variants of clinical significance in this population, finding that 2% of the population has a medically actionable variant. Additionally, we leverage the phenotypic data to characterize the relationship between rare BRCA1 and BRCA2 pathogenic variants and cancer risk. Exomes from the first 49,960 participants are now made accessible to the scientific community and highlight the promise offered by genomic sequencing in large-scale population-based studies.
biorxiv genomics 200-500-users 2019A rapid and simple method for assessing and representing genome sequence relatedness, bioRxiv, 2019-03-08
AbstractCoherent genomic groups are frequently used as a proxy for bacterial species delineation through computation of overall genome relatedness indices (OGRI). Average nucleotide identity (ANI) is a widely employed method for estimating relatedness between genomic sequences. However, pairwise comparisons of genome sequences based on ANI is relatively computationally intensive and therefore precludes analyses of large datasets composed of thousand genome sequences.In this work we evaluated an alternative OGRI based on k-mers counts to study prokaryotic species delimitation. A dataset containing more than 3,500 Pseudomonas genome sequences was successfully classified in few hours with the same precision as ANI. A new visualization method based on zoomable circle packing was employed for assessing relationships between among the 350 cliques generated. Amendment of databases with these Pseudomonas cliques greatly improved the classification of metagenomic read sets with k-mers-based classifier.The developed workflow was integrated in the user-friendly KI-S tool that is available at the following address <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsiris.angers.inra.frgalaxypub-cfbp>httpsiris.angers.inra.frgalaxypub-cfbp<jatsext-link>.
biorxiv genomics 0-100-users 2019Mapping Histone Modifications in Low Cell Number and Single Cells Using Antibody-guided Chromatin Tagmentation (ACT-seq), bioRxiv, 2019-03-08
ABSTRACTModern next-generation sequencing-based methods have empowered researchers to assay the epigenetic states of individual cells. Existing techniques for profiling epigenetic marks in single cells often require the use and optimization of time-intensive procedures such as drop fluidics, chromatin fragmentation, and end repair. Here we describe ACT-seq, a novel and streamlined method for mapping genome-wide distributions of histone tail modifications, histone variants, and chromatin-binding proteins in a small number of or single cells. ACT-seq utilizes a fusion of Tn5 transposase to Protein A that is targeted to chromatin by a specific antibody, allowing chromatin fragmentation and sequence tag insertion specifically at genomic sites presenting the relevant antigen. The Tn5 transposase enables the use of an index multiplexing strategy (iACT-seq), which enables construction of thousands of single-cell libraries in one day by a single researcher without the need for drop-based fluidics or visual sorting. We conclude that ACT-seq present an attractive alternative to existing techniques for mapping epigenetic marks in single cells.
biorxiv genomics 0-100-users 2019CUT&Tag for efficient epigenomic profiling of small samples and single cells, bioRxiv, 2019-03-07
AbstractMany chromatin features play critical roles in regulating gene expression. A complete understanding of gene regulation will require the mapping of specific chromatin features in small samples of cells at high resolution. Here we describe Cleavage Under Targets and Tagmentation (CUT&Tag), an enzyme-tethering strategy that provides efficient high-resolution sequencing libraries for profiling diverse chromatin components. In CUT&Tag, a chromatin protein is bound in situ by a specific antibody, which then tethers a protein A-Tn5 transposase fusion protein. Activation of the transposase efficiently generates fragment libraries with high resolution and exceptionally low background. All steps from live cells to sequencing-ready libraries can be performed in a single tube on the benchtop or a microwell in a high-throughput pipeline, and the entire procedure can be performed in one day. We demonstrate the utility of CUT&Tag by profiling histone modifications, RNA Polymerase II and transcription factors on low cell numbers and single cells.
biorxiv genomics 200-500-users 2019Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program, bioRxiv, 2019-03-07
Summary paragraphThe Trans-Omics for Precision Medicine (TOPMed) program seeks to elucidate the genetic architecture and disease biology of heart, lung, blood, and sleep disorders, with the ultimate goal of improving diagnosis, treatment, and prevention. The initial phases of the program focus on whole genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here, we describe TOPMed goals and design as well as resources and early insights from the sequence data. The resources include a variant browser, a genotype imputation panel, and sharing of genomic and phenotypic data via dbGaP. In 53,581 TOPMed samples, >400 million single-nucleotide and insertiondeletion variants were detected by alignment with the reference genome. Additional novel variants are detectable through assembly of unmapped reads and customized analysis in highly variable loci. Among the >400 million variants detected, 97% have frequency <1% and 46% are singletons. These rare variants provide insights into mutational processes and recent human evolutionary history. The nearly complete catalog of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and non-coding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and extends the reach of nearly all genome-wide association studies to include variants down to ~0.01% in frequency.
biorxiv genomics 100-200-users 2019