Characterising the loss-of-function impact of 5' untranslated region variants in whole genome sequence data from 15,708 individuals, bioRxiv, 2019-02-08
Upstream open reading frames (uORFs) are important tissue-specific cis-regulators of protein translation. Although isolated case reports have shown that variants that create or disrupt uORFs can cause disease, genetic sequencing approaches typically focus on protein-coding regions and ignore these variants. Here, we describe a systematic genome-wide study of variants that create and disrupt human uORFs, and explore their role in human disease using 15,708 whole genome sequences collected by the Genome Aggregation Database (gnomAD) project. We show that 14,897 variants that create new start codons upstream of the canonical coding sequence (CDS), and 2,406 variants disrupting the stop site of existing uORFs, are under strong negative selection. Furthermore, variants creating uORFs that overlap the CDS show signals of selection equivalent to coding loss-of-function variants, and uORF-perturbing variants are under strong selection when arising upstream of known disease genes and genes intolerant to loss-of-function variants. Finally, we identify specific genes where perturbation of uORFs is likely to represent an important disease mechanism, and report a novel uORF frameshift variant upstream of NF2 in families with neurofibromatosis. Our results highlight uORF-perturbing variants as an important and under-recognised functional class that can contribute to penetrant human disease, and demonstrate the power of large-scale population sequencing data to study the deleteriousness of specific classes of non-coding variants.
biorxiv genomics 200-500-users 2019Characterising the loss-of-function impact of 5’ untranslated region variants in whole genome sequence data from 15,708 individuals, bioRxiv, 2019-02-08
AbstractUpstream open reading frames (uORFs) are important tissue-specific cis-regulators of protein translation. Although isolated case reports have shown that variants that create or disrupt uORFs can cause disease, genetic sequencing approaches typically focus on protein-coding regions and ignore these variants. Here, we describe a systematic genome-wide study of variants that create and disrupt human uORFs, and explore their role in human disease using 15,708 whole genome sequences collected by the Genome Aggregation Database (gnomAD) project. We show that 14,897 variants that create new start codons upstream of the canonical coding sequence (CDS), and 2,406 variants disrupting the stop site of existing uORFs, are under strong negative selection. Furthermore, variants creating uORFs that overlap the CDS show signals of selection equivalent to coding loss-of-function variants, and uORF-perturbing variants are under strong selection when arising upstream of known disease genes and genes intolerant to loss-of-function variants. Finally, we identify specific genes where perturbation of uORFs is likely to represent an important disease mechanism, and report a novel uORF frameshift variant upstream of NF2 in families with neurofibromatosis. Our results highlight uORF-perturbing variants as an important and under-recognised functional class that can contribute to penetrant human disease, and demonstrate the power of large-scale population sequencing data to study the deleteriousness of specific classes of non-coding variants.
biorxiv genomics 200-500-users 2019Characterization of prevalence and health consequences of uniparental disomy in four million individuals from the general population, bioRxiv, 2019-02-06
Meiotic nondisjunction and resulting aneuploidy can lead to severe health consequences in humans. Aneuploidy rescue can restore euploidy but may result in uniparental disomy (UPD), the inheritance of both homologs of a chromosome from one parent with no representative copy from the other. Current understanding of UPD is limited to ~3,300 cases for which UPD was associated with clinical presentation due to imprinting disorders or recessive diseases. Thus, the prevalence of UPD and its phenotypic consequences in the general population are unknown. We searched for instances of UPD in over four million consented research participants from the personal genetics company 23andMe, Inc., and 431,094 UK Biobank participants. Using computationally detected DNA segments identical-by-descent (IBD) and runs of homozygosity (ROH), we identified 675 instances of UPD across both databases. Here we present the first characterization of UPD prevalence in the general population, a machine-learning framework to detect UPD using ROH, and a novel association between autism and UPD of chromosome 22.
biorxiv genomics 0-100-users 2019Comparative analysis of commercially available single-cell RNA sequencing platforms for their performance in complex human tissues, bioRxiv, 2019-02-06
ABSTRACTThe past five years have witnessed a tremendous growth of single-cell RNA-seq methodologies. Currently, there are three major commercial platforms for single-cell RNA-seq Fluidigm C1, Clontech iCell8 (formerly Wafergen) and 10x Genomics Chromium. Here, we provide a systematic comparison of the throughput, sensitivity, cost and other performance statistics for these three platforms using single cells from primary human islets. The primary human islets represent a complex biological system where multiple cell types coexist, with varying cellular abundance, diverse transcriptomic profiles and differing total RNA contents. We apply standard pipelines optimized for each system to derive gene expression matrices. We further evaluate the performance of each system by benchmarking single-cell data with bulk RNA-seq data from sorted cell fractions. Our analyses can be generalized to a variety of complex biological systems and serve as a guide to newcomers to the field of single-cell RNA-seq when selecting platforms.
biorxiv genomics 100-200-users 2019The functional landscape of the human phosphoproteome, bioRxiv, 2019-02-06
Protein phosphorylation is a key post-translational modification regulating protein function in almost all cellular processes. While tens of thousands of phosphorylation sites have been identified in human cells to date, the extent and functional importance of the phosphoproteome remains largely unknown. Here, we have analyzed 6,801 publicly available phospho-enriched mass spectrometry proteomics experiments, creating a state-of-the-art phosphoproteome containing 119,809 human phosphosites. To prioritize functional sites, 59 features indicative of proteomic, structural, regulatory or evolutionary relevance were integrated into a single functional score using machine learning. We demonstrate how this prioritization identifies regulatory phosphosites across different molecular mechanisms and pinpoint genetic susceptibilities at a genomic scale. Several novel regulatory phosphosites were experimentally validated including a role in neuronal differentiation for phosphosites present in the SWISNF SMARCC2 complex member. The scored reference phosphoproteome and its annotations identify the most relevant phosphorylations for a given process or disease addressing a major bottleneck in cell signaling studies.
biorxiv genomics 0-100-users 2019Supervised classification enables rapid annotation of cell atlases, bioRxiv, 2019-02-05
Single cell technologies for profiling tissues or even entire organisms are rapidly being adopted. However, the manual process by which cell types are typically annotated in the resulting data is labor-intensive and increasingly rate-limiting for the field. Here we describe Garnett, an algorithm and accompanying software for rapidly annotating cell types in scRNA-seq and scATAC-seq datasets, based on an interpretable, hierarchical markup language of cell type-specific genes. Garnett successfully classifies cell types in tissue and whole organism datasets, as well as across species.
biorxiv genomics 200-500-users 2019