An atlas of genetic associations in UK Biobank, bioRxiv, 2017-08-17
ABSTRACTGenome-wide association studies have revealed many loci contributing to the variation of complex traits, yet the majority of loci that contribute to the heritability of complex traits remain elusive. Large study populations with sufficient statistical power are required to detect the small effect sizes of the yet unidentified genetic variants. However, the analysis of huge cohorts, like UK Biobank, is complicated by incidental structure present when collecting such large cohorts. For instance, UK Biobank comprises 107,162 third degree or closer related participants. Traditionally, GWAS have removed related individuals because they comprised an insignificant proportion of the overall sample size, however, removing related individuals in UK Biobank would entail a substantial loss of power. Furthermore, modelling such structure using linear mixed models is computationally expensive, which requires a computational infrastructure that may not be accessible to all researchers. Here we present an atlas of genetic associations for 118 non-binary and 599 binary traits of 408,455 related and unrelated UK Biobank participants of White-British descent. Results are compiled in a publicly accessible database that allows querying genome-wide association summary results for 623,944 genotyped and HapMap2 imputed SNPs, as well downloading whole GWAS summary statistics for over 30 million imputed SNPs from the Haplotype Reference Consortium panel. Our atlas of associations (GeneATLAS, <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpgeneatlas.roslin.ed.ac.uk>httpgeneatlas.roslin.ed.ac.uk<jatsext-link>) will help researchers to query UK Biobank results in an easy way without the need to incur in high computational costs.
biorxiv genomics 100-200-users 2017Frequent lack of repressive capacity of promoter DNA methylation identified through genome-wide epigenomic manipulation, bioRxiv, 2017-08-17
AbstractIt is widely assumed that the addition of DNA methylation at CpG rich gene promoters silences gene transcription. However, this conclusion is largely drawn from the observation that promoter DNA methylation inversely correlates with gene expression in natural conditions. The effect of induced DNA methylation on endogenous promoters has yet to be comprehensively assessed. Here, we induced the simultaneous methylation of thousands of promoters in the genome of human cells using an engineered zinc finger-DNMT3A fusion protein, enabling assessment of the effect of forced DNA methylation upon transcription, histone modifications, and DNA methylation persistence after the removal of the fusion protein. We find that DNA methylation is frequently insufficient to transcriptionally repress promoters. Furthermore, DNA methylation deposited at promoter regions associated with H3K4me3 is rapidly erased after removal of the zinc finger-DNMT3A fusion protein. Finally, we demonstrate that induced DNA methylation can exist simultaneously on promoter nucleosomes that possess the active histone modification H3K4me3, or DNA bound by the initiated form of RNA polymerase II. These findings suggest that promoter DNA methylation is not generally sufficient for transcriptional inactivation, with implications for the emerging field of epigenome engineering.One Sentence SummaryGenome-wide epigenomic manipulation of thousands of human promoters reveals that induced promoter DNA methylation is unstable and frequently does not function as a primary instructive biochemical signal for gene silencing and chromatin reconfiguration.
biorxiv genomics 500+-users 2017Extracting a Biologically Relevant Latent Space from Cancer Transcriptomes with Variational Autoencoders, bioRxiv, 2017-08-12
The Cancer Genome Atlas (TCGA) has profiled over 10,000 tumors across 33 different cancer-types for many genomic features, including gene expression levels. Gene expression measurements capture substantial information about the state of each tumor. Certain classes of deep neural network models are capable of learning a meaningful latent space. Such a latent space could be used to explore and generate hypothetical gene expression profiles under various types of molecular and genetic perturbation. For example, one might wish to use such a model to predict a tumor’s response to specific therapies or to characterize complex gene expression activations existing in differential proportions in different tumors. Variational autoencoders (VAEs) are a deep neural network approach capable of generating meaningful latent spaces for image and text data. In this work, we sought to determine the extent to which a VAE can be trained to model cancer gene expression, and whether or not such a VAE would capture biologically-relevant features. In the following report, we introduce a VAE trained on TCGA pan-cancer RNA-seq data, identify specific patterns in the VAE encoded features, and discuss potential merits of the approach. We name our method “Tybalt” after an instigative, cat-like character who sets a cascading chain of events in motion in Shakespeare’s “Romeo and Juliet”. From a systems biology perspective, Tybalt could one day aid in cancer stratification or predict specific activated expression patterns that would result from genetic changes or treatment effects.
biorxiv bioinformatics 0-100-users 2017MSPminer abundance-based reconstitution of microbial pan-genomes from shotgun meta-genomic data, bioRxiv, 2017-08-09
AbstractMotivationAnalysis toolkits for shotgun metagenomic data achieve strain-level characterization of complex microbial communities by capturing intra-species gene content variation. Yet, these tools are hampered by the extent of reference genomes that are far from covering all microbial variability, as many species are still not sequenced or have only few strains available. Binning co-abundant genes obtained from de novo assembly is a powerful reference-free technique to discover and reconstitute gene repertoire of microbial species. While current methods accurately identify species core parts, they miss many accessory genes or split them into small gene groups that remain unassociated to core clusters.ResultsWe introduce MSPminer, a computationally efficient software tool that reconstitutes Metagenomic Species Pan-genomes (MSPs) by binning co-abundant genes across metagenomic samples. MSPminer relies on a new robust measure of proportionality coupled with an empirical classifier to group and distinguish not only species core genes but accessory genes also. Applied to a large scale metagenomic dataset, MSPminer successfully delineates in a few hours the gene repertoires of 1 661 microbial species with similar specificity and higher sensitivity than existing tools. The taxonomic annotation of MSPs reveals microorganisms hitherto unknown and brings coherence in the nomenclature of the species of the human gut microbiota. The provided MSPs can be readily used for taxonomic profiling and biomarkers discovery in human gut metagenomic samples. In addition, MSPminer can be applied on gene count tables from other ecosystems to perform similar analyses.AvailabilityThe binary is freely available for non-commercial users at enterome.frsitedownloads Contact florian.plaza-onate@inra.frSupplementary informationAvailable in the file named Supplementary Information.pdf
biorxiv bioinformatics 0-100-users 2017Genome-wide association study identifies 30 Loci Associated with Bipolar Disorder, bioRxiv, 2017-08-08
ABSTRACTBipolar disorder is a highly heritable psychiatric disorder that features episodes of mania and depression. We performed the largest genome-wide association study to date, including 20,352 cases and 31,358 controls of European descent, with follow-up analysis of 822 sentinel variants at loci with P<1×10-4 in an independent sample of 9,412 cases and 137,760 controls. In the combined analysis, 30 loci reached genome-wide significant evidence for association, of which 20 were novel. These significant loci contain genes encoding ion channels and neurotransmitter transporters (CACNA1C, GRIN2A, SCN2A, SLC4A1), synaptic components (RIMS1, ANK3), immune and energy metabolism components. Bipolar disorder type I (depressive and manic episodes; ~73% of our cases) is strongly genetically correlated with schizophrenia whereas bipolar disorder type II (depressive and hypomanic episodes; ~17% of our cases) is more strongly correlated with major depressive disorder. These findings address key clinical questions and provide potential new biological mechanisms for bipolar disorder.
biorxiv genetics 100-200-users 2017Bioinformatics Core Competencies for Undergraduate Life Sciences Education, bioRxiv, 2017-08-04
AbstractBioinformatics is becoming increasingly central to research in the life sciences. However, despite its importance, bioinformatics skills and knowledge are not well integrated in undergraduate biology education. This curricular gap prevents biology students from harnessing the full potential of their education, limiting their career opportunities and slowing genomic research innovation. To advance the integration of bioinformatics into life sciences education, a framework of core bioinformatics competencies is needed. To that end, we here report the results of a survey of life sciences faculty in the United States about teaching bioinformatics to undergraduate life scientists. Responses were received from 1,260 faculty representing institutions in all fifty states with a combined capacity to educate hundreds of thousands of students every year. Results indicate strong, widespread agreement that bioinformatics knowledge and skills are critical for undergraduate life scientists, as well as considerable agreement about which skills are necessary. Perceptions of the importance of some skills varied with the respondent’s degree of training, time since degree earned, andor the Carnegie classification of the respondent’s institution. To assess which skills are currently being taught, we analyzed syllabi of courses with bioinformatics content submitted by survey respondents. Finally, we used the survey results, the analysis of syllabi, and our collective research and teaching expertise to develop a set of bioinformatics core competencies for undergraduate life sciences students. These core competencies are intended to serve as a guide for institutions as they work to integrate bioinformatics into their life sciences curricula.Significance StatementBioinformatics, an interdisciplinary field that uses techniques from computer science and mathematics to store, manage, and analyze biological data, is becoming increasingly central to modern biology research. Given the widespread use of bioinformatics and its impacts on societal problem-solving (e.g., in healthcare, agriculture, and natural resources management), there is a growing need for the integration of bioinformatics competencies into undergraduate life sciences education. Here, we present a set of bioinformatics core competencies for undergraduate life scientists developed using the results of a large national survey and the expertise of our working group of bioinformaticians and educators. We also present results from the survey on the importance of bioinformatics skills and the current state of integration of bioinformatics into biology education.
biorxiv bioinformatics 200-500-users 2017