Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts, bioRxiv, 2019-04-12
AbstractMotivationGenome-wide profiles of chromatin accessibility and gene expression in diverse cellular contexts are critical to decipher the dynamics of transcriptional regulation. Recently, convolutional neural networks (CNNs) have been used to learn predictive cis-regulatory DNA sequence models of context-specific chromatin accessibility landscapes. However, these context-specific regulatory sequence models cannot generalize predictions across cell types.ResultsWe introduce multi-modal, residual neural network architectures that integrate cis-regulatory sequence and context-specific expression of trans-regulators to predict genome-wide chromatin accessibility profiles across cellular contexts. We show that the average accessibility of a genomic region across training contexts can be a surprisingly powerful predictor. We leverage this feature and employ novel strategies for training models to enhance genome-wide prediction of shared and context-specific chromatin accessible sites across cell types. We interpret the models to reveal insights into cis and trans regulation of chromatin dynamics across 123 diverse cellular contexts.AvailabilityThe code is available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comkundajelabChromDragoNN>httpsgithub.comkundajelabChromDragoNN<jatsext-link>Contactakundaje@stanford.edu
biorxiv genomics 100-200-users 2019chromoMap An R package for Interactive Visualization and Annotation of Chromosomes, bioRxiv, 2019-04-11
AbstractSummarychromoMap is an R package for constructing interactive visualizations of chromosomeschromosomal regions, and mapping of chromosomal elements (like genes) onto them, of any living organism. The package takes separate tab-delimited files (BED like) to specify the genomic co-ordinates of the chromosomes and the elements to annotate. Each rendered chromosome is composed of continuous loci of specific ranges where each locus, on hover, displays detailed information about the elements annotated within that locus range. By just tweaking parameters of a single function, users can generate a variety of plots that can either be saved as static image or shared as HTML documents. Users can utilize the various prominent features of chromoMap including, but not limited to, visualizing polyploidy, creating chromosome heatmaps, mapping groups of elements, adding hyperlinks to elements, multi-species chromosome visualization.Availability and implementationThe R package chromoMap is available under the GPL-3 Open Source license. It is included with a vignette for comprehensive understanding of its various features, and is freely available from <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsCRAN.R-project.orgpackage=chromoMap>httpsCRAN.R-project.orgpackage=chromoMap<jatsext-link>.Contactlakshayanand15@gmail.com<jatssec sec-type=supplementary-material>Supplementary informationSupplementary data are available online.
biorxiv bioinformatics 100-200-users 2019Comparing within- and between-family polygenic score prediction, bioRxiv, 2019-04-11
AbstractPolygenic scores are a popular tool for prediction of complex traits. However, prediction estimates in samples of unrelated participants can include effects of population stratification, assortative mating and environmentally mediated parental genetic effects, a form of genotype-environment correlation (rGE). Comparing genome-wide polygenic score (GPS) predictions in unrelated individuals with predictions between siblings in a within-family design is a powerful approach to identify these different sources of prediction. Here, we compared within- to between-family GPS predictions of eight life outcomes (anthropometric, cognitive, personality and health) for eight corresponding GPSs. The outcomes were assessed in up to 2,366 dizygotic (DZ) twin pairs from the Twins Early Development Study from age 12 to age 21. To account for family clustering, we used mixed-effects modelling, simultaneously estimating within- and between-family effects for target- and cross-trait GPS prediction of the outcomes. There were three main findings (1) DZ twin GPS differences predicted DZ differences in height, BMI, intelligence, educational achievement and ADHD symptoms; (2) target and cross-trait analyses indicated that GPS prediction estimates for cognitive traits (intelligence and educational achievement) were on average 60% greater between families than within families, but this was not the case for non-cognitive traits; and (3) this within- and between-family difference for cognitive traits disappeared after controlling for family socio-economic status (SES), suggesting that SES is a source of between-family prediction through rGE mechanisms. These results provide novel insights into the patterns by which rGE contributes to GPS prediction, while ruling out confounding due to population stratification and assortative mating.
biorxiv genomics 100-200-users 2019Not just onep Multivariate GWAS of psychiatric disorders and their cardinal symptoms reveal two dimensions of cross-cutting genetic liabilities, bioRxiv, 2019-04-10
AbstractA single dimension of general psychopathology,p, has been hypothesized to represent a general liability that spans multiple types of psychiatric disorders and non-clinical variation in psychiatric symptoms across the lifespan. We conducted genome-wide association analyses of lifetime symptoms of mania, psychosis, irritability in 124,952 to 208,315 individuals from UK Biobank, and then applied Genomic SEM to model the genetic relationships between these psychiatric symptoms and clinically-defined psychiatric disorders (schizophrenia, bipolar disorder, major depressive disorder). Two dimensions of cross-cutting genetic liability emerged general vulnerability to self-reported symptoms (pself) versus transdiagnostic vulnerability to clinically-diagnosed disease (pclinician). These were only modestly correlated (rg= .344). Multivariate GWAS identified 145 and 11 independent and genome-wide significant loci forpclinicianandpself, respectively, and improved polygenic prediction, relative to univariate GWAS, in hold-out samples. Despite the severe impairments in occupational and educational functioning seen in patients with schizophrenia and bipolar disorder,pselfshowed stronger and more pervasive genetic correlations with facets of socioeconomic disadvantage (educational attainment, income, and neighborhood deprivation), whereaspclinicianwas more strongly associated with medical disorders unrelated to the brain. Genetic variance inpclinicianthat was unrelated to general vulnerability to psychiatric symptoms was associated withlesssocioeconomic disadvantage, suggesting positive selection biases in clinical samples used in psychiatric GWAS. These findings inform criticisms of psychiatric nosology by suggesting that cross-disorder genetic liabilities identified in GWASs of clinician-defined psychiatric disease are relatively distinct from genetic liabilities operating on self-reported symptom variation in the general population.
biorxiv genetics 100-200-users 2019Basal Contamination of Bulk Sequencing Lessons from the GTEx dataset, bioRxiv, 2019-04-09
AbstractBackgroundOne of the challenges of next generation sequencing (NGS) is contaminating reads from other samples. We used the Genotype-Tissue Expression (GTEx) project, a large, diverse, and robustly generated dataset, as a useful resource to understand the factors that contribute to contamination.ResultsWe obtained 11,340 RNA-Seq samples, DNA variant call files (VCF) of 635 individuals, and technical metadata from GTEx as well as read count data from the Human Protein Atlas (HPA) and a pharmacogenetics study. We analyzed 48 tissues in GTEx. Of these, 24 had variant co-expression clusters of four known highly expressed and pancreas-enriched genes (PRSS1, PNLIP, CLPS, and CELA3A). Fifteen additional highly expressed genes from other tissues were also indicative of contamination (KRT4, KRT13, PGC, CPA1, GP2, PRL, LIPF, CTRB2, FGA, HP, CKM, FGG, MYBPC1, MYH2, ZG16B). Sample contamination by non-native genes was highly associated with a sample being sequenced on the same day as a tissue that natively has high levels of those genes. This was highly significant for both pancreas genes (p= 2.7E-75) and esophagus genes (p= 8.9E-154). We used genetic polymorphism differences between individuals as validation of the contamination. Specifically, 11 SNPs in five genes shown to contaminate non-native tissues demonstrated allelic differences between DNA-based genotypes and contaminated sample RNA-based genotypes. Low-level contamination affected 1,841 (15.8%) samples (defined as ≥500 PRSS1 read counts). It also led to eQTL assignments in inappropriate tissues among these 19 genes. In support of this type of contamination occurring widely, pancreas gene contamination (PRSS1) was also observed in the HPA dataset, where pancreas samples were sequenced, but not in the pharmacogenomics dataset, where they were not.ConclusionsHighly expressed, tissue-enriched genes basally contaminate the GTEx dataset impacting on some downstream GTEx data analyses. This type of contamination is not unique to GTEx, being shared with other datasets. Awareness of this process will reduce assigning variable, contaminating low-level gene expression to disease processes.
biorxiv genomics 100-200-users 2019Basal Contamination of Sequencing Lessons from the GTEx dataset, bioRxiv, 2019-04-09
AbstractOne of the challenges of next generation sequencing (NGS) is read contamination. We used the Genotype-Tissue Expression (GTEx) project, a large, diverse, and robustly generated dataset, to understand the factors that contribute to contamination. We obtained GTEx datasets and technical metadata and validating RNA-Seq from other studies. Of 48 analyzed tissues in GTEx, 26 had variant co-expression clusters of four known highly expressed and pancreas-enriched genes (PRSS1, PNLIP, CLPS, andor CELA3A). Fourteen additional highly expressed genes from other tissues also indicated contamination. Sample contamination by non-native genes was associated with a sample being sequenced on the same day as a tissue that natively expressed those genes. This was highly significant for pancreas and esophagus genes (linear model, p=9.5e-237 and p=5e-260 respectively). Nine SNPs in four genes shown to contaminate non-native tissues demonstrated allelic differences between DNA-based genotypes and contaminated sample RNA-based genotypes, validating the contamination. Low-level contamination affected 4,497 (39.6%) samples (defined as 10 PRSS1 TPM). It also led ≥ to eQTL assignments in inappropriate tissues among these 18 genes. We note this type of contamination occurs widely, impacting bulk and single cell data set analysis. In conclusion, highly expressed, tissue-enriched genes basally contaminate GTEx and other datasets impacting analyses. Awareness of this process is necessary to avoid assigning inaccurate importance to low-level gene expression in inappropriate tissues and cells.
biorxiv genomics 100-200-users 2019