Revealing multi-scale population structure in large cohorts, bioRxiv, 2018-09-23
Genetic structure in large cohorts results from technical, sampling and demographic variation. Visualisation is therefore a first step in most genomic analyses. However, existing data exploration methods struggle with unbalanced sampling and the many scales of population structure. We investigate an approach to dimension reduction of genomic data that combines principal components analysis (PCA) with uniform manifold approximation and projection (UMAP) to succinctly illustrate population structure in large cohorts and capture their relationships on local and global scales. Using data from large-scale genomic datasets, we demonstrate that PCA-UMAP effectively clusters closely related individuals while placing them in a global continuum of genetic variation. This approach reveals previously overlooked subpopulations within the American Hispanic population and fine-scale relationships between geography, genotypes, and phenotypes in the UK population. This opens new lines of investigation for demographic research and statistical genetics. Given its small computational cost, PCA-UMAP also provides a general-purpose approach to exploratory analysis in population-scale datasets.
biorxiv genomics 100-200-users 2018Genomic prediction of cognitive traits in childhood and adolescence, bioRxiv, 2018-09-18
AbstractRecent advances in genomics are producing powerful DNA predictors of complex traits, especially cognitive abilities. Here, we leveraged summary statistics from the most recent genome-wide association studies of intelligence and educational attainment to build prediction models of general cognitive ability and educational achievement. To this end, we compared the performances of multi-trait genomic and polygenic scoring methods. In a representative UK sample of 7,026 children at age 12 and 16, we show that we can now predict up to 11 percent of the variance in intelligence and 16 percent in educational achievement. We also show that predictive power increases from age 12 to age 16 and that genomic predictions do not differ for girls and boys. Multivariate genomic methods were effective in boosting predictive power and, even though prediction accuracy varied across polygenic scores approaches, results were similar using different multivariate and polygenic score methods. Polygenic scores for educational attainment and intelligence are the most powerful predictors in the behavioural sciences and exceed predictions that can be made from parental phenotypes such as educational attainment and occupational status.
biorxiv genomics 100-200-users 2018Coupled single-cell CRISPR screening and epigenomic profiling reveals causal gene regulatory networks, bioRxiv, 2018-09-17
SummaryHere we present Perturb-ATAC, a method which combines multiplexed CRISPR interference or knockout with genome-wide chromatin accessibility profiling in single cells, based on the simultaneous detection of CRISPR guide RNAs and open chromatin sites by assay of transposase-accessible chromatin with sequencing (ATAC-seq). We applied Perturb-ATAC to transcription factors (TFs), chromatin-modifying factors, and noncoding RNAs (ncRNAs) in ∼4,300 single cells, encompassing more than 63 unique genotype-phenotype relationships. Perturb-ATAC in human B lymphocytes uncovered regulators of chromatin accessibility, TF occupancy, and nucleosome positioning, and identified a hierarchical organization of TFs that govern B cell state, variation, and disease-associatedcis-regulatory elements. Perturb-ATAC in primary human epidermal cells revealed three sequential modules ofcis-elements that specify keratinocyte fate, orchestrated by the TFs JUNB, KLF4, ZNF750, CEBPA, and EHF. Combinatorial deletion of all pairs of these TFs uncovered their epistatic relationships and highlighted genomic co-localization as a basis for synergistic interactions. Thus, Perturb-ATAC is a powerful and general strategy to dissect gene regulatory networks in development and disease.Highlights<jatslist list-type=order><jatslist-item>A new method for simultaneous measurement of CRISPR perturbations and chromatin state in single cells.<jatslist-item><jatslist-item>Perturb-ATAC reveals regulatory factors that controlcis-element accessibility,trans-factor occupancy, and nucleosome positioning.<jatslist-item><jatslist-item>Perturb-ATAC reveals regulatory modules of coordinatedtrans-factor activity in B lymphoblasts.<jatslist-item><jatslist-item>Keratinocyte differentiation is orchestrated by synergistic activities of co-binding TFs oncis-elements.<jatslist-item>
biorxiv genomics 100-200-users 2018A guide to performing Polygenic Risk Score analyses, bioRxiv, 2018-09-16
The application of polygenic risk scores (PRS) has become routine across genetic research. Among a range of applications, PRS are exploited to assess shared aetiology between phenotypes, to evaluate the predictive power of genetic data for use in clinical settings, and as part of experimental studies in which, for example, experiments are performed on individuals, or their biological samples (eg. tissues, cells), at the tails of the PRS distribution and contrasted. As GWAS sample sizes increase and PRS become more powerful, they are set to play a key role in personalised medicine. However, despite the growing application and importance of PRS, there are limited guidelines for performing PRS analyses, which can lead to inconsistency between studies and misinterpretation of results. Here we provide detailed guidelines for performing polygenic risk score analyses relevant to different methods for their calculation, outlining standard quality control steps and offering recommendations for best-practice. We also discuss different methods for the calculation of PRS, common misconceptions regarding the interpretation of results and future challenges.
biorxiv genomics 100-200-users 2018A comprehensive analysis of RNA sequences reveals macroscopic somatic clonal expansion across normal tissues, bioRxiv, 2018-09-14
Cancer genome studies have significantly advanced our knowledge of somatic mutations. However, how these mutations accumulate in normal cells and whether they promote pre-cancerous lesions remains poorly understood. Here we perform a comprehensive analysis of normal tissues by utilizing RNA sequencing data from ~6,700 samples across 29 normal tissues collected as part of the Genotype-Tissue Expression (GTEx) project. We identify somatic mutations using a newly developed pipeline, RNA-MuTect, for calling somatic mutations directly from RNA-seq samples and their matched-normal DNA. When applied to the GTEx dataset, we detect multiple variants across different tissues and find that mutation burden is associated with both the age of the individual and tissue proliferation rate. We also detect hotspot cancer mutations that share tissue specificity with their matched cancer type. This study is the first to analyze a large number of samples across multiple normal tissues, identifying clones with genomic aberrations observed in cancer.
biorxiv genomics 200-500-users 2018Dating genomic variants and shared ancestry in population-scale sequencing data, bioRxiv, 2018-09-14
AbstractThe origin and fate of new mutations within species is the fundamental process underlying evolution. However, while previous efforts have been focused on characterizing the presence, frequency, and phenotypic impact of genetic variation, the evolutionary histories of most variants are largely unexplored. We have developed a non-parametric approach for estimating the date of origin of genetic variants that can be applied to large-scale genomic variation data sets. We demonstrate the accuracy and robustness of the approach through simulation and apply it to over 16 million single nucleotide poly-morphisms (SNPs) from two publicly available human genomic diversity resources. We characterize the differential relationship between variant frequency and age in different geographical regions and demonstrate the value of allele age in interpreting variants of known functional and selective importance. Finally, we use allele age estimates to power a rapid approach for inferring the genealogical history of a single genome or a group of individuals.
biorxiv genomics 100-200-users 2018