The PAGE Study How Genetic Diversity Improves Our Understanding of the Architecture of Complex Traits, bioRxiv, 2017-09-16
SummaryAbstractGenome-wide association studies (GWAS) have laid the foundation for investigations into the biology of complex traits, drug development, and clinical guidelines. However, the dominance of European-ancestry populations in GWAS creates a biased view of the role of human variation in disease, and hinders the equitable translation of genetic associations into clinical and public health applications. The Population Architecture using Genomics and Epidemiology (PAGE) study conducted a GWAS of 26 clinical and behavioral phenotypes in 49,839 non-European individuals. Using strategies designed for analysis of multi-ethnic and admixed populations, we confirm 574 GWAS catalog variants across these traits, and find 38 secondary signals in known loci and 27 novel loci. Our data shows strong evidence of effect-size heterogeneity across ancestries for published GWAS associations, substantial benefits for fine-mapping using diverse cohorts, and insights into clinical implications. We strongly advocate for continued, large genome-wide efforts in diverse populations to reduce health disparities.
biorxiv genetics 200-500-users 2017Global determinants of navigation ability, bioRxiv, 2017-09-15
SummaryCountries vary in their geographical and cultural properties. Only a few studies have explored how such variations influence how humans navigate or reason about space [1–7]. We predicted that these variations impact human cognition, resulting in an organized spatial distribution of cognition at a planetary-wide scale. To test this hypothesis we developed a mobile-app-based cognitive task, measuring non-verbal spatial navigation ability in more than 2.5 million people, sampling populations in every nation state. We focused on spatial navigation due to its universal requirement across cultures. Using a clustering approach, we find that navigation ability is clustered into five distinct, yet geographically related, groups of countries. Specifically, the economic wealth of a nation was predictive of the average navigation ability of its inhabitants, and gender inequality was predictive of the size of performance difference between males and females. Thus, cognitive abilities, at least for spatial navigation, are clustered according to economic wealth and gender inequalities globally, which has significant implications for cross-cultural studies and multi-centre clinical trials using cognitive testing.
biorxiv neuroscience 200-500-users 2017Massive Mining of Publicly Available RNA-seq Data from Human and Mouse, bioRxiv, 2017-09-15
RNA-sequencing (RNA-seq) is currently the leading technology for genome-wide transcript quantification. While the volume of RNA-seq data is rapidly increasing, the currently publicly available RNA-seq data is provided mostly in raw form, with small portions processed non- uniformly. This is mainly because the computational demand, particularly for the alignment step, is a significant barrier for global and integrative retrospective analyses. To address this challenge, we developed all RNA-seq and ChIP-seq sample and signature search (ARCHS4), a web resource that makes the majority of previously published RNA-seq data from human and mouse freely available at the gene count level. Such uniformly processed data enables easy integration for downstream analyses. For developing the ARCHS4 resource, all available FASTQ files from RNA-seq experiments were retrieved from the Gene Expression Omnibus (GEO) and aligned using a cloud-based infrastructure. In total 137,792 samples are accessible through ARCHS4 with 72,363 mouse and 65,429 human samples. Through efficient use of cloud resources and dockerized deployment of the sequencing pipeline, the alignment cost per sample is reduced to less than one cent. ARCHS4 is updated automatically by adding newly published samples to the database as they become available. Additionally, the ARCHS4 web interface provides intuitive exploration of the processed data through querying tools, interactive visualization, and gene landing pages that provide average expression across cell lines and tissues, top co-expressed genes, and predicted biological functions and protein-protein interactions for each gene based on prior knowledge combined with co-expression. Benchmarking the quality of these predictions, co-expression correlation data created from ARCHS4 outperforms co-expression data created from other major gene expression data repositories such as GTEx and CCLE.ARCHS4 is freely accessible at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpamp.pharm.mssm.eduarchs4>httpamp.pharm.mssm.eduarchs4<jatsext-link>
biorxiv bioinformatics 200-500-users 2017Minor allele frequency thresholds dramatically affect population structure inference with genomic datasets, bioRxiv, 2017-09-15
AbstractOne common method of minimizing errors in large DNA sequence datasets is to drop variable sites with a minor allele frequency below some specified threshold. Though widespread, this procedure has the potential to alter downstream population genetic inferences and has received relatively little rigorous analysis. Here we use simulations and an empirical SNP dataset to demonstrate the impacts of minor allele frequency (MAF) thresholds on inference of population structure. We find that model-based inference of population structure is confounded when singletons are included in the alignment, and that both model-based and multivariate analyses infer less distinct clusters when more stringent MAF cutoffs are applied. We propose that this behavior is caused by the combination of a drop in the total size of the data matrix and by correlations between allele frequencies and mutational age. We recommend a set of best practices for applying MAF filters in studies seeking to describe population structure with genomic data.
biorxiv genomics 100-200-users 2017A zombie LIF gene in elephants is up-regulated by TP53 to induce apoptosis in response to DNA damage, bioRxiv, 2017-09-13
AbstractLarge bodied organisms have more cells that can potentially turn cancerous than smallbodied organisms with fewer cells, imposing an increased risk of developing cancer. This expectation predicts a positive correlation between body size and cancer risk, however, there is no correlation between body size and cancer risk across species (‘Peto’s Paradox’). Here we show that elephants and their extinct relatives (Proboscideans) may have resolved Peto’s Paradox in part through re-functionalizing a leukemia inhibitory factor pseudogene (LIF6) with pro-apoptotic functions. The LIF6 gene is transcriptionally up-regulated by TP53 in response to DNA damage, and translocates to the mitochondria where it induces apoptosis. Phylogenetic analyses of living and extinct Proboscidean LIF6 genes indicates its TP53 response element evolved coincident with the evolution of large body sizes in the Proboscidean stem-lineage. These results suggest that re-functionalizing of a pro-apoptotic LIF pseudogene may have been permissive (though not sufficient) for the evolution of large body sizes in Proboscideans.
biorxiv evolutionary-biology 100-200-users 2017No major flaws in “Identification of individuals by trait prediction using whole-genome sequencing data”, bioRxiv, 2017-09-12
AbstractIn a recently published PNAS article, we studied the identifiability of genomic samples using machine learning methods [Lippert et al., 2017]. In a response, Erlich [2017] argued that our work contained major flaws. The main technical critique of Erlich [2017] builds on a simulation experiment that shows that our proposed algorithm, which uses only a genomic sample for identification, performed no better than a strategy that uses demographic variables. Below, we show why this comparison is misleading and provide a detailed discussion of the key critical points in our analyses that have been brought up in Erlich [2017] and in the media. Further, not only faces may be derived from DNA, but a wide range of phenotypes and demographic variables. In this light, the main contribution of Lippert et al. [2017] is an algorithm that identifies genomes of individuals by combining multiple DNA-based predictive models for a myriad of traits.
biorxiv genomics 100-200-users 2017