A primer on deep learning in genomics, Nature Genetics, 2018-11-21
Deep learning methods are a class of machine learning techniques capable of identifying highly complex patterns in large datasets. Here, we provide a perspective and primer on deep learning applications for genome analysis. We discuss successful applications in the fields of regulatory genomics, variant calling and pathogenicity scores. We include general guidance for how to effectively use deep learning methods as well as a practical guide to tools and resources. This primer is accompanied by an interactive online tutorial.
nature genetics genetics 500+-users 2018Factors associated with sharing email information and mental health survey participation in large population cohorts, bioRxiv, 2018-11-20
AbstractPeople who opt to participate in scientific studies tend to be healthier, wealthier, and more educated than the broader population. While selection bias does not always pose a problem for analysing the relationships between exposures and diseases or other outcomes, it can lead to biased effect size estimates. Biased estimates may weaken the utility of genetic findings because the goal is often to make inferences in a new sample (such as in polygenic risk score analysis). We used data from UK Biobank and Generation Scotland and conducted phenotypic and genome-wide association analyses on two phenotypes that reflected mental health data availability (1) whether participants were contactable by email for follow-up) and (2) whether participants responded to a follow-up surveys of mental health. We identified nine genetic loci associated with email contact and 25 loci associated with mental health survey completion. Both phenotypes were positively genetically correlated with higher educational attainment and better health and negatively genetically correlated with psychological distress and schizophrenia. Recontact availability and follow-up participation can act as further genetic filters for data on mental health phenotypes.
biorxiv genetics 100-200-users 2018Epigenetically reprogrammed methylation landscape drives the DNA self-assembly and serves as a universal cancer biomarker, Nature Communications, 2018-11-15
Epigenetic reprogramming in cancer genomes creates a distinct methylation landscape encompassing clustered methylation at regulatory regions separated by large intergenic tracks of hypomethylated regions. This methylation landscape that we referred to as Methylscape is displayed by most cancer types, thus may serve as a universal cancer biomarker. To-date most research has focused on the biological consequences of DNA Methylscape changes whereas its impact on DNA physicochemical properties remains unexplored. Herein, we examine the effect of levels and genomic distribution of methylcytosines on the physicochemical properties of DNA to detect the Methylscape biomarker. We find that DNA polymeric behaviour is strongly affected by differential patterning of methylcytosine, leading to fundamental differences in DNA solvation and DNA-gold affinity between cancerous and normal genomes. We exploit these Methylscape differences to develop simple, highly sensitive and selective electrochemical or colorimetric one-step assays for the detection of cancer. These assays are quick, i.e., analysis time ≤10 minutes, and require minimal sample preparation and small DNA input.
nature communications genetics 200-500-users 2018Assembly of a pan-genome from deep sequencing of 910 humans of African descent, Nature Genetics, 2018-11-13
We used a deeply sequenced dataset of 910 individuals, all of African descent, to construct a set of DNA sequences that is present in these individuals but missing from the reference human genome. We aligned 1.19 trillion reads from the 910 individuals to the reference genome (GRCh38), collected all reads that failed to align, and assembled these reads into contiguous sequences (contigs). We then compared all contigs to one another to identify a set of unique sequences representing regions of the African pan-genome missing from the reference genome. Our analysis revealed 296,485,284 bp in 125,715 distinct contigs present in the populations of African descent, demonstrating that the African pan-genome contains ~10% more DNA than the current human reference genome. Although the functional significance of nearly all of this sequence is unknown, 387 of the novel contigs fall within 315 distinct protein-coding genes, and the rest appear to be intergenic.
nature genetics genetics 500+-users 2018An atlas of polygenic risk score associations to highlight putative causal relationships across the human phenome, bioRxiv, 2018-11-11
AbstractThe age of large-scale genome-wide association studies (GWAS) has provided us with an unprecedented opportunity to evaluate the genetic liability of complex disease using polygenic risk scores (PRS). In this study, we have analysed 162 PRS (P<5×l0 05) derived from GWAS and 551 heritable traits from the UK Biobank study (N=334,398). Findings can be investigated using a web application (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpmrcieu.mrsoftware.orgPRS_atlas>httpmrcieu.mrsoftware.orgPRS_atlas<jatsext-link>), which we envisage will help uncover both known and novel mechanisms which contribute towards disease susceptibility.To demonstrate this, we have investigated the results from a phenome-wide evaluation of schizophrenia genetic liability. Amongst findings were inverse associations with measures of cognitive function which extensive follow-up analyses using Mendelian randomization (MR) provided evidence of a causal relationship. We have also investigated the effect of multiple risk factors on disease using mediation and multivariable MR frameworks. Our atlas provides a resource for future endeavours seeking to unravel the causal determinants of complex disease.
biorxiv genetics 100-200-users 2018