Pan-cancer analysis of whole genomes, bioRxiv, 2017-07-13
We report the integrative analysis of more than 2,600 whole cancer genomes and their matching normal tissues across 39 distinct tumour types. By studying whole genomes we have been able to catalogue non-coding cancer driver events, study patterns of structural variation, infer tumour evolution, probe the interactions among variants in the germline genome, the tumour genome and the transcriptome, and derive an understanding of how coding and non-coding variations together contribute to driving individual patient's tumours. This work represents the most comprehensive look at cancer whole genomes to date. NOTE TO READERS This is an incomplete draft of the marker paper for the Pan-Cancer Analysis of Whole Genomes Project, and is intended to provide the background information for a series of in-depth papers that will be posted to BioRixv during the summer of 2017.
biorxiv cancer-biology 0-100-users 2017Sequential regulatory activity prediction across chromosomes with convolutional neural networks, bioRxiv, 2017-07-11
AbstractModels for predicting phenotypic outcomes from genotypes have important applications to understanding genomic function and improving human health. Here, we develop a machine-learning system to predict cell type-specific epigenetic and transcriptional profiles in large mammalian genomes from DNA sequence alone. Using convolutional neural networks, this system identifies promoters and distal regulatory elements and synthesizes their content to make effective gene expression predictions. We show that model predictions for the influence of genomic variants on gene expression align well to causal variants underlying eQTLs in human populations and can be useful for generating mechanistic hypotheses to enable fine mapping of disease loci.
biorxiv genomics 0-100-users 2017Single nucleus analysis of the chromatin landscape in mouse forebrain development, bioRxiv, 2017-07-05
ABSTRACTGenome-wide analysis of chromatin accessibility in primary tissues has uncovered millions of candidate regulatory sequences in the human and mouse genomes1–4. However, the heterogeneity of biological samples used in previous studies has prevented a precise understanding of the dynamic chromatin landscape in specific cell types. Here, we show that analysis of the transposase-accessible-chromatin in single nuclei isolated from frozen tissue samples can resolve cellular heterogeneity and delineate transcriptional regulatory sequences in the constituent cell types. Our strategy is based on a combinatorial barcoding assisted single cell assay for transposase-accessible chromatin5 and is optimized for nuclei from flash-frozen primary tissue samples (snATAC-seq). We used this method to examine the mouse forebrain at seven development stages and in adults. From snATAC-seq profiles of more than 15,000 high quality nuclei, we identify 20 distinct cell populations corresponding to major neuronal and non-neuronal cell-types in foetal and adult forebrains. We further define cell-type specific cis regulatory sequences and infer potential master transcriptional regulators of each cell population. Our results demonstrate the feasibility of a general approach for identifying cell-type-specific cis regulatory sequences in heterogeneous tissue samples, and provide a rich resource for understanding forebrain development in mammals.
biorxiv genomics 0-100-users 2017A simple high-throughput approach identifies actionable drug sensitivities in patient-derived tumor organoids, bioRxiv, 2017-06-29
AbstractThere is increasing interest in developing 3D tumor organoid models for drug development and personalized medicine applications. While tumor organoids are in principle amenable to high-throughput drug screenings, progress has been hampered by technical constraints and extensive manipulations required by current methodologies. Here, we introduce a miniaturized, fully automatable, flexible high-throughput method using a simplified geometry to rapidly establish 3D organoids from cell lines and primary tissue and robustly assay drug responses. As proof of principle, we use our miniring approach to establish organoids of high-grade serous tumors and one carcinosarcoma of the ovaries and screen hundreds of protein kinase compounds currently FDA-approved or in clinical development. In all cases we could identify drugs causing significant reduction in cell viability, number and size of organoids within a week from surgery, a timeline compatible with therapeutic decision making.
biorxiv cancer-biology 0-100-users 2017Biological classification with RNA-Seq data Can alternative splicing enhance machine learning classifier?, bioRxiv, 2017-06-19
AbstractThe extent to which the genes are expressed in the cell can be simplistically defined as a function of one or more factors of the environment, lifestyle, and genetics. RNA sequencing (RNA-Seq) is becoming a prevalent approach to quantify gene expression, and is expected to gain better insights to a number of biological and biomedical questions, compared to the DNA microarrays. Most importantly, RNA-Seq allows to quantify expression at the gene and alternative splicing isoform levels. However, leveraging the RNA-Seq data requires development of new data mining and analytics methods. Supervised machine learning methods are commonly used approaches for biological data analysis, and have recently gained attention for their applications to the RNA-Seq data.In this work, we assess the utility of supervised learning methods trained on RNA-Seq data for a diverse range of biological classification tasks. We hypothesize that the isoform-level expression data is more informative for biological classification tasks than the gene-level expression data. Our large-scale assessment is done through utilizing multiple datasets, organisms, lab groups, and RNA-Seq analysis pipelines. Overall, we performed and assessed 61 biological classification problems that leverage three independent RNA-Seq datasets and include over 2,000 samples that come from multiple organisms, lab groups, and RNA-Seq analyses. These 61 problems include predictions of the tissue type, sex, or age of the sample, healthy or cancerous phenotypes and, the pathological tumor stage for the samples from the cancerous tissue. For each classification problem, the performance of three normalization techniques and six machine learning classifiers was explored. We find that for every single classification problem, the isoform-based classifiers outperform or are comparable with gene expression based methods. The top-performing supervised learning techniques reached a near perfect classification accuracy, demonstrating the utility of supervised learning for RNA-Seq based data analysis.
biorxiv bioinformatics 0-100-users 2017Punctuated evolution shaped modern vertebrate diversity, bioRxiv, 2017-06-19
AbstractThe relative importance of different modes of evolution in shaping phenotypic diversity remains a hotly debated question. Fossil data suggest that stasis may be a common mode of evolution, while modern data suggest very fast rates of evolution. One way to reconcile these observations is to imagine that evolution is punctuated, rather than gradual, on geological time scales. To test this hypothesis, we developed a novel maximum likelihood framework for fitting Lévy processes to comparative morphological data. This class of stochastic processes includes both a gradual and punctuated component. We found that a plurality of modern vertebrate clades examined are best fit by punctuated processes over models of gradual change, gradual stasis, and adaptive radiation. When we compare our results to theoretical expectations of the rate and speed of regime shifts for models that detail fitness landscape dynamics, we find that our quantitative results are broadly compatible with both microevolutionary models and with observations from the fossil record.
biorxiv evolutionary-biology 0-100-users 2017