Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks, bioRxiv, 2018-09-14
Algorithms that accurately predict gene structure from primary sequence alone were transformative for annotating the human genome. Can we also predict the expression levels of genes based solely on genome sequence? Here we sought to apply deep convolutional neural networks towards this goal. Surprisingly, a model that includes only promoter sequences and features associated with mRNA stability explains 59% and 71% of variation in steady-state mRNA levels in human and mouse, respectively. This model, which we call Xpresso, more than doubles the accuracy of alternative sequence-based models, and isolates rules as predictive as models relying on ChIP-seq data. Xpresso recapitulates genome-wide patterns of transcriptional activity and predicts the influence of enhancers, heterochromatic domains, and microRNAs. Model interpretation reveals that promoter-proximal CpG dinucleotides strongly predict transcriptional activity. Looking forward, we propose the accurate prediction of cell type-specific gene expression based solely on primary sequence as a grand challenge for the field.
biorxiv genomics 200-500-users 2018Cardelino Integrating whole exomes and single-cell transcriptomes to reveal phenotypic impact of somatic variants, bioRxiv, 2018-09-12
AbstractDecoding the clonal substructures of somatic tissues sheds light on cell growth, development and differentiation in health, ageing and disease. DNA-sequencing, either using bulk or using single-cell assays, has enabled the reconstruction of clonal trees from frequency and co-occurrence patterns of somatic variants. However, approaches to systematically characterize phenotypic and functional variations between individual clones are not established. Here we present cardelino (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comPMBiocardelino>httpsgithub.comPMBiocardelino<jatsext-link>), a computational method for inferring the clone of origin of individual cells that have been assayed using single-cell RNA-seq (scRNA-seq). After validating our model using simulations, we apply cardelino to matched scRNA-seq and exome sequencing data from 32 human dermal fibroblast lines, identifying hundreds of differentially expressed genes between cells from different somatic clones. These genes are frequently enriched for cell cycle and proliferation pathways, indicating a key role for cell division genes in non-neutral somatic evolution.Key findings<jatslist list-type=bullet><jatslist-item>A novel approach for integrating DNA-seq and single-cell RNA-seq data to reconstruct clonal substructure for single-cell transcriptomes.<jatslist-item><jatslist-item>Evidence for non-neutral evolution of clonal populations in human fibroblasts.<jatslist-item><jatslist-item>Proliferation and cell cycle pathways are commonly distorted in mutated clonal populations.<jatslist-item>
biorxiv genomics 100-200-users 2018Resource Scalable whole genome sequencing of 40,000 single cells identifies stochastic aneuploidies, genome replication states and clonal repertoires, bioRxiv, 2018-09-07
SummaryEssential features of cancer tissue cellular heterogeneity such as negatively selected genome topologies, sub-clonal mutation patterns and genome replication states can only effectively be studied by sequencing single-cell genomes at scale and high fidelity. Using an amplification-free single-cell genome sequencing approach implemented on commodity hardware (DLP+) coupled with a cloud-based computational platform, we define a resource of 40,000 single-cell genomes characterized by their genome states, across a wide range of tissue types and conditions. We show that shallow sequencing across thousands of genomes permits reconstruction of clonal genomes to single nucleotide resolution through aggregation analysis of cells sharing higher order genome structure. From large-scale population analysis over thousands of cells, we identify rare cells exhibiting mitotic mis-segregation of whole chromosomes. We observe that tissue derived scWGS libraries exhibit lower rates of whole chromosome anueploidy than cell lines, and loss of p53 results in a shift in event type, but not overall prevalence in breast epithelium. Finally, we demonstrate that the replication states of genomes can be identified, allowing the number and proportion of replicating cells, as well as the chromosomal pattern of replication to be unambiguously identified in single-cell genome sequencing experiments. The combined annotated resource and approach provide a re-implementable large scale platform for studying lineages and tissue heterogeneity.
biorxiv genomics 100-200-users 2018Consequences of PCA graphs, SNP codings, and PCA variants for elucidating population structure, bioRxiv, 2018-08-16
AbstractSNP datasets are high-dimensional, often with thousands to millions of SNPs and hundreds to thousands of samples or individuals. Accordingly, PCA graphs are frequently used to provide a low-dimensional visualization in order to display and discover patterns in SNP data from humans, animals, plants, and microbes—especially to elucidate population structure. Given the popularity of PCA, one might expect that PCA is understood well and applied effectively. However, our literature survey of 125 representative articles that apply PCA to SNP data shows that three choices have usually been made poorly PCA graph, SNP coding, and PCA variant. Our main three recommendations are simple and easily implemented Use PCA biplots, SNP coding 1 for the rare allele and 0 for the common allele, and double-centered PCA (or AMMI1 if main effects are of interest). The ultimate benefit from informed and optimal choices of PCA graph, SNP coding, and PCA variant, is expected to be discovery of more biology, and thereby acceleration of medical, agricultural, and other vital applications.
biorxiv genomics 0-100-users 2018MULTI-seq Scalable sample multiplexing for single-cell RNA sequencing using lipid-tagged indices, bioRxiv, 2018-08-08
ABSTRACTWe describe MULTI-seq A rapid, modular, and universal scRNA-seq sample multiplexing strategy using lipid-tagged indices. MULTI-seq reagents can barcode any cell type from any species with an accessible plasma membrane. The method is compatible with enzymatic tissue dissociation, and also preserves viability and endogenous gene expression patterns. We leverage these features to multiplex the analysis of multiple solid tissues comprising human and mouse cells isolated from patient-derived xenograft mouse models. We also utilize MULTI-seq’s modular design to perform a 96-plex perturbation experiment with human mammary epithelial cells. MULTI-seq also enables robust doublet identification, which improves data quality and increases scRNA-seq cell throughput by minimizing the negative effects of Poisson loading. We anticipate that the sample throughput and reagent savings enabled by MULTI-seq will expand the purview of scRNA-seq and democratize the application of these technologies within the scientific community.
biorxiv genomics 100-200-users 2018Pooled optical screens in human cells, bioRxiv, 2018-08-03
Large-scale genetic screens play a key role in the systematic discovery of genes underlying cellular phenotypes. Pooling of genetic perturbations greatly increases screening throughput, but has so far been limited to screens of enrichments defined by cell fitness and flow cytometry, or to comparatively low-throughput single cell gene expression profiles. Although microscopy is a rich source of spatial and temporal information about mammalian cells, high-content imaging screens have been restricted to much less efficient arrayed formats. Here, we introduce an optical method to link perturbations and their phenotypic outcomes at the single-cell level in a pooled setting. Barcoded perturbations are read out by targeted in situ sequencing following image-based phenotyping. We apply this technology to screen a focused set of 952 genes across >3 million cells for involvement in NF-κB activation by imaging the translocation of RelA (p65) to the nucleus, recovering 20 known pathway components and 3 novel candidate positive regulators of IL-1β and TNFα-stimulated immune responses.
biorxiv genomics 200-500-users 2018