Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression, bioRxiv, 2019-03-14
AbstractSingle-cell RNA-seq (scRNA-seq) data exhibits significant cell-to-cell variation due to technical factors, including the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects. To address this, we present a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments. We propose that the Pearson residuals from ’regularized negative binomial regression’, where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity. Importantly, we show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates. Our procedure omits the need for heuristic steps including pseudocount addition or log-transformation, and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. Our approach can be applied to any UMI-based scRNA-seq dataset and is freely available as part of the R package sctransform, with a direct interface to our single-cell toolkit Seurat.
biorxiv genomics 200-500-users 2019Feature Selection and Dimension Reduction for Single Cell RNA-Seq based on a Multinomial Model, bioRxiv, 2019-03-12
AbstractSingle cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero-inflation. Current normalization pro-cedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We pro-pose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform current practice in a downstream clustering assessment using ground-truth datasets.
biorxiv genomics 200-500-users 2019Genetic analysis identifies molecular systems and biological pathways associated with household income, bioRxiv, 2019-03-12
AbstractSocio-economic position (SEP) is a multi-dimensional construct reflecting (and influencing) multiple socio-cultural, physical, and environmental factors. Previous genome-wide association studies (GWAS) using household income as a marker of SEP have shown that common genetic variants account for 11% of its variation. Here, in a sample of 286,301 participants from UK Biobank, we identified 30 independent genome-wide significant loci, 29 novel, that are associated with household income. Using a recently-developed method to meta-analyze data that leverages power from genetically-correlated traits, we identified an additional 120 income-associated loci. These loci showed clear evidence of functional enrichment, with transcriptional differences identified across multiple cortical tissues, in addition to links with GABAergic and serotonergic neurotransmission. We identified neurogenesis and the components of the synapse as candidate biological systems that are linked with income. By combining our GWAS on income with data from eQTL studies and chromatin interactions, 24 genes were prioritized for follow up, 18 of which were previously associated with cognitive ability. Using Mendelian Randomization, we identified cognitive ability as one of the causal, partly-heritable phenotypes that bridges the gap between molecular genetic inheritance and phenotypic consequence in terms of income differences. Significant differences between genetic correlations indicated that, the genetic variants associated with income are related to better mental health than those linked to educational attainment (another commonly-used marker of SEP). Finally, we were able to predict 2.5% of income differences using genetic data alone in an independent sample. These results are important for understanding the observed socioeconomic inequalities in Great Britain today.
biorxiv genetics 200-500-users 2019Clades of huge phage from across Earth’s ecosystems, bioRxiv, 2019-03-11
Phage typically have small genomes and depend on their bacterial hosts for replication. DNA sequenced from many diverse ecosystems revealed hundreds of huge phage genomes, between 200 kbp and 716 kbp in length. Thirty-four genomes were manually curated to completion, including the largest phage genomes yet reported. Expanded genetic repertoires include diverse and new CRISPR-Cas systems, tRNAs, tRNA synthetases, tRNA modification enzymes, translation initiation and elongation factors, and ribosomal proteins. Phage CRISPR-Cas systems have the capacity to silence host transcription factors and translational genes, potentially as part of a larger interaction network that intercepts translation to redirect biosynthesis to phage-encoded functions. In addition, some phage may repurpose bacterial CRISPR-Cas systems to eliminate competing phage. We phylogenetically define major clades of huge phage from human and other animal microbiomes, oceans, lakes, sediments, soils and the built environment. We conclude that their large gene inventories reflect a conserved biological strategy, observed over a broad bacterial host range and across Earth’s ecosystems.
biorxiv microbiology 200-500-users 2019Blue light induces neuronal-activity-regulated gene expression in the absence of optogenetic proteins, bioRxiv, 2019-03-09
Optogenetics is widely used to control diverse cellular functions with light, requiring experimenters to expose cells to bright light. Because extended exposure to visible light can be toxic to cells, it is important to characterize the effects of light stimulation on cellular function in the absence of exogenous optogenetic proteins. Here we exposed cultured mouse cortical neurons that did not express optogenetic proteins to several hours of flashing blue, red, or green light. We found that exposing neurons to as short as one hour of blue, but not red or green, light results in the induction of neuronal-activity-regulated genes without inducing neuronal activity. Our findings suggest blue light stimulation is ill-suited to long-term optogenetic experiments, especially those that measure transcription.Significance StatementOptogenetics is widely used to control cellular functions using light. In neuroscience, channelrhodopsins, exogenous light-sensitive channels, are used to achieve light-dependent control of neuronal firing. This optogenetic control of neuronal firing requires exposing neurons to high-powered light. We ask how this light exposure, in the absence of channelrhodopsin, affects the expression of neuronal-activity-regulated genes, i.e., the genes that are transcribed in response to neuronal stimuli. Surprisingly, we find that neurons without channelrhodopsin express neuronal-activity-regulated genes in response to as short as an hour of blue, but not red or green, light exposure. These findings suggest that experimenters wishing to achieve longer-term (an hour or more) optogenetic control over neuronal firing should avoid using systems that require blue light.
biorxiv neuroscience 200-500-users 2019Whole exome sequencing and characterization of coding variation in 49,960 individuals in the UK Biobank, bioRxiv, 2019-03-09
SUMMARYThe UK Biobank is a prospective study of 502,543 individuals, combining extensive phenotypic and genotypic data with streamlined access for researchers around the world. Here we describe the first tranche of large-scale exome sequence data for 49,960 study participants, revealing approximately 4 million coding variants (of which ~98.4% have frequency < 1%). The data includes 231,631 predicted loss of function variants, a >10-fold increase compared to imputed sequence for the same participants. Nearly all genes (>97%) had ≥1 predicted loss of function carrier, and most genes (>69%) had ≥10 loss of function carriers. We illustrate the power of characterizing loss of function variation in this large population through association analyses across 1,741 phenotypes. In addition to replicating a range of established associations, we discover novel loss of function variants with large effects on disease traits, including PIEZO1 on varicose veins, COL6A1 on corneal resistance, MEPE on bone density, and IQGAP2 and GMPR on blood cell traits. We further demonstrate the value of exome sequencing by surveying the prevalence of pathogenic variants of clinical significance in this population, finding that 2% of the population has a medically actionable variant. Additionally, we leverage the phenotypic data to characterize the relationship between rare BRCA1 and BRCA2 pathogenic variants and cancer risk. Exomes from the first 49,960 participants are now made accessible to the scientific community and highlight the promise offered by genomic sequencing in large-scale population-based studies.
biorxiv genomics 200-500-users 2019