Gene regulatory network reconstruction using single-cell RNA sequencing of barcoded genotypes in diverse environments, bioRxiv, 2019-03-19
AbstractUnderstanding how gene expression programs are controlled requires identifying regulatory relationships between transcription factors and target genes. Gene regulatory networks are typically constructed from gene expression data acquired following genetic perturbation or environmental stimulus. Single-cell RNA sequencing (scRNAseq) captures the gene expression state of thousands of individual cells in a single experiment, offering advantages in combinatorial experimental design, large numbers of independent measurements, and accessing the interaction between the cell cycle and environmental responses that is hidden by population-level analysis of gene expression. To leverage these advantages, we developed a method for transcriptionally barcoding gene deletion mutants and performing scRNAseq in budding yeast (Saccharomyces cerevisiae). We pooled diverse genotypes in 11 different environmental conditions and determined their expression state by sequencing 38,285 individual cells. We developed, and benchmarked, a framework for learning gene regulatory networks from scRNAseq data that incorporates multitask learning and constructed a global gene regulatory network comprising 12,018 interactions. Our study establishes a general approach to gene regulatory network reconstruction from scRNAseq data that can be employed in any organism.
biorxiv genomics 0-100-users 2019Ribosome profiling at isoform level reveals an evolutionary conserved impact of differential splicing on the proteome, bioRxiv, 2019-03-19
AbstractThe differential production of transcript isoforms from gene loci is a key cellular mechanism. Yet, its impact in protein production remains an open question. Here, we describe ORQAS (ORF quantification pipeline for alternative splicing) a new pipeline for the translation quantification of individual transcript isoforms using ribosome-protected mRNA fragments (Ribosome profiling). We found evidence of translation for 40-50% of the expressed transcript isoforms in human and mouse, with 53% of the expressed genes having more than one translated isoform in human, 33% in mouse. Differential analysis revealed that about 40% of the splicing changes at RNA level were concordant with changes in translation, with 21.7% of changes at RNA level and 17.8% at translational level conserved between human and mouse. Furthermore, orthologous cassette exons preserving the directionality of the change were found enriched in microexons in a comparison between glia and glioma, and were conserved between human and mouse. ORQAS leverages ribosome profiling to uncover a widespread and evolutionary conserved impact of differential splicing on the translation of isoforms and in particular, of microexon-containing ones. ORQAS is available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comcomprnaorqas>httpsgithub.comcomprnaorqas<jatsext-link>
biorxiv genomics 100-200-users 2019An open resource of structural variation for medical and population genetics, bioRxiv, 2019-03-15
SUMMARYStructural variants (SVs) rearrange the linear and three-dimensional organization of the genome, which can have profound consequences in evolution, diversity, and disease. As national biobanks, human disease association studies, and clinical genetic testing are increasingly reliant on whole-genome sequencing, population references for small variants (i.e., SNVs & indels) in protein-coding genes, such as the Genome Aggregation Database (gnomAD), have become integral for the evaluation and interpretation of genomic variation. However, no comparable large-scale reference maps for SVs exist to date. Here, we constructed a reference atlas of SVs from deep whole-genome sequencing (WGS) of 14,891 individuals across diverse global populations (54% non-European) as a component of gnomAD. We discovered a rich landscape of 498,257 unique SVs, including 5,729 multi-breakpoint complex SVs across 13 mutational subclasses, and examples of localized chromosome shattering, like chromothripsis, in the general population. The mutation rates and densities of SVs were non-uniform across chromosomes and SV classes. We discovered strong correlations between constraint against predicted loss-of-function (pLoF) SNVs and rare SVs that both disrupt and duplicate protein-coding genes, suggesting that existing per-gene metrics of pLoF SNV constraint do not simply reflect haploinsufficiency, but appear to capture a gene’s general sensitivity to dosage alterations. The average genome in gnomAD-SV harbored 8,202 SVs, and approximately eight genes altered by rare SVs. When incorporating these data with pLoF SNVs, we estimate that SVs comprise at least 25% of all rare pLoF events per genome. We observed large (≥1Mb), rare SVs in 3.1% of genomes (∼132 individuals), and a clinically reportable pathogenic incidental finding from SVs in 0.24% of genomes (∼1417 individuals). We also estimated the prevalence of previously reported pathogenic recurrent CNVs associated with genomic disorders, which highlighted differences in frequencies across populations and confirmed that WGS-based analyses can readily recapitulate these clinically important variants. In total, gnomAD-SV includes at least one CNV covering 57% of the genome, while the remaining 43% is significantly enriched for CNVs found in tumors and individuals with developmental disorders. However, current sample sizes remain markedly underpowered to establish estimates of SV constraint on the level of individual genes or noncoding loci. The gnomAD-SV resources have been integrated into the gnomAD browser (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgnomad.broadinstitute.org>httpsgnomad.broadinstitute.org<jatsext-link>), where users can freely explore this dataset without restrictions on reuse, which will have broad utility in population genetics, disease association, and diagnostic screening.
biorxiv genomics 200-500-users 2019A systematic evaluation of the design, orientation, and sequence context dependencies of massively parallel reporter assays, bioRxiv, 2019-03-14
ABSTRACTMassively parallel reporter assays (MPRAs) functionally screen thousands of sequences for regulatory activity in parallel. Although MPRAs have been applied to address diverse questions in gene regulation, there has been no systematic comparison of how differences in experimental design influence findings. Here, we screen a library of 2,440 sequences, representing candidate liver enhancers and controls, in HepG2 cells for regulatory activity using nine different approaches (including conventional episomal, STARR-seq, and lentiviral MPRA designs). We identify subtle but significant differences in the resulting measurements that correlate with epigenetic and sequence-level features. We also test this library in both orientations with respect to the promoter, validating en masse that enhancer activity is robustly independent of orientation. Finally, we develop and apply a novel method to assemble and functionally test libraries of the same putative enhancers as 192-mers, 354-mers, and 678-mers, and observe surprisingly large differences in functional activity. This work provides a framework for the experimental design of high-throughput reporter assays, suggesting that the extended sequence context of tested elements, and to a lesser degree the precise assay, influence MPRA results.
biorxiv genomics 0-100-users 2019Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression, bioRxiv, 2019-03-14
AbstractSingle-cell RNA-seq (scRNA-seq) data exhibits significant cell-to-cell variation due to technical factors, including the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects. To address this, we present a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments. We propose that the Pearson residuals from ’regularized negative binomial regression’, where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity. Importantly, we show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates. Our procedure omits the need for heuristic steps including pseudocount addition or log-transformation, and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. Our approach can be applied to any UMI-based scRNA-seq dataset and is freely available as part of the R package sctransform, with a direct interface to our single-cell toolkit Seurat.
biorxiv genomics 200-500-users 2019Feature Selection and Dimension Reduction for Single Cell RNA-Seq based on a Multinomial Model, bioRxiv, 2019-03-12
AbstractSingle cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero-inflation. Current normalization pro-cedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We pro-pose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform current practice in a downstream clustering assessment using ground-truth datasets.
biorxiv genomics 200-500-users 2019