Valid post-clustering differential analysis for single-cell RNA-Seq, bioRxiv, 2018-11-05
SummarySingle-cell computational pipelines involve two critical steps organizing cells (clustering) and identifying the markers driving this organization (differential expression analysis). State-of-the-art pipelines perform differential analysis after clustering on the same dataset. We observe that because clustering forces separation, reusing the same dataset generates artificially low p-values and hence false discoveries. We introduce a valid post-clustering differential analysis framework which corrects for this problem. We provide software at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comjessemzhangtn_test>httpsgithub.comjessemzhangtn_test<jatsext-link>.
biorxiv bioinformatics 100-200-users 2018Systematic identification of human SNPs affecting regulatory element activity, bioRxiv, 2018-11-04
AbstractMost of the millions of single-nucleotide polymorphisms (SNPs) in the human genome are non-coding, and many overlap with putative regulatory elements. Genome-wide association studies have linked many of these SNPs to human traits or to gene expression levels, but rarely with sufficient resolution to identify the causal SNPs. Functional screens based on reporter assays have previously been of insufficient throughput to test the vast space of SNPs for possible effects on enhancer and promoter activity. Here, we have leveraged the throughput of the SuRE reporter technology to survey a total of 5.9 million SNPs, including 57% of the known common SNPs. We identified more than 30 thousand SNPs that alter the activity of putative regulatory elements, often in a cell-type specific manner. These data indicate that a large proportion of human non-coding SNPs may affect gene regulation. Integration of these SuRE data with genome-wide association studies may help pinpoint SNPs that underlie human traits.
biorxiv genomics 100-200-users 2018Ultra-sensitive sequencing for cancer detection reveals progressive clonal selection in normal tissue over a century of human lifespan, bioRxiv, 2018-11-04
ABSTRACTHigh accuracy next-generation DNA sequencing promises a paradigm shift in early cancer detection by enabling the identification of mutant cancer molecules in minimally-invasive body fluid samples. We demonstrate 80% sensitivity for ovarian cancer detection using ultra-accurate Duplex Sequencing to identify TP53 mutations in uterine lavage. However, in addition to tumor DNA, we also detect low frequency TP53 mutations in nearly all lavages from women with and without cancer. These mutations increase with age and share the selection traits of clonal TP53 mutations commonly found in human tumors. We show that low frequency TP53 mutations exist in multiple healthy tissues, from newborn to centenarian, and progressively increase in abundance and pathogenicity with older age across tissue types. Our results illustrate that subclonal cancer evolutionary processes are a ubiquitous part of normal human aging and great care must be taken to distinguish tumor-derived, from age-associated mutations in high sensitivity clinical cancer diagnostics.
biorxiv cancer-biology 200-500-users 2018Stem cell differentiation trajectories in Hydra resolved at single-cell resolution, bioRxiv, 2018-11-03
AbstractThe adult Hydra polyp continuously renews all of its cells using three separate stem cell populations, but the genetic pathways enabling homeostatic tissue maintenance are not well understood. We used Drop-seq to sequence transcriptomes of 24,985 single Hydra cells and identified the molecular signatures of a broad spectrum of cell states, from stem cells to terminally differentiated cells. We constructed differentiation trajectories for each cell lineage and identified the transcription factors expressed along these trajectories, thus creating a comprehensive molecular map of all developmental lineages in the adult animal. We unexpectedly found that neuron and gland cell differentiation transits through a common progenitor state, suggesting a shared evolutionary history for these secretory cell types. Finally, we have built the first gene expression map of the Hydra nervous system. By producing a comprehensive molecular description of the adult Hydra polyp, we have generated a resource for addressing fundamental questions regarding the evolution of developmental processes and nervous system function.
biorxiv developmental-biology 100-200-users 2018Comprehensive integration of single cell data, bioRxiv, 2018-11-02
Single cell transcriptomics (scRNA-seq) has transformed our ability to discover and annotate cell types and states, but deep biological understanding requires more than a taxonomic listing of clusters. As new methods arise to measure distinct cellular modalities, including high-dimensional immunophenotypes, chromatin accessibility, and spatial positioning, a key analytical challenge is to integrate these datasets into a harmonized atlas that can be used to better understand cellular identity and function. Here, we develop a computational strategy to “anchor” diverse datasets together, enabling us to integrate and compare single cell measurements not only across scRNA-seq technologies, but different modalities as well. After demonstrating substantial improvement over existing methods for data integration, we anchor scRNA-seq experiments with scATAC-seq datasets to explore chromatin differences in closely related interneuron subsets, and project single cell protein measurements onto a human bone marrow atlas to annotate and characterize lymphocyte populations. Lastly, we demonstrate how anchoring can harmonize in-situ gene expression and scRNA-seq datasets, allowing for the transcriptome-wide imputation of spatial gene expression patterns, and the identification of spatial relationships between mapped cell types in the visual cortex. Our work presents a strategy for comprehensive integration of single cell data, including the assembly of harmonized references, and the transfer of information across datasets.Availability Installation instructions, documentation, and tutorials are available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpswww.satijalab.orgseurat>httpswww.satijalab.orgseurat<jatsext-link>
biorxiv genomics 200-500-users 2018Inferring the ancestry of everyone, bioRxiv, 2018-11-01
AbstractA central problem in evolutionary biology is to infer the full genealogical history of a set of DNA sequences. This history contains rich information about the forces that have influenced a sexually reproducing species. However, existing methods are limited the most accurate is unable to cope with more than a few dozen samples. With modern genetic data sets rapidly approaching millions of genomes, there is an urgent need for efficient inference methods to exploit such rich resources. We introduce an algorithm to infer whole-genome history which has comparable accuracy to the state-of-the-art but can process around four orders of magnitude more sequences. Additionally, our method results in an “evolutionary encoding” of the original sequence data, enabling efficient access to genealogies and calculation of genetic statistics over the data. We apply this technique to human data from the 1000 Genomes Project, Simons Genome Diversity Project and UK Biobank, showing that the genealogies we estimate are both rich in biological signal and efficient to process.
biorxiv evolutionary-biology 200-500-users 2018