Continuity and admixture in the last five millennia of Levantine history from ancient Canaanite and present-day Lebanese genome sequences, bioRxiv, 2017-05-27
The Canaanites inhabited the Levant region during the Bronze Age and established a culture which became influential in the Near East and beyond. However, the Canaanites, unlike most other ancient Near Easterners of this period, left few surviving textual records and thus their origin and relationship to ancient and present-day populations remain unclear. In this study, we sequenced five whole-genomes from ~3,700-year-old individuals from the city of Sidon, a major Canaanite city-state on the Eastern Mediterranean coast. We also sequenced the genomes of 99 individuals from present-day Lebanon to catalogue modern Levantine genetic diversity. We find that a Bronze Age Canaanite-related ancestry was widespread in the region, shared among urban populations inhabiting the coast (Sidon) and inland populations (Jordan) who likely lived in farming societies or were pastoral nomads. This Canaanite-related ancestry derived from mixture between local Neolithic populations and eastern migrants genetically related to Chalcolithic Iranians. We estimate, using linkage-disequilibrium decay patterns, that admixture occurred 6,600-3,550 years ago, coinciding with massive population movements in the mid-Holocene triggered by aridification ~4,200 years ago. We show that present-day Lebanese derive most of their ancestry from a Canaanite-related population, which therefore implies substantial genetic continuity in the Levant since at least the Bronze Age. In addition, we find Eurasian ancestry in the Lebanese not present in Bronze Age or earlier Levantines. We estimate this Eurasian ancestry arrived in the Levant around 3,750-2,170 years ago during a period of successive conquests by distant populations such as the Persians and Macedonians.
biorxiv genetics 0-100-users 2017scImpute Accurate And Robust Imputation For Single Cell RNA-Seq Data, bioRxiv, 2017-05-25
The emerging single cell RNA sequencing (scRNA-seq) technologies enable the investigation of transcriptomic landscapes at single-cell resolution. The analysis of scRNA-seq data is complicated by excess zero or near zero counts, the so-called dropouts due to the low amounts of mRNA sequenced within individual cells. Downstream analysis of scRNA-seq would be severely biased if the dropout events are not properly corrected. We introduce scImpute, a statistical method to accurately and robustly impute the dropout values in scRNA-seq data. ScImpute automatically identifies gene expression values affected by dropout events, and only perform imputation on these values without introducing new bias to the rest data. ScImpute also detects outlier or rare cells and excludes them from imputation. Evaluation based on both simulated and real scRNA-seq data on mouse embryos, mouse brain cells, human blood cells, and human embryonic stem cells suggests that scImpute is an effective tool to recover transcriptome dynamics masked by dropout events. scImpute is shown to correct false zero counts, enhance the clustering of cell populations and subpopulations, improve the accuracy of differential expression analysis, and aid the study of gene expression dynamics.
biorxiv bioinformatics 0-100-users 2017A comparison between single cell RNA sequencing and single molecule RNA FISH for rare cell analysis, bioRxiv, 2017-05-19
AbstractThe development of single cell RNA sequencing technologies has emerged as a powerful means of profiling the transcriptional behavior of single cells, leveraging the breadth of sequencing measurements to make inferences about cell type. However, there is still little understanding of how well these methods perform at measuring single cell variability for small sets of genes and what “transcriptome coverage” (e.g. genes detected per cell) is needed for accurate measurements. Here, we use single molecule RNA FISH measurements of 26 genes in thousands of melanoma cells to provide an independent reference dataset to assess the performance of the DropSeq and Fluidigm single cell RNA sequencing platforms. We quantified the Gini coefficient, a measure of rare-cell expression variability, and find that the correspondence between RNA FISH and single cell RNA sequencing for Gini, unlike for mean, increases markedly with per-cell library complexity up to a threshold of ∼2000 genes detected. A similar complexity threshold also allows for robust assignment of multi-genic cell states such as cell cycle phase. Our results provide guidelines for selecting sequencing depth and complexity thresholds for single cell RNA sequencing. More generally, our results suggest that if the number of genes whose expression levels are required to answer any given biological question is small, then greater transcriptome complexity per cell is likely more important than obtaining very large numbers of cells.
biorxiv genomics 0-100-users 2017Cohesin loss eliminates all loop domains, leading to links among superenhancers and downregulation of nearby genes, bioRxiv, 2017-05-19
SUMMARYThe human genome folds to create thousands of intervals, called “contact domains,” that exhibit enhanced contact frequency within themselves. “Loop domains” form because of tethering between two loci - almost always bound by CTCF and cohesin – lying on the same chromosome. “Compartment domains” form when genomic intervals with similar histone marks co-segregate. Here, we explore the effects of degrading cohesin. All loop domains are eliminated, but neither compartment domains nor histone marks are affected. Loci in different compartments that had been in the same loop domain become more segregated. Loss of loop domains does not lead to widespread ectopic gene activation, but does affect a significant minority of active genes. In particular, cohesin loss causes superenhancers to co-localize, forming hundreds of links within and across chromosomes, and affecting the regulation of nearby genes. Cohesin restoration quickly reverses these effects, consistent with a model where loop extrusion is rapid.
biorxiv genomics 0-100-users 2017SAVER Gene expression recovery for UMI-based single cell RNA sequencing, bioRxiv, 2017-05-17
AbstractRapid advances in massively parallel single cell RNA sequencing (scRNA-seq) is paving the way for high-resolution single cell profiling of biological samples. In most scRNA-seq studies, only a small fraction of the transcripts present in each cell are sequenced. The efficiency, that is, the proportion of transcripts in the cell that are sequenced, can be especially low in highly parallelized experiments where the number of reads allocated for each cell is small. This leads to unreliable quantification of lowly and moderately expressed genes, resulting in extremely sparse data and hindering downstream analysis. To address this challenge, we introduce SAVER (Single-cell Analysis Via Expression Recovery), an expression recovery method for scRNA-seq that borrows information across genes and cells to impute the zeros as well as to improve the expression estimates for all genes. We show, by comparison to RNA fluorescence in situ hybridization (FISH) and by data down-sampling experiments, that SAVER reliably recovers cell-specific gene expression concentrations, cross-cell gene expression distributions, and gene-to-gene and cell-to-cell correlations. This improves the power and accuracy of any downstream analysis involving genes with low to moderate expression.
biorxiv genomics 0-100-users 2017A Next Generation Connectivity Map L1000 Platform And The First 1,000,000 Profiles, bioRxiv, 2017-05-11
SUMMARYWe previously piloted the concept of a Connectivity Map (CMap), whereby genes, drugs and disease states are connected by virtue of common gene-expression signatures. Here, we report more than a 1,000-fold scale-up of the CMap as part of the NIH LINCS Consortium, made possible by a new, low-cost, high throughput reduced representation expression profiling method that we term L1000. We show that L1000 is highly reproducible, comparable to RNA sequencing, and suitable for computational inference of the expression levels of 81% of non-measured transcripts. We further show that the expanded CMap can be used to discover mechanism of action of small molecules, functionally annotate genetic variants of disease genes, and inform clinical trials. The 1.3 million L1000 profiles described here, as well as tools for their analysis, are available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsclue.io>httpsclue.io<jatsext-link>.HIGHLIGHTS<jatslist list-type=bullet><jatslist-item>A new gene expression profiling method, L1000, dramatically lowers cost<jatslist-item><jatslist-item>The Connectivity Map database now includes 1.3 million publicly accessible L1000 perturbational profiles<jatslist-item><jatslist-item>This expanded Connectivity Map facilitates discovery of small molecule mechanism of action and functional annotation of genetic variants<jatslist-item><jatslist-item>The work establishes feasibility and utility of a truly comprehensive Connectivity Map<jatslist-item>
biorxiv genomics 0-100-users 2017