Visualizing Structure and Transitions for Biological Data Exploration, bioRxiv, 2017-03-25
AbstractWith the advent of high-throughput technologies measuring high-dimensional biological data, there is a pressing need for visualization tools that reveal the structure and emergent patterns of data in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure in data by an information-geometric distance between datapoints. We perform extensive comparison between PHATE and other tools on a variety of artificial and biological datasets, and find that it consistently preserves a range of patterns in data including continual progressions, branches, and clusters. We define a manifold preservation metric DEMaP to show that PHATE produces quantitatively better denoised embeddings than existing visualization methods. We show that PHATE is able to gain unique insight from a newly generated scRNA-seq dataset of human germ layer differentiation. Here, PHATE reveals a dynamic picture of the main developmental branches in unparalleled detail, including the identification of three novel subpopulations. Finally, we show that PHATE is applicable to a wide variety of datatypes including mass cytometry, single-cell RNA-sequencing, Hi-C, and gut microbiome data, where it can generate interpretable insights into the underlying systems.
biorxiv bioinformatics 0-100-users 2017Light Sheet Theta Microscopy for High-resolution Quantitative Imaging of Large Biological Systems, bioRxiv, 2017-03-23
AbstractAdvances in tissue clearing and molecular labelling methods are enabling unprecedented optical access to large intact biological systems. These advances fuel the need for high-speed microscopy approaches to image large samples quantitatively and at high resolution. While Light Sheet Microscopy (LSM), with its high planar imaging speed and low photo-bleaching, can be effective, scaling up to larger imaging volumes has been hindered by the use of orthogonal light-sheet illumination. To address this fundamental limitation, we have developed Light Sheet Theta Microscopy (LSTM), which uniformly illuminates samples from same side as the detection objective, thereby eliminating limits on lateral dimensions without sacrificing imaging resolution, depth and speed. We present detailed characterization of LSTM, and show that this approach achieves rapid high-resolution imaging of large intact samples with superior uniform high-resolution than LSM. LSTM is a significant step in high-resolution quantitative mapping of structure and function of large intact biological systems.
biorxiv neuroscience 0-100-users 2017Single-cell RNA-seq of human induced pluripotent stem cells reveals cellular heterogeneity and cell state transitions between subpopulations, bioRxiv, 2017-03-23
AbstractHeterogeneity of cell states represented in pluripotent cultures have not been described at the transcriptional level. Since gene expression is highly heterogeneous between cells, single-cell RNA sequencing can be used to identify how individual pluripotent cells function. Here, we present results from the analysis of single-cell RNA sequencing data from 18,787 individual WTC CRISPRi human induced pluripotent stem cells. We developed an unsupervised clustering method, and through this identified four subpopulations distinguishable on the basis of their pluripotent state including a core pluripotent population (48.3%), proliferative (47.8%), early-primed for differentiation (2.8%) and late-primed for differentiation (1.1%). For each subpopulation we were able to identify the genes and pathways that define differences in pluripotent cell states. Our method identified four transcriptionally distinct predictor gene sets comprised of 165 unique genes that denote the specific pluripotency states; and using these sets, we developed a multigenic machine learning prediction method to accurately classify single cells into each of the subpopulations. Compared against a set of established pluripotency markers, our method increases prediction accuracy by 10%, specificity by 20%, and explains a substantially larger proportion of deviance (up to 3-fold) from the prediction model. Finally, we developed an innovative method to predict cells transitioning between subpopulations, and support our conclusions with results from two orthogonal pseudotime trajectory methods.
biorxiv genomics 0-100-users 2017Multiplexing droplet-based single cell RNA-sequencing using natural genetic barcodes, bioRxiv, 2017-03-21
Droplet-based single-cell RNA-sequencing (dscRNA-seq) has enabled rapid, massively parallel profiling of transcriptomes from tens of thousands of cells. Multiplexing samples for single cell capture and library preparation in dscRNA-seq would enable cost-effective designs of differential expression and genetic studies while avoiding technical batch effects, but its implementation remains challenging. Here, we introduce an in-silico algorithm demuxlet that harnesses natural genetic variation to discover the sample identity of each cell and identify droplets containing two cells. These capabilities enable multiplexed dscRNA-seq experiments where cells from unrelated individuals are pooled and captured at higher throughput than standard workflows. To demonstrate the performance of demuxlet, we sequenced 3 pools of peripheral blood mononuclear cells (PBMCs) from 8 lupus patients. Given genotyping data for each individual, demuxlet correctly recovered the sample identity of > 99% of singlets, and identified doublets at rates consistent with previous estimates. In PBMCs, we demonstrate the utility of multiplexed dscRNA-seq in two applications characterizing cell type specificity and inter-individual variability of cytokine response from 8 lupus patients and mapping genetic variants associated with cell type specific gene expression from 23 donors. Demuxlet is fast, accurate, scalable and could be extended to other single cell datasets that incorporate natural or synthetic DNA barcodes.
biorxiv bioinformatics 0-100-users 2017Scaling up DNA data storage and random access retrieval, bioRxiv, 2017-03-08
Current storage technologies can no longer keep pace with exponentially growing amounts of data. 1 Synthetic DNA offers an attractive alternative due to its potential information density of ~ 1018 Bmm3, 107 times denser than magnetic tape, and potential durability of thousands of years.2 Recent advances in DNA data storage have highlighted technical challenges, in particular, coding and random access, but have stored only modest amounts of data in synthetic DNA. 3,4,5 This paper demonstrates an end-to-end approach toward the viability of DNA data storage with large-scale random access. We encoded and stored 35 distinct files, totaling 200MB of data, in more than 13 million DNA oligonucleotides (about 2 billion nucleotides in total) and fully recovered the data with no bit errors, representing an advance of almost an order of magnitude compared to prior work. 6 Our data curation focused on technologically advanced data types and historical relevance, including the Universal Declaration of Human Rights in over 100 languages,7 a high-definition music video of the band OK Go,8 and a CropTrust database of the seeds stored in the Svalbard Global Seed Vault.9 We developed a random access methodology based on selective amplification, for which we designed and validated a large library of primers, and successfully retrieved arbitrarily chosen items from a subset of our pool containing 10.3 million DNA sequences. Moreover, we developed a novel coding scheme that dramatically reduces the physical redundancy (sequencing read coverage) required for error-free decoding to a median of 5x, while maintaining levels of logical redundancy comparable to the best prior codes. We further stress-tested our coding approach by successfully decoding a file using the more error-prone nanopore-based sequencing. We provide a detailed analysis of errors in the process of writing, storing, and reading data from synthetic DNA at a large scale, which helps characterize DNA as a storage medium and justify our coding approach. Thus, we have demonstrated a significant improvement in data volume, random access, and encodingdecoding schemes that contribute to a whole-system vision for DNA data storage.
biorxiv bioengineering 0-100-users 2017Parallel paleogenomic transects reveal complex genetic history of early European farmers, bioRxiv, 2017-03-07
Ancient DNA studies have established that European Neolithic populations were descended from Anatolian migrants who received a limited amount of admixture from resident hunter-gatherers. Many open questions remain, however, about the spatial and temporal dynamics of population interactions and admixture during the Neolithic period. Using the highest-resolution genome-wide ancient DNA data set assembled to date—a total of 177 samples, 127 newly reported here, from the Neolithic and Chalcolithic of Hungary (6000–2900 BCE, n = 98), Germany (5500–3000 BCE, n = 42), and Spain (5500–2200 BCE, n = 37)—we investigate the population dynamics of Neolithization across Europe. We find that genetic diversity was shaped predominantly by local processes, with varied sources and proportions of hunter-gatherer ances try among the three regions and through time. Admixture between groups with different ancestry profiles was pervasive and resulted in observable population transformation across almost all cultural transitions. Our results shed new light on the ways that gene flow reshaped European populations throughout the Neolithic period and demonstrate the potential of time-series-based sampling and modeling approaches to elucidate multiple dimensions of historical population interactions.
biorxiv genetics 0-100-users 2017