Pushing the limits of de novo genome assembly for complex prokaryotic genomes harboring very long, near identical repeats, bioRxiv, 2018-04-12
AbstractGenerating a complete, de novo genome assembly for prokaryotes is often considered a solved problem. However, we here show that Pseudomonas koreensis P19E3 harbors multiple, near identical repeat pairs up to 70 kilobase pairs in length. Beyond long repeats, the P19E3 assembly was further complicated by a shufflon region. Its complex genome could not be de novo assembled with long reads produced by Pacific Biosciences’ technology, but required very long reads from the Oxford Nanopore Technology. Another important factor for a full genomic resolution was the choice of assembly algorithm.Importantly, a repeat analysis indicated that very complex bacterial genomes represent a general phenomenon beyond Pseudomonas. Roughly 10% of 9331 complete bacterial and a handful of 293 complete archaeal genomes represented this dark matter for de novo genome assembly of prokaryotes. Several of these dark matter genome assemblies contained repeats far beyond the resolution of the sequencing technology employed and likely contain errors, other genomes were closed employing labor-intense steps like cosmid libraries, primer walking or optical mapping. Using very long sequencing reads in combination with assemblers capable of resolving long, near identical repeats will bring most prokaryotic genomes within reach of fast and complete de novo genome assembly.
biorxiv genomics 0-100-users 2018Bayesian Inference for a Generative Model of Transcriptome Profiles from Single-cell RNA Sequencing, bioRxiv, 2018-03-30
AbstractTranscriptome profiles of individual cells reflect true and often unexplored biological diversity, but are also affected by noise of biological and technical nature. This raises the need to explicitly model the resulting uncertainty and take it into account in any downstream analysis, such as dimensionality reduction, clustering, and differential expression. Here, we introduce Single-cell Variational Inference (scVI), a scalable framework for probabilistic representation and analysis of gene expression in single cells. Our model uses variational inference and stochastic optimization of deep neural networks to approximate the parameters that govern the distribution of expression values of each gene in every cell, using a non-linear mapping between the observations and a low-dimensional latent space.By doing so, scVI pools information between similar cells or genes while taking nuisance factors of variation such as batch effects and limited sensitivity into account. To evaluate scVI, we conducted a comprehensive comparative analysis to existing methods for distributional modeling and dimensionality reduction, all of which rely on generalized linear models. We first show that scVI scales to over one million cells, whereas competing algorithms can process at most tens of thousands of cells. Next, we show that scVI fits unseen data more closely and can impute missing data more accurately, both indicative of a better generalization capacity. We then utilize scVI to conduct a set of fundamental analysis tasks – including batch correction, visualization, clustering and differential expression – and demonstrate its accuracy in comparison to the state-of-the-art tools in each task. scVI is publicly available, and can be readily used as a principled and inclusive solution for multiple tasks of single-cell RNA sequencing data analysis.
biorxiv bioinformatics 0-100-users 2018Ancient Fennoscandian genomes reveal origin and spread of Siberian ancestry in Europe, bioRxiv, 2018-03-22
AbstractEuropean history has been shaped by migrations of people, and their subsequent admixture. Recently, evidence from ancient DNA has brought new insights into migration events that could be linked to the advent of agriculture, and possibly to the spread of Indo-European languages. However, little is known so far about the ancient population history of north-eastern Europe, in particular about populations speaking Uralic languages, such as Finns and Saami. Here we analyse ancient genomic data from 11 individuals from Finland and Northwest Russia. We show that the specific genetic makeup of northern Europe traces back to migrations from Siberia that began at least 3,500 years ago. This ancestry was subsequently admixed into many modern populations in the region, in particular populations speaking Uralic languages today. In addition, we show that ancestors of modern Saami inhabited a larger territory during the Iron Age than today, which adds to historical and linguistic evidence for the population history of Finland.
biorxiv genomics 0-100-users 2018Marionette E. coli containing 12 highly-optimized small molecule sensors, bioRxiv, 2018-03-21
Cellular processes are carried out by many interacting genes and their study and optimization requires multiple levers by which they can be independently controlled. The most common method is via a genetically-encoded sensor that responds to a small molecule (an “inducible system”). However, these sensors are often suboptimal, exhibiting high background expression and low dynamic range. Further, using multiple sensors in one cell is limited by cross-talk and the taxing of cellular resources. Here, we have developed a directed evolution strategy to simultaneously select for less background, high dynamic range, increased sensitivity, and low crosstalk. Libraries of the regulatory protein and output promoter are built based on random and rationally-guided mutations. This is applied to generate a set of 12 high-performance sensors, which exhibit >100-fold induction with low background and cross-reactivity. These are combined to build a single “sensor array” and inserted into the genomes of E. coli MG1655 (wild-type), DH10B (cloning), and BL21 (protein expression). These “Marionette” strains allow for the independent control of gene expression using 2,4-diacetylphophloroglucinol (DAPG), cuminic acid (Cuma), 3-oxohexanoyl-homoserine lactone (OC6), vanillic acid (Van), isopropyl β-D-1-thiogalactopyranoside (IPTG), anhydrotetracycline (aTc), L-arabinose (Ara), choline chloride (Cho), naringenin (Nar), 3,4-dihydroxybenzoic acid (DHBA), sodium salicylate (Sal), and 3-hydroxytetradecanoyl-homoserine lactone (OHC14).
biorxiv synthetic-biology 0-100-users 2018Profiling of pluripotency factors in individual stem cells and early embryos, bioRxiv, 2018-03-21
SUMMARYMajor cell fate decisions are governed by sequence-specific transcription factors (TFs) that act in small cell populations within developing embryos. To understand how TFs regulate cell fate it is important to identify their genomic binding sites in these populations. However, current methods cannot profile TFs genome-wide at or near the single cell level. Here we adapt the CUT&RUN method to profile chromatin proteins in low cell numbers, mapping TF-DNA interactions in single cells and individual pre-implantation embryos for the first time. Using this method, we demonstrate that the pluripotency TF NANOG is significantly more dependent on the SWISNF family ATPase BRG1 for association with its genomic targets in vivo than in cultured cells—a finding that could not have been made using traditional approaches. Ultra-low input CUT&RUN (uliCUT&RUN) enables interrogation of TF binding from low cell numbers, with broad applicability to rare cell populations of importance in development or disease.
biorxiv genomics 0-100-users 2018Third-generation in situ hybridization chain reaction multiplexed, quantitative, sensitive, versatile, robust, bioRxiv, 2018-03-20
ABSTRACTIn situ hybridization based on the mechanism of hybridization chain reaction (HCR) has addressed multi-decade challenges to imaging mRNA expression in diverse organisms, offering a unique combination of multiplexing, quantitation, sensitivity, resolution, and versatility. Here, with third-generation in situ HCR, we augment these capabilities using probes and amplifiers that combine to provide automatic background suppression throughout the protocol, ensuring that even if reagents bind non-specifically within the sample they will not generate amplified background. Automatic background suppression dramatically enhances performance and robustness, combining the benefits of higher signal-to-background with the convenience of using unoptimized probe sets for new targets and organisms. In situ HCR v3.0 enables multiplexed quantitative mRNA imaging with subcellular resolution in the anatomical context of whole-mount vertebrate embryos, multiplexed quantitative mRNA flow cytometry for high-throughput single-cell expression profiling, and multiplexed quantitative single-molecule mRNA imaging in thick autofluorescent samples.SUMMARYIn situ hybridization chain reaction (HCR) v3.0 exploits automatic background suppression to enable multiplexed quantitative mRNA imaging and flow cytometry with dramatically enhanced ease-of-use and performance.
biorxiv developmental-biology 0-100-users 2018