The art of using t-SNE for single-cell transcriptomics, bioRxiv, 2018-10-26
AbstractSingle-cell transcriptomics yields ever growing data sets containing RNA expression levels for thousands of genes from up to millions of cells. Common data analysis pipelines include a dimensionality reduction step for visualising the data in two dimensions, most frequently performed using t-distributed stochastic neighbour embedding (t-SNE). It excels at revealing local structure in high-dimensional data, but naive applications often suffer from severe shortcomings, e.g. the global structure of the data is not represented accurately. Here we describe how to circumvent such pitfalls, and develop a protocol for creating more faithful t-SNE visualisations. It includes PCA initialisation, a high learning rate, and multi-scale similarity kernels; for very large data sets, we additionally use exaggeration and downsampling-based initialisation. We use published single-cell RNA-seq data sets to demonstrate that this protocol yields superior results compared to the naive application of t-SNE.
biorxiv bioinformatics 100-200-users 2018Automated optimized parameters for t-distributed stochastic neighbor embedding improve visualization and allow analysis of large datasets, bioRxiv, 2018-10-25
Accurate and comprehensive extraction of information from high-dimensional single cell datasets necessitates faithful visualizations to assess biological populations. A state-of-the-art algorithm for non-linear dimension reduction, t-SNE, requires multiple heuristics and fails to produce clear representations of datasets when millions of cells are projected. We developed opt-SNE, an automated toolkit for t-SNE parameter selection that utilizes Kullback-Liebler divergence evaluation in real time to tailor the early exaggeration and overall number of gradient descent iterations in a dataset-specific manner. The precise calibration of early exaggeration together with opt-SNE adjustment of gradient descent learning rate dramatically improves computation time and enables high-quality visualization of large cytometry and transcriptomics datasets, overcoming limitations of analysis tools with hard-coded parameters that often produce poorly resolved or misleading maps of fluorescent and mass cytometry data. In summary, opt-SNE enables superior data resolution in t-SNE space and thereby more accurate data interpretation.
biorxiv bioinformatics 100-200-users 2018Inference of recombination maps from a single pair of genomes and its application to archaic samples, bioRxiv, 2018-10-25
ABSTRACTUnderstanding the causes and consequences of recombination rate evolution is a fundamental goal in genetics that requires recombination maps from across the tree of life. Since statistical inference of recombination maps typically depends on large samples, reaching out studies to non-model organisms requires alternative tools. Here we extend the sequentially Markovian coalescent model to jointly infer demography and the variation in recombination along a pair of genomes. Using extensive simulations and sequence data from humans, fruit-flies and a fungal pathogen, we demonstrate that iSMC accurately infers recombination maps under a wide range of scenarios – remarkably, even from a single pair of unphased genomes. We exploit this possibility and reconstruct the recombination maps of archaic hominids. We report that the evolution of the recombination landscape follows the established phylogeny of Neandertals, Denisovans and modern human populations, as expected if the genomic distribution of crossovers in hominids is largely neutral.
biorxiv evolutionary-biology 0-100-users 2018Independent domestication events in the blue-cheese fungus Penicillium roqueforti, bioRxiv, 2018-10-24
AbstractDomestication provides an excellent framework for studying adaptive divergence. Using population genomics and phenotypic assays, we reconstructed the domestication history of the blue cheese mold Penicillium roqueforti. We showed that this fungus was domesticated twice independently. The population used in Roquefort originated from an old domestication event associated with weak bottlenecks and exhibited traits beneficial for pre-industrial cheese production (slower growth in cheese and greater spore production on bread, the traditional multiplication medium). The other cheese population originated more recently from the selection of a single clonal lineage, was associated to all types of blue cheese worldwide but Roquefort, and displayed phenotypes more suited for industrial cheese production (high lipolytic activity, efficient cheese cavity colonization ability and salt tolerance). We detected genomic regions affected by recent positive selection and putative horizontal gene transfers. This study sheds light on the processes of rapid adaptation and raises questions about genetic resource conservation.
biorxiv evolutionary-biology 0-100-users 2018Microbiota profiling with long amplicons using Nanopore sequencing full-length 16S rRNA gene and whole rrn operon, bioRxiv, 2018-10-24
Background Profiling microbiome on low biomass samples is challenging for metagenomics since these samples are prone to present DNA from other sources, such as the host or the environment. The usual approach is sequencing specific hypervariable regions of the 16S rRNA gene, which fails to assign taxonomy to genus and species level. Here, we aim to assess long-amplicon PCR-based approaches for assigning taxonomy at the genus and species level. We use Nanopore sequencing with two different markers full-length 16S rRNA (~1,500 bp) and the whole rrn operon (16S rRNA gene - ITS - 23S rRNA gene; 4,500 bp).Methods We sequenced a clinical isolate of Staphylococcus pseudintermedius, two mock communities (HM-783D, Bei Resources; D6306, ZymoBIOMICS) and two pools of low-biomass samples (dog skin). Nanopore sequencing was performed on MinION (Oxford Nanopore Technologies) using 1D PCR barcoding kit. Sequences were pre-processed, and data were analyzed using WIMP workflow on EPI2ME (ONT) or Minimap2 software with rrn database.Results Full-length 16S rRNA and the rrn operon retrieved the microbiota composition from the bacterial isolate, the mock communities and the complex skin samples, even at the genus and species level. For Staphylococcus pseudintermedius isolate, when using EPI2ME, the amplicons were assigned to the correct bacterial species in ~98% of the cases with rrn operon as the marker, and ~68% of the cases with 16S rRNA gene respectively. In both skin microbiota samples, we detected many species with an environmental origin. In chin, we found different Pseudomonas species in high abundance, whereas in the dorsal skin there were more taxa with lower abundances.Conclusions Both full-length 16S rRNA and the rrn operon retrieved the microbiota composition of simple and complex microbial communities, even from the low-biomass samples such as dog skin. For an increased resolution at the species level, rrn operon would be the best choice.
biorxiv microbiology 100-200-users 2018Origins and Evolution of the Global RNA Virome, bioRxiv, 2018-10-24
AbstractViruses with RNA genomes dominate the eukaryotic virome, reaching enormous diversity in animals and plants. The recent advances of metaviromics prompted us to perform a detailed phylogenomic reconstruction of the evolution of the dramatically expanded global RNA virome. The only universal gene among RNA viruses is the RNA-dependent RNA polymerase (RdRp). We developed an iterative computational procedure that alternates the RdRp phylogenetic tree construction with refinement of the underlying multiple sequence alignments. The resulting tree encompasses 4,617 RNA virus RdRps and consists of 5 major branches, 2 of which include positive-sense RNA viruses, 1 is a mix of positive-sense (+) RNA and double-stranded (ds) RNA viruses, and 2 consist of dsRNA and negative-sense (−) RNA viruses, respectively. This tree topology implies that dsRNA viruses evolved from +RNA viruses on at least two independent occasions, whereas -RNA viruses evolved from dsRNA viruses. Reconstruction of RNA virus evolution using the RdRp tree as the scaffold suggests that the last common ancestors of the major branches of +RNA viruses encoded only the RdRp and a single jelly-roll capsid protein. Subsequent evolution involved independent capture of additional genes, particularly, those encoding distinct RNA helicases, enabling replication of larger RNA genomes and facilitating virus genome expression and virus-host interactions. Phylogenomic analysis reveals extensive gene module exchange among diverse viruses and horizontal virus transfer between distantly related hosts. Although the network of evolutionary relationships within the RNA virome is bound to further expand, the present results call for a thorough reevaluation of the RNA virus taxonomy.IMPORTANCEThe majority of the diverse viruses infecting eukaryotes have RNA genomes, including numerous human, animal, and plant pathogens. Recent advances of metagenomics have led to the discovery of many new groups of RNA viruses in a wide range of hosts. These findings enable a far more complete reconstruction of the evolution of RNA viruses than what was attainable previously. This reconstruction reveals the relationships between different Baltimore Classes of viruses and indicates extensive transfer of viruses between distantly related hosts, such as plants and animals. These results call for a major revision of the existing taxonomy of RNA viruses.
biorxiv microbiology 100-200-users 2018