K-nearest neighbor smoothing for high-throughput single-cell RNA-Seq data, bioRxiv, 2017-12-06
High-throughput single-cell RNA-Seq (scRNA-Seq) is a powerful approach for studying heterogeneous tissues and dynamic cellular processes. However, compared to bulk RNA-Seq, single-cell expression profiles are extremely noisy, as they only capture a fraction of the transcripts present in the cell. Here, we propose the k-nearest neighbor smoothing (kNN-smoothing) algorithm, designed to reduce noise by aggregating information from similar cells (neighbors) in a computationally efficient and statistically tractable manner. The algorithm is based on the observation that across protocols, the technical noise exhibited by UMI-filtered scRNA-Seq data closely follows Poisson statistics. Smoothing is performed by first identifying the nearest neighbors of each cell in a step-wise fashion, based on partially smoothed and variance-stabilized expression profiles, and then aggregating their transcript counts. We show that kNN-smoothing greatly improves the detection of clusters of cells and co-expressed genes, and clearly outperforms other smoothing methods on simulated data. To accurately perform smoothing for datasets containing highly similar cell populations, we propose the kNN-smoothing 2 algorithm, in which neighbors are determined after projecting the partially smoothed data onto the first few principal components. We show that unlike its predecessor, kNN-smoothing 2 can accurately distinguish between cells from different T cell subsets, and enables their identification in peripheral blood using unsupervised methods. Our work facilitates the analysis of scRNA-Seq data across a broad range of applications, including the identification of cell populations in heterogeneous tissues and the characterization of dynamic processes such as cellular differentiation. Reference implementations of our algorithms can be found at httpsgithub.comyanailabknn-smoothing.
biorxiv bioinformatics 0-100-users 2017Testing the parasite mass burden effect on host behaviour alteration in the Schistocephalus-stickleback system, bioRxiv, 2017-12-06
ABSTRACTMany parasites with complex life cycles modify their intermediate host’s behaviour, which has been proposed to increase transmission to their definitive host. This behavioural change could result from the parasite actively manipulating its host, but could also be explained by a mechanical effect, where the parasite’s physical presence affects host behaviour. We created an artificial internal parasite using silicone injections in the body cavity to test this mechanical effect hypothesis. We used the Schistocephalus solidus - threespine stickleback (Gasterosteus aculeatus) system, as this cestode can reach up to 92% of its fish host mass. Our results suggest that the mass burden brought by this macroparasite alone is not sufficient to cause behavioural changes in its host. Furthermore, our results show that wall-hugging (thigmotaxis), a measure of anxiety in vertebrates, is significantly reduced in Schistocephalus-infected sticklebacks, unveiling a new altered component of behaviour that may result from manipulation by this macroparasite.
biorxiv animal-behavior-and-cognition 0-100-users 2017A quantitative model for characterizing the evolutionary history of mammalian gene expression, bioRxiv, 2017-12-05
AbstractCharacterizing the evolutionary history of a gene’s expression profile is a critical component for understanding the relationship between genotype, expression, and phenotype. However, it is not well-established how best to distinguish the different evolutionary forces acting on gene expression. Here, we use RNA-seq across 7 tissues from 17 mammalian species to show that expression evolution across mammals is accurately modeled by the Ornstein-Uhlenbeck (OU) process. This stochastic process models expression trajectories across time as Gaussian distributions whose variance is parameterized by the rate of genetic drift and strength of stabilizing selection. We use these mathematical properties to identify expression pathways under neutral, stabilizing, and directional selection, and quantify the extent of selective pressure on a gene’s expression. We further detect deleterious expression levels outside expected evolutionary distributions in expression data from individual patients. Our work provides a statistical framework for interpreting expression data across species and in disease.One Sentence SummaryWe demonstrate the power of a stochastic model for quantifying selective pressure on expression and estimating evolutionary distributions of optimal gene expression.
biorxiv genomics 0-100-users 2017Interpretation of biological experiments changes with evolution of Gene Ontology and its annotations, bioRxiv, 2017-12-04
ABSTRACTGene Ontology (GO) enrichment analysis is ubiquitously used for interpreting high throughput molecular data and generating hypotheses about underlying biological phenomena of experiments. However, the two building blocks of this analysis — the ontology and the annotations — evolve rapidly. We used gene signatures derived from 104 disease analyses to systematically evaluate how enrichment analysis results were affected by evolution of the GO over a decade. We found low consistency between enrichment analyses results obtained with early and more recent GO versions. Furthermore, there continues to be strong annotation bias in the GO annotations where 58% of the annotations are for 16% of the human genes. Our analysis suggests that GO evolution may have affected the interpretation and possibly reproducibility of experiments over time. Hence, researchers must exercise caution when interpreting GO enrichment analyses and should reexamine previous analyses with the most recent GO version.
biorxiv bioinformatics 0-100-users 2017Phylogenomics places orphan protistan lineages in a novel eukaryotic super-group, bioRxiv, 2017-12-04
AbstractRecent phylogenetic analyses position certain ‘orphan’ protist lineages deep in the tree of eukaryotic life, but their exact placements are poorly resolved. We conducted phylogenomic analyses that incorporate deeply sequenced transcriptomes from representatives of collodictyonids (diphylleids), rigifilids, Mantamonas and ancyromonads (planomonads). Analyses of 351 genes, using site-heterogeneous mixture models, strongly support a novel supergroup-level clade that includes collodictyonids, rigifilids and Mantamonas, which we name ‘CRuMs’. Further, they robustly place CRuMs as the closest branch to Amorphea (including animals and fungi). Ancyromonads are strongly inferred to be more distantly related to Amorphea than are CRuMs. They emerge either as sister to malawimonads, or as a separate deeper branch. CRuMs and ancyromonads represent two distinct major groups that branch deeply on the lineage that includes animals, near the most commonly inferred root of the eukaryote tree. This makes both groups crucial in examinations of the deepest-level history of extant eukaryotes.
biorxiv evolutionary-biology 0-100-users 2017Model-based detection and analysis of introgressed Neanderthal ancestry in modern humans, bioRxiv, 2017-12-02
AbstractGenetic evidence has revealed that the ancestors of modern human populations outside of Africa and their hominin sister groups, notably the Neanderthals, exchanged genetic material in the past. The distribution of these introgressed sequence-tracts along modern-day human genomes provides insight into the ancient structure and migration patterns of these archaic populations. Furthermore, it facilitates studying the selective processes that lead to the accumulation or depletion of introgressed genetic variation. Recent studies have developed methods to localize these introgressed regions, reporting long regions that are depleted of Neanderthal introgression and enriched in genes, suggesting negative selection against the Neanderthal variants. On the other hand, enriched Neanderthal ancestry in hair- and skin-related genes suggests that some introgressed variants facilitated adaptation to new environments. Here, we present a model-based method called diCal-admix and apply it to detect tracts of Neanderthal introgression in modern humans. We demonstrate its efficiency and accuracy through extensive simulations. We use our method to detect introgressed regions in modern human individuals from the 1000 Genomes Project, using a high coverage genome from a Neanderthal individual from the Altai mountains as reference. Our introgression detection results and findings concerning their functional implications are largely concordant with previous studies, and are consistent with weak selection against Neanderthal ancestry. We find some evidence that selection against Neanderthal ancestry was due to higher genetic load in Neanderthals, resulting from small effective population size, rather than Dobzhansky-Müller incompatibilities. Finally, we investigate the role of the X-chromosome in the divergence between Neanderthals and modern humans.
biorxiv evolutionary-biology 0-100-users 2017