K-nearest neighbor smoothing for high-throughput single-cell RNA-Seq data, bioRxiv, 2017-12-06
High-throughput single-cell RNA-Seq (scRNA-Seq) is a powerful approach for studying heterogeneous tissues and dynamic cellular processes. However, compared to bulk RNA-Seq, single-cell expression profiles are extremely noisy, as they only capture a fraction of the transcripts present in the cell. Here, we propose the k-nearest neighbor smoothing (kNN-smoothing) algorithm, designed to reduce noise by aggregating information from similar cells (neighbors) in a computationally efficient and statistically tractable manner. The algorithm is based on the observation that across protocols, the technical noise exhibited by UMI-filtered scRNA-Seq data closely follows Poisson statistics. Smoothing is performed by first identifying the nearest neighbors of each cell in a step-wise fashion, based on partially smoothed and variance-stabilized expression profiles, and then aggregating their transcript counts. We show that kNN-smoothing greatly improves the detection of clusters of cells and co-expressed genes, and clearly outperforms other smoothing methods on simulated data. To accurately perform smoothing for datasets containing highly similar cell populations, we propose the kNN-smoothing 2 algorithm, in which neighbors are determined after projecting the partially smoothed data onto the first few principal components. We show that unlike its predecessor, kNN-smoothing 2 can accurately distinguish between cells from different T cell subsets, and enables their identification in peripheral blood using unsupervised methods. Our work facilitates the analysis of scRNA-Seq data across a broad range of applications, including the identification of cell populations in heterogeneous tissues and the characterization of dynamic processes such as cellular differentiation. Reference implementations of our algorithms can be found at httpsgithub.comyanailabknn-smoothing.
biorxiv bioinformatics 0-100-users 2017Rethinking phylogenetic comparative methods, bioRxiv, 2017-12-06
As a result of the process of descent with modification, closely related species tend to be similar to one another in a myriad different ways. In statistical terms, this means that traits measured on one species will not be independent of traits measured on others. Since their introduction in the 1980s, phylogenetic comparative methods (PCMs) have been framed as a solution to this problem. In this paper, we argue that this way of thinking about PCMs is deeply misleading. Not only has this sowed widespread confusion in the literature about what PCMs are doing but has led us to develop methods that are susceptible to the very thing we sought to build defenses against --- unreplicated evolutionary events. Through three Case Studies, we demonstrate that the susceptibility to singular events is indeed a recurring problem in comparative biology that links several seemingly unrelated controversies. In each Case Study we propose a potential solution to the problem. While the details of our proposed solutions differ, they share a common theme unifying hypothesis testing with data-driven approaches (which we term phylogenetic natural history) to disentangle the impact of singular evolutionary events from that of the factors we are investigating. More broadly, we argue that our field has, at times, been sloppy when weighing evidence in support of causal hypotheses. We suggest that one way to refine our inferences is to re-imagine phylogenies as probabilistic graphical models; adopting this way of thinking will help clarify precisely what we are testing and what evidence supports our claims.
biorxiv evolutionary-biology 100-200-users 2017Testing the parasite mass burden effect on host behaviour alteration in the Schistocephalus-stickleback system, bioRxiv, 2017-12-06
ABSTRACTMany parasites with complex life cycles modify their intermediate host’s behaviour, which has been proposed to increase transmission to their definitive host. This behavioural change could result from the parasite actively manipulating its host, but could also be explained by a mechanical effect, where the parasite’s physical presence affects host behaviour. We created an artificial internal parasite using silicone injections in the body cavity to test this mechanical effect hypothesis. We used the Schistocephalus solidus - threespine stickleback (Gasterosteus aculeatus) system, as this cestode can reach up to 92% of its fish host mass. Our results suggest that the mass burden brought by this macroparasite alone is not sufficient to cause behavioural changes in its host. Furthermore, our results show that wall-hugging (thigmotaxis), a measure of anxiety in vertebrates, is significantly reduced in Schistocephalus-infected sticklebacks, unveiling a new altered component of behaviour that may result from manipulation by this macroparasite.
biorxiv animal-behavior-and-cognition 0-100-users 2017A quantitative model for characterizing the evolutionary history of mammalian gene expression, bioRxiv, 2017-12-05
AbstractCharacterizing the evolutionary history of a gene’s expression profile is a critical component for understanding the relationship between genotype, expression, and phenotype. However, it is not well-established how best to distinguish the different evolutionary forces acting on gene expression. Here, we use RNA-seq across 7 tissues from 17 mammalian species to show that expression evolution across mammals is accurately modeled by the Ornstein-Uhlenbeck (OU) process. This stochastic process models expression trajectories across time as Gaussian distributions whose variance is parameterized by the rate of genetic drift and strength of stabilizing selection. We use these mathematical properties to identify expression pathways under neutral, stabilizing, and directional selection, and quantify the extent of selective pressure on a gene’s expression. We further detect deleterious expression levels outside expected evolutionary distributions in expression data from individual patients. Our work provides a statistical framework for interpreting expression data across species and in disease.One Sentence SummaryWe demonstrate the power of a stochastic model for quantifying selective pressure on expression and estimating evolutionary distributions of optimal gene expression.
biorxiv genomics 0-100-users 2017Assessing the Landscape of U.S. Postdoctoral Salaries, bioRxiv, 2017-12-04
AbstractPurposePostdocs make up a significant portion of the biomedical workforce. However, data about the postdoctoral position are generally scarce, including salary data. The purpose of this study was to request, obtain and interpret actual salaries, and the associated job titles, for postdocs at U.S. public institutions.MethodologyFreedom of Information Act Requests were submitted to U.S. public institutions estimated to have at least 300 postdocs according to the National Science Foundation’s Survey of Graduate Students and Postdocs. Salaries and job titles of postdoctoral employees as of December 1st, 2016 were requested.FindingsSalaries and job titles for over 13,000 postdocs at 52 public U.S. institutions and 1 private institution around the date of December 1st, 2016 were received, and individual postdoc names were also received for approximately 7,000 postdocs. This study shows evidence of gender-related salary discrepancies, a significant influence of job title description on postdoc salary, and a complex relationship between salaries and the level of institutional NIH funding.ValueThese results provide insights into the ability of institutions to collate actual payroll-type data related to their postdocs, highlighting difficulties faced in tracking, and reporting data on this population. Ultimately, these types of efforts, aimed at increasing transparency, may lead to improved tracking and support for postdocs at all U.S. institutions.
biorxiv scientific-communication-and-education 100-200-users 2017Interpretation of biological experiments changes with evolution of Gene Ontology and its annotations, bioRxiv, 2017-12-04
ABSTRACTGene Ontology (GO) enrichment analysis is ubiquitously used for interpreting high throughput molecular data and generating hypotheses about underlying biological phenomena of experiments. However, the two building blocks of this analysis — the ontology and the annotations — evolve rapidly. We used gene signatures derived from 104 disease analyses to systematically evaluate how enrichment analysis results were affected by evolution of the GO over a decade. We found low consistency between enrichment analyses results obtained with early and more recent GO versions. Furthermore, there continues to be strong annotation bias in the GO annotations where 58% of the annotations are for 16% of the human genes. Our analysis suggests that GO evolution may have affected the interpretation and possibly reproducibility of experiments over time. Hence, researchers must exercise caution when interpreting GO enrichment analyses and should reexamine previous analyses with the most recent GO version.
biorxiv bioinformatics 0-100-users 2017