Correcting batch effects in single-cell RNA sequencing data by matching mutual nearest neighbours, bioRxiv, 2017-07-19
AbstractThe presence of batch effects is a well-known problem in experimental data analysis, and single- cell RNA sequencing (scRNA-seq) is no exception. Large-scale scRNA-seq projects that generate data from different laboratories and at different times are rife with batch effects that can fatally compromise integration and interpretation of the data. In such cases, computational batch correction is critical for eliminating uninteresting technical factors and obtaining valid biological conclusions. However, existing methods assume that the composition of cell populations are either known or the same across batches. Here, we present a new strategy for batch correction based on the detection of mutual nearest neighbours in the high-dimensional expression space. Our approach does not rely on pre-defined or equal population compositions across batches, only requiring that a subset of the population be shared between batches. We demonstrate the superiority of our approach over existing methods on a range of simulated and real scRNA-seq data sets. We also show how our method can be applied to integrate scRNA-seq data from two separate studies of early embryonic development.
biorxiv bioinformatics 0-100-users 2017Integrated analysis of single cell transcriptomic data across conditions, technologies, and species, bioRxiv, 2017-07-19
ABSTRACTSingle cell RNA-seq (scRNA-seq) has emerged as a transformative tool to discover and define cellular phenotypes. While computational scRNA-seq methods are currently well suited for experiments representing a single condition, technology, or species, analyzing multiple datasets simultaneously raises new challenges. In particular, traditional analytical workflows struggle to align subpopulations that are present across datasets, limiting the possibility for integrated or comparative analysis. Here, we introduce a new computational strategy for scRNA-seq alignment, utilizing common sources of variation to identify shared subpopulations between datasets as part of our R toolkit Seurat. We demonstrate our approach by aligning scRNA-seq datasets of PBMCs under resting and stimulated conditions, hematopoietic progenitors sequenced across two profiling technologies, and pancreatic cell ‘atlases’ generated from human and mouse islets. In each case, we learn distinct or transitional cell states jointly across datasets, and can identify subpopulations that could not be detected by analyzing datasets independently. We anticipate that these methods will serve not only to correct for batch or technology-dependent effects, but also to facilitate general comparisons of scRNA-seq datasets, potentially deepening our understanding of how distinct cell states respond to perturbation, disease, and evolution.AvailabilityInstallation instructions, documentation, and tutorials are available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpwww.satijalab.orgseurat>httpwww.satijalab.orgseurat<jatsext-link>
biorxiv genomics 100-200-users 2017Genomics of Mesolithic Scandinavia reveal colonization routes and high-latitude adaptation, bioRxiv, 2017-07-18
AbstractScandinavia was one of the last geographic areas in Europe to become habitable for humans after the last glaciation. However, the origin(s) of the first colonizers and their migration routes remain unclear. We sequenced the genomes, up to 57x coverage, of seven hunter-gatherers excavated across Scandinavia and dated to 9,500-6,000 years before present. Surprisingly, among the Scandinavian Mesolithic individuals, the genetic data display an east-west genetic gradient that opposes the pattern seen in other parts of Mesolithic Europe. This result suggests that Scandinavia was initially colonized following two different routes one from the south, the other from the northeast. The latter followed the ice-free Norwegian north Atlantic coast, along which novel and advanced pressure-blade stone-tool techniques may have spread. These two groups met and mixed in Scandinavia, creating a genetically diverse population, which shows patterns of genetic adaptation to high latitude environments. These adaptations include high frequencies of low pigmentation variants and a gene-region associated with physical performance, which shows strong continuity into modern-day northern Europeans.
biorxiv genomics 0-100-users 2017A comprehensive map of genetic variation in the world’s largest ethnic group - Han Chinese, bioRxiv, 2017-07-14
AbstractAs are most non-European populations around the globe, the Han Chinese are relatively understudied in population and medical genetics studies. From low-coverage whole-genome sequencing of 11,670 Han Chinese women we present a catalog of 25,057,223 variants, including 548,401 novel variants that are seen at least 10 times in our dataset. Individuals from our study come from 19 out of 22 provinces across China, allowing us to study population structure, genetic ancestry, and local adaptation in Han Chinese. We identify previously unrecognized population structure along the East-West axis of China and report unique signals of admixture across geographical space, such as European influences among the Northwestern provinces of China. Finally, we identified a number of highly differentiated loci, indicative of local adaptation in the Han Chinese. In particular, we detected extreme differentiation among the Han Chinese at MTHFR, ADH7, and FADS loci, suggesting that these loci may not be specifically selected in Tibetan and Inuit populations as previously suggested. On the other hand, we find that Neandertal ancestry does not vary significantly across the provinces, consistent with admixture prior to the dispersal of modern Han Chinese. Furthermore, contrary to a previous report, Neandertal ancestry does not explain a significant amount of heritability in depression. Our findings provide the largest genetic data set so far made available for Han Chinese and provide insights into the history and population structure of the world’s largest ethnic group.
biorxiv genetics 100-200-users 2017Pan-cancer analysis of whole genomes, bioRxiv, 2017-07-13
We report the integrative analysis of more than 2,600 whole cancer genomes and their matching normal tissues across 39 distinct tumour types. By studying whole genomes we have been able to catalogue non-coding cancer driver events, study patterns of structural variation, infer tumour evolution, probe the interactions among variants in the germline genome, the tumour genome and the transcriptome, and derive an understanding of how coding and non-coding variations together contribute to driving individual patient's tumours. This work represents the most comprehensive look at cancer whole genomes to date. NOTE TO READERS This is an incomplete draft of the marker paper for the Pan-Cancer Analysis of Whole Genomes Project, and is intended to provide the background information for a series of in-depth papers that will be posted to BioRixv during the summer of 2017.
biorxiv cancer-biology 0-100-users 2017Why Does the Neocortex Have Columns, A Theory of Learning the Structure of the World, bioRxiv, 2017-07-13
ABSTRACTNeocortical regions are organized into columns and layers. Connections between layers run mostly perpendicular to the surface suggesting a columnar functional organization. Some layers have long-range excitatory lateral connections suggesting interactions between columns. Similar patterns of connectivity exist in all regions but their exact role remain a mystery. In this paper, we propose a network model composed of columns and layers that performs robust object learning and recognition. Each column integrates its changing input over time to learn complete predictive models of observed objects. Excitatory lateral connections across columns allow the network to more rapidly infer objects based on the partial knowledge of adjacent columns. Because columns integrate input over time and space, the network learns models of complex objects that extend well beyond the receptive field of individual cells. Our network model introduces a new feature to cortical columns. We propose that a representation of location relative to the object being sensed is calculated within the sub-granular layers of each column. The location signal is provided as an input to the network, where it is combined with sensory data. Our model contains two layers and one or more columns. Simulations show that using Hebbian-like learning rules small single-column networks can learn to recognize hundreds of objects, with each object containing tens of features. Multi-column networks recognize objects with significantly fewer movements of the sensory receptors. Given the ubiquity of columnar and laminar connectivity patterns throughout the neocortex, we propose that columns and regions have more powerful recognition and modeling capabilities than previously assumed.
biorxiv neuroscience 100-200-users 2017