High throughput single cell RNA-seq of developing mouse kidney and human kidney organoids reveals a roadmap for recreating the kidney, bioRxiv, 2017-12-17
AbstractRecent advances in our capacity to differentiate human pluripotent stem cells to human kidney tissue are moving the field closer to novel approaches for renal replacement. Such protocols have relied upon our current understanding of the molecular basis of mammalian kidney morphogenesis. To date this has depended upon population based-profiling of non-homogenous cellular compartments. In order to improve our resolution of individual cell transcriptional profiles during kidney morphogenesis, we have performed 10x Chromium single cell RNA-seq on over 6000 cells from the E18.5 developing mouse kidney, as well as more than 7000 cells from human iPSC-derived kidney organoids. We identified 16 clusters of cells representing all major cell lineages in the E18.5 mouse kidney. The differentially expressed genes from individual murine clusters were then used to guide the classification of 16 cell clusters within human kidney organoids, revealing the presence of distinguishable stromal, endothelial, nephron, podocyte and nephron progenitor populations. Despite the congruence between developing mouse and human organoid, our analysis suggested limited nephron maturation and the presence of ‘off target’ populations in human kidney organoids, including unidentified stromal populations and evidence of neural clusters. This may reflect unique human kidney populations, mixed cultures or aberrant differentiation in vitro. Analysis of clusters within the mouse data revealed novel insights into progenitor maintenance and cellular maturation in the major renal lineages and will serve as a roadmap to refine directed differentiation approaches in human iPSC-derived kidney organoids.
biorxiv developmental-biology 0-100-users 2017Performance Assessment and Selection of Normalization Procedures for Single-Cell RNA-seq, bioRxiv, 2017-12-17
AbstractSystematic measurement biases make data normalization an essential preprocessing step in single-cell RNA sequencing (scRNA-seq) analysis. There may be multiple, competing considerations behind the assessment of normalization performance, some of them study-specific. Because normalization can have a large impact on downstream results (e.g., clustering and differential expression), it is critically important that practitioners assess the performance of competing methods.We have developed scone — a flexible framework for assessing normalization performance based on a comprehensive panel of data-driven metrics. Through graphical summaries and quantitative reports, scone summarizes performance trade-offs and ranks large numbers of normalization methods by aggregate panel performance. The method is implemented in the open-source Bioconductor R software package scone. We demonstrate the effectiveness of scone on a collection of scRNA-seq datasets, generated with different protocols, including Fluidigm C1 and 10x platforms. We show that top-performing normalization methods lead to better agreement with independent validation data.
biorxiv genomics 100-200-users 2017Phylofactorization a graph-partitioning algorithm to identify phylogenetic scales of ecological data, bioRxiv, 2017-12-17
AbstractThe problem of pattern and scale is a central challenge in ecology. The problem of scale is central to community ecology, where functional ecological groups are aggregated and treated as a unit underlying an ecological pattern, such as aggregation of “nitrogen fixing trees” into a total abundance of a trait underlying ecosystem physiology. With the emergence of massive community ecological datasets, from microbiomes to breeding bird surveys, there is a need to objectively identify the scales of organization pertaining to well-defined patterns in community ecological data.The phylogeny is a scaffold for identifying key phylogenetic scales associated with macroscopic patterns. Phylofactorization was developed to objectively identify phylogenetic scales underlying patterns in relative abundance data. However, many ecological data, such as presence-absences and counts, are not relative abundances, yet it is still desireable and informative to identify phylogenetic scales underlying a pattern of interest. Here, we generalize phylofactorization beyond relative abundances to a graph-partitioning algorithm for any community ecological data.Generalizing phylofactorization connects many tools from data analysis to phylogenetically-informe analysis of community ecological data. Two-sample tests identify three phylogenetic factors of mammalian body mass which arose during the K-Pg extinction event, consistent with other analyses of mammalian body mass evolution. Projection of data onto coordinates defined by the phylogeny yield a phylogenetic principal components analysis which refines our understanding of the major sources of variation in the human gut microbiome. These same coordinates allow generalized additive modeling of microbes in Central Park soils and confirm that a large clade of Acidobacteria thrive in neutral soils. Generalized linear and additive modeling of exponential family random variables can be performed by phylogenetically-constrained reduced-rank regression or stepwise factor contrasts. We finish with a discussion of how phylofac-torization produces an ecological species concept with a phylogenetic constraint. All of these tools can be implemented with a new R package available online.
biorxiv ecology 0-100-users 2017Amplicon sequencing of the 16S-ITS-23S rRNA operon with long-read technology for improved phylogenetic classification of uncultured prokaryotes, bioRxiv, 2017-12-16
AbstractAmplicon sequencing of the 16S rRNA gene is the predominant method to quantify microbial compositions of environmental samples and to discover previously unknown lineages. Its unique structure of interspersed conserved and variable regions is an excellent target for PCR and allows for classification of reads at all taxonomic levels. However, the relatively few phylogenetically informative sites prevent confident phylogenetic placements of novel lineages that are deep branching relative to reference taxa. This problem is exacerbated when only short 16S rRNA gene fragments are sequenced. To resolve their placement, it is common practice to gather more informative sites by combining multiple conserved genes into concatenated datasets. This however requires genomic data which may be obtained through relatively expensive metagenome sequencing and computationally demanding analyses. Here we develop a protocol that amplifies a large part of 16S and 23S rRNA genes within the rRNA operon, including the ITS region, and sequences the amplicons with PacBio long-read technology. We tested our method with a synthetic mock community and developed a read curation pipeline that reduces the overall error rate to 0.18%. Applying our method on four diverse environmental samples, we were able to capture near full-length rRNA operon amplicons from a large diversity of prokaryotes. Phylogenetic trees constructed with these sequences showed an increase in statistical support compared to trees inferred with shorter, Illumina-like sequences using only the 16S rRNA gene (250 bp). Our method is a cost-effective solution to generate high quality, near full-length 16S and 23S rRNA gene sequences from environmental prokaryotes.
biorxiv microbiology 0-100-users 2017High accuracy measurements of nanometer-scale distances between fluorophores at the single-molecule level, bioRxiv, 2017-12-16
To uncover the mechanisms of molecular machines it is useful to probe their structural conformations. Single-molecule Förster resonance energy transfer (smFRET) is a powerful tool for measuring intra-molecular shape changes of single-molecules, but is confined to distances of 2-8 nm. Current super-resolution measurements are error prone at <25 nm. Thus, reliable high-throughput distance information between 8-25 nm is currently difficult to achieve. Here, we describe methods that utilize information about localization and imaging errors to measure distances between two different color fluorophores with ∼1 nm accuracy at any distance >2 nm, using a standard TIRF microscope and open-source software. We applied our two-color localization method to uncover a ∼4 nm conformational change in the “stalk” of the motor protein dynein, revealing unexpected flexibility in this antiparallel coiled-coil domain. These new methods enable high-accuracy distance measurements of single-molecules that can be used over a wide range of length scales.
biorxiv biophysics 0-100-users 2017Real-time search of all bacterial and viral genomic data, bioRxiv, 2017-12-16
Genome sequencing of pathogens is now ubiquitous in microbiology, and the sequence archives are effectively no longer searchable for arbitrary sequences. Furthermore, the exponential increase of these archives is likely to be further spurred by automated diagnostics. To unlock their use for scientific research and real-time surveillance we have combined knowledge about bacterial genetic variation with ideas used in web-search, to build a DNA search engine for microbial data that can grow incrementally. We indexed the complete global corpus of bacterial and viral whole genome sequence data (447,833 genomes), using four orders of magnitude less storage than previous methods. The method allows future scaling to millions of genomes. This renders the global archive accessible to sequence search, which we demonstrate with three applications ultra-fast search for resistance genes MCR1-3, analysis of host-range for 2827 plasmids, and quantification of the rise of antibiotic resistance prevalence in the sequence archives.
biorxiv bioinformatics 500+-users 2017