Quantitative analysis of population-scale family trees using millions of relatives, bioRxiv, 2017-02-08
AbstractFamily trees have vast applications in multiple fields from genetics to anthropology and economics. However, the collection of extended family trees is tedious and usually relies on resources with limited geographical scope and complex data usage restrictions. Here, we collected 86 million profiles from publicly-available online data from genealogy enthusiasts. After extensive cleaning and validation, we obtained population-scale family trees, including a single pedigree of 13 million individuals. We leveraged the data to partition the genetic architecture of longevity by inspecting millions of relative pairs and to provide insights to population genetics theories on the dispersion of families. We also report a simple digital procedure to overlay other datasets with our resource in order to empower studies with population-scale genealogical data.One Sentence SummaryUsing massive crowd-sourced genealogy data, we created a population-scale family tree resource for scientific studies.
biorxiv genomics 100-200-users 2017Comprehensive single cell transcriptional profiling of a multicellular organism by combinatorial indexing, bioRxiv, 2017-02-03
AbstractConventional methods for profiling the molecular content of biological samples fail to resolve heterogeneity that is present at the level of single cells. In the past few years, single cell RNA sequencing has emerged as a powerful strategy for overcoming this challenge. However, its adoption has been limited by a paucity of methods that are at once simple to implement and cost effective to scale massively. Here, we describe a combinatorial indexing strategy to profile the transcriptomes of large numbers of single cells or single nuclei without requiring the physical isolation of each cell (Single cell Combinatorial Indexing RNA-seq or sci-RNA-seq). We show that sci-RNA-seq can be used to efficiently profile the transcriptomes of tens-of-thousands of single cells per experiment, and demonstrate that we can stratify cell types from these data. Key advantages of sci-RNA-seq over contemporary alternatives such as droplet-based single cell RNA-seq include sublinear cost scaling, a reliance on widely available reagents and equipment, the ability to concurrently process many samples within a single workflow, compatibility with methanol fixation of cells, cell capture based on DNA content rather than cell size, and the flexibility to profile either cells or nuclei. As a demonstration of sci-RNA-seq, we profile the transcriptomes of 42,035 single cells from C. elegans at the L2 stage, effectively 50-fold “shotgun cellular coverage” of the somatic cell composition of this organism at this stage. We identify 27 distinct cell types, including rare cell types such as the two distal tip cells of the developing gonad, estimate consensus expression profiles and define cell-type specific and selective genes. Given that C. elegans is the only organism with a fully mapped cellular lineage, these data represent a rich resource for future methods aimed at defining cell types and states. They will advance our understanding of developmental biology, and constitute a major step towards a comprehensive, single-cell molecular atlas of a whole animal.
biorxiv genomics 200-500-users 2017Scaling single cell transcriptomics through split pool barcoding, bioRxiv, 2017-02-03
Constructing an atlas of cell types in complex organisms will require a collective effort to characterize billions of individual cells. Single cell RNA sequencing (scRNA-seq) has emerged as the main tool for characterizing cellular diversity, but current methods use custom microfluidics or microwells to compartmentalize single cells, limiting scalability and widespread adoption. Here we present Split Pool Ligation-based Transcriptome sequencing (SPLiT-seq), a scRNA-seq method that labels the cellular origin of RNA through combinatorial indexing. SPLiT-seq is compatible with fixed cells, scales exponentially, uses only basic laboratory equipment, and costs one cent per cell. We used this approach to analyze 109,069 single cell transcriptomes from an entire postnatal day 5 mouse brain, providing the first global snapshot at this stage of development. We identified 13 main populations comprising different types of neurons, glia, immune cells, endothelia, as well as types in the blood-brain-barrier. Moreover, we resolve substructure within these clusters corresponding to cells at different stages of development. As sequencing capacity increases, SPLiT-seq will enable profiling of billions of cells in a single experiment.
biorxiv genomics 100-200-users 2017High-throughput annotation of full-length long noncoding RNAs with Capture Long-Read Sequencing, bioRxiv, 2017-02-02
AbstractAccurate annotations of genes and their transcripts is a foundation of genomics, but no annotation technique presently combines throughput and accuracy. As a result, reference gene collections remain incomplete many gene models are fragmentary, while thousands more remain uncatalogued–particularly for long noncoding RNAs (lncRNAs). To accelerate lncRNA annotation, the GENCODE consortium has developed RNA Capture Long Seq (CLS), combining targeted RNA capture with third-generation long-read sequencing. We present an experimental re-annotation of the GENCODE intergenic lncRNA population in matched human and mouse tissues, resulting in novel transcript models for 3574 561 gene loci, respectively. CLS approximately doubles the annotated complexity of targeted loci, outperforming existing short-read techniques. Full-length transcript models produced by CLS enable us to definitively characterize the genomic features of lncRNAs, including promoter- and gene-structure, and protein-coding potential. Thus CLS removes a longstanding bottleneck of transcriptome annotation, generating manual-quality full-length transcript models at high-throughput scales.Abbreviations<jatsdef-list><jatsdef-item>bpbase pair<jatsdef-item><jatsdef-item>FLfull length<jatsdef-item><jatsdef-item>ntnucleotide<jatsdef-item><jatsdef-item>ROIread of insert, i.e. PacBio read<jatsdef-item><jatsdef-item>SJsplice junction<jatsdef-item><jatsdef-item>SMRTsingle-molecule real-time<jatsdef-item><jatsdef-item>TMtranscript model<jatsdef-item><jatsdef-list>
biorxiv genomics 0-100-users 2017SvABA Genome-wide detection of structural variants and indels by local assembly, bioRxiv, 2017-02-02
AbstractStructural variants (SVs), including small insertion and deletion variants (indels), are challenging to detect through standard alignment-based variant calling methods. Sequence assembly offers a powerful approach to identifying SVs, but is difficult to apply at-scale genome-wide for SV detection due to its computational complexity and the difficulty of extracting SVs from assembly contigs. We describe SvABA, an efficient and accurate method for detecting SVs from short-read sequencing data using genome-wide local assembly with low memory and computing requirements. We evaluated SvABA’s performance on the NA12878 human genome and in simulated and real cancer genomes. SvABA demonstrates superior sensitivity and specificity across a large spectrum of SVs, and substantially improved detection performance for variants in the 20-300 bp range, compared with existing methods. SvABA also identifies complex somatic rearrangements with chains of short (< 1,000 bp) templated-sequence insertions copied from distant genomic regions. We applied SvABA to 344 cancer genomes from 11 cancer types, and found that templated-sequence insertions occur in ~4% of all somatic rearrangements. Finally, we demonstrate that SvABA can identify sites of viral integration and cancer driver alterations containing medium-sized SVs.
biorxiv genomics 0-100-users 2017Mass-spectrometry of single mammalian cells quantifies proteome heterogeneity during cell differentiation, bioRxiv, 2017-01-25
Cellular heterogeneity is important to biological processes, including cancer and development. However, proteome heterogeneity is largely unexplored because of the limitations of existing methods for quantifying protein levels in single cells. To alleviate these limitations, we developed Single Cell ProtEomics by Mass Spectrometry (SCoPE-MS), and validated its ability to identify distinct human cancer cell types based on their proteomes. We used SCoPE-MS to quantify over a thousand proteins in differentiating mouse embryonic stem (ES) cells. The single-cell proteomes enabled us to deconstruct cell populations and infer protein abundance relationships. Comparison between single-cell proteomes and transcriptomes indicated coordinated mRNA and protein covariation. Yet many genes exhibited functionally concerted and distinct regulatory patterns at the mRNA and the protein levels, suggesting that post-transcriptional regulatory mechanisms contribute to proteome remodeling during lineage specification, especially for developmental genes. SCoPE-MS is broadly applicable to measuring proteome configurations of single cells and linking them to functional phenotypes, such as cell type and differentiation potentials.
biorxiv genomics 200-500-users 2017