Efficient long single molecule sequencing for cost effective and accurate sequencing, haplotyping, and de novo assembly, bioRxiv, 2018-05-17
Obtaining accurate sequences from long DNA molecules is very important for genome assembly and other applications. Here we describe single tube long fragment read (stLFR), a technology that enables this a low cost. It is based on adding the same barcode sequence to sub-fragments of the original long DNA molecule (DNA co-barcoding). To achieve this efficiently, stLFR uses the surface of microbeads to create millions of miniaturized barcoding reactions in a single tube. Using a combinatorial process up to 3.6 billion unique barcode sequences were generated on beads, enabling practically non-redundant co-barcoding with 50 million barcodes per sample. Using stLFR, we demonstrate efficient unique co-barcoding of over 8 million 20-300 kb genomic DNA fragments. Analysis of the genome of the human genome NA12878 with stLFR demonstrated high quality variant calling and phasing into contigs up to N50 34 Mb. We also demonstrate detection of complex structural variants and complete diploid de novo assembly of NA12878. These analyses were all performed using single stLFR libraries and their construction did not significantly add to the time or cost of whole genome sequencing (WGS) library preparation. stLFR represents an easily automatable solution that enables high quality sequencing, phasing, SV detection, scaffolding, cost-effective diploid de novo genome assembly, and other long DNA sequencing applications.
biorxiv genomics 0-100-users 2018The genetic prehistory of the Greater Caucasus, bioRxiv, 2018-05-16
AbstractArchaeogenetic studies have described the formation of Eurasian ‘steppe ancestry’ as a mixture of Eastern and Caucasus hunter-gatherers. However, it remains unclear when and where this ancestry arose and whether it was related to a horizon of cultural innovations in the 4th millennium BCE that subsequently facilitated the advance of pastoral societies likely linked to the dispersal of Indo-European languages. To address this, we generated genome-wide SNP data from 45 prehistoric individuals along a 3000-year temporal transect in the North Caucasus. We observe a genetic separation between the groups of the Caucasus and those of the adjacent steppe. The Caucasus groups are genetically similar to contemporaneous populations south of it, suggesting that – unlike today – the Caucasus acted as a bridge rather than an insurmountable barrier to human movement. The steppe groups from Yamnaya and subsequent pastoralist cultures show evidence for previously undetected farmer-related ancestry from different contact zones, while Steppe Maykop individuals harbour additional Upper Palaeolithic Siberian and Native American related ancestry.
biorxiv genomics 200-500-users 2018Whole-genome deep learning analysis reveals causal role of noncoding mutations in autism, bioRxiv, 2018-05-11
AbstractWe address the challenge of detecting the contribution of noncoding mutations to disease with a deep-learning-based framework that predicts specific regulatory effects and deleterious disease impact of genetic variants. Applying this framework to 1,790 Autism Spectrum Disorder (ASD) simplex families reveals autism disease causality of noncoding mutations by demonstrating that ASD probands harbor transcriptional (TRDs) and post-transcriptional (RRDs) regulation-disrupting mutations of significantly higher functional impact than unaffected siblings. Importantly, we detect this significant noncoding contribution at each level, transcriptional and post-transcriptional, independently and after multiple hypothesis correction. Further analysis suggests involvement of noncoding mutations in synaptic transmission and neuronal development, and reveals a convergent genetic landscape of coding and noncoding (TRD and RRD) de novo mutations in ASD. We demonstrate that sequences carrying prioritized proband de novo mutations possess transcriptional regulatory activity and drive expression differentially, and highlight a link between noncoding mutations and IQ heterogeneity in ASD probands. Our predictive genomics framework illuminates the role of noncoding mutations in ASD, prioritizes high impact transcriptional and post-transcriptional regulatory mutations for further study, and is broadly applicable to complex human diseases.
biorxiv genomics 100-200-users 2018Highly Multiplexed Single-Cell RNA-seq for Defining Cell Population and Transcriptional Spaces, bioRxiv, 2018-05-05
AbstractWe describe a universal sample multiplexing method for single-cell RNA-seq in which cells are chemically labeled with identifying DNA oligonucleotides. Analysis of a 96-plex perturbation experiment revealed changes in cell population structure and transcriptional states that cannot be discerned from bulk measurements, establishing a cost effective means to survey cell populations from large experiments and clinical samples with the depth and resolution of single-cell RNA-seq.
biorxiv genomics 200-500-users 2018crisprQTL mapping as a genome-wide association framework for cellular genetic screens, bioRxiv, 2018-05-04
AbstractExpression quantitative trait locus (eQTL) and genome-wide association studies (GWAS) are powerful paradigms for mapping the determinants of gene expression and organismal phenotypes, respectively. However, eQTL mapping and GWAS are limited in scope (to naturally occurring, common genetic variants) and resolution (by linkage disequilibrium). Here, we present crisprQTL mapping, a framework in which large numbers of CRISPRCas9 perturbations are introduced to each cell on an isogenic background, followed by single-cell RNA-seq (scRNA-seq). crisprQTL mapping is analogous to conventional human eQTL studies, but with individual humans replaced by individual cells; genetic variants replaced by unique combinations of ‘unlinked’ guide RNA (gRNA)-programmed perturbations per cell; and tissue-level RNA-seq of many individuals replaced by scRNA-seq of many cells. By randomly introducing gRNAs, a single population of cells can be leveraged to test for association between each perturbation and the expression of any potential target gene, analogous to how eQTL studies leverage populations of humans to test millions of genetic variants for associations with expression in a genome-wide manner. However, crisprQTL mapping is neither limited to naturally occurring, common genetic variants nor by linkage disequilibrium. As a proof-of-concept, we applied crisprQTL mapping to evaluate 1,119 candidate enhancers with no strong a priori hypothesis as to their target gene(s). Perturbations were made by a nuclease-dead Cas9 (dCas9) tethered to KRAB, and introduced at a mean ‘allele frequency’ of 1.1% into a population of 47,650 profiled human K562 cells (median of 15 gRNAs identified per cell). We tested for differential expression of all genes within 1 megabase (Mb) of each candidate enhancer, effectively evaluating 17,584 potential enhancer-target gene relationships within a single experiment. At an empirical false discovery rate (FDR) of 10%, we identify 128 cis crisprQTLs (11%) whose targeting resulted in downregulation of 105 nearby genes. crisprQTLs were strongly enriched for proximity to their target genes (median 34.3 kilobases (Kb)) and the strength of H3K27ac, p300, and lineage-specific transcription factor (TF) ChIP-seq peaks. Our results establish the power of the eQTL mapping paradigm as applied to programmed variation in populations of cells, rather than natural variation in populations of individuals. We anticipate that crisprQTL mapping will facilitate the comprehensive elucidation of the cis-regulatory architecture of the human genome.
biorxiv genomics 200-500-users 2018Comparative analysis of droplet-based ultra-high-throughput single-cell RNA-seq systems, bioRxiv, 2018-05-02
SummarySince its establishment in 2009, single-cell RNA-seq has been a major driver behind progress in biomedical research. In developmental biology and stem cell studies, the ability to profile single cells confers particular benefits. While most studies still focus on individual tissues or organs, the recent development of ultra-high-throughput single-cell RNA-seq has demonstrated potential power in characterizing more complex systems or even the entire body. However, although multiple ultra-high-throughput single-cell RNA-seq systems have attracted attention, no systematic comparison of these systems has been performed. Here, we focus on three widely used droplet-based ultra-high-throughput single-cell RNA-seq systems, inDrop, Drop-seq, and 10X Genomics Chromium. While each system is capable of profiling single-cell transcriptomes, their detailed comparison revealed the distinguishing features and suitable applications for each system.
biorxiv genomics 0-100-users 2018