Systematic identification of human SNPs affecting regulatory element activity, bioRxiv, 2018-11-04
AbstractMost of the millions of single-nucleotide polymorphisms (SNPs) in the human genome are non-coding, and many overlap with putative regulatory elements. Genome-wide association studies have linked many of these SNPs to human traits or to gene expression levels, but rarely with sufficient resolution to identify the causal SNPs. Functional screens based on reporter assays have previously been of insufficient throughput to test the vast space of SNPs for possible effects on enhancer and promoter activity. Here, we have leveraged the throughput of the SuRE reporter technology to survey a total of 5.9 million SNPs, including 57% of the known common SNPs. We identified more than 30 thousand SNPs that alter the activity of putative regulatory elements, often in a cell-type specific manner. These data indicate that a large proportion of human non-coding SNPs may affect gene regulation. Integration of these SuRE data with genome-wide association studies may help pinpoint SNPs that underlie human traits.
biorxiv genomics 100-200-users 2018Stem cell differentiation trajectories in Hydra resolved at single-cell resolution, bioRxiv, 2018-11-03
AbstractThe adult Hydra polyp continuously renews all of its cells using three separate stem cell populations, but the genetic pathways enabling homeostatic tissue maintenance are not well understood. We used Drop-seq to sequence transcriptomes of 24,985 single Hydra cells and identified the molecular signatures of a broad spectrum of cell states, from stem cells to terminally differentiated cells. We constructed differentiation trajectories for each cell lineage and identified the transcription factors expressed along these trajectories, thus creating a comprehensive molecular map of all developmental lineages in the adult animal. We unexpectedly found that neuron and gland cell differentiation transits through a common progenitor state, suggesting a shared evolutionary history for these secretory cell types. Finally, we have built the first gene expression map of the Hydra nervous system. By producing a comprehensive molecular description of the adult Hydra polyp, we have generated a resource for addressing fundamental questions regarding the evolution of developmental processes and nervous system function.
biorxiv developmental-biology 100-200-users 2018A complete Cannabis chromosome assembly and adaptive admixture for elevated cannabidiol (CBD) content, bioRxiv, 2018-10-31
AbstractCannabis has been cultivated for millennia with distinct cultivars providing either fiber and grain or tetrahydrocannabinol. Recent demand for cannabidiol rather than tetrahydrocannabinol has favored the breeding of admixed cultivars with extremely high cannabidiol content. Despite several draft Cannabis genomes, the genomic structure of cannabinoid synthase loci has remained elusive. A genetic map derived from a tetrahydrocannabinolcannabidiol segregating population and a complete chromosome assembly from a high-cannabidiol cultivar together resolve the linkage of cannabidiolic and tetrahydrocannabinolic acid synthase gene clusters which are associated with transposable elements. High-cannabidiol cultivars appear to have been generated by integrating hemp-type cannabidiolic acid synthase gene clusters into a background of marijuana-type cannabis. Quantitative trait locus mapping suggests that overall drug potency, however, is associated with other genomic regions needing additional study.Resources available online at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpcannabisgenome.org>httpcannabisgenome.org<jatsext-link>SummaryA complete chromosome assembly and an ultra-high-density linkage map together identify the genetic mechanism responsible for the ratio of tetrahydrocannabinol (THC) to cannabidiol (CBD) in Cannabis cultivars, allowing paradigms for the evolution and inheritance of drug potency to be evaluated.
biorxiv genomics 100-200-users 2018Data Denoising with transfer learning in single-cell transcriptomics, bioRxiv, 2018-10-31
Single-cell RNA sequencing (scRNA-seq) data is noisy and sparse. Here, we show that transfer learning across datasets remarkably improves data quality. By coupling a deep autoencoder with a Bayesian model, SAVER-X extracts transferable gene-gene relationships across data from different labs, varying conditions, and divergent species to denoise target new datasets.
biorxiv bioinformatics 100-200-users 2018Personalized and graph genomes reveal missing signal in epigenomic data, bioRxiv, 2018-10-31
AbstractBackgroundEpigenomic studies that use next generation sequencing experiments typically rely on the alignment of reads to a reference sequence. However, because of genetic diversity and the diploid nature of the human genome, we hypothesized that using a generic reference could lead to incorrectly mapped reads and bias downstream results.ResultsWe show that accounting for genetic variation using a modified reference genome (MPG) or a denovo assembled genome (DPG) can alter histone H3K4me1 and H3K27ac ChIP-seq peak calls by either creating new personal peaks or by the loss of reference peaks. MPGs are found to alter approximately 1% of peak calls while DPGs alter up to 5% of peaks. We also show statistically significant differences in the amount of reads observed in regions associated with the new, altered and unchanged peaks. We report that short insertions and deletions (indels), followed by single nucleotide variants (SNVs), have the highest probability of modifying peak calls. A counter-balancing factor is peak width, with wider calls being less likely to be altered. Next, because high-quality DPGs remain hard to obtain, we show that using a graph personalized genome (GPG), represents a reasonable compromise between MPGs and DPGs and alters about 2.5% of peak calls. Finally, we demonstrate that altered peaks have a genomic distribution typical of other peaks. For instance, for H3K4me1, 518 personal-only peaks were replicated using at least two of three approaches, 394 of which were inside or within 10Kb of a gene.ConclusionsAnalysing epigenomic datasets with personalized and graph genomes allows the recovery of new peaks enriched for indels and SNVs. These altered peaks are more likely to differ between individuals and, as such, could be relevant in the study of various human phenotypes.
biorxiv bioinformatics 100-200-users 2018Transfer learning in single-cell transcriptomics improves data denoising and pattern discovery, bioRxiv, 2018-10-31
Although single-cell RNA sequencing (scRNA-seq) technologies have shed light on the role of cellular diversity in human pathophysiology1–3, the resulting data remains noisy and sparse, making reliable quantification of gene expression challenging. Here, we show that a deep autoencoder coupled to a Bayesian model remarkably improves UMI-based scRNA-seq data quality by transfer learning across datasets. This new technology, SAVER-X, outperforms existing state-of-the-art tools. The deep learning model in SAVER-X extracts transferable gene expression features across data from different labs, generated by varying technologies, and obtained from divergent species. Through this framework, we explore the limits of transfer learning in a diverse testbed and demonstrate that future human sequencing projects will unequivocally benefit from the accumulation of publicly available data. We further show, through examples in immunology and neurodevelopment, that SAVER-X can harness existing public data to enhance downstream analysis of new data, such as those collected in clinical settings.
biorxiv bioinformatics 100-200-users 2018