Comparative genomics of the major parasitic worms, bioRxiv, 2017-12-21
ABSTRACTParasitic nematodes (roundworms) and platyhelminths (flatworms) cause debilitating chronic infections of humans and animals, decimate crop production and are a major impediment to socioeconomic development. Here we compare the genomes of 81 nematode and platyhelminth species, including those of 76 parasites. From 1.4 million genes, we identify gene family births and hundreds of large expanded gene families at key nodes in the phylogeny that are relevant to parasitism. Examples include gene families that modulate host immune responses, enable parasite migration though host tissues or allow the parasite to feed. We use a wide-ranging in silico screen to identify and prioritise new potential drug targets and compounds for testing. We also uncover lineage-specific differences in core metabolism and in protein families historically targeted for drug development. This is the broadest comparative study to date of the genomes of parasitic and non-parasitic worms. It provides a transformative new resource for the research community to understand and combat the diseases that parasitic worms cause.
biorxiv genomics 0-100-users 2017Performance Assessment and Selection of Normalization Procedures for Single-Cell RNA-seq, bioRxiv, 2017-12-17
AbstractSystematic measurement biases make data normalization an essential preprocessing step in single-cell RNA sequencing (scRNA-seq) analysis. There may be multiple, competing considerations behind the assessment of normalization performance, some of them study-specific. Because normalization can have a large impact on downstream results (e.g., clustering and differential expression), it is critically important that practitioners assess the performance of competing methods.We have developed scone — a flexible framework for assessing normalization performance based on a comprehensive panel of data-driven metrics. Through graphical summaries and quantitative reports, scone summarizes performance trade-offs and ranks large numbers of normalization methods by aggregate panel performance. The method is implemented in the open-source Bioconductor R software package scone. We demonstrate the effectiveness of scone on a collection of scRNA-seq datasets, generated with different protocols, including Fluidigm C1 and 10x platforms. We show that top-performing normalization methods lead to better agreement with independent validation data.
biorxiv genomics 100-200-users 2017Sequence variation aware genome references and read mapping with the variation graph toolkit, bioRxiv, 2017-12-16
AbstractReference genomes guide our interpretation of DNA sequence data. However, conventional linear references are fundamentally limited in that they represent only one version of each locus, whereas the population may contain multiple variants. When the reference represents an individual’s genome poorly, it can impact read mapping and introduce bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation, including large scale structural variation such as inversions and duplications.1 Equivalent structures are produced by de novo genome assemblers.2,3 Here we present vg, a toolkit of computational methods for creating, manipulating, and utilizing these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays,4 with improved accuracy over alignment to a linear reference, creating data structures to support downstream variant calling and genotyping. These capabilities make using variation graphs as reference structures for DNA sequencing practical at the scale of vertebrate genomes, or at the topological complexity of new species assemblies.
biorxiv genomics 200-500-users 2017Specificity of RNAi, LNA and CRISPRi as loss-of-function methods in transcriptional analysis, bioRxiv, 2017-12-16
ABSTRACTLoss-of-function (LOF) methods, such as RNA interference (RNAi), antisense oligonucleotides or CRISPR-based genome editing, provide unparalleled power for studying the biological function of genes of interest. When coupled with transcriptomic analyses, LOF methods allow researchers to dissect networks of transcriptional regulation. However, a major concern is nonspecific targeting, which involves depletion of transcripts other than those intended. The off-target effects of each of these common LOF methods have yet to be compared at the whole-transcriptome level. Here, we systematically and experimentally compared non-specific activity of RNAi, antisense oligonucleotides and CRISPR interference (CRISPRi). All three methods yielded non-negligible offtarget effects in gene expression, with CRISPRi exhibiting clonal variation in the transcriptional profile. As an illustrative example, we evaluated the performance of each method for deciphering the role of a long noncoding RNA (lncRNA) with unknown function. Although all LOF methods reduced expression of the candidate lncRNA, each method yielded different sets of differentially expressed genes upon knockdown as well as a different cellular phenotype. Therefore, to definitively confirm the functional role of a transcriptional regulator, we recommend the simultaneous use of at least two different LOF methods and the inclusion of multiple, specifically designed negative controls.
biorxiv genomics 0-100-users 2017Resolving the Full Spectrum of Human Genome Variation using Linked-Reads, bioRxiv, 2017-12-09
AbstractLarge-scale population based analyses coupled with advances in technology have demonstrated that the human genome is more diverse than originally thought. To date, this diversity has largely been uncovered using short read whole genome sequencing. However, standard short-read approaches, used primarily due to accuracy, throughput and costs, fail to give a complete picture of a genome. They struggle to identify large, balanced structural events, cannot access repetitive regions of the genome and fail to resolve the human genome into its two haplotypes. Here we describe an approach that retains long range information while harnessing the advantages of short reads. Starting from only ∼1ng of DNA, we produce barcoded short read libraries. The use of novel informatic approaches allows for the barcoded short reads to be associated with the long molecules of origin producing a novel datatype known as ‘Linked-Reads’. This approach allows for simultaneous detection of small and large variants from a single Linked-Read library. We have previously demonstrated the utility of whole genome Linked-Reads (lrWGS) for performing diploid, de novo assembly of individual genomes (Weisenfeld et al. 2017). In this manuscript, we show the advantages of Linked-Reads over standard short read approaches for reference based analysis. We demonstrate the ability of Linked-Reads to reconstruct megabase scale haplotypes and to recover parts of the genome that are typically inaccessible to short reads, including phenotypically important genes such as STRC, SMN1 and SMN2. We demonstrate the ability of both lrWGS and Linked-Read Whole Exome Sequencing (lrWES) to identify complex structural variations, including balanced events, single exon deletions, and single exon duplications. The data presented here show that Linked-Reads provide a scalable approach for comprehensive genome analysis that is not possible using short reads alone.
biorxiv genomics 0-100-users 2017A quantitative model for characterizing the evolutionary history of mammalian gene expression, bioRxiv, 2017-12-05
AbstractCharacterizing the evolutionary history of a gene’s expression profile is a critical component for understanding the relationship between genotype, expression, and phenotype. However, it is not well-established how best to distinguish the different evolutionary forces acting on gene expression. Here, we use RNA-seq across 7 tissues from 17 mammalian species to show that expression evolution across mammals is accurately modeled by the Ornstein-Uhlenbeck (OU) process. This stochastic process models expression trajectories across time as Gaussian distributions whose variance is parameterized by the rate of genetic drift and strength of stabilizing selection. We use these mathematical properties to identify expression pathways under neutral, stabilizing, and directional selection, and quantify the extent of selective pressure on a gene’s expression. We further detect deleterious expression levels outside expected evolutionary distributions in expression data from individual patients. Our work provides a statistical framework for interpreting expression data across species and in disease.One Sentence SummaryWe demonstrate the power of a stochastic model for quantifying selective pressure on expression and estimating evolutionary distributions of optimal gene expression.
biorxiv genomics 0-100-users 2017