Determining sufficient sequencing depth in RNA-Seq differential expression studies, bioRxiv, 2019-05-13
AbstractRNA-Seq studies require a sufficient read depth to detect biologically important genes. Sequencing below this threshold will reduce statistical power while sequencing above will provide only marginal improvements in power and incur unnecessary sequencing costs. Although existing methodologies can help assess whether there is sufficient read depth, they are unable to guide how many additional reads should be sequenced to reach this threshold. We provide a new method called superSeq that models the relationship between statistical power and read depth. We apply the superSeq framework to 393 RNA-Seq experiments (1,021 total contrasts) in the Expression Atlas and find the model accurately predicts the increase in statistical power gained by increasing the read depth. Based on our analysis, we find that most published studies (> 70%) are undersequenced, i.e., their statistical power can be improved by increasing the sequencing read depth. In addition, the extent of saturation is highly dependent on statistical methodology only 9.5%, 29.5%, and 26.6% of contrasts are saturated when using DESeq2, edgeR, and limma, respectively. Finally, we also find that there is no clear minimum per-transcript read depth to guarantee saturation for an entire technology. Therefore, our framework not only delineates key differences among methods and their impact on determining saturation, but will also be needed even as technology improves and the read depth of experiments increases. Researchers can thus use superSeq to calculate the read depth to achieve required statistical power while avoiding unnecessary sequencing costs.
biorxiv genomics 100-200-users 2019Dsuite - fast D-statistics and related admixture evidence from VCF files, bioRxiv, 2019-05-11
AbstractSummaryThe D-statistic, also known as the ABBA-BABA statistic, and related statistics are commonly used to assess evidence of gene flow between populations or closely related species. While the calculations are not computationally intensive, currently available implementations require custom file formats and are impractical to evaluate all gene flow hypotheses across datasets that include many populations or species. Dsuite is a fast C++ implementation, allowing genome scale calculations of the D-statistic across all combinations of tens or even hundreds of populations or species directly from a variant call format (VCF) file. Furthermore, the program can estimate the admixture fraction and provide evidence of whether introgression is confined to specific loci. Thus Dsuite facilitates assessment of gene flow across large genomic datasets.Availability and implementationSource code and documentation are available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.commillanekDsuite>httpsgithub.commillanekDsuite<jatsext-link>
biorxiv genomics 0-100-users 2019Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads, bioRxiv, 2019-05-11
AbstractThe sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective stand-alone technology for de novo assembly of human genomes.
biorxiv genomics 0-100-users 2019Paragraph A graph-based structural variant genotyper for short-read sequence data, bioRxiv, 2019-05-11
AbstractAccurate detection and genotyping of structural variations (SVs) from short-read data is a long-standing area of development in genomics research and clinical sequencing pipelines. We introduce Paragraph, a fast and accurate genotyper that models SVs using sequence graphs and SV annotations produced by a range of methods and technologies. We demonstrate the accuracy of Paragraph on whole genome sequence data from a control sample with both short and long read sequencing data available, and then apply it at scale to a cohort of 100 samples of diverse ancestry sequenced with short-reads. Comparative analyses indicate that Paragraph has better accuracy than other existing genotypers. The Paragraph software is open-source and available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comIlluminaparagraph>httpsgithub.comIlluminaparagraph<jatsext-link>
biorxiv genomics 100-200-users 2019Quantifying genetic regulatory variation in human populations improves transcriptome analysis in rare disease patients, bioRxiv, 2019-05-10
AbstractTranscriptome data holds substantial promise for better interpretation of rare genetic variants in basic research and clinical settings. Here, we introduce ANalysis of Expression VAriation (ANEVA) to quantify genetic variation in gene dosage from allelic expression (AE) data in a population. Application to GTEx data showed that this variance estimate is robust across datasets and is correlated with selective constraint in a gene. We next used ANEVA variance estimates in a Dosage Outlier Test (ANEVA-DOT) to identify genes in an individual that are affected by a rare regulatory variant with an unusually strong effect. Applying ANEVA-DOT to AE data form 70 Mendelian muscular disease patients showed high accuracy in detecting genes with pathogenic variants in previously resolved cases, and lead to one confirmed and several potential new diagnoses in cases previously unresolved. Using our reference estimates from GTEx data, ANEVA-DOT can be readily incorporated in rare disease diagnostic pipelines to better utilize RNA-seq data.One Sentence SummaryNew statistical framework for modelling allelic expression characterizes genetic regulatory variation in populations and informs diagnosis in rare disease patients
biorxiv genomics 0-100-users 2019Stress-driven transposable element de-repression dynamics in a fungal pathogen, bioRxiv, 2019-05-10
AbstractTransposable elements (TEs) are drivers of genome evolution and affect the expression landscape of the host genome. Stress is a major factor inducing TE activity, however the regulatory mechanisms underlying de-repression are poorly understood. Key unresolved questions are whether different types of stress differentially induce TE activity and whether different TEs respond differently to the same stress. Plant pathogens are excellent models to dissect the impact of stress on TEs, because lifestyle transitions on and off the host impose exposure to a variety of stress conditions. We analyzed the TE expression landscape of four well-characterized strains of the major wheat pathogen Zymoseptoria tritici. We experimentally exposed strains to nutrient starvation and host infection stress. Contrary to expectations, we show that the two distinct conditions induce the expression of different sets of TEs. In particular, the most highly expressed TEs, including MITE and LTR-Gypsy elements, show highly distinct de-repression across stress conditions. Both the genomic context of TEs and the genetic background stress (i.e. different strains harboring the same TEs) were major predictors of de-repression dynamics under stress. Genomic defenses inducing point mutations in repetitive regions were largely ineffective to prevent TE de-repression. Consistent with TE de-repression being governed by epigenetic effects, we found that gene expression profiles under stress varied significantly depending on the proximity to the closest TEs. The unexpected complexity in TE responsiveness to stress across genetic backgrounds and genomic locations shows that species harbor substantial genetic variation to control TEs.
biorxiv genomics 0-100-users 2019