Basal Contamination of Sequencing Lessons from the GTEx dataset, bioRxiv, 2019-04-09

AbstractOne of the challenges of next generation sequencing (NGS) is read contamination. We used the Genotype-Tissue Expression (GTEx) project, a large, diverse, and robustly generated dataset, to understand the factors that contribute to contamination. We obtained GTEx datasets and technical metadata and validating RNA-Seq from other studies. Of 48 analyzed tissues in GTEx, 26 had variant co-expression clusters of four known highly expressed and pancreas-enriched genes (PRSS1, PNLIP, CLPS, andor CELA3A). Fourteen additional highly expressed genes from other tissues also indicated contamination. Sample contamination by non-native genes was associated with a sample being sequenced on the same day as a tissue that natively expressed those genes. This was highly significant for pancreas and esophagus genes (linear model, p=9.5e-237 and p=5e-260 respectively). Nine SNPs in four genes shown to contaminate non-native tissues demonstrated allelic differences between DNA-based genotypes and contaminated sample RNA-based genotypes, validating the contamination. Low-level contamination affected 4,497 (39.6%) samples (defined as 10 PRSS1 TPM). It also led ≥ to eQTL assignments in inappropriate tissues among these 18 genes. We note this type of contamination occurs widely, impacting bulk and single cell data set analysis. In conclusion, highly expressed, tissue-enriched genes basally contaminate GTEx and other datasets impacting analyses. Awareness of this process is necessary to avoid assigning inaccurate importance to low-level gene expression in inappropriate tissues and cells.

biorxiv genomics 100-200-users 2019

Whole genome phylogenies reflect long-tailed distributions of recombination rates in many bacterial species, bioRxiv, 2019-04-08

AbstractAlthough homologous recombination is accepted to be common in bacteria, so far it has been challenging to accurately quantify its impact on genome evolution within bacterial species. We here introduce methods that use the statistics of single-nucleotide polymorphism (SNP) splits in the core genome alignment of a set of strains to show that, for many bacterial species, recombination dominates genome evolution. Each genomic locus has been overwritten so many times by recombination that it is impossible to reconstruct the clonal phylogeny and, instead of a consensus phylogeny, the phylogeny typically changes many thousands of times along the core genome alignment.We also show how SNP splits can be used to quantify the relative rates with which different subsets of strains have recombined in the past. We find that virtually every strain has a unique pattern of recombination frequencies with other strains and that the relative rates with which different subsets of strains share SNPs follow long-tailed distributions. Our findings show that bacterial populations are neither clonal nor freely recombining, but structured such that recombination rates between different lineages vary along a continuum spanning several orders of magnitude, with a unique pattern of rates for each lineage. Thus, rather than reflecting clonal ancestry, whole genome phylogenies reflect these long-tailed distributions of recombination rates.

biorxiv evolutionary-biology 200-500-users 2019

Pooled-parent exome sequencing to prioritise de novo variants in genetic disease, bioRxiv, 2019-04-07

AbstractIn the clinical setting, exome sequencing has become standard-of-care in diagnosing rare genetic disorders, however many patients remain unsolved. Trio sequencing has been demonstrated to produce a higher diagnostic yield than singleton (proband-only) sequencing. Parental sequencing is especially useful when a disease is suspected to be caused by a de novo variant in the proband, because parental data provide a strong filter for the majority of variants that are shared by the proband and their parents. However the additional cost of sequencing the parents makes the trio strategy uneconomical for many clinical situations. With two thirds of the sequencing budget being spent on parents, these are funds that could be used to sequence more probands. For this reason many clinics are reluctant to sequence parents.Here we propose a pooled-parent strategy for exome sequencing of individuals with likely de novo disease. In this strategy, DNA from all the parents of a cohort of unrelated probands is pooled together into a single exome capture and sequencing run. Variants called in the proband can then be filtered if they are also found in the parent pool, resulting in a shorter list of prioritised variants. To evaluate the pooled-parent strategy we performed a series of simulations by combining reads from individual exomes to imitate sample pooling. We assessed the recall and false positive rate and investigated the trade-off between pool size and recall rate. We compared the performance of GATK HaplotypeCaller individual and joint calling, and FreeBayes to genotype pooled samples. Finally, we applied a pooled-parent strategy to a set of real unsolved cases and showed that the parent pool is a powerful filter that is complementary to other commonly used variant filters such as population variant frequencies.

biorxiv bioinformatics 0-100-users 2019

 

Created with the audiences framework by Jedidiah Carlson

Powered by Hugo