Measuring and Mitigating PCR Bias in Microbiome Data, bioRxiv, 2019-04-10
AbstractPCR amplification plays a central role in the measurement of mixed microbial communities via high-throughput sequencing. Yet PCR is also known to be a common source of bias in microbiome data. Here we present a paired modeling and experimental approach to characterize and mitigate PCR bias in microbiome studies. We use experimental data from mock bacterial communities to validate our approach and human gut microbiota samples to characterize PCR bias under real-world conditions. Our results suggest that PCR can bias estimates of microbial relative abundances by a factor of 2-4 but that this bias can be mitigated using simple Bayesian multinomial logistic-normal linear models.Author summaryHigh-throughput sequencing is often used to profile host-associated microbial communities. Many processing steps are required to transform a community of bacteria into a pool of DNA suitable for sequencing. One important step is amplification where, to create enough DNA for sequencing, DNA from many different bacteria are repeatedly copied using a technique called Polymerase Chain Reaction (PCR). However, PCR is known to introduce bias as DNA from some bacteria are more efficiently copied than others. Here we introduce an experimental procedure that allows this bias to be measured and computational techniques that allow this bias to be mitigated in sequencing data.
biorxiv genomics 0-100-users 2019Not just onep Multivariate GWAS of psychiatric disorders and their cardinal symptoms reveal two dimensions of cross-cutting genetic liabilities, bioRxiv, 2019-04-10
AbstractA single dimension of general psychopathology,p, has been hypothesized to represent a general liability that spans multiple types of psychiatric disorders and non-clinical variation in psychiatric symptoms across the lifespan. We conducted genome-wide association analyses of lifetime symptoms of mania, psychosis, irritability in 124,952 to 208,315 individuals from UK Biobank, and then applied Genomic SEM to model the genetic relationships between these psychiatric symptoms and clinically-defined psychiatric disorders (schizophrenia, bipolar disorder, major depressive disorder). Two dimensions of cross-cutting genetic liability emerged general vulnerability to self-reported symptoms (pself) versus transdiagnostic vulnerability to clinically-diagnosed disease (pclinician). These were only modestly correlated (rg= .344). Multivariate GWAS identified 145 and 11 independent and genome-wide significant loci forpclinicianandpself, respectively, and improved polygenic prediction, relative to univariate GWAS, in hold-out samples. Despite the severe impairments in occupational and educational functioning seen in patients with schizophrenia and bipolar disorder,pselfshowed stronger and more pervasive genetic correlations with facets of socioeconomic disadvantage (educational attainment, income, and neighborhood deprivation), whereaspclinicianwas more strongly associated with medical disorders unrelated to the brain. Genetic variance inpclinicianthat was unrelated to general vulnerability to psychiatric symptoms was associated withlesssocioeconomic disadvantage, suggesting positive selection biases in clinical samples used in psychiatric GWAS. These findings inform criticisms of psychiatric nosology by suggesting that cross-disorder genetic liabilities identified in GWASs of clinician-defined psychiatric disease are relatively distinct from genetic liabilities operating on self-reported symptom variation in the general population.
biorxiv genetics 100-200-users 2019Basal Contamination of Bulk Sequencing Lessons from the GTEx dataset, bioRxiv, 2019-04-09
AbstractBackgroundOne of the challenges of next generation sequencing (NGS) is contaminating reads from other samples. We used the Genotype-Tissue Expression (GTEx) project, a large, diverse, and robustly generated dataset, as a useful resource to understand the factors that contribute to contamination.ResultsWe obtained 11,340 RNA-Seq samples, DNA variant call files (VCF) of 635 individuals, and technical metadata from GTEx as well as read count data from the Human Protein Atlas (HPA) and a pharmacogenetics study. We analyzed 48 tissues in GTEx. Of these, 24 had variant co-expression clusters of four known highly expressed and pancreas-enriched genes (PRSS1, PNLIP, CLPS, and CELA3A). Fifteen additional highly expressed genes from other tissues were also indicative of contamination (KRT4, KRT13, PGC, CPA1, GP2, PRL, LIPF, CTRB2, FGA, HP, CKM, FGG, MYBPC1, MYH2, ZG16B). Sample contamination by non-native genes was highly associated with a sample being sequenced on the same day as a tissue that natively has high levels of those genes. This was highly significant for both pancreas genes (p= 2.7E-75) and esophagus genes (p= 8.9E-154). We used genetic polymorphism differences between individuals as validation of the contamination. Specifically, 11 SNPs in five genes shown to contaminate non-native tissues demonstrated allelic differences between DNA-based genotypes and contaminated sample RNA-based genotypes. Low-level contamination affected 1,841 (15.8%) samples (defined as ≥500 PRSS1 read counts). It also led to eQTL assignments in inappropriate tissues among these 19 genes. In support of this type of contamination occurring widely, pancreas gene contamination (PRSS1) was also observed in the HPA dataset, where pancreas samples were sequenced, but not in the pharmacogenomics dataset, where they were not.ConclusionsHighly expressed, tissue-enriched genes basally contaminate the GTEx dataset impacting on some downstream GTEx data analyses. This type of contamination is not unique to GTEx, being shared with other datasets. Awareness of this process will reduce assigning variable, contaminating low-level gene expression to disease processes.
biorxiv genomics 100-200-users 2019Basal Contamination of Sequencing Lessons from the GTEx dataset, bioRxiv, 2019-04-09
AbstractOne of the challenges of next generation sequencing (NGS) is read contamination. We used the Genotype-Tissue Expression (GTEx) project, a large, diverse, and robustly generated dataset, to understand the factors that contribute to contamination. We obtained GTEx datasets and technical metadata and validating RNA-Seq from other studies. Of 48 analyzed tissues in GTEx, 26 had variant co-expression clusters of four known highly expressed and pancreas-enriched genes (PRSS1, PNLIP, CLPS, andor CELA3A). Fourteen additional highly expressed genes from other tissues also indicated contamination. Sample contamination by non-native genes was associated with a sample being sequenced on the same day as a tissue that natively expressed those genes. This was highly significant for pancreas and esophagus genes (linear model, p=9.5e-237 and p=5e-260 respectively). Nine SNPs in four genes shown to contaminate non-native tissues demonstrated allelic differences between DNA-based genotypes and contaminated sample RNA-based genotypes, validating the contamination. Low-level contamination affected 4,497 (39.6%) samples (defined as 10 PRSS1 TPM). It also led ≥ to eQTL assignments in inappropriate tissues among these 18 genes. We note this type of contamination occurs widely, impacting bulk and single cell data set analysis. In conclusion, highly expressed, tissue-enriched genes basally contaminate GTEx and other datasets impacting analyses. Awareness of this process is necessary to avoid assigning inaccurate importance to low-level gene expression in inappropriate tissues and cells.
biorxiv genomics 100-200-users 2019Native molecule sequencing by nano-ID reveals synthesis and stability of RNA isoforms, bioRxiv, 2019-04-08
AbstractEukaryotic genes often generate a variety of RNA isoforms that can lead to functionally distinct protein variants. The synthesis and stability of RNA isoforms is however poorly characterized. The reason for this is that current methods to quantify RNA metabolism use ‘short-read’ sequencing that cannot detect RNA isoforms. Here we present nanopore sequencing-based Isoform Dynamics (nano-ID), a method that detects newly synthesized RNA isoforms and monitors isoform metabolism. nano-ID combines metabolic RNA labeling, ‘long-read’ nanopore sequencing of native RNA molecules and machine learning. Application of nano-ID to the heat shock response in human cells reveals that many RNA isoforms change their synthesis rate, stability, and splicing pattern. nano-ID also shows that the metabolism of individual RNA isoforms differs strongly from that estimated for the combined RNA signal at a specific gene locus. And although combined RNA stability correlates with poly(A)-tail length, individual RNA isoforms can deviate significantly. nano-ID enables studies of RNA metabolism on the level of single RNA molecules and isoforms in different cell states and conditions.
biorxiv molecular-biology 0-100-users 2019Tunability of DNA polymerase stability during eukaryotic DNA replication, bioRxiv, 2019-04-08
SummaryStructural and biochemical studies have revealed the basic principles of how the replisome duplicates genomic DNA, but little is known about its dynamics during DNA replication. We reconstitute the 34 proteins needed to form the S. cerevisiae replisome and show how changing local concentrations of the key DNA polymerases tunes the ability of the complex to efficiently recycle these proteins or to dynamically exchange them. Particularly, we demonstrate redundancy of the Pol α DNA polymerase activity in replication and show that Pol α primase and the lagging-strand Pol δ can be re-used within the replisome to support the synthesis of large numbers of Okazaki fragments. This unexpected malleability of the replisome might allow it to deal with barriers and resource challenges during replication of large genomes.
biorxiv biophysics 100-200-users 2019