Uncertainty in RNA-seq gene expression data, bioRxiv, 2018-10-17
RNA-sequencing data is widely used to identify disease biomarkers and therapeutic targets. Here, using data from five RNA-seq processing pipelines applied to 6,690 human tumor and normal tissues, we show that for >12% of protein-coding genes, in at least 1% of samples, current best-in-class RNA-seq processing pipelines differ in their abundance estimates by more than four-fold using the same samples and the same set of RNA-seq reads, raising clinical concern.
biorxiv bioinformatics 0-100-users 2018Using long-read sequencing to detect imprinted DNA methylation, bioRxiv, 2018-10-17
Systematic variation in the methylation of cytosines at CpG sites plays a critical role in early development of humans and other mammals. Of particular interest are regions of differential methylation between parental alleles, as these often dictate monoallelic gene expression, resulting in parent of origin specific control of the embryonic transcriptome and subsequent development, in a phenomenon known as genomic imprinting. Using long-read nanopore sequencing we show that, with an average genomic coverage of approximately ten, it is possible to determine both the level of methylation of CpG sites and the haplotype from which each read arises. The long-read property is exploited to characterise, using novel methods, both methylation and haplotype for reads that have reduced basecalling precision compared to Sanger sequencing. We validate the analysis both through comparison of nanopore-derived methylation patterns with those from Reduced Representation Bisulfite Sequencing data and through comparison with previously reported data. Our analysis successfully identifies known imprinting control regions as well as some novel differentially methylated regions which, due to their proximity to hitherto unknown monoallelically expressed genes, may represent new imprinting control regions.
biorxiv genomics 0-100-users 2018A membrane-depolarising toxin substrate of the Staphylococcus aureus Type VII protein secretion system targets eukaryotes and bacteria, bioRxiv, 2018-10-16
SummaryThe type VII protein secretion system (T7SS) is conserved across Staphylococcus aureus strains and plays important roles in virulence and interbacterial competition. To date only one T7SS substrate protein, encoded in a subset of strains, has been functionally characterized. Here, using an unbiased proteomic approach, we identify TspA as a further T7SS substrate. TspA, encoded distantly from the T7SS gene cluster, is found across all S. aureus strains. Heterologous expression of TspA indicates that it has a toxic C-terminal domain that depolarizes membranes. The membrane depolarizing activity is alleviated by co-production of the TsaI immunity protein. Using a zebrafish hindbrain ventricle infection model, we demonstrate that the T7SS of strain RN6390 contributes to zebrafish mortality, and deletion of tspA leads to increased bacterial clearance in vivo. The toxin domain of TspA is highly polymorphic and S. aureus strains encode multiple tsaI homologues at the tspA locus, suggestive of additional roles in intra-species competition. In agreement, we demonstrate TspA-dependent growth inhibition of RN6390 by strain COL in the zebrafish infection model, that is alleviated by the presence of TsaI homologues. This is the first T7SS substrate protein shown to have activity against both eukaryotes and prokaryotes.
biorxiv microbiology 0-100-users 2018ForestQC quality control on genetic variants from next-generation sequencing data using random forest, bioRxiv, 2018-10-16
ABSTRACTNext-generation sequencing technology (NGS) enables discovery of nearly all genetic variants present in a genome. A subset of these variants, however, may have poor sequencing quality due to limitations in sequencing technology or in variant calling algorithms. In genetic studies that analyze a large number of sequenced individuals, it is critical to detect and remove those variants with poor quality as they may cause spurious findings. In this paper, we present a statistical approach for performing quality control on variants identified from NGS data by combining a traditional filtering approach and a machine learning approach. Our method uses information on sequencing quality such as sequencing depth, genotyping quality, and GC contents to predict whether a certain variant is likely to contain errors. To evaluate our method, we applied it to two whole-genome sequencing datasets where one dataset consists of related individuals from families while the other consists of unrelated individuals. Results indicate that our method outperforms widely used methods for performing quality control on variants such as VQSR of GATK by considerably improving the quality of variants to be included in the analysis. Our approach is also very efficient, and hence can be applied to large sequencing datasets. We conclude that combining a machine learning algorithm trained with sequencing quality information and the filtering approach is an effective approach to perform quality control on genetic variants from sequencing data.Author SummaryGenetic disorders can be caused by many types of genetic mutations, including common and rare single nucleotide variants, structural variants, insertions and deletions. Nowadays, next generation sequencing (NGS) technology allows us to identify various genetic variants that are associated with diseases. However, variants detected by NGS might have poor sequencing quality due to biases and errors in sequencing technologies and analysis tools. Therefore, it is critical to remove variants with low quality, which could cause spurious findings in follow-up analyses. Previously, people applied either hard filters or machine learning models for variant quality control (QC), which failed to filter out those variants accurately. Here, we developed a statistical tool, ForestQC, for variant QC by combining a filtering approach and a machine learning approach. We applied ForestQC to one family-based whole genome sequencing (WGS) dataset and one general case-control WGS dataset, to evaluate our method. Results show that ForestQC outperforms widely used methods for variant QC by considerably improving the quality of variants. Also, ForestQC is very efficient and scalable to large-scale sequencing datasets. Our study indicates that combining filtering approaches and machine learning approaches enables effective variant QC.
biorxiv bioinformatics 0-100-users 2018Substantial Batch Effects in TCGA Exome Sequences Undermine Pan-Cancer Analysis of Germline Variants, bioRxiv, 2018-10-16
ABSTRACTBackgroundIn recent years, research on cancer predisposition germline variants has emerged as a prominent field. The identity of somatic mutations is based on a reliable mapping of the patient germline variants. In addition, the statistics of germline variants frequencies in healthy individuals and cancer patients is the basis for seeking candidates for cancer predisposition genes. The Cancer Genome Atlas (TCGA) is one of the main sources of such data, providing a diverse collection of molecular data including deep sequencing for more than 30 types of cancer from >10,000 patients.MethodsOur hypothesis in this study is that whole exome sequences from healthy blood samples of cancer patients are not expected to show systematic differences among cancer types. To test this hypothesis, we analyzed common and rare germline variants across six cancer types, covering 2,241 samples from TCGA. In our analysis we accounted for inherent variables in the data including the different variant calling protocols, sequencing platforms, and ethnicity.ResultsWe report on substantial batch effects in germline variants associated with cancer types. We attribute the effect to the specific sequencing centers that produced the data. Specifically, we measured 30% variability in the number of reported germline variants per sample across sequencing centers. The batch effect is further expressed in nucleotide composition and variant frequencies. Importantly, the batch effect causes substantial differences in germline variant distribution patterns across numerous genes, including prominent cancer predisposition genes such as BRCA1, RET, MAX, and KRAS. For most of known cancer predisposition genes, we found a distinct batch-dependent difference in germline variants.ConclusionTCGA germline data is exposed to strong batch effects with substantial variabilities among TCGA sequencing centers. We claim that those batch effects are consequential for numerous TCGA pan-cancer studies. In particular, these effects may compromise the reliability and the potency to detect new cancer predisposition genes. Furthermore, interpretation of pan-cancer analyses should be revisited in view of the source of the genomic data after accounting for the reported batch effects.
biorxiv cancer-biology 0-100-users 20184D imaging and analysis of multicellular tumour spheroid cell migration and invasion, bioRxiv, 2018-10-15
Studying and characterising tumour cell migration is critical for understanding disease progression and for assessing drug efficacy. Whilst tumour cell migration occurs fundamentally in 3 spatial dimensions (3D), for practical reasons, most migration studies to date have performed analysis in 2D. Here we imaged live multicellular tumour spheroids with lightsheet fluorescence microscopy to determine cellular migration and invasion in 3D over time (4D). We focused on glioblastoma, which are aggressive brain tumours, where cell invasion into the surrounding normal brain remains a major clinical challenge. We developed a workflow for analysing complex 3D cell movement, taking into account migration within the spheroid as well as invasion into the surrounding matrix. This provided metrics characterising cell motion, which we used to evaluate the efficacy of chemother-apeutics on invasion. These rich datasets open avenues for further studies on drug efficacy, microenvironment composition, as well as collective cell migration and metastatic potential.
biorxiv cell-biology 0-100-users 2018