A membrane-depolarising toxin substrate of the Staphylococcus aureus Type VII protein secretion system targets eukaryotes and bacteria, bioRxiv, 2018-10-16

SummaryThe type VII protein secretion system (T7SS) is conserved across Staphylococcus aureus strains and plays important roles in virulence and interbacterial competition. To date only one T7SS substrate protein, encoded in a subset of strains, has been functionally characterized. Here, using an unbiased proteomic approach, we identify TspA as a further T7SS substrate. TspA, encoded distantly from the T7SS gene cluster, is found across all S. aureus strains. Heterologous expression of TspA indicates that it has a toxic C-terminal domain that depolarizes membranes. The membrane depolarizing activity is alleviated by co-production of the TsaI immunity protein. Using a zebrafish hindbrain ventricle infection model, we demonstrate that the T7SS of strain RN6390 contributes to zebrafish mortality, and deletion of tspA leads to increased bacterial clearance in vivo. The toxin domain of TspA is highly polymorphic and S. aureus strains encode multiple tsaI homologues at the tspA locus, suggestive of additional roles in intra-species competition. In agreement, we demonstrate TspA-dependent growth inhibition of RN6390 by strain COL in the zebrafish infection model, that is alleviated by the presence of TsaI homologues. This is the first T7SS substrate protein shown to have activity against both eukaryotes and prokaryotes.

biorxiv microbiology 0-100-users 2018

ForestQC quality control on genetic variants from next-generation sequencing data using random forest, bioRxiv, 2018-10-16

ABSTRACTNext-generation sequencing technology (NGS) enables discovery of nearly all genetic variants present in a genome. A subset of these variants, however, may have poor sequencing quality due to limitations in sequencing technology or in variant calling algorithms. In genetic studies that analyze a large number of sequenced individuals, it is critical to detect and remove those variants with poor quality as they may cause spurious findings. In this paper, we present a statistical approach for performing quality control on variants identified from NGS data by combining a traditional filtering approach and a machine learning approach. Our method uses information on sequencing quality such as sequencing depth, genotyping quality, and GC contents to predict whether a certain variant is likely to contain errors. To evaluate our method, we applied it to two whole-genome sequencing datasets where one dataset consists of related individuals from families while the other consists of unrelated individuals. Results indicate that our method outperforms widely used methods for performing quality control on variants such as VQSR of GATK by considerably improving the quality of variants to be included in the analysis. Our approach is also very efficient, and hence can be applied to large sequencing datasets. We conclude that combining a machine learning algorithm trained with sequencing quality information and the filtering approach is an effective approach to perform quality control on genetic variants from sequencing data.Author SummaryGenetic disorders can be caused by many types of genetic mutations, including common and rare single nucleotide variants, structural variants, insertions and deletions. Nowadays, next generation sequencing (NGS) technology allows us to identify various genetic variants that are associated with diseases. However, variants detected by NGS might have poor sequencing quality due to biases and errors in sequencing technologies and analysis tools. Therefore, it is critical to remove variants with low quality, which could cause spurious findings in follow-up analyses. Previously, people applied either hard filters or machine learning models for variant quality control (QC), which failed to filter out those variants accurately. Here, we developed a statistical tool, ForestQC, for variant QC by combining a filtering approach and a machine learning approach. We applied ForestQC to one family-based whole genome sequencing (WGS) dataset and one general case-control WGS dataset, to evaluate our method. Results show that ForestQC outperforms widely used methods for variant QC by considerably improving the quality of variants. Also, ForestQC is very efficient and scalable to large-scale sequencing datasets. Our study indicates that combining filtering approaches and machine learning approaches enables effective variant QC.

biorxiv bioinformatics 0-100-users 2018

Substantial Batch Effects in TCGA Exome Sequences Undermine Pan-Cancer Analysis of Germline Variants, bioRxiv, 2018-10-16

ABSTRACTBackgroundIn recent years, research on cancer predisposition germline variants has emerged as a prominent field. The identity of somatic mutations is based on a reliable mapping of the patient germline variants. In addition, the statistics of germline variants frequencies in healthy individuals and cancer patients is the basis for seeking candidates for cancer predisposition genes. The Cancer Genome Atlas (TCGA) is one of the main sources of such data, providing a diverse collection of molecular data including deep sequencing for more than 30 types of cancer from >10,000 patients.MethodsOur hypothesis in this study is that whole exome sequences from healthy blood samples of cancer patients are not expected to show systematic differences among cancer types. To test this hypothesis, we analyzed common and rare germline variants across six cancer types, covering 2,241 samples from TCGA. In our analysis we accounted for inherent variables in the data including the different variant calling protocols, sequencing platforms, and ethnicity.ResultsWe report on substantial batch effects in germline variants associated with cancer types. We attribute the effect to the specific sequencing centers that produced the data. Specifically, we measured 30% variability in the number of reported germline variants per sample across sequencing centers. The batch effect is further expressed in nucleotide composition and variant frequencies. Importantly, the batch effect causes substantial differences in germline variant distribution patterns across numerous genes, including prominent cancer predisposition genes such as BRCA1, RET, MAX, and KRAS. For most of known cancer predisposition genes, we found a distinct batch-dependent difference in germline variants.ConclusionTCGA germline data is exposed to strong batch effects with substantial variabilities among TCGA sequencing centers. We claim that those batch effects are consequential for numerous TCGA pan-cancer studies. In particular, these effects may compromise the reliability and the potency to detect new cancer predisposition genes. Furthermore, interpretation of pan-cancer analyses should be revisited in view of the source of the genomic data after accounting for the reported batch effects.

biorxiv cancer-biology 0-100-users 2018

 

Created with the audiences framework by Jedidiah Carlson

Powered by Hugo