ForestQC quality control on genetic variants from next-generation sequencing data using random forest, bioRxiv, 2018-10-16
ABSTRACTNext-generation sequencing technology (NGS) enables discovery of nearly all genetic variants present in a genome. A subset of these variants, however, may have poor sequencing quality due to limitations in sequencing technology or in variant calling algorithms. In genetic studies that analyze a large number of sequenced individuals, it is critical to detect and remove those variants with poor quality as they may cause spurious findings. In this paper, we present a statistical approach for performing quality control on variants identified from NGS data by combining a traditional filtering approach and a machine learning approach. Our method uses information on sequencing quality such as sequencing depth, genotyping quality, and GC contents to predict whether a certain variant is likely to contain errors. To evaluate our method, we applied it to two whole-genome sequencing datasets where one dataset consists of related individuals from families while the other consists of unrelated individuals. Results indicate that our method outperforms widely used methods for performing quality control on variants such as VQSR of GATK by considerably improving the quality of variants to be included in the analysis. Our approach is also very efficient, and hence can be applied to large sequencing datasets. We conclude that combining a machine learning algorithm trained with sequencing quality information and the filtering approach is an effective approach to perform quality control on genetic variants from sequencing data.Author SummaryGenetic disorders can be caused by many types of genetic mutations, including common and rare single nucleotide variants, structural variants, insertions and deletions. Nowadays, next generation sequencing (NGS) technology allows us to identify various genetic variants that are associated with diseases. However, variants detected by NGS might have poor sequencing quality due to biases and errors in sequencing technologies and analysis tools. Therefore, it is critical to remove variants with low quality, which could cause spurious findings in follow-up analyses. Previously, people applied either hard filters or machine learning models for variant quality control (QC), which failed to filter out those variants accurately. Here, we developed a statistical tool, ForestQC, for variant QC by combining a filtering approach and a machine learning approach. We applied ForestQC to one family-based whole genome sequencing (WGS) dataset and one general case-control WGS dataset, to evaluate our method. Results show that ForestQC outperforms widely used methods for variant QC by considerably improving the quality of variants. Also, ForestQC is very efficient and scalable to large-scale sequencing datasets. Our study indicates that combining filtering approaches and machine learning approaches enables effective variant QC.
biorxiv bioinformatics 0-100-users 2018Programmed DNA elimination of germline development genes in songbirds, bioRxiv, 2018-10-16
Genomes can vary within individual organisms. Programmed DNA elimination leads to dramatic changes in genome organisation during the germline-soma differentiation of ciliates, lampreys, nematodes, and various other eukaryotes. A particularly remarkable example of tissue-specific genome differentiation is the germline-restricted chromosome (GRC) in the zebra finch which is consistently absent from somatic cells. Although the zebra finch is an important animal model system, molecular evidence from its large GRC (>150 megabases) is limited to a short intergenic region and a single mRNA. Here, we combined cytogenetic, genomic, transcriptomic, and proteomic evidence to resolve the evolutionary origin and functional significance of the GRC. First, by generating tissue-specific de-novo linked-read genome assemblies and re-sequencing two additional germline and soma samples, we found that the GRC contains at least 115 genes which are paralogous to single-copy genes on 18 autosomes and the Z chromosome. We detected an amplification of 38 GRC-linked genes into high copy numbers (up to 308 copies) but, surprisingly, no enrichment of transposable elements on the GRC. Second, transcriptome and proteome data provided evidence for functional expression of GRC genes at the RNA and protein levels in testes and ovaries. Interestingly, the GRC is enriched for genes with highly expressed orthologs in chicken gonads and gene ontologies involved in female gonad development. Third, we detected evolutionary strata of GRC-linked genes. Genes such as bicc1 and trim71 have resided on the GRC for tens of millions of years, whereas dozens have become GRC-linked very recently. The GRC is thus likely widespread in songbirds (half of all bird species) and its rapid evolution may have contributed to their diversification. Together, our results demonstrate a highly dynamic evolutionary history of the songbird GRC leading to dramatic germline-soma genome differences as a novel mechanism to minimize genetic conflict between germline and soma.
biorxiv evolutionary-biology 200-500-users 2018Substantial Batch Effects in TCGA Exome Sequences Undermine Pan-Cancer Analysis of Germline Variants, bioRxiv, 2018-10-16
ABSTRACTBackgroundIn recent years, research on cancer predisposition germline variants has emerged as a prominent field. The identity of somatic mutations is based on a reliable mapping of the patient germline variants. In addition, the statistics of germline variants frequencies in healthy individuals and cancer patients is the basis for seeking candidates for cancer predisposition genes. The Cancer Genome Atlas (TCGA) is one of the main sources of such data, providing a diverse collection of molecular data including deep sequencing for more than 30 types of cancer from >10,000 patients.MethodsOur hypothesis in this study is that whole exome sequences from healthy blood samples of cancer patients are not expected to show systematic differences among cancer types. To test this hypothesis, we analyzed common and rare germline variants across six cancer types, covering 2,241 samples from TCGA. In our analysis we accounted for inherent variables in the data including the different variant calling protocols, sequencing platforms, and ethnicity.ResultsWe report on substantial batch effects in germline variants associated with cancer types. We attribute the effect to the specific sequencing centers that produced the data. Specifically, we measured 30% variability in the number of reported germline variants per sample across sequencing centers. The batch effect is further expressed in nucleotide composition and variant frequencies. Importantly, the batch effect causes substantial differences in germline variant distribution patterns across numerous genes, including prominent cancer predisposition genes such as BRCA1, RET, MAX, and KRAS. For most of known cancer predisposition genes, we found a distinct batch-dependent difference in germline variants.ConclusionTCGA germline data is exposed to strong batch effects with substantial variabilities among TCGA sequencing centers. We claim that those batch effects are consequential for numerous TCGA pan-cancer studies. In particular, these effects may compromise the reliability and the potency to detect new cancer predisposition genes. Furthermore, interpretation of pan-cancer analyses should be revisited in view of the source of the genomic data after accounting for the reported batch effects.
biorxiv cancer-biology 0-100-users 20184D imaging and analysis of multicellular tumour spheroid cell migration and invasion, bioRxiv, 2018-10-15
Studying and characterising tumour cell migration is critical for understanding disease progression and for assessing drug efficacy. Whilst tumour cell migration occurs fundamentally in 3 spatial dimensions (3D), for practical reasons, most migration studies to date have performed analysis in 2D. Here we imaged live multicellular tumour spheroids with lightsheet fluorescence microscopy to determine cellular migration and invasion in 3D over time (4D). We focused on glioblastoma, which are aggressive brain tumours, where cell invasion into the surrounding normal brain remains a major clinical challenge. We developed a workflow for analysing complex 3D cell movement, taking into account migration within the spheroid as well as invasion into the surrounding matrix. This provided metrics characterising cell motion, which we used to evaluate the efficacy of chemother-apeutics on invasion. These rich datasets open avenues for further studies on drug efficacy, microenvironment composition, as well as collective cell migration and metastatic potential.
biorxiv cell-biology 0-100-users 2018Brain-wide cellular resolution imaging of Cre transgenic zebrafish lines for functional circuit-mapping, bioRxiv, 2018-10-15
AbstractDecoding the functional connectivity of the nervous system is facilitated by transgenic methods that express a genetically encoded reporter or effector in specific neurons; however, most transgenic lines show broad spatiotemporal and cell-type expression. Increased specificity can be achieved using intersectional genetic methods which restrict reporter expression to cells that co-express multiple drivers, such as Gal4 and Cre. To facilitate intersectional targeting in zebrafish, we have generated more than 50 new Cre lines, and co-registered brain expression images with the Zebrafish Brain Browser, a cellular resolution atlas of 264 transgenic lines. Lines labeling neurons of interest can be identified using a web-browser to perform a 3D spatial search (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpzbbrowser.com>zbbrowser.com<jatsext-link>). This resource facilitates the design of intersectional genetic experiments and will advance a wide range of precision circuit-mapping studies.
biorxiv neuroscience 0-100-users 2018Harnessing the Anti-Cancer Natural Product Nimbolide for Targeted Protein Degradation, bioRxiv, 2018-10-15
AbstractNimbolide, a terpenoid natural product derived from the Neem tree, impairs cancer pathogenicity across many types of human cancers; however, the direct targets and mechanisms by which nimbolide exerts its effects are poorly understood. Here, we used activity-based protein profiling (ABPP) chemoproteomic platforms to discover that nimbolide reacts with a novel functional cysteine crucial for substrate recognition in the E3 ubiquitin ligase RNF114. Nimbolide impairs breast cancer cell proliferation in-part by disrupting RNF114 substrate recognition, leading to inhibition of ubiquitination and degradation of the tumor-suppressors such as p21, resulting in their rapid stabilization. We further demonstrate that nimbolide can be harnessed to recruit RNF114 as an E3 ligase in targeted protein degradation applications and show that synthetically simpler scaffolds are also capable of accessing this unique reactive site. Our study highlights the utility of ABPP platforms in uncovering unique druggable modalities accessed by natural products for cancer therapy and targeted protein degradation applications.
biorxiv cancer-biology 0-100-users 2018