ForestQC quality control on genetic variants from next-generation sequencing data using random forest, bioRxiv, 2018-10-16

ABSTRACTNext-generation sequencing technology (NGS) enables discovery of nearly all genetic variants present in a genome. A subset of these variants, however, may have poor sequencing quality due to limitations in sequencing technology or in variant calling algorithms. In genetic studies that analyze a large number of sequenced individuals, it is critical to detect and remove those variants with poor quality as they may cause spurious findings. In this paper, we present a statistical approach for performing quality control on variants identified from NGS data by combining a traditional filtering approach and a machine learning approach. Our method uses information on sequencing quality such as sequencing depth, genotyping quality, and GC contents to predict whether a certain variant is likely to contain errors. To evaluate our method, we applied it to two whole-genome sequencing datasets where one dataset consists of related individuals from families while the other consists of unrelated individuals. Results indicate that our method outperforms widely used methods for performing quality control on variants such as VQSR of GATK by considerably improving the quality of variants to be included in the analysis. Our approach is also very efficient, and hence can be applied to large sequencing datasets. We conclude that combining a machine learning algorithm trained with sequencing quality information and the filtering approach is an effective approach to perform quality control on genetic variants from sequencing data.Author SummaryGenetic disorders can be caused by many types of genetic mutations, including common and rare single nucleotide variants, structural variants, insertions and deletions. Nowadays, next generation sequencing (NGS) technology allows us to identify various genetic variants that are associated with diseases. However, variants detected by NGS might have poor sequencing quality due to biases and errors in sequencing technologies and analysis tools. Therefore, it is critical to remove variants with low quality, which could cause spurious findings in follow-up analyses. Previously, people applied either hard filters or machine learning models for variant quality control (QC), which failed to filter out those variants accurately. Here, we developed a statistical tool, ForestQC, for variant QC by combining a filtering approach and a machine learning approach. We applied ForestQC to one family-based whole genome sequencing (WGS) dataset and one general case-control WGS dataset, to evaluate our method. Results show that ForestQC outperforms widely used methods for variant QC by considerably improving the quality of variants. Also, ForestQC is very efficient and scalable to large-scale sequencing datasets. Our study indicates that combining filtering approaches and machine learning approaches enables effective variant QC.

biorxiv bioinformatics 0-100-users 2018

Bazam A rapid method for read extraction and realignment of high throughput sequencing data, bioRxiv, 2018-10-04

AbstractBackgroundAs costs of high throughput sequencing have fallen, we are seeing vast quantities of short read genomic data being generated. Often, the data is exchanged and stored as aligned reads, which provides high compression and convenient access for many analyses. However, aligned data becomes outdated as new reference genomes and alignment methods become available. Moreover, some applications cannot utilise pre-aligned reads at all, necessitating conversion back to raw format (FASTQ) before they can be used. In both cases, the process of extraction and realignment is expensive and time consuming.FindingsWe describe Bazam, a tool that efficiently extracts the original paired FASTQ from reads stored in aligned form (BAM or CRAM format). Bazam extracts reads in a format that directly allows realignment with popular aligners with high concurrency. Through eliminating steps and increasing the accessible concurrency, Bazam facilitates up to a 90% reduction in the time required for realignment compared to standard methods. Bazam can support selective extraction of read pairs from focused genomic regions, further increasing efficiency for targeted analyses. Bazam is additionally suitable as a base for other applications that require efficient paired read information, such as quality control, structural variant calling and alignment comparison.ConclusionsBazam offers significant improvements for users needing to realign genomic data.

biorxiv bioinformatics 200-500-users 2018

 

Created with the audiences framework by Jedidiah Carlson

Powered by Hugo