Uncertainty in RNA-seq gene expression data, bioRxiv, 2018-10-17
RNA-sequencing data is widely used to identify disease biomarkers and therapeutic targets. Here, using data from five RNA-seq processing pipelines applied to 6,690 human tumor and normal tissues, we show that for >12% of protein-coding genes, in at least 1% of samples, current best-in-class RNA-seq processing pipelines differ in their abundance estimates by more than four-fold using the same samples and the same set of RNA-seq reads, raising clinical concern.
biorxiv bioinformatics 0-100-users 2018ForestQC quality control on genetic variants from next-generation sequencing data using random forest, bioRxiv, 2018-10-16
ABSTRACTNext-generation sequencing technology (NGS) enables discovery of nearly all genetic variants present in a genome. A subset of these variants, however, may have poor sequencing quality due to limitations in sequencing technology or in variant calling algorithms. In genetic studies that analyze a large number of sequenced individuals, it is critical to detect and remove those variants with poor quality as they may cause spurious findings. In this paper, we present a statistical approach for performing quality control on variants identified from NGS data by combining a traditional filtering approach and a machine learning approach. Our method uses information on sequencing quality such as sequencing depth, genotyping quality, and GC contents to predict whether a certain variant is likely to contain errors. To evaluate our method, we applied it to two whole-genome sequencing datasets where one dataset consists of related individuals from families while the other consists of unrelated individuals. Results indicate that our method outperforms widely used methods for performing quality control on variants such as VQSR of GATK by considerably improving the quality of variants to be included in the analysis. Our approach is also very efficient, and hence can be applied to large sequencing datasets. We conclude that combining a machine learning algorithm trained with sequencing quality information and the filtering approach is an effective approach to perform quality control on genetic variants from sequencing data.Author SummaryGenetic disorders can be caused by many types of genetic mutations, including common and rare single nucleotide variants, structural variants, insertions and deletions. Nowadays, next generation sequencing (NGS) technology allows us to identify various genetic variants that are associated with diseases. However, variants detected by NGS might have poor sequencing quality due to biases and errors in sequencing technologies and analysis tools. Therefore, it is critical to remove variants with low quality, which could cause spurious findings in follow-up analyses. Previously, people applied either hard filters or machine learning models for variant quality control (QC), which failed to filter out those variants accurately. Here, we developed a statistical tool, ForestQC, for variant QC by combining a filtering approach and a machine learning approach. We applied ForestQC to one family-based whole genome sequencing (WGS) dataset and one general case-control WGS dataset, to evaluate our method. Results show that ForestQC outperforms widely used methods for variant QC by considerably improving the quality of variants. Also, ForestQC is very efficient and scalable to large-scale sequencing datasets. Our study indicates that combining filtering approaches and machine learning approaches enables effective variant QC.
biorxiv bioinformatics 0-100-users 2018Selene a PyTorch-based deep learning library for biological sequence-level data, bioRxiv, 2018-10-10
AbstractTo enable the application of deep learning in biology, we present Selene (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsselene.flatironinstitute.org>httpsselene.flatironinstitute.org<jatsext-link>), a PyTorch-based deep learning library for fast and easy development, training, and application of deep learning model architectures for any biological sequences. We demonstrate how Selene allows researchers to easily train a published architecture on new data, develop and evaluate a new architecture, and use a trained model to answer biological questions of interest.
biorxiv bioinformatics 100-200-users 2018MetaCell analysis of single cell RNA-seq data using k-NN graph partitions, bioRxiv, 2018-10-09
ABSTRACTSingle cell RNA-seq (scRNA-seq) has become the method of choice for analyzing mRNA distributions in heterogeneous cell populations. scRNA-seq only partially samples the cells in a tissue and the RNA in each cell, resulting in sparse data that challenge analysis. We develop a methodology that addresses scRNA-seq’s sparsity through partitioning the data into metacells disjoint, homogenous and highly compact groups of cells, each exhibiting only sampling variance. Metacells constitute local building blocks for clustering and quantitative analysis of gene expression, while not enforcing any global structure on the data, thereby maintaining statistical control and minimizing biases. We illustrate the MetaCell framework by re-analyzing cell type and transcriptional gradients in peripheral blood and whole organism scRNA-seq maps. Our algorithms are implemented in the new MetaCell RC++ software package.
biorxiv bioinformatics 0-100-users 2018Combining Gene Ontology with Deep Neural Networks to Enhance the Clustering of Single Cell RNA-Seq Data, bioRxiv, 2018-10-07
AbstractBackgroundSingle cell RNA sequencing (scRNA-seq) is applied to assay the individual transcriptomes of large numbers of cells. The gene expression at single-cell level provides an opportunity for better understanding of cell function and new discoveries in biomedical areas. To ensure that the single-cell based gene expression data are interpreted appropriately, it is crucial to develop new computational methods.ResultsIn this article, we try to construct the structure of neural networks based on the prior knowledge of Gene Ontology (GO). By integrating GO with both unsupervised and supervised models, two novel methods are proposed, named GOAE (Gene Ontology AutoEncoder) and GONN (Gene Ontology Neural Network) respectively, for clustering of scRNA-seq data.ConclusionsThe evaluation results show that the proposed models outperform some state-of-the-art approaches. Furthermore, incorporating with GO, we provide an opportunity to interpret the underlying biological mechanism behind the neural network-based model.
biorxiv bioinformatics 0-100-users 2018Bazam A rapid method for read extraction and realignment of high throughput sequencing data, bioRxiv, 2018-10-04
AbstractBackgroundAs costs of high throughput sequencing have fallen, we are seeing vast quantities of short read genomic data being generated. Often, the data is exchanged and stored as aligned reads, which provides high compression and convenient access for many analyses. However, aligned data becomes outdated as new reference genomes and alignment methods become available. Moreover, some applications cannot utilise pre-aligned reads at all, necessitating conversion back to raw format (FASTQ) before they can be used. In both cases, the process of extraction and realignment is expensive and time consuming.FindingsWe describe Bazam, a tool that efficiently extracts the original paired FASTQ from reads stored in aligned form (BAM or CRAM format). Bazam extracts reads in a format that directly allows realignment with popular aligners with high concurrency. Through eliminating steps and increasing the accessible concurrency, Bazam facilitates up to a 90% reduction in the time required for realignment compared to standard methods. Bazam can support selective extraction of read pairs from focused genomic regions, further increasing efficiency for targeted analyses. Bazam is additionally suitable as a base for other applications that require efficient paired read information, such as quality control, structural variant calling and alignment comparison.ConclusionsBazam offers significant improvements for users needing to realign genomic data.
biorxiv bioinformatics 200-500-users 2018