A Simple Deep Learning Approach for Detecting Duplications and Deletions in Next-Generation Sequencing Data, bioRxiv, 2019-06-03
AbstractCopy number variants (CNV) are associated with phenotypic variation in several species. However, properly detecting changes in copy numbers of sequences remains a difficult problem, especially in lower quality or lower coverage next-generation sequencing data. Here, inspired by recent applications of machine learning in genomics, we describe a method to detect duplications and deletions in short-read sequencing data. In low coverage data, machine learning appears to be more powerful in the detection of CNVs than the gold-standard methods or coverage estimation alone, and of equal power in high coverage data. We also demonstrate how replicating training sets allows a more precise detection of CNVs, even identifying novel CNVs in two genomes previously surveyed thoroughly for CNVs using long read data.Available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comtomh1llldudeml>httpsgithub.comtomh1llldudeml<jatsext-link>
biorxiv bioinformatics 100-200-users 2019Alignment and mapping methodology influence transcript abundance estimation, bioRxiv, 2019-06-03
AbstractThe accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy.We investigate the effect of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential gene expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large, and can affect downstream analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally-acquired samples. We discuss best practices regarding alignment for the purposes of quantification, and also introduce a new hybrid alignment methodology, called selective alignment (SA), to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment.
biorxiv bioinformatics 100-200-users 2019Minimum epistasis interpolation for sequence-function relationships, bioRxiv, 2019-06-02
AbstractMassively parallel phenotyping assays have provided unprecedented insight into how multiple mutations combine to determine biological function. While these assays can measure phenotypes for thousands to millions of genotypes in a single experiment, in practice these measurements are not exhaustive, so that there is a need for techniques to impute values for genotypes whose phenotypes are not directly assayed. Here we present a method based on the idea of inferring the least epistatic possible sequence-function relationship compatible with the data. In particular, we infer the reconstruction in which mutational effects change as little as possible across adjacent genetic backgrounds. Although this method is highly conservative and has no tunable parameters, it also makes no assumptions about the form that genetic interactions take, resulting in predictions that can behave in a very complicated manner where the data require it but which are nearly additive where data is sparse or absent. We apply this method to analyze a fitness landscape for protein G, showing that our technique can provide a substantially less epistatic fit to the landscape than standard methods with little loss in predictive power. Moreover, our analysis reveals that the complex structure of epistasis observed in this dataset can be well-understood in terms of a simple qualitative model consisting of three fitness peaks where the landscape is locally additive in the vicinity of each peak.
biorxiv bioinformatics 0-100-users 20193D RNA-seq - a powerful and flexible tool for rapid and accurate differential expression and alternative splicing analysis of RNA-seq data for biologists, bioRxiv, 2019-06-01
AbstractRNA-sequencing (RNA-seq) analysis of gene expression and alternative splicing should be routine and robust but is often a bottleneck for biologists because of different and complex analysis programs and reliance on skilled bioinformaticians to perform the analysis. To overcome these issues, we have developed the “3D RNA-seq” App, an R shiny App which provides an easy-to-use, flexible and powerful tool for the three-way differential analysis Differential Expression (DE), Differential Alternative Splicing (DAS) and Differential Transcript Usage (DTU) of RNA-seq data. The full analysis is extremely rapidand can be done within hours. The program integrates Limma, a state-of-the-art, highly rated differential expression analysis tool and adopts best practice for RNA-seq analysis. It runs the analysis through a user-friendly graphical interface, can handle complex experimental designs, allows user setting of statistical parameters, visualizes the results through graphics and tables, and generates publication quality figures such as heat-maps, expression profiles and GO enrichment plots. The utility of 3D RNA-seq is illustrated by analysis of Arabidopsis and mouse RNA-seq data. The program is designed to be run by biologists with minimal bioinformatics experience (or by bioinformaticians) allowing lab scientists to take control of the analysis of their RNA-seq data.
biorxiv bioinformatics 100-200-users 2019Deep learning does not outperform classical machine learning for cell-type annotation, bioRxiv, 2019-05-31
AbstractDeep learning has revolutionized image analysis and natural language processing with remarkable accuracies in prediction tasks, such as image labeling or word identification. The origin of this revolution was arguably the deep learning approach by the Hinton lab in 2012, which halved the error rate of existing classifiers in the then 2-year-old ImageNet database1. In hindsight, the combination of algorithmic and hardware advances with the appearance of large and well-labeled datasets has led up to this seminal contribution.The emergence of large amounts of data from single-cell RNA-seq and the recent global effort to chart all cell types in the Human Cell Atlas has attracted an interest in deep-learning applications. However, all current approaches are unsupervised, i.e., learning of latent spaces without using any cell labels, even though supervised learning approaches are often more powerful in feature learning and the most popular approach in the current AI revolution by far.Here, we ask why this is the case. In particular we ask whether supervised deep learning can be used for cell annotation, i.e. to predict cell-type labels from single-cell gene expression profiles. After evaluating 6 classification methods across 14 datasets, we notably find that deep learning does not outperform classical machine-learning methods in the task. Thus, cell-type prediction based on gene-signature derived cell-type labels is potentially too simplistic a task for complex non-linear methods, which demands better labels of functional single-cell readouts. We, therefore, are still waiting for the “ImageNet moment” in single-cell genomics.
biorxiv bioinformatics 100-200-users 2019Genomic diversity affects the accuracy of bacterial SNP calling pipelines, bioRxiv, 2019-05-31
AbstractBackgroundAccurately identifying SNPs from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained.This study evaluates the performance of 41 SNP calling pipelines using simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally-sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia and Klebsiella.ResultsWe evaluated the performance of 41 SNP calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic, bacteria such as Escherichia coli, but less dominant for clonal species such as Mycobacterium tuberculosis.ConclusionsThe accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest performing pipelines was NovoalignGATK. However, across the full range of (divergent) genomes, among the consistently highest-performing pipelines was Snippy.
biorxiv bioinformatics 100-200-users 2019