Deep learning does not outperform classical machine learning for cell-type annotation, bioRxiv, 2019-05-31

AbstractDeep learning has revolutionized image analysis and natural language processing with remarkable accuracies in prediction tasks, such as image labeling or word identification. The origin of this revolution was arguably the deep learning approach by the Hinton lab in 2012, which halved the error rate of existing classifiers in the then 2-year-old ImageNet database1. In hindsight, the combination of algorithmic and hardware advances with the appearance of large and well-labeled datasets has led up to this seminal contribution.The emergence of large amounts of data from single-cell RNA-seq and the recent global effort to chart all cell types in the Human Cell Atlas has attracted an interest in deep-learning applications. However, all current approaches are unsupervised, i.e., learning of latent spaces without using any cell labels, even though supervised learning approaches are often more powerful in feature learning and the most popular approach in the current AI revolution by far.Here, we ask why this is the case. In particular we ask whether supervised deep learning can be used for cell annotation, i.e. to predict cell-type labels from single-cell gene expression profiles. After evaluating 6 classification methods across 14 datasets, we notably find that deep learning does not outperform classical machine-learning methods in the task. Thus, cell-type prediction based on gene-signature derived cell-type labels is potentially too simplistic a task for complex non-linear methods, which demands better labels of functional single-cell readouts. We, therefore, are still waiting for the “ImageNet moment” in single-cell genomics.

biorxiv bioinformatics 100-200-users 2019

Genomic diversity affects the accuracy of bacterial SNP calling pipelines, bioRxiv, 2019-05-31

AbstractBackgroundAccurately identifying SNPs from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained.This study evaluates the performance of 41 SNP calling pipelines using simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally-sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia and Klebsiella.ResultsWe evaluated the performance of 41 SNP calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic, bacteria such as Escherichia coli, but less dominant for clonal species such as Mycobacterium tuberculosis.ConclusionsThe accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest performing pipelines was NovoalignGATK. However, across the full range of (divergent) genomes, among the consistently highest-performing pipelines was Snippy.

biorxiv bioinformatics 100-200-users 2019

Accelerating Sequence Alignment to Graphs, bioRxiv, 2019-05-28

AbstractAligning DNA sequences to an annotated reference is a key step for genotyping in biology. Recent scientific studies have demonstrated improved inference by aligning reads to a variation graph, i.e., a reference sequence augmented with known genetic variations. Given a variation graph in the form of a directed acyclic string graph, the sequence to graph alignment problem seeks to find the best matching path in the graph for an input query sequence. Solving this problem exactly using a sequential dynamic programming algorithm takes quadratic time in terms of the graph size and query length, making it difficult to scale to high throughput DNA sequencing data. In this work, we propose the first parallel algorithm for computing sequence to graph alignments that leverages multiple cores and single-instruction multiple-data (SIMD) operations. We take advantage of the available inter-task parallelism, and provide a novel blocked approach to compute the score matrix while ensuring high memory locality. Using a 48-core Intel Xeon Skylake processor, the proposed algorithm achieves peak performance of 317 billion cell updates per second (GCUPS), and demonstrates near linear weak and strong scaling on up to 48 cores. It delivers significant performance gains compared to existing algorithms, and results in run-time reduction from multiple days to three hours for the problem of optimally aligning high coverage long (PacBioONT) or short (Illumina) DNA reads to an MHC human variation graph containing 10 million vertices.AvailabilityThe implementation of our algorithm is available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comParBLiSSPaSGAL>httpsgithub.comParBLiSSPaSGAL<jatsext-link>. Data sets used for evaluation are accessible using <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsalurulab.cc.gatech.eduPaSGAL>httpsalurulab.cc.gatech.eduPaSGAL<jatsext-link>.

biorxiv bioinformatics 100-200-users 2019

Estimations of the weather effects on brain functions using functional MRI – a cautionary tale, bioRxiv, 2019-05-28

AbstractThe influences of environmental factors such as weather on human brain are still largely unknown. A few neuroimaging studies have demonstrated seasonal effects, but were limited by their cross-sectional design or sample sizes. Most importantly, the stability of MRI scanner hasn’t been taken into account, which may also be affected by environments. In the current study, we analyzed longitudinal resting-state functional MRI (fMRI) data from eight individuals, where the participants were scanned over months to years. We applied machine learning regression to use different resting-state parameters, including amplitude of low-frequency fluctuations (ALFF), regional homogeneity (ReHo), and functional connectivity matrix, to predict different weather and environmental parameters. For a careful control, the raw EPI and the anatomical images were also used in the prediction analysis. We first found that daylight length and temperatures could be reliability predicted using cross-validation using resting-state parameters. However, similar prediction accuracies could also achieved by using one frame of EPI image, and even higher accuracies could be achieved by using segmented or even the raw anatomical images. Finally, we verified that the signals outside of the brain in the anatomical images and signals in phantom scans could also achieve higher prediction accuracies, suggesting that the predictability may be due to the baseline signals of the MRI scanner. After all, we did not identify detectable influences of weather on brain functions other than the influences on the stability of MRI scanners. The results highlight the difficulty of studying long term effects on brain using MRI.

biorxiv neuroscience 100-200-users 2019

RNASeqR an R package for automated two-group RNA-Seq analysis workflow, bioRxiv, 2019-05-28

RNA-Seq analysis has revolutionized researchers' understanding of the transcriptome in biological research. Assessing the differences in transcriptomic profiles between tissue samples or patient groups enables researchers to explore the underlying biological impact of transcription. RNA-Seq analysis requires multiple processing steps and huge computational capabilities. There are many well-developed R packages for individual steps; however, there are few RBioconductor packages that integrate existing software tools into a comprehensive RNA-Seq analysis and provide fundamental end-to-end results in pure R environment so that researchers can quickly and easily get fundamental information in big sequencing data. To address this need, we have developed the open source RBioconductor package, RNASeqR. It allows users to run an automated RNA-Seq analysis with only six steps, producing essential tabular and graphical results for further biological interpretation. The features of RNASeqR include six-step analysis, comprehensive visualization, background execution version, and the integration of both R and command-line software. RNASeqR provides fast, light-weight, and easy-to-run RNA-Seq analysis pipeline in pure R environment. It allows users to efficiently utilize popular software tools, including both RBioconductor and command-line tools, without predefining the resources or environments. RNASeqR is freely available for Linux and macOS operating systems from Bioconductor (httpsbioconductor.orgpackagesreleasebiochtmlRNASeqR.html).

biorxiv bioinformatics 100-200-users 2019

 

Created with the audiences framework by Jedidiah Carlson

Powered by Hugo