bioinformatics | audiences

Lineage calling can identify antibiotic resistant clones within minutes, bioRxiv, 2018-08-29

Introductory ParagraphSurveillance of circulating drug resistant bacteria is essential for healthcare providers to deliver effective empiric antibiotic therapy. However, the results of surveillance may not be available on a timescale that is optimal for guiding patient treatment. Here we present a method for inferring characteristics of an unknown bacterial sample by identifying the presence of sequence variation across the genome that is linked to a phenotype of interest, in this case drug resistance. We demonstrate an implementation of this principle using sequence k-mer content, matched to a database of known genomes. We show this technique can be applied to data from an Oxford Nanopore device in real time and is capable of identifying the presence of a known resistant strain in 5 minutes, even from a complex metagenomic sample. This flexible approach has wide application to pathogen surveillance and may be used to greatly accelerate diagnoses of resistant infections.

biorxiv bioinformatics 200-500-users 2018

Rapid heuristic inference of antibiotic resistance and susceptibility by genomic neighbor typing, bioRxiv, 2018-08-29

AbstractSurveillance of drug-resistant bacteria is essential for healthcare providers to deliver effective empiric antibiotic therapy. However, traditional molecular epidemiology does not typically occur on a timescale that could impact patient treatment and outcomes. Here we present a method called ‘genomic neighbor typing’ for inferring the phenotype of a bacterial sample by identifying its closest relatives in a database of genomes with metadata. We show that this technique can infer antibiotic susceptibility and resistance for both S. pneumoniae and N. gonorrhoeae. We implemented this with rapid k-mer matching, which, when used on Oxford Nanopore MinION data, can run in real time. This resulted in determination of resistance within ten minutes (sensspec 91%100% for S. pneumoniae and 81%100% N. gonorrhoeae from isolates with a representative database) of sequencing starting, and for clinical metagenomic sputum samples (75%100% for S. pneumoniae), within four hours of sample collection. This flexible approach has wide application to pathogen surveillance and may be used to greatly accelerate appropriate empirical antibiotic treatment.

biorxiv bioinformatics 200-500-users 2018

Fast Batch Alignment of Single Cell Transcriptomes Unifies Multiple Mouse Cell Atlases into an Integrated Landscape, bioRxiv, 2018-08-21

AbstractIncreasing numbers of large scale single cell RNA-Seq projects are leading to a data explosion, which can only be fully exploited through data integration. Therefore, efficient computational tools for combining diverse datasets are crucial for biology in the single cell genomics era. A number of methods have been developed to assist data integration by removing technical batch effects, but most are computationally intensive. To overcome the challenge of enormous datasets, we have developed BBKNN, an extremely fast graph-based data integration method. We illustrate the power of BBKNN for dimensionalityreduced visualisation and clustering in multiple biological scenarios, including a massive integrative study over several murine atlases. BBKNN successfully connects cell populations across experimentally heterogeneous mouse scRNA-Seq datasets, which reveals global markers of cell type and organspecificity and provides the foundation for inferring the underlying transcription factor network. BBKNN is available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comTeichlabbbknn>httpsgithub.comTeichlabbbknn<jatsext-link>.

biorxiv bioinformatics 0-100-users 2018

One read per cell per gene is optimal for single-cell RNA-Seq, bioRxiv, 2018-08-10

An underlying question for virtually all single-cell RNA sequencing experiments is how to allocate the limited sequencing budget deep sequencing of a few cells or shallow sequencing of many cells? A mathematical framework reveals that, for estimating many important gene properties, the optimal allocation is to sequence at the depth of one read per cell per gene. Interestingly, the corresponding optimal estimator is not the widely-used plugin estimator but one developed via empirical Bayes.

biorxiv bioinformatics 100-200-users 2018

Parameter tuning is a key part of dimensionality reduction via deep variational autoencoders for single cell RNA transcriptomics, bioRxiv, 2018-08-06

Single-cell RNA sequencing (scRNA-seq) is a powerful tool to profile the transcriptomes of a large number of individual cells at a high resolution. These data usually contain measurements of gene expression for many genes in thousands or tens of thousands of cells, though some datasets now reach the million-cell mark. Projecting high-dimensional scRNA-seq data into a low dimensional space aids downstream analysis and data visualization. Many recent preprints accomplish this using variational autoencoders (VAE), generative models that learn underlying structure of data by compress it into a constrained, low dimensional space. The low dimensional spaces generated by VAEs have revealed complex patterns and novel biological signals from large-scale gene expression data and drug response predictions. Here, we evaluate a simple VAE approach for gene expression data, Tybalt, by training and measuring its performance on sets of simulated scRNA-seq data. We find a number of counter-intuitive performance features i.e., deeper neural networks can struggle when datasets contain more observations under some parameter configurations. We show that these methods are highly sensitive to parameter tuning when tuned, the performance of the Tybalt model, which was not optimized for scRNA-seq data, outperforms other popular dimension reduction approaches – PCA, ZIFA, UMAP and t-SNE. On the other hand, without tuning performance can also be remarkably poor on the same data. Our results should discourage authors and reviewers from relying on self-reported performance comparisons to evaluate the relative value of contributions in this area at this time. Instead, we recommend that attempts to compare or benchmark autoencoder methods for scRNA-seq data be performed by disinterested third parties or by methods developers only on unseen benchmark data that are provided to all participants simultaneously because the potential for performance differences due to unequal parameter tuning is so high.

biorxiv bioinformatics 0-100-users 2018

Moving beyond P values Everyday data analysis with estimation plots, bioRxiv, 2018-07-26

Over the past 75 years, a number of statisticians have advised that the data-analysis method known as null-hypothesis significance testing (NHST) should be deprecated (Berkson, 1942; Halsey et al., 2015; Wasserstein et al., 2019). The limitations of NHST have been extensively discussed, with a broad consensus that current statistical practice in the biological sciences needs reform. However, there is less agreement on reform’s specific nature, with vigorous debate surrounding what would constitute a suitable alternative (Altman et al., 2000; Benjamin et al., 2017; Cumming and Calin-Jageman, 2016). An emerging view is that a more complete analytic technique would use statistical graphics to estimate effect sizes and evaluate their uncertainty (Cohen, 1994; Cumming and Calin-Jageman, 2016). As these estimation methods require only minimal statistical retraining, they have great potential to shift the current data-analysis culture away from dichotomous thinking towards quantitative reasoning (Claridge-Chang and Assam, 2016). The evolution of statistics has been inextricably linked to the development of quantitative displays that support complex visual reasoning (Tufte, 2001). We consider that the graphic we describe here as estimation plot is the most intuitive way to display the complete statistical information about experimental data sets. However, a major obstacle to adopting estimation plots is accessibility to suitable software. To lower this hurdle, we have developed free software that makes high-quality estimation plotting available to all. Here, we explain the rationale for estimation plots by contrasting them with conventional charts used to display data with NHST results, and describe how the use of these graphs affords five major analytical advantages.

biorxiv bioinformatics 500+-users 2018