DeepMHC Deep Convolutional Neural Networks for High-performance peptide-MHC Binding Affinity Prediction, bioRxiv, 2017-12-25
AbstractConvolutional neural networks (CNN) have been shown to outperform conventional methods in DNA-protien binding specificity prediction. However, whether we can transfer this success to protien-peptide binding affinity prediction depends on appropriate design of the CNN architectue that calls for thorough understanding how to match the architecture to the problem. Here we propose DeepMHC, a deep convolutional neural network (CNN) based protein-peptide binding prediction algorithm for achieving better performance in MHC-I peptide binding affinity prediction than conventional algorithms. Our model takes only raw binding peptide sequences as input without needing any human-designed features and othe physichochemical or evolutionary information of the amino acids. Our CNN models are shown to be able to learn non-linear relationships among the amino acid positions of the peptides to achieve highly competitive performance on most of the IEDB benchmark datasets with a single model architecture and without using any consensus or composite ensemble classifier models. By systematically exploring the best CNN architecture, we identified critical design considerations in CNN architecture development for peptide-MHC binding prediction.
biorxiv bioinformatics 100-200-users 2017NanoPack visualizing and processing long read sequencing data, bioRxiv, 2017-12-22
AbstractSummary Here we describe NanoPack, a set of tools developed for visualization and processing of long read sequencing data from Oxford Nanopore Technologies and Pacific Biosciences.Availability and Implementation The NanoPack tools are written in Python3 and released under the GNU GPL3.0 Licence. The source code can be found at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comwdecosternanopack>httpsgithub.comwdecosternanopack<jatsext-link>, together with links to separate scripts and their documentation. The scripts are compatible with Linux, Mac OS and the MS Windows 10 subsystem for linux and are available as a graphical user interface, a web service at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpnanoplot.bioinf.be>httpnanoplot.bioinf.be<jatsext-link> and command line tools.Contactwouter.decoster@molgen.vib-ua.beSupplementary information Supplementary tables and figures are available at Bioinformatics online.
biorxiv bioinformatics 100-200-users 2017bioSyntax Syntax Highlighting For Computational Biology, bioRxiv, 2017-12-21
AbstractComputational biology requires the reading and comprehension of biological data files. Plain-text formats such as SAM, VCF, GTF, PDB and FASTA, often contain critical information that is obfuscated by the complexity of the data structures. bioSyntax (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpbioSyntax.org>httpbioSyntax.org<jatsext-link>) is a freely available suite of syntax highlighting packages for vim, gedit, Sublime, and less, which aids computational scientists to parse and work with their data more efficiently.
biorxiv bioinformatics 0-100-users 2017Content-Aware Image Restoration Pushing the Limits of Fluorescence Microscopy, bioRxiv, 2017-12-20
Fluorescence microscopy is a key driver of discoveries in the life-sciences, with observable phenomena being limited by the optics of the microscope, the chemistry of the fluorophores, and the maximum photon exposure tolerated by the sample. These limits necessitate trade-offs between imaging speed, spatial resolution, light exposure, and imaging depth. In this work we show how image restoration based on deep learning extends the range of biological phenomena observable by microscopy. On seven concrete examples we demonstrate how microscopy images can be restored even if 60-fold fewer photons are used during acquisition, how near isotropic resolution can be achieved with up to 10-fold under-sampling along the axial direction, and how tubular and granular structures smaller than the diffraction limit can be resolved at 20-times higher frame-rates compared to state-of-the-art methods. All developed image restoration methods are freely available as open source software in Python, FIJI, and KNIME.
biorxiv bioinformatics 500+-users 2017Exploring Single-Cell Data with Deep Multitasking Neural Networks, bioRxiv, 2017-12-20
AbstractBiomedical researchers are generating high-throughput, high-dimensional single-cell data at a staggering rate. As costs of data generation decrease, experimental design is moving towards measurement of many different single-cell samples in the same dataset. These samples can correspond to different patients, conditions, or treatments. While scalability of methods to datasets of these sizes is a challenge on its own, dealing with large-scale experimental design presents a whole new set of problems, including batch effects and sample comparison issues. Currently, there are no computational tools that can both handle large amounts of data in a scalable manner (many cells) and at the same time deal with many samples (many patients or conditions). Moreover, data analysis currently involves the use of different tools that each operate on their own data representation, not guaranteeing a synchronized analysis pipeline. For instance, data visualization methods can be disjoint and mismatched with the clustering method. For this purpose, we present SAUCIE, a deep neural network that leverages the high degree of parallelization and scalability offered by neural networks, as well as the deep representation of data that can be learned by them to perform many single-cell data analysis tasks, all on a unified representation.A well-known limitation of neural networks is their interpretability. Our key contribution here are newly formulated regularizations (penalties) that render features learned in hidden layers of the neural network interpretable. When large multi-patient datasets are fed into SAUCIE, the various hidden layers contain denoised and batch-corrected data, a low dimensional visualization, unsupervised clustering, as well as other information that can be used to explore the data. We show this capability by analyzing a newly generated 180-sample dataset consisting of T cells from dengue patients in India, measured with mass cytometry. We show that SAUCIE, for the first time, can batch correct and process this 11-million cell data to identify cluster-based signatures of acute dengue infection and create a patient manifold, stratifying immune response to dengue on the basis of single-cell measurements.
biorxiv bioinformatics 0-100-users 2017Real-time search of all bacterial and viral genomic data, bioRxiv, 2017-12-16
Genome sequencing of pathogens is now ubiquitous in microbiology, and the sequence archives are effectively no longer searchable for arbitrary sequences. Furthermore, the exponential increase of these archives is likely to be further spurred by automated diagnostics. To unlock their use for scientific research and real-time surveillance we have combined knowledge about bacterial genetic variation with ideas used in web-search, to build a DNA search engine for microbial data that can grow incrementally. We indexed the complete global corpus of bacterial and viral whole genome sequence data (447,833 genomes), using four orders of magnitude less storage than previous methods. The method allows future scaling to millions of genomes. This renders the global archive accessible to sequence search, which we demonstrate with three applications ultra-fast search for resistance genes MCR1-3, analysis of host-range for 2827 plasmids, and quantification of the rise of antibiotic resistance prevalence in the sequence archives.
biorxiv bioinformatics 500+-users 2017