WhatsHap fast and accurate read-based phasing, bioRxiv, 2016-11-03
AbstractRead-based phasing allows to reconstruct the haplotypes of a sample purely from sequencing reads. While phasing is an important step for answering questions about population genetics, compound heterozygosity, and to aid in clinical decision making, there has been a lack of accurate, usable and standards-based software.WhatsHap is a production-ready tool for highly accurate read-based phasing. It was designed from the beginning to leverage third-generation sequencing technologies, whose long reads can span many variants and are therefore ideal for phasing. WhatsHap works also well with second-generation data, is easy to use and will phase not only SNVs, but also indels and other variants. It is unique in its ability to combine read-based with pedigree-based phasing, allowing to further improve accuracy if multiple related samples are provided.
biorxiv bioinformatics 0-100-users 2016Pavian Interactive analysis of metagenomics data for microbiomics and pathogen identification, bioRxiv, 2016-11-01
AbstractSummaryPavian is a web application for exploring metagenomics classification results, with a special focus on infectious disease diagnosis. Pinpointing pathogens in metagenomics classification results is often complicated by host and laboratory contaminants as well as many non-pathogenic microbiota. With Pavian, researchers can analyze, display and transform results from the Kraken and Centrifuge classifiers using interactive tables, heatmaps and flow diagrams. Pavian also provides an alignment viewer for validation of matches to a particular genome.Availability and implementationPavian is implemented in the R language and based on the Shiny framework. It can be hosted on Windows, Mac OS X and Linux systems, and used with any contemporary web browser. It is freely available under a GPL-3 license from <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpgithub.comfbreitwieserpavian>httpgithub.comfbreitwieserpavian<jatsext-link>. Furthermore a Docker image is provided at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpshub.docker.comrflorianbwpavian>httpshub.docker.comrflorianbwpavian<jatsext-link>.Contactfbreitw1@jhu.eduSupplementary informationSupplementary data is available at Bioinformatics online.
biorxiv bioinformatics 100-200-users 2016Leveraging uncertainty information from deep neural networks for disease detection, bioRxiv, 2016-10-29
AbstractDeep learning (DL) has revolutionized the field of computer vision and image processing. In medical imaging, algorithmic solutions based on DL have been shown to achieve high performance on tasks that previously required medical experts. However, DL-based solutions for disease detection have been proposed without methods to quantify and control their uncertainty in a decision. In contrast, a physician knows whether she is uncertain about a case and will consult more experienced colleagues if needed. Here we evaluate drop-out based Bayesian uncertainty measures for DL in diagnosing diabetic retinopathy (DR) from fundus images and show that it captures uncertainty better than straightforward alternatives. Furthermore, we show that uncertainty informed decision referral can improve diagnostic performance. Experiments across different networks, tasks and datasets show robust generalization. Depending on network capacity and taskdataset difficulty, we surpass 85% sensitivity and 80% specificity as recommended by the NHS when referring 0%–20% of the most uncertain decisions for further inspection. We analyse causes of uncertainty by relating intuitions from 2D visualizations to the high-dimensional image space. While uncertainty is sensitive to clinically relevant cases, sensitivity to unfamiliar data samples is task dependent, but can be rendered more robust.
biorxiv bioinformatics 0-100-users 2016FIDDLE An integrative deep learning framework for functional genomic data inference, bioRxiv, 2016-10-19
AbstractNumerous advances in sequencing technologies have revolutionized genomics through generating many types of genomic functional data. Statistical tools have been developed to analyze individual data types, but there lack strategies to integrate disparate datasets under a unified framework. Moreover, most analysis techniques heavily rely on feature selection and data preprocessing which increase the difficulty of addressing biological questions through the integration of multiple datasets. Here, we introduce FIDDLE (Flexible Integration of Data with Deep LEarning) an open source data-agnostic flexible integrative framework that learns a unified representation from multiple data types to infer another data type. As a case study, we use multiple Saccharomyces cerevisiae genomic datasets to predict global transcription start sites (TSS) through the simulation of TSS-seq data. We demonstrate that a type of data can be inferred from other sources of data types without manually specifying the relevant features and preprocessing. We show that models built from multiple genome-wide datasets perform profoundly better than models built from individual datasets. Thus FIDDLE learns the complex synergistic relationship within individual datasets and, importantly, across datasets.
biorxiv bioinformatics 0-100-users 2016Reconstructing cell cycle and disease progression using deep learning, bioRxiv, 2016-10-19
AbstractWe show that deep convolutional neural networks combined with non-linear dimension reduction enable reconstructing biological processes based on raw image data. We demonstrate this by recon-structing the cell cycle of Jurkat cells and disease progression in diabetic retinopathy. In further analysis of Jurkat cells, we detect and separate a subpopulation of dead cells in an unsupervised manner and, in classifying discrete cell cycle stages, we reach a 6-fold reduction in error rate compared to a recent approach based on boosting on image features. In contrast to previous methods, deep learning based predictions are fast enough for on-the-fly analysis in an imaging flow cytometer.
biorxiv bioinformatics 100-200-users 2016Canu scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation, bioRxiv, 2016-08-25
AbstractLong-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either PacBio or Oxford Nanopore technologies, and achieves a contig NG50 of greater than 21 Mbp on both human and Drosophila melanogaster PacBio datasets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.
biorxiv bioinformatics 100-200-users 2016