Interpretable multimodal deep learning for real-time pan-tissue pan-disease pathology search on social media, bioRxiv, 2018-08-21
AbstractBackgroundPathologists are responsible for rapidly providing a diagnosis on critical health issues, from infection to malignancy. Challenging cases benefit from additional opinions of pathologist colleagues. In addition to on-site colleagues, there is an active worldwide community of pathologists on social media for complementary opinions. Such access to pathologists worldwide has the capacity to (i) improve diagnostic accuracy and (ii) generate broader consensus on next steps in patient care.Methods and findingsFrom Twitter we curate 13,626 images from 6,351 tweets from 25 pathologists from 13 countries. We supplement the Twitter data with 113,161 images from 1,074,484 PubMed articles. We develop machine learning and deep learning models to (i) accurately identify histopathology stains, (ii) discriminate between tissues, and (iii) differentiate disease states. For deep learning, we derive novel regularization and activation functions for set representations related to set cardinality and the Heaviside step function. Area Under Receiver Operating Characteristic is 0.805-0.996 for these tasks. We repurpose the disease classifier to search for similar disease states given an image and clinical covariates. We report precision@k=1 = 0.701±0.003 (chance 0.397±0.004, mean±stdev). The classifiers find texture and tissue are important clinico-visual features of disease. For search, deep features and cell nuclei features are less important.We implement a social media bot (@pathobot on Twitter) to use the trained classifiers to aid pathologists in obtaining real-time feedback on challenging cases. The bot activates when mentioned in a social media post containing pathology text and images. The bot generates quantitative predictions of disease state (normalartifact infectioninjurynontumor, pre-neoplasticbenignlow-grade-malignant-potential, or malignant) and provides a ranked list of similar cases across social media and PubMed.ConclusionsOur project has become a globally distributed expert system that facilitates pathological diagnosis and brings expertise to underserved regions or hospitals with less expertise in a particular disease. This is the first pan-tissue pan-disease (i.e. from infections to malignancy) method for prediction and search on social media, and the first pathology study prospectively tested in public on social media. We expect our project to cultivate a more connected world of physicians and improve patient care worldwide.Author summaryWhy was this study done?<jatslist list-type=bullet><jatslist-item>No publicly available pan-tissue pan-disease dataset exists for computational pathology. This limits the general application of machine learning in histopathology.<jatslist-item><jatslist-item>Pathologists use social media to obtain both (i) opinions for challenging patient cases and (ii) continuing education. Connecting pathologists and linking to similar cases leads to more informative exchanges than computational predictions – e.g. to diagnose best, pathologists may discuss patient history and next tests to order. Additionally, pathologists seek the most interesting rare cases and new articles.<jatslist-item>What did the researchers do and find?<jatslist list-type=bullet><jatslist-item>We generated a pan-tissue, pan-disease dataset comprising 10,000+ images from social media and 100,000+ images from PubMed. Classifiers applied to social media data suggest texture and tissue are important clinico-visual features of disease. Learning from both clinical covariates (e.g. tissue type or marker mentions) and visual features (e.g. local binary patterns or deep learning image features), these classifiers are multimodal.<jatslist-item><jatslist-item>These data and classifiers power the first social media bot for pathology. It responds to pathologists in real time, searches for similar cases, and encourages collaboration.<jatslist-item>What do these findings mean?<jatslist list-type=bullet><jatslist-item>This diverse dataset will be a critical test for machine learning in computational pathology, e.g. search for cures of rare diseases.<jatslist-item><jatslist-item>Interpretable real-time classifiers can be successfully applied to images on social media and PubMed to find similar diseases and generate disease predictions. Going forward, similar methods may elucidate important clinico-visual features of specific diseases.<jatslist-item>
biorxiv pathology 100-200-users 2018Nanopore-based genome assembly and the evolutionary genomics of basmati rice, bioRxiv, 2018-08-20
ABSTRACTBACKGROUNDThe circum-basmati group of cultivated Asian rice (Oryza sativa) contains many iconic varieties and is widespread in the Indian subcontinent. Despite its economic and cultural importance, a high-quality reference genome is currently lacking, and the group’s evolutionary history is not fully resolved. To address these gaps, we used long-read nanopore sequencing and assembled the genomes of two circum-basmati rice varieties, Basmati 334 and Dom Sufid.RESULTSWe generated two high-quality, chromosome-level reference genomes that represented the 12 chromosomes of Oryza. The assemblies showed a contig N50 of 6.32Mb and 10.53Mb for Basmati 334 and Dom Sufid, respectively. Using our highly contiguous assemblies we characterized structural variations segregating across circum-basmati genomes. We discovered repeat expansions not observed in japonica—the rice group most closely related to circum- basmati—as well as presenceabsence variants of over 20Mb, one of which was a circum- basmati-specific deletion of a gene regulating awn length. We further detected strong evidence of admixture between the circum-basmati and circum-aus groups. This gene flow had its greatest effect on chromosome 10, causing both structural variation and single nucleotide polymorphism to deviate from genome-wide history. Lastly, population genomic analysis of 78 circum-basmati varieties showed three major geographically structured genetic groups (1) BhutanNepal group, (2) IndiaBangladeshMyanmar group, and (3) IranPakistan group.CONCLUSIONAvailability of high-quality reference genomes from nanopore sequencing allowed functional and evolutionary genomic analyses, providing genome-wide evidence for gene flow between circum-aus and circum-basmati, the nature of circum-basmati structural variation, and the presenceabsence of genes in this important and iconic rice variety group.
biorxiv evolutionary-biology 100-200-users 2018Large-scale whole-genome sequencing of three diverse Asian populations in Singapore, bioRxiv, 2018-08-11
AbstractAsian populations are currently underrepresented in human genetics research. Here we present whole-genome sequencing data of 4,810 Singaporeans from three diverse ethnic groups 2,780 Chinese, 903 Malays, and 1,127 Indians. Despite a medium depth of 13.7×, we achieved essentially perfect (>99.8%) sensitivity and accuracy for detecting common variants and good sensitivity (>89%) for detecting extremely rare variants with <0.1% allele frequency. We found 89.2 million single-nucleotide polymorphisms (SNPs) and 9.1 million small insertions and deletions (INDELs), more than half of which have not been cataloged in dbSNP. In particular, we found 126 common deleterious mutations (MAF>0.01) that were absent in the existing public databases, highlighting the importance of local population reference for genetic diagnosis. We describe fine-scale genetic structure of Singapore populations and their relationship to worldwide populations from the 1000 Genomes Project. In addition to revealing noticeable amounts of admixture among three Singapore populations and a Malay-related novel ancestry component that has not been captured by the 1000 Genomes Project, our analysis also identified some fine-scale features of genetic structure consistent with two waves of prehistoric migration from south China to Southeast Asia. Finally, we demonstrate that our data can substantially improve genotype imputation not only for Singapore populations, but also for populations across Asia and Oceania. These results highlight the genetic diversity in Singapore and the potential impacts of our data as a resource to empower human genetics discovery in a broad geographic region.
biorxiv genetics 100-200-users 2018One read per cell per gene is optimal for single-cell RNA-Seq, bioRxiv, 2018-08-10
An underlying question for virtually all single-cell RNA sequencing experiments is how to allocate the limited sequencing budget deep sequencing of a few cells or shallow sequencing of many cells? A mathematical framework reveals that, for estimating many important gene properties, the optimal allocation is to sequence at the depth of one read per cell per gene. Interestingly, the corresponding optimal estimator is not the widely-used plugin estimator but one developed via empirical Bayes.
biorxiv bioinformatics 100-200-users 2018Rapid Diagnosis of Lower Respiratory Infection using Nanopore-based Clinical Metagenomics, bioRxiv, 2018-08-09
AbstractLower respiratory infections (LRIs) accounted for three million deaths worldwide in 2016, the leading infectious cause of mortality. The “gold standard” for investigation of bacterial LRIs is culture, which has poor sensitivity and is too slow to guide early antibiotic therapy. Metagenomic sequencing potentially could replace culture, providing rapid, sensitive and comprehensive results. We developed a metagenomics pipeline for the investigation of bacterial LRIs using saponin-based host DNA depletion combined with rapid nanopore sequencing. The first iteration of the pipeline was tested on respiratory samples from 40 patients. It was then refined to reduce turnaround and increase sensitivity, before testing a further 41 samples. The refined method was 96.6% concordant with culture for detection of pathogens and could accurately detect resistance genes with a turnaround time of six hours. This study demonstrates that nanopore metagenomics can rapidly and accurately characterise bacterial LRIs when combined with efficient human DNA depletion.
biorxiv microbiology 100-200-users 2018MULTI-seq Scalable sample multiplexing for single-cell RNA sequencing using lipid-tagged indices, bioRxiv, 2018-08-08
ABSTRACTWe describe MULTI-seq A rapid, modular, and universal scRNA-seq sample multiplexing strategy using lipid-tagged indices. MULTI-seq reagents can barcode any cell type from any species with an accessible plasma membrane. The method is compatible with enzymatic tissue dissociation, and also preserves viability and endogenous gene expression patterns. We leverage these features to multiplex the analysis of multiple solid tissues comprising human and mouse cells isolated from patient-derived xenograft mouse models. We also utilize MULTI-seq’s modular design to perform a 96-plex perturbation experiment with human mammary epithelial cells. MULTI-seq also enables robust doublet identification, which improves data quality and increases scRNA-seq cell throughput by minimizing the negative effects of Poisson loading. We anticipate that the sample throughput and reagent savings enabled by MULTI-seq will expand the purview of scRNA-seq and democratize the application of these technologies within the scientific community.
biorxiv genomics 100-200-users 2018