biorxiv | audiences

Interpretable multimodal deep learning for real-time pan-tissue pan-disease pathology search on social media, bioRxiv, 2018-08-21

AbstractBackgroundPathologists are responsible for rapidly providing a diagnosis on critical health issues, from infection to malignancy. Challenging cases benefit from additional opinions of pathologist colleagues. In addition to on-site colleagues, there is an active worldwide community of pathologists on social media for complementary opinions. Such access to pathologists worldwide has the capacity to (i) improve diagnostic accuracy and (ii) generate broader consensus on next steps in patient care.Methods and findingsFrom Twitter we curate 13,626 images from 6,351 tweets from 25 pathologists from 13 countries. We supplement the Twitter data with 113,161 images from 1,074,484 PubMed articles. We develop machine learning and deep learning models to (i) accurately identify histopathology stains, (ii) discriminate between tissues, and (iii) differentiate disease states. For deep learning, we derive novel regularization and activation functions for set representations related to set cardinality and the Heaviside step function. Area Under Receiver Operating Characteristic is 0.805-0.996 for these tasks. We repurpose the disease classifier to search for similar disease states given an image and clinical covariates. We report precision@k=1 = 0.701±0.003 (chance 0.397±0.004, mean±stdev). The classifiers find texture and tissue are important clinico-visual features of disease. For search, deep features and cell nuclei features are less important.We implement a social media bot (@pathobot on Twitter) to use the trained classifiers to aid pathologists in obtaining real-time feedback on challenging cases. The bot activates when mentioned in a social media post containing pathology text and images. The bot generates quantitative predictions of disease state (normalartifact infectioninjurynontumor, pre-neoplasticbenignlow-grade-malignant-potential, or malignant) and provides a ranked list of similar cases across social media and PubMed.ConclusionsOur project has become a globally distributed expert system that facilitates pathological diagnosis and brings expertise to underserved regions or hospitals with less expertise in a particular disease. This is the first pan-tissue pan-disease (i.e. from infections to malignancy) method for prediction and search on social media, and the first pathology study prospectively tested in public on social media. We expect our project to cultivate a more connected world of physicians and improve patient care worldwide.Author summaryWhy was this study done?<jatslist list-type=bullet><jatslist-item>No publicly available pan-tissue pan-disease dataset exists for computational pathology. This limits the general application of machine learning in histopathology.<jatslist-item><jatslist-item>Pathologists use social media to obtain both (i) opinions for challenging patient cases and (ii) continuing education. Connecting pathologists and linking to similar cases leads to more informative exchanges than computational predictions – e.g. to diagnose best, pathologists may discuss patient history and next tests to order. Additionally, pathologists seek the most interesting rare cases and new articles.<jatslist-item>What did the researchers do and find?<jatslist list-type=bullet><jatslist-item>We generated a pan-tissue, pan-disease dataset comprising 10,000+ images from social media and 100,000+ images from PubMed. Classifiers applied to social media data suggest texture and tissue are important clinico-visual features of disease. Learning from both clinical covariates (e.g. tissue type or marker mentions) and visual features (e.g. local binary patterns or deep learning image features), these classifiers are multimodal.<jatslist-item><jatslist-item>These data and classifiers power the first social media bot for pathology. It responds to pathologists in real time, searches for similar cases, and encourages collaboration.<jatslist-item>What do these findings mean?<jatslist list-type=bullet><jatslist-item>This diverse dataset will be a critical test for machine learning in computational pathology, e.g. search for cures of rare diseases.<jatslist-item><jatslist-item>Interpretable real-time classifiers can be successfully applied to images on social media and PubMed to find similar diseases and generate disease predictions. Going forward, similar methods may elucidate important clinico-visual features of specific diseases.<jatslist-item>

biorxiv pathology 100-200-users 2018

Nanopore-based genome assembly and the evolutionary genomics of basmati rice, bioRxiv, 2018-08-20

ABSTRACTBACKGROUNDThe circum-basmati group of cultivated Asian rice (Oryza sativa) contains many iconic varieties and is widespread in the Indian subcontinent. Despite its economic and cultural importance, a high-quality reference genome is currently lacking, and the group’s evolutionary history is not fully resolved. To address these gaps, we used long-read nanopore sequencing and assembled the genomes of two circum-basmati rice varieties, Basmati 334 and Dom Sufid.RESULTSWe generated two high-quality, chromosome-level reference genomes that represented the 12 chromosomes of Oryza. The assemblies showed a contig N50 of 6.32Mb and 10.53Mb for Basmati 334 and Dom Sufid, respectively. Using our highly contiguous assemblies we characterized structural variations segregating across circum-basmati genomes. We discovered repeat expansions not observed in japonica—the rice group most closely related to circum- basmati—as well as presenceabsence variants of over 20Mb, one of which was a circum- basmati-specific deletion of a gene regulating awn length. We further detected strong evidence of admixture between the circum-basmati and circum-aus groups. This gene flow had its greatest effect on chromosome 10, causing both structural variation and single nucleotide polymorphism to deviate from genome-wide history. Lastly, population genomic analysis of 78 circum-basmati varieties showed three major geographically structured genetic groups (1) BhutanNepal group, (2) IndiaBangladeshMyanmar group, and (3) IranPakistan group.CONCLUSIONAvailability of high-quality reference genomes from nanopore sequencing allowed functional and evolutionary genomic analyses, providing genome-wide evidence for gene flow between circum-aus and circum-basmati, the nature of circum-basmati structural variation, and the presenceabsence of genes in this important and iconic rice variety group.

biorxiv evolutionary-biology 100-200-users 2018

Consequences of PCA graphs, SNP codings, and PCA variants for elucidating population structure, bioRxiv, 2018-08-16

AbstractSNP datasets are high-dimensional, often with thousands to millions of SNPs and hundreds to thousands of samples or individuals. Accordingly, PCA graphs are frequently used to provide a low-dimensional visualization in order to display and discover patterns in SNP data from humans, animals, plants, and microbes—especially to elucidate population structure. Given the popularity of PCA, one might expect that PCA is understood well and applied effectively. However, our literature survey of 125 representative articles that apply PCA to SNP data shows that three choices have usually been made poorly PCA graph, SNP coding, and PCA variant. Our main three recommendations are simple and easily implemented Use PCA biplots, SNP coding 1 for the rare allele and 0 for the common allele, and double-centered PCA (or AMMI1 if main effects are of interest). The ultimate benefit from informed and optimal choices of PCA graph, SNP coding, and PCA variant, is expected to be discovery of more biology, and thereby acceleration of medical, agricultural, and other vital applications.

biorxiv genomics 0-100-users 2018

High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution, bioRxiv, 2018-08-16

AbstractTargeted PCR amplification and high-throughput sequencing (amplicon sequencing) of 16S rRNA gene fragments is widely used to profile microbial communities. New long-read sequencing technologies can sequence the entire 16S rRNA gene, but higher error rates have limited their attractiveness when accuracy is important. Here we present a high-throughput amplicon sequencing methodology based on PacBio circular consensus sequencing and the DADA2 sample inference method that measures the full-length 16S rRNA gene with single-nucleotide resolution and a near-zero error rate.In two artificial communities of known composition, our method recovered the full complement of full-length 16S sequence variants from expected community members without residual errors. The measured abundances of intra-genomic sequence variants were in the integral ratios expected from the genuine allelic variants within a genome. The full-length 16S gene sequences recovered by our approach allowed E. coli strains to be correctly classified to the O157H7 and K12 sub-species clades. In human fecal samples, our method showed strong technical replication and was able to recover the full complement of 16S rRNA alleles in several E. coli strains.There are likely many applications beyond microbial profiling for which high-throughput amplicon sequencing of complete genes with single-nucleotide resolution will be of use.

biorxiv microbiology 200-500-users 2018

Large-scale whole-genome sequencing of three diverse Asian populations in Singapore, bioRxiv, 2018-08-11

AbstractAsian populations are currently underrepresented in human genetics research. Here we present whole-genome sequencing data of 4,810 Singaporeans from three diverse ethnic groups 2,780 Chinese, 903 Malays, and 1,127 Indians. Despite a medium depth of 13.7×, we achieved essentially perfect (>99.8%) sensitivity and accuracy for detecting common variants and good sensitivity (>89%) for detecting extremely rare variants with <0.1% allele frequency. We found 89.2 million single-nucleotide polymorphisms (SNPs) and 9.1 million small insertions and deletions (INDELs), more than half of which have not been cataloged in dbSNP. In particular, we found 126 common deleterious mutations (MAF>0.01) that were absent in the existing public databases, highlighting the importance of local population reference for genetic diagnosis. We describe fine-scale genetic structure of Singapore populations and their relationship to worldwide populations from the 1000 Genomes Project. In addition to revealing noticeable amounts of admixture among three Singapore populations and a Malay-related novel ancestry component that has not been captured by the 1000 Genomes Project, our analysis also identified some fine-scale features of genetic structure consistent with two waves of prehistoric migration from south China to Southeast Asia. Finally, we demonstrate that our data can substantially improve genotype imputation not only for Singapore populations, but also for populations across Asia and Oceania. These results highlight the genetic diversity in Singapore and the potential impacts of our data as a resource to empower human genetics discovery in a broad geographic region.

biorxiv genetics 100-200-users 2018

One read per cell per gene is optimal for single-cell RNA-Seq, bioRxiv, 2018-08-10

An underlying question for virtually all single-cell RNA sequencing experiments is how to allocate the limited sequencing budget deep sequencing of a few cells or shallow sequencing of many cells? A mathematical framework reveals that, for estimating many important gene properties, the optimal allocation is to sequence at the depth of one read per cell per gene. Interestingly, the corresponding optimal estimator is not the widely-used plugin estimator but one developed via empirical Bayes.

biorxiv bioinformatics 100-200-users 2018