2018 | audiences

Widespread methane formation by Cyanobacteria in aquatic and terrestrial ecosystems, bioRxiv, 2018-08-25

AbstractEvidence is accumulating to challenge the paradigm that biogenic methanogenesis, traditionally considered a strictly anerobic process, is exclusive to Archaea. Here we demonstrate that Cyanobacteria living in marine, freshwater and terrestrial environments produce methane at substantial rates under light and dark oxic and anoxic conditions, forming a link between light driven primary productivity and methane production in globally relevant group of phototrophs. Biogenic methane production was enhanced during oxygenic photosynthesis and directly attributed to the cyanobacteria by applying stable isotope labelling techniques. We suggest that formation of methane by Cyanobacteria may contribute to methane accumulation in oxygen-saturated surface waters of marine and freshwater ecosystems. Moreover, in these environments, cyanobacterial blooms already do, and might further occur more frequently during future global warming and thus have a direct feedback on climate change. We further highlight that cyanobacterial methane production not only affects recent and future global methane budgets, but also has implications for inferences on Earth’s methane budget for the last 3.5 billion years, when this phylum is thought to have first evolved.

biorxiv microbiology 200-500-users 2018

Fast Batch Alignment of Single Cell Transcriptomes Unifies Multiple Mouse Cell Atlases into an Integrated Landscape, bioRxiv, 2018-08-21

AbstractIncreasing numbers of large scale single cell RNA-Seq projects are leading to a data explosion, which can only be fully exploited through data integration. Therefore, efficient computational tools for combining diverse datasets are crucial for biology in the single cell genomics era. A number of methods have been developed to assist data integration by removing technical batch effects, but most are computationally intensive. To overcome the challenge of enormous datasets, we have developed BBKNN, an extremely fast graph-based data integration method. We illustrate the power of BBKNN for dimensionalityreduced visualisation and clustering in multiple biological scenarios, including a massive integrative study over several murine atlases. BBKNN successfully connects cell populations across experimentally heterogeneous mouse scRNA-Seq datasets, which reveals global markers of cell type and organspecificity and provides the foundation for inferring the underlying transcription factor network. BBKNN is available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comTeichlabbbknn>httpsgithub.comTeichlabbbknn<jatsext-link>.

biorxiv bioinformatics 0-100-users 2018

Interpretable multimodal deep learning for real-time pan-tissue pan-disease pathology search on social media, bioRxiv, 2018-08-21

AbstractBackgroundPathologists are responsible for rapidly providing a diagnosis on critical health issues, from infection to malignancy. Challenging cases benefit from additional opinions of pathologist colleagues. In addition to on-site colleagues, there is an active worldwide community of pathologists on social media for complementary opinions. Such access to pathologists worldwide has the capacity to (i) improve diagnostic accuracy and (ii) generate broader consensus on next steps in patient care.Methods and findingsFrom Twitter we curate 13,626 images from 6,351 tweets from 25 pathologists from 13 countries. We supplement the Twitter data with 113,161 images from 1,074,484 PubMed articles. We develop machine learning and deep learning models to (i) accurately identify histopathology stains, (ii) discriminate between tissues, and (iii) differentiate disease states. For deep learning, we derive novel regularization and activation functions for set representations related to set cardinality and the Heaviside step function. Area Under Receiver Operating Characteristic is 0.805-0.996 for these tasks. We repurpose the disease classifier to search for similar disease states given an image and clinical covariates. We report precision@k=1 = 0.701±0.003 (chance 0.397±0.004, mean±stdev). The classifiers find texture and tissue are important clinico-visual features of disease. For search, deep features and cell nuclei features are less important.We implement a social media bot (@pathobot on Twitter) to use the trained classifiers to aid pathologists in obtaining real-time feedback on challenging cases. The bot activates when mentioned in a social media post containing pathology text and images. The bot generates quantitative predictions of disease state (normalartifact infectioninjurynontumor, pre-neoplasticbenignlow-grade-malignant-potential, or malignant) and provides a ranked list of similar cases across social media and PubMed.ConclusionsOur project has become a globally distributed expert system that facilitates pathological diagnosis and brings expertise to underserved regions or hospitals with less expertise in a particular disease. This is the first pan-tissue pan-disease (i.e. from infections to malignancy) method for prediction and search on social media, and the first pathology study prospectively tested in public on social media. We expect our project to cultivate a more connected world of physicians and improve patient care worldwide.Author summaryWhy was this study done?<jatslist list-type=bullet><jatslist-item>No publicly available pan-tissue pan-disease dataset exists for computational pathology. This limits the general application of machine learning in histopathology.<jatslist-item><jatslist-item>Pathologists use social media to obtain both (i) opinions for challenging patient cases and (ii) continuing education. Connecting pathologists and linking to similar cases leads to more informative exchanges than computational predictions – e.g. to diagnose best, pathologists may discuss patient history and next tests to order. Additionally, pathologists seek the most interesting rare cases and new articles.<jatslist-item>What did the researchers do and find?<jatslist list-type=bullet><jatslist-item>We generated a pan-tissue, pan-disease dataset comprising 10,000+ images from social media and 100,000+ images from PubMed. Classifiers applied to social media data suggest texture and tissue are important clinico-visual features of disease. Learning from both clinical covariates (e.g. tissue type or marker mentions) and visual features (e.g. local binary patterns or deep learning image features), these classifiers are multimodal.<jatslist-item><jatslist-item>These data and classifiers power the first social media bot for pathology. It responds to pathologists in real time, searches for similar cases, and encourages collaboration.<jatslist-item>What do these findings mean?<jatslist list-type=bullet><jatslist-item>This diverse dataset will be a critical test for machine learning in computational pathology, e.g. search for cures of rare diseases.<jatslist-item><jatslist-item>Interpretable real-time classifiers can be successfully applied to images on social media and PubMed to find similar diseases and generate disease predictions. Going forward, similar methods may elucidate important clinico-visual features of specific diseases.<jatslist-item>

biorxiv pathology 100-200-users 2018

Nanopore-based genome assembly and the evolutionary genomics of basmati rice, bioRxiv, 2018-08-20

ABSTRACTBACKGROUNDThe circum-basmati group of cultivated Asian rice (Oryza sativa) contains many iconic varieties and is widespread in the Indian subcontinent. Despite its economic and cultural importance, a high-quality reference genome is currently lacking, and the group’s evolutionary history is not fully resolved. To address these gaps, we used long-read nanopore sequencing and assembled the genomes of two circum-basmati rice varieties, Basmati 334 and Dom Sufid.RESULTSWe generated two high-quality, chromosome-level reference genomes that represented the 12 chromosomes of Oryza. The assemblies showed a contig N50 of 6.32Mb and 10.53Mb for Basmati 334 and Dom Sufid, respectively. Using our highly contiguous assemblies we characterized structural variations segregating across circum-basmati genomes. We discovered repeat expansions not observed in japonica—the rice group most closely related to circum- basmati—as well as presenceabsence variants of over 20Mb, one of which was a circum- basmati-specific deletion of a gene regulating awn length. We further detected strong evidence of admixture between the circum-basmati and circum-aus groups. This gene flow had its greatest effect on chromosome 10, causing both structural variation and single nucleotide polymorphism to deviate from genome-wide history. Lastly, population genomic analysis of 78 circum-basmati varieties showed three major geographically structured genetic groups (1) BhutanNepal group, (2) IndiaBangladeshMyanmar group, and (3) IranPakistan group.CONCLUSIONAvailability of high-quality reference genomes from nanopore sequencing allowed functional and evolutionary genomic analyses, providing genome-wide evidence for gene flow between circum-aus and circum-basmati, the nature of circum-basmati structural variation, and the presenceabsence of genes in this important and iconic rice variety group.

biorxiv evolutionary-biology 100-200-users 2018

Consequences of PCA graphs, SNP codings, and PCA variants for elucidating population structure, bioRxiv, 2018-08-16

AbstractSNP datasets are high-dimensional, often with thousands to millions of SNPs and hundreds to thousands of samples or individuals. Accordingly, PCA graphs are frequently used to provide a low-dimensional visualization in order to display and discover patterns in SNP data from humans, animals, plants, and microbes—especially to elucidate population structure. Given the popularity of PCA, one might expect that PCA is understood well and applied effectively. However, our literature survey of 125 representative articles that apply PCA to SNP data shows that three choices have usually been made poorly PCA graph, SNP coding, and PCA variant. Our main three recommendations are simple and easily implemented Use PCA biplots, SNP coding 1 for the rare allele and 0 for the common allele, and double-centered PCA (or AMMI1 if main effects are of interest). The ultimate benefit from informed and optimal choices of PCA graph, SNP coding, and PCA variant, is expected to be discovery of more biology, and thereby acceleration of medical, agricultural, and other vital applications.

biorxiv genomics 0-100-users 2018

High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution, bioRxiv, 2018-08-16

AbstractTargeted PCR amplification and high-throughput sequencing (amplicon sequencing) of 16S rRNA gene fragments is widely used to profile microbial communities. New long-read sequencing technologies can sequence the entire 16S rRNA gene, but higher error rates have limited their attractiveness when accuracy is important. Here we present a high-throughput amplicon sequencing methodology based on PacBio circular consensus sequencing and the DADA2 sample inference method that measures the full-length 16S rRNA gene with single-nucleotide resolution and a near-zero error rate.In two artificial communities of known composition, our method recovered the full complement of full-length 16S sequence variants from expected community members without residual errors. The measured abundances of intra-genomic sequence variants were in the integral ratios expected from the genuine allelic variants within a genome. The full-length 16S gene sequences recovered by our approach allowed E. coli strains to be correctly classified to the O157H7 and K12 sub-species clades. In human fecal samples, our method showed strong technical replication and was able to recover the full complement of 16S rRNA alleles in several E. coli strains.There are likely many applications beyond microbial profiling for which high-throughput amplicon sequencing of complete genes with single-nucleotide resolution will be of use.

biorxiv microbiology 200-500-users 2018