MSPminer abundance-based reconstitution of microbial pan-genomes from shotgun meta-genomic data, bioRxiv, 2017-08-09
AbstractMotivationAnalysis toolkits for shotgun metagenomic data achieve strain-level characterization of complex microbial communities by capturing intra-species gene content variation. Yet, these tools are hampered by the extent of reference genomes that are far from covering all microbial variability, as many species are still not sequenced or have only few strains available. Binning co-abundant genes obtained from de novo assembly is a powerful reference-free technique to discover and reconstitute gene repertoire of microbial species. While current methods accurately identify species core parts, they miss many accessory genes or split them into small gene groups that remain unassociated to core clusters.ResultsWe introduce MSPminer, a computationally efficient software tool that reconstitutes Metagenomic Species Pan-genomes (MSPs) by binning co-abundant genes across metagenomic samples. MSPminer relies on a new robust measure of proportionality coupled with an empirical classifier to group and distinguish not only species core genes but accessory genes also. Applied to a large scale metagenomic dataset, MSPminer successfully delineates in a few hours the gene repertoires of 1 661 microbial species with similar specificity and higher sensitivity than existing tools. The taxonomic annotation of MSPs reveals microorganisms hitherto unknown and brings coherence in the nomenclature of the species of the human gut microbiota. The provided MSPs can be readily used for taxonomic profiling and biomarkers discovery in human gut metagenomic samples. In addition, MSPminer can be applied on gene count tables from other ecosystems to perform similar analyses.AvailabilityThe binary is freely available for non-commercial users at enterome.frsitedownloads Contact florian.plaza-onate@inra.frSupplementary informationAvailable in the file named Supplementary Information.pdf
biorxiv bioinformatics 0-100-users 2017Bioinformatics Core Competencies for Undergraduate Life Sciences Education, bioRxiv, 2017-08-04
AbstractBioinformatics is becoming increasingly central to research in the life sciences. However, despite its importance, bioinformatics skills and knowledge are not well integrated in undergraduate biology education. This curricular gap prevents biology students from harnessing the full potential of their education, limiting their career opportunities and slowing genomic research innovation. To advance the integration of bioinformatics into life sciences education, a framework of core bioinformatics competencies is needed. To that end, we here report the results of a survey of life sciences faculty in the United States about teaching bioinformatics to undergraduate life scientists. Responses were received from 1,260 faculty representing institutions in all fifty states with a combined capacity to educate hundreds of thousands of students every year. Results indicate strong, widespread agreement that bioinformatics knowledge and skills are critical for undergraduate life scientists, as well as considerable agreement about which skills are necessary. Perceptions of the importance of some skills varied with the respondent’s degree of training, time since degree earned, andor the Carnegie classification of the respondent’s institution. To assess which skills are currently being taught, we analyzed syllabi of courses with bioinformatics content submitted by survey respondents. Finally, we used the survey results, the analysis of syllabi, and our collective research and teaching expertise to develop a set of bioinformatics core competencies for undergraduate life sciences students. These core competencies are intended to serve as a guide for institutions as they work to integrate bioinformatics into their life sciences curricula.Significance StatementBioinformatics, an interdisciplinary field that uses techniques from computer science and mathematics to store, manage, and analyze biological data, is becoming increasingly central to modern biology research. Given the widespread use of bioinformatics and its impacts on societal problem-solving (e.g., in healthcare, agriculture, and natural resources management), there is a growing need for the integration of bioinformatics competencies into undergraduate life sciences education. Here, we present a set of bioinformatics core competencies for undergraduate life scientists developed using the results of a large national survey and the expertise of our working group of bioinformaticians and educators. We also present results from the survey on the importance of bioinformatics skills and the current state of integration of bioinformatics into biology education.
biorxiv bioinformatics 200-500-users 2017Accurate detection of complex structural variations using single molecule sequencing, bioRxiv, 2017-07-29
AbstractStructural variations (SVs) are the largest source of genetic variation, but remain poorly understood because of limited genomics technology. Single molecule long read sequencing from Pacific Biosciences and Oxford Nanopore has the potential to dramatically advance the field, although their high error rates challenge existing methods. Addressing this need, we introduce open-source methods for long read alignment (NGMLR, <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comphilresngmlr>httpsgithub.comphilresngmlr<jatsext-link>) and SV identification (Sniffles, <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comfritzsedlazeckSniffles>httpsgithub.comfritzsedlazeckSniffles<jatsext-link>) that enable unprecedented SV sensitivity and precision, including within repeat-rich regions and of complex nested events that can have significant impact on human disorders. Examining several datasets, including healthy and cancerous human genomes, we discover thousands of novel variants using long reads and categorize systematic errors in short-read approaches. NGMLR and Sniffles are further able to automatically filter false events and operate on low amounts of coverage to address the cost factor that has hindered the application of long reads in clinical and research settings.
biorxiv bioinformatics 100-200-users 2017Correcting batch effects in single-cell RNA sequencing data by matching mutual nearest neighbours, bioRxiv, 2017-07-19
AbstractThe presence of batch effects is a well-known problem in experimental data analysis, and single- cell RNA sequencing (scRNA-seq) is no exception. Large-scale scRNA-seq projects that generate data from different laboratories and at different times are rife with batch effects that can fatally compromise integration and interpretation of the data. In such cases, computational batch correction is critical for eliminating uninteresting technical factors and obtaining valid biological conclusions. However, existing methods assume that the composition of cell populations are either known or the same across batches. Here, we present a new strategy for batch correction based on the detection of mutual nearest neighbours in the high-dimensional expression space. Our approach does not rely on pre-defined or equal population compositions across batches, only requiring that a subset of the population be shared between batches. We demonstrate the superiority of our approach over existing methods on a range of simulated and real scRNA-seq data sets. We also show how our method can be applied to integrate scRNA-seq data from two separate studies of early embryonic development.
biorxiv bioinformatics 0-100-users 2017Text mining of 15 million full-text scientific articles, bioRxiv, 2017-07-12
AbstractAcross academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823–2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein–protein, disease–gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.
biorxiv bioinformatics 100-200-users 2017Privacy-preserving generative deep neural networks support clinical data sharing, bioRxiv, 2017-07-06
AbstractBackgroundData sharing accelerates scientific progress but sharing individual level data while preserving patient privacy presents a barrier.Methods and ResultsUsing pairs of deep neural networks, we generated simulated, synthetic “participants” that closely resemble participants of the SPRINT trial. We showed that such paired networks can be trained with differential privacy, a formal privacy framework that limits the likelihood that queries of the synthetic participants’ data could identify a real a participant in the trial. Machine-learning predictors built on the synthetic population generalize to the original dataset. This finding suggests that the synthetic data can be shared with others, enabling them to perform hypothesis-generating analyses as though they had the original trial data.ConclusionsDeep neural networks that generate synthetic participants facilitate secondary analyses and reproducible investigation of clinical datasets by enhancing data sharing while preserving participant privacy.
biorxiv bioinformatics 200-500-users 2017