A global perspective on bioinformatics training needs, bioRxiv, 2017-02-28
AbstractIn the last decade, life-science research has become increasingly data-intensive and computational. Nevertheless, basic bioinformatics and data stewardship are still only rarely taught in life-science degree programmes, creating a widening skills gap that spans educational levels and career roles. To better understand this situation, we ran surveys to determine how the skills dearth is affecting the need for bioinformatics training worldwide. Perhaps unsurprisingly, we found that respondents wanted more short courses to help boost their expertise and confidence in data analysis and interpretation. However, it was evident that most respondents appreciated their need for training only after designing their experiments and collecting their data. This is clearly rather late in the research workflow, and suboptimal from a training perspective, as skills acquired to address a specific need at a particular time are seldom retained, engendering a cycle of low confidence in trainees. To ensure that such skill gaps do not continue to create barriers to the progress of research, we argue that universities should strive to bring their life-science curricula into the digital-data era. Meanwhile, the demand for point-of-need training in bioinformatics and data stewardship will grow. While this situation persists, international groups like GOBLET are increasing their efforts to enlarge the community of trainers and quench the global thirst for bioinformatics training.
biorxiv scientific-communication-and-education 100-200-users 2017Enabling cross-study analysis of RNA-Sequencing data, bioRxiv, 2017-02-28
AbstractDriven by the recent advances of next generation sequencing (NGS) technologies and an urgent need to decode complex human diseases, a multitude of large-scale studies were conducted recently that have resulted in an unprecedented volume of whole transcriptome sequencing (RNA-seq) data. While these data offer new opportunities to identify the mechanisms underlying disease, the comparison of data from different sources poses a great challenge, due to differences in sample and data processing. Here, we present a pipeline that processes and unifies RNA-seq data from different studies, which includes uniform realignment and gene expression quantification as well as batch effect removal. We find that uniform alignment and quantification is not sufficient when combining RNA-seq data from different sources and that the removal of other batch effects is essential to facilitate data comparison. We have processed data from the Genotype Tissue Expression project (GTEx) and The Cancer Genome Atlas (TCGA) and have successfully corrected for study-specific biases, enabling comparative analysis across studies. The normalized data are available for download via GitHub (at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.commskccRNAseqDB>httpsgithub.commskccRNAseqDB<jatsext-link>).
biorxiv bioinformatics 0-100-users 2017MAGIC A diffusion-based imputation method reveals gene-gene interactions in single-cell RNA-sequencing data, bioRxiv, 2017-02-26
ABSTRACTSingle-cell RNA-sequencing is fast becoming a major technology that is revolutionizing biological discovery in fields such as development, immunology and cancer. The ability to simultaneously measure thousands of genes at single cell resolution allows, among other prospects, for the possibility of learning gene regulatory networks at large scales. However, scRNA-seq technologies suffer from many sources of significant technical noise, the most prominent of which is ‘dropout’ due to inefficient mRNA capture. This results in data that has a high degree of sparsity, with typically only ~10% non-zero values. To address this, we developed MAGIC (Markov Affinity-based Graph Imputation of Cells), a method for imputing missing values, and restoring the structure of the data. After MAGIC, we find that two- and three-dimensional gene interactions are restored and that MAGIC is able to impute complex and non-linear shapes of interactions. MAGIC also retains cluster structure, enhances cluster-specific gene interactions and restores trajectories, as demonstrated in mouse retinal bipolar cells, hematopoiesis, and our newly generated epithelial-to-mesenchymal transition dataset.
biorxiv bioinformatics 100-200-users 2017Multiplexed confocal and super-resolution fluorescence imaging of cytoskeletal and neuronal synapse proteins, bioRxiv, 2017-02-26
ABSTRACTNeuronal synapses contain dozens of protein species whose expression levels and localizations are key determinants of synaptic transmission and plasticity. The spectral properties of fluorophores used in conventional microscopy limit the number of measured proteins to four species within a given sample. The ability to perform high-throughput confocal or super-resolution imaging of many proteins simultaneously without limitation in target number imposed by this spectral limit would enable large-scale characterization of synaptic protein networks in situ. Here, we introduce PRISM Probe-based Imaging for Sequential Multiplexing, a method that sequentially utilizes either high affinity Locked Nucleic Acid (LNA) or low affinity DNA probes to enable diffraction-limited confocal and PAINT-based super-resolution imaging. High-affinity LNA probes offer high-throughput, confocal-based imaging compared with PAINT, which uses low affinity probes to realize localization-based super-resolution imaging. Simultaneous immunostaining of all targets is performed prior to imaging, followed by sequential LNADNA probe exchange that requires only minutes under mild wash conditions. We apply PRISM to quantify the co-expression levels and nanometer-scale organization of one dozen cytoskeletal and synaptic proteins within individual neuronal synapses. Our approach is scalable to dozens of target proteins and is compatible with high-content screening platforms commonly used to interrogate phenotypic changes associated with genetic and drug perturbations in a variety of cell types.
biorxiv bioengineering 0-100-users 2017Modern machine learning outperforms GLMs at predicting spikes, bioRxiv, 2017-02-25
AbstractNeuroscience has long focused on finding encoding models that effectively ask “what predicts neural spiking?” and generalized linear models (GLMs) are a typical approach. It is often unknown how much of explainable neural activity is captured, or missed, when fitting a GLM. Here we compared the predictive performance of GLMs to three leading machine learning methods feedforward neural networks, gradient boosted trees (using XGBoost), and stacked ensembles that combine the predictions of several methods. We predicted spike counts in macaque motor (M1) and somatosensory (S1) cortices from standard representations of reaching kinematics, and in rat hippocampal cells from open field location and orientation. In general, the modern methods (particularly XGBoost and the ensemble) produced more accurate spike predictions and were less sensitive to the preprocessing of features. This discrepancy in performance suggests that standard feature sets may often relate to neural activity in a nonlinear manner not captured by GLMs. Encoding models built with machine learning techniques, which can be largely automated, more accurately predict spikes and can offer meaningful benchmarks for simpler models.
biorxiv neuroscience 100-200-users 2017A practical guide for inferring reliable dominance hierarchies and estimating their uncertainty, bioRxiv, 2017-02-24
AbstractMany animal social structures are organized hierarchically, with dominant individuals monopolizing resources. Dominance hierarchies have received great attention from behavioural and evolutionary ecologists. As a result, there are many methods for inferring hierarchies from social interactions. Yet, there are no clear guidelines about how many observed dominance interactions (i.e. sampling effort) are necessary for inferring reliable dominance hierarchies, nor are there any established tools for quantifying their uncertainty. In this study, we simulated interactions (winners and losers) in scenarios of varying steepness (the probability that a dominant defeats a subordinate based on their difference in rank). Using these data, we (1) quantify how the number of interactions recorded and hierarchy steepness affect the performance of three methods, (2) propose an amendment that improves the performance of a popular method, and (3) suggest two easy procedures to measure uncertainty in the inferred hierarchy. First, we found that the ratio of interactions to individuals required to infer reliable hierarchies is surprisingly low, but depends on the hierarchy steepness and method used. We then show that David’s score and our novel randomized Elo-rating are the two best methods, whereas the original Elo-rating and the recently described ADAGIO perform less well. Finally, we propose two simple methods to estimate uncertainty at the individual and group level. These uncertainty measures further allow to differentiate non-existent, very flat and highly uncertain hierarchies from intermediate, steep and certain hierarchies. Overall, we find that the methods for inferring dominance hierarchies are relatively robust, even when the ratio of observed interactions to individuals is as low as 10 to 20. However, we suggest that implementing simple procedures for estimating uncertainty will benefit researchers, and quantifying the shape of the dominance hierarchies will provide new insights into the study organisms.Highlights<jatslist list-type=bullet><jatslist-item>David’s score and the randomized Elo-rating perform best.<jatslist-item><jatslist-item>Method performance depends on hierarchy steepness and sampling effort.<jatslist-item><jatslist-item>Generally, inferring dominance hierarchies requires relatively few observations.<jatslist-item><jatslist-item>The R package “aniDom” allows easy estimation of hierarchy uncertainty.<jatslist-item><jatslist-item>Hierarchy uncertainty provides insights into the shape of the dominance hierarchy.<jatslist-item>
biorxiv animal-behavior-and-cognition 0-100-users 2017