Enabling cross-study analysis of RNA-Sequencing data, bioRxiv, 2017-02-28
AbstractDriven by the recent advances of next generation sequencing (NGS) technologies and an urgent need to decode complex human diseases, a multitude of large-scale studies were conducted recently that have resulted in an unprecedented volume of whole transcriptome sequencing (RNA-seq) data. While these data offer new opportunities to identify the mechanisms underlying disease, the comparison of data from different sources poses a great challenge, due to differences in sample and data processing. Here, we present a pipeline that processes and unifies RNA-seq data from different studies, which includes uniform realignment and gene expression quantification as well as batch effect removal. We find that uniform alignment and quantification is not sufficient when combining RNA-seq data from different sources and that the removal of other batch effects is essential to facilitate data comparison. We have processed data from the Genotype Tissue Expression project (GTEx) and The Cancer Genome Atlas (TCGA) and have successfully corrected for study-specific biases, enabling comparative analysis across studies. The normalized data are available for download via GitHub (at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.commskccRNAseqDB>httpsgithub.commskccRNAseqDB<jatsext-link>).
biorxiv bioinformatics 0-100-users 2017MAGIC A diffusion-based imputation method reveals gene-gene interactions in single-cell RNA-sequencing data, bioRxiv, 2017-02-26
ABSTRACTSingle-cell RNA-sequencing is fast becoming a major technology that is revolutionizing biological discovery in fields such as development, immunology and cancer. The ability to simultaneously measure thousands of genes at single cell resolution allows, among other prospects, for the possibility of learning gene regulatory networks at large scales. However, scRNA-seq technologies suffer from many sources of significant technical noise, the most prominent of which is ‘dropout’ due to inefficient mRNA capture. This results in data that has a high degree of sparsity, with typically only ~10% non-zero values. To address this, we developed MAGIC (Markov Affinity-based Graph Imputation of Cells), a method for imputing missing values, and restoring the structure of the data. After MAGIC, we find that two- and three-dimensional gene interactions are restored and that MAGIC is able to impute complex and non-linear shapes of interactions. MAGIC also retains cluster structure, enhances cluster-specific gene interactions and restores trajectories, as demonstrated in mouse retinal bipolar cells, hematopoiesis, and our newly generated epithelial-to-mesenchymal transition dataset.
biorxiv bioinformatics 100-200-users 2017Granatum a graphical single-cell RNA-Seq analysis pipeline for genomics scientists, bioRxiv, 2017-02-23
AbstractBackgroundSingle-cell RNA sequencing (scRNA-Seq) is an increasingly popular platform to study heterogeneity at the single-cell level.Computational methods to process scRNA-Seq have limited accessibility to bench scientists as they require significant amounts of bioinformatics skills.ResultsWe have developed Granatum, a web-based scRNA-Seq analysis pipeline to make analysis more broadly accessible to researchers. Without a single line of programming code, users can click through the pipeline, setting parameters and visualizing results via the interactive graphical interface Granatum conveniently walks users through various steps of scRNA-Seq analysis. It has a comprehensive list of modules, including plate merging and batch-effect removal, outlier-sample removal, gene filtering, geneexpression normalization, cell clustering, differential gene expression analysis, pathwayontology enrichment analysis, protein-networ interaction visualization, and pseudo-time cell series construction.ConclusionsGranatum enables broad adoption of scRNA-Seq technology by empowering the bench scientists with an easy-to-use graphical interface for scRNA-Seq data analysis. The package is freely available for research use at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpgarmiregroup.orggranatumapp>httpgarmiregroup.orggranatumapp<jatsext-link>
biorxiv bioinformatics 0-100-users 2017W2RAP a pipeline for high quality, robust assemblies of large complex genomes from short read data, bioRxiv, 2017-02-23
AbstractProducing high-quality whole-genome shotgun de novo assemblies from plant and animal species with large and complex genomes using low-cost short read sequencing technologies remains a challenge. But when the right sequencing data, with appropriate quality control, is assembled using approaches focused on robustness of the process rather than maximization of a single metric such as the usual contiguity estimators, good quality assemblies with informative value for comparative analyses can be produced. Here we present a complete method described from data generation and qc all the way up to scaffold of complex genomes using Illumina short reads and its application to data from plants and human datasets. We show how to use the w2rap pipeline following a metric-guided approach to produce cost-effective assemblies. The assemblies are highly accurate, provide good coverage of the genome and show good short range contiguity. Our pipeline has already enabled the rapid, cost-effective generation of de novo genome assemblies from large, polyploid crop species with a focus on comparative genomics.Availabilityw2rap is available under MIT license, with some subcomponents under GPL-licenses. A ready-to-run docker with all software pre-requisites and example data is also available.<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpgithub.combioinfologicsw2rap>httpgithub.combioinfologicsw2rap<jatsext-link><jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpgithub.combioinfologicsw2rap-contigger>httpgithub.combioinfologicsw2rap-contigger<jatsext-link>
biorxiv bioinformatics 0-100-users 2017mixOmics an R package for ‘omics feature selection and multiple data integration, bioRxiv, 2017-02-15
AbstractThe advent of high throughput technologies has led to a wealth of publicly available ‘omics data coming from different sources, such as transcriptomics, proteomics, metabolomics. Combining such large-scale biological data sets can lead to the discovery of important biological insights, provided that relevant information can be extracted in a holistic manner. Current statistical approaches have been focusing on identifying small subsets of molecules (a ‘molecular signature’) to explain or predict biological conditions, but mainly for a single type of ‘omics. In addition, commonly used methods are univariate and consider each biological feature independently.We introducemixOmics, an R package dedicated to the multivariate analysis of biological data sets with a specific focus on data exploration, dimension reduction and visualisation. By adopting a system biology approach, the toolkit provides a wide range of methods that statistically integrate several data sets at once to probe relationships between heterogeneous ‘omics data sets. Our recent methods extend Projection to Latent Structure (PLS) models for discriminant analysis, for data integration across multiple ‘omics data or across independent studies, and for the identification of molecular signatures. We illustrate our latestmixOmicsintegrative frameworks for the multivariate analyses of ‘omics data available from the package.
biorxiv bioinformatics 100-200-users 2017A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases, bioRxiv, 2017-01-28
AbstractEmerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this paper, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290x faster than BWA-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥ 5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and > 60, 000 genomes.
biorxiv bioinformatics 100-200-users 2017