scRNA-seq mixology towards better benchmarking of single cell RNA-seq analysis methods, bioRxiv, 2018-10-04
AbstractSingle cell RNA sequencing (scRNA-seq) technology has undergone rapid development in recent years, bringing with new challenges in data processing and analysis. This has led to an explosion of tailored analysis methods for scRNA-seq data to address various biological questions. However, the current lack of gold-standard benchmark datasets makes it difficult for researchers to systematically evaluate the performance of the many methods available. Here, we designed and carried out a realistic benchmark experiment that included mixtures of single cells or ‘pseudo cells’ created by sampling admixtures of cells or RNA from up to 5 distinct cancer cell lines. Altogether we generated 14 datasets using droplet and plate-based scRNA-seq protocols, compared multiple data analysis methods in combination for tasks ranging from normalization and imputation, to clustering, trajectory analysis and data integration. Evaluation across 3,913 analyses (methods × benchmark dataset combinations) revealed pipelines suited to different types of data for different tasks. Our dataset and analysis present a comprehensive comparison framework for benchmarking most common scRNA-seq analysis tasks.
biorxiv bioinformatics 100-200-users 2018Structural variants identified by Oxford Nanopore PromethION sequencing of the human genome, bioRxiv, 2018-10-04
AbstractWe sequenced the Yoruban NA19240 genome on the long read sequencing platform Oxford Nanopore PromethION for benchmarking and evaluation of recently published aligners and structural variant calling tools. In this work, we determined the precision and recall, present high confidence and high sensitivity call sets of variants and discuss optimal parameters. The aligner Minimap2 and structural variant caller Sniffles are both the most accurate and the most computationally efficient tools in our study. We describe our scalable workflow for identification, annotation, and characterization of tens of thousands of structural variants from long read genome sequencing of an individual or population. By discussing the results of this genome we provide an approximation of what can be expected in future long read sequencing studies aiming for structural variant identification.
biorxiv bioinformatics 0-100-users 2018NanoJ a high-performance open-source super-resolution microscopy toolbox, bioRxiv, 2018-10-02
Super-resolution microscopy has become essential for the study of nanoscale biological processes. This type of imaging often requires the use of specialised image analysis tools to process a large volume of recorded data and extract quantitative information. In recent years, our team has built an open-source image analysis framework for super-resolution microscopy designed to combine high performance and ease of use. We named it NanoJ - a reference to the popular ImageJ software it was developed for. In this paper, we highlight the current capabilities of NanoJ for several essential processing steps spatio-temporal alignment of raw data (NanoJ-Core), super-resolution image reconstruction (NanoJ-SRRF), image quality assessment (NanoJ-SQUIRREL), structural modelling (NanoJ-VirusMapper) and control of the sample environment (NanoJ-Fluidics). We expect to expand NanoJ in the future through the development of new tools designed to improve quantitative data analysis and measure the reliability of fluorescent microscopy studies.
biorxiv bioinformatics 100-200-users 2018An introduction to MPEG-G, the new ISO standard for genomic information representation, bioRxiv, 2018-09-27
AbstractThe MPEG-G standardization initiative is a coordinated international effort to specify a compressed data format that enables large scale genomic data to be processed, transported and shared. The standard consists of a set of specifications (i.e., a book) describing i) a nor-mative format syntax, and ii) a normative decoding process to retrieve the information coded in a compliant file or bitstream. Such decoding process enables the use of leading-edge com-pression technologies that have exhibited significant compression gains over currently used formats for storage of unaligned and aligned sequencing reads. Additionally, the standard provides a wealth of much needed functionality, such as selective access, data aggregation, ap-plication programming interfaces to the compressed data, standard interfaces to support data protection mechanisms, support for streaming and a procedure to assess the conformance of implementations. ISOIEC is engaged in supporting the maintenance and availability of the standard specification, which guarantees the perenniality of applications using MPEG-G. Fi-nally, the standard ensures interoperability and integration with existing genomic information processing pipelines by providing support for conversion from the FASTQSAMBAM file formats.In this paper we provide an overview of the MPEG-G specification, with particular focus on the main advantages and novel functionality it offers. As the standard only specifies the decoding process, encoding performance, both in terms of speed and compression ratio, can vary depending on specific encoder implementations, and will likely improve during the lifetime of MPEG-G. Hence, the performance statistics provided here are only indicative baseline examples of the technologies included in the standard.
biorxiv bioinformatics 100-200-users 2018Parliament2 Fast Structural Variant Calling Using Optimized Combinations of Callers, bioRxiv, 2018-09-23
AbstractHere we present Parliament2 – a structural variant caller which combines multiple best-in-class structural variant callers to create a highly accurate callset. This captures more events than the individual callers achieve independently. Parliament2 uses a call-overlap-genotype approach that is highly extensible to new methods and presents users the choice to run some or all of Breakdancer, Breakseq, CNVnator, Delly, Lumpy, and Manta to run. Parliament2 applies an additional parallelization framework to speed certain callers and executes these in parallel, taking advantage of the different resource requirements to complete structural variant calling much faster than running the programs individually. Parliament2 is available as a Docker container, which pre-installs all required dependencies. This allows users to run any caller with easy installation and execution. This Docker container can easily be deployed in cloud or local environments and is available as an app on DNAnexus.
biorxiv bioinformatics 0-100-users 2018Dissecting heterogeneous cell-populations across signaling and disease conditions with PopAlign, bioRxiv, 2018-09-21
AbstractSingle-cell measurement techniques can now probe gene expression in heterogeneous cell populations from the human body across a range of environmental and physiological conditions. However, new mathematical and computational methods are required to represent and analyze gene expression changes that occur in complex mixtures of single cells as they respond to signals, drugs, or disease states. Here, we introduce a mathematical modeling platform, PopAlign, that automatically identifies subpopulations of cells within a heterogeneous mixture, and tracks gene expression and cell abundance changes across subpopulations by constructing and comparing probabilistic models. We apply PopAlign to discover specific categories of signaling responses within primary human immune cells as well as patient-specific disease signatures in multiple myeloma that are obscured by techniques like tSNE. PopAlign scales to comparisons involving tens to hundreds of samples, enabling large-scale studies of natural and engineered cell populations as they respond to drugs, signals or physiological change.
biorxiv bioinformatics 0-100-users 2018