bioinformatics | audiences

Generalizing RNA velocity to transient cell states through dynamical modeling, bioRxiv, 2019-10-29

AbstractThe introduction of RNA velocity in single cells has opened up new ways of studying cellular differentiation. The originally proposed framework obtains velocities as the deviation of the observed ratio of spliced and unspliced mRNA from an inferred steady state. Errors in velocity estimates arise if the central assumptions of a common splicing rate and the observation of the full splicing dynamics with steady-state mRNA levels are violated. With scVelo (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsscvelo.org>httpsscvelo.org<jatsext-link>), we address these restrictions by solving the full transcriptional dynamics of splicing kinetics using a likelihood-based dynamical model. This generalizes RNA velocity to a wide variety of systems comprising transient cell states, which are common in development and in response to perturbations. We infer gene-specific rates of transcription, splicing and degradation, and recover the latent time of the underlying cellular processes. This latent time represents the cell’s internal clock and is based only on its transcriptional dynamics. Moreover, scVelo allows us to identify regimes of regulatory changes such as stages of cell fate commitment and, therein, systematically detects putative driver genes. We demonstrate that scVelo enables disentangling heterogeneous subpopulation kinetics with unprecedented resolution in hippocampal dentate gyrus neurogenesis and pancreatic endocrinogenesis. We anticipate that scVelo will greatly facilitate the study of lineage decisions, gene regulation, and pathway activity identification.

biorxiv bioinformatics 200-500-users 2019

Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis, bioRxiv, 2019-10-26

AbstractHere we use deep transfer learning to quantify histopathological patterns across 17,396 H&E stained histopathology image slides from 28 cancer types and correlate these with underlying genomic and transcriptomic data. Pan-cancer computational histopathology (PC-CHiP) classifies the tissue origin across organ sites and provides highly accurate, spatially resolved tumor and normal distinction within a given slide. The learned computational histopathological features correlate with a large range of recurrent genetic aberrations, including whole genome duplications (WGDs), arm-level copy number gains and losses, focal amplifications and deletions as well as driver gene mutations within a range of cancer types. WGDs can be predicted in 2527 cancer types (mean AUC=0.79) including those that were not part of model training. Similarly, we observe associations with 25% of mRNA transcript levels, which enables to learn and localise histopathological patterns of molecularly defined cell types on each slide. Lastly, we find that computational histopathology provides prognostic information augmenting histopathological subtyping and grading in the majority of cancers assessed, which pinpoints prognostically relevant areas such as necrosis or infiltrating lymphocytes on each tumour section. Taken together, these findings highlight the large potential of PC-CHiP to discover new molecular and prognostic associations, which can augment diagnostic workflows and lay out a rationale for integrating molecular and histopathological data.Key points<jatslist list-type=bullet><jatslist-item>Pan-cancer computational histopathology analysis with deep learning extracts histopathological patterns and accurately discriminates 28 cancer and 14 normal tissue types<jatslist-item><jatslist-item>Computational histopathology predicts whole genome duplications, focal amplifications and deletions, as well as driver gene mutations<jatslist-item><jatslist-item>Wide-spread correlations with gene expression indicative of immune infiltration and proliferation<jatslist-item><jatslist-item>Prognostic information augments conventional grading and histopathology subtyping in the majority of cancers<jatslist-item>

biorxiv bioinformatics 500+-users 2019

Molecular Cross-Validation for Single-Cell RNA-seq, bioRxiv, 2019-10-01

Single-cell RNA sequencing enables researchers to study the gene expression of individual cells. However, in high-throughput methods the portrait of each individual cell is noisy, representing thousands of the hundreds of thousands of mRNA molecules originally present. While many methods for denoising single-cell data have been proposed, a principled procedure for selecting and calibrating the best method for a given dataset has been lacking. We present “molecular cross-validation,” a statistically principled and data-driven approach for estimating the accuracy of any denoising method without the need for ground-truth. We validate this approach for three denoising methods—principal component analysis, network diffusion, and a deep autoencoder—on a dataset of deeply-sequenced neurons. We show that molecular cross-validation correctly selects the optimal parameters for each method and identifies the best method for the dataset.

biorxiv bioinformatics 200-500-users 2019

GeneRax A tool for species tree-aware maximum likelihood based gene tree inference under gene duplication, transfer, and loss, bioRxiv, 2019-09-27

AbstractInferring gene trees is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges species tree-aware methods seek to use information from a putative species tree. However, there are few methods available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data pre-processing (e.g., computing bootstrap trees), and rely on approximations and heuristics that limit the exploration of tree space. Here we present GeneRax, the first maximum likelihood species tree-aware gene tree inference software. It simultaneously accounts for substitutions at the sequence level and gene level events, such as duplication, transfer and loss and uses established maximum likelihood optimization algorithms. GeneRax can infer rooted gene trees for an arbitrary number of gene families, directly from the per-gene sequence alignments and a rooted, but undated, species tree. We show that compared to competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms relative Robinson-Foulds distance. While, on empirical datasets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and that it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1099 Cyanobacteria families in eight minutes on 512 CPU cores. Thus, its advanced parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comBenoitMorelGeneRax>httpsgithub.comBenoitMorelGeneRax<jatsext-link>.

biorxiv bioinformatics 0-100-users 2019

Tximeta reference sequence checksums for provenance identification in RNA-seq, bioRxiv, 2019-09-26

AbstractCorrect annotation metadata is critical for reproducible and accurate RNA-seq analysis. When files are shared publicly or among collaborators with incorrect or missing annotation metadata, it becomes difficult or impossible to reproduce bioinformatic analyses from raw data. It also makes it more difficult to locate the transcriptomic features, such as transcripts or genes, in their proper genomic context, which is necessary for overlapping expression data with other datasets. We provide a solution in the form of an RBioconductor package tximeta that performs numerous annotation and metadata gathering tasks automatically on behalf of users during the import of transcript quantification files. The correct reference transcriptome is identified via a hashed checksum stored in the quantification output, and key transcript databases are downloaded and cached locally. The computational paradigm of automatically adding annotation metadata based on reference sequence checksums can greatly facilitate genomic workflows, by helping to reduce overhead during bioinformatic analyses, preventing costly bioinformatic mistakes, and promoting computational reproducibility. The tximeta package is available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsbioconductor.orgpackagestximeta>httpsbioconductor.orgpackagestximeta<jatsext-link>.

biorxiv bioinformatics 0-100-users 2019

centroFlye Assembling Centromeres with Long Error-Prone Reads, bioRxiv, 2019-09-17

AbstractAlthough variations in centromeres have been linked to cancer and infertility, centromeres still represent the “dark matter of the human genome” and remain an enigma for both biomedical and evolutionary studies. Since centromeres have withstood all previous attempts to develop an automated tool for their assembly and since their assembly using short reads is viewed as intractable, recent efforts attempted to manually assemble centromeres using long error-prone reads. We describe the centroFlye algorithm for centromere assembly using long error-prone reads, apply it for assembling the human X centromere, and use the constructed assembly to gain insights into centromere evolution. Our analysis reveals putative breakpoints in the previous manual reconstruction of the human X centromere and opens a possibility to automatically close the remaining multi-megabase gaps in the reference human genome.

biorxiv bioinformatics 100-200-users 2019