bioSyntax Syntax Highlighting For Computational Biology, bioRxiv, 2017-12-21
AbstractComputational biology requires the reading and comprehension of biological data files. Plain-text formats such as SAM, VCF, GTF, PDB and FASTA, often contain critical information that is obfuscated by the complexity of the data structures. bioSyntax (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpbioSyntax.org>httpbioSyntax.org<jatsext-link>) is a freely available suite of syntax highlighting packages for vim, gedit, Sublime, and less, which aids computational scientists to parse and work with their data more efficiently.
biorxiv bioinformatics 0-100-users 2017Comparative genomics of the major parasitic worms, bioRxiv, 2017-12-21
ABSTRACTParasitic nematodes (roundworms) and platyhelminths (flatworms) cause debilitating chronic infections of humans and animals, decimate crop production and are a major impediment to socioeconomic development. Here we compare the genomes of 81 nematode and platyhelminth species, including those of 76 parasites. From 1.4 million genes, we identify gene family births and hundreds of large expanded gene families at key nodes in the phylogeny that are relevant to parasitism. Examples include gene families that modulate host immune responses, enable parasite migration though host tissues or allow the parasite to feed. We use a wide-ranging in silico screen to identify and prioritise new potential drug targets and compounds for testing. We also uncover lineage-specific differences in core metabolism and in protein families historically targeted for drug development. This is the broadest comparative study to date of the genomes of parasitic and non-parasitic worms. It provides a transformative new resource for the research community to understand and combat the diseases that parasitic worms cause.
biorxiv genomics 0-100-users 2017Single-cell transcriptomic characterization of 20 organs and tissues from individual mice creates a Tabula Muris, bioRxiv, 2017-12-21
The Tabula Muris ConsortiumWe have created a compendium of single cell transcriptome data from the model organism Mus musculus comprising more than 100,000 cells from 20 organs and tissues. These data represent a new resource for cell biology, revealing gene expression in poorly characterized cell populations and allowing for direct and controlled comparison of gene expression in cell types shared between tissues, such as T-lymphocytes and endothelial cells from distinct anatomical locations. Two distinct technical approaches were used for most tissues one approach, microfluidic droplet-based 3’-end counting, enabled the survey of thousands of cells at relatively low coverage, while the other, FACS-based full length transcript analysis, enabled characterization of cell types with high sensitivity and coverage. The cumulative data provide the foundation for an atlas of transcriptomic cell biology.
biorxiv cell-biology 200-500-users 2017Candidate cancer driver mutations in superenhancers and long-range chromatin interaction networks, bioRxiv, 2017-12-20
AbstractA comprehensive catalogue of the mutations that drive tumorigenesis and progression is essential to understanding tumor biology and developing therapies. Protein-coding driver mutations have been well-characterized by large exome-sequencing studies, however many tumors have no mutations in protein-coding driver genes. Non-coding mutations are thought to explain many of these cases, however few non-coding drivers besides TERT promoter are known. To fill this gap, we analyzed 150,000 cis-regulatory regions in 1,844 whole cancer genomes from the ICGC-TCGA PCAWG project. Using our new method, ActiveDriverWGS, we found 41 frequently mutated regulatory elements (FMREs) enriched in non-coding SNVs and indels (FDR<0.05) characterized by aging-associated mutation signatures and frequent structural variants. Most FMREs are distal from genes, reported here for the first time and also recovered by additional driver discovery methods. FMREs were enriched in super-enhancers, H3K27ac enhancer marks of primary tumors and long-range chromatin interactions, suggesting that the mutations drive cancer by distally controlling gene expression through threedimensional genome organization. In support of this hypothesis, the chromatin interaction network of FMREs and target genes revealed associations of mutations and differential gene expression of known and novel cancer genes (e.g., CNNB1IP1, RCC1), activation of immune response pathways and altered enhancer marks. Thus distal genomic regions may include additional, infrequently mutated drivers that act on target genes via chromatin loops. Our study is an important step towards finding such regulatory regions and deciphering the somatic mutation landscape of the non-coding genome.
biorxiv cancer-biology 0-100-users 2017Content-Aware Image Restoration Pushing the Limits of Fluorescence Microscopy, bioRxiv, 2017-12-20
Fluorescence microscopy is a key driver of discoveries in the life-sciences, with observable phenomena being limited by the optics of the microscope, the chemistry of the fluorophores, and the maximum photon exposure tolerated by the sample. These limits necessitate trade-offs between imaging speed, spatial resolution, light exposure, and imaging depth. In this work we show how image restoration based on deep learning extends the range of biological phenomena observable by microscopy. On seven concrete examples we demonstrate how microscopy images can be restored even if 60-fold fewer photons are used during acquisition, how near isotropic resolution can be achieved with up to 10-fold under-sampling along the axial direction, and how tubular and granular structures smaller than the diffraction limit can be resolved at 20-times higher frame-rates compared to state-of-the-art methods. All developed image restoration methods are freely available as open source software in Python, FIJI, and KNIME.
biorxiv bioinformatics 500+-users 2017Exploring Single-Cell Data with Deep Multitasking Neural Networks, bioRxiv, 2017-12-20
AbstractBiomedical researchers are generating high-throughput, high-dimensional single-cell data at a staggering rate. As costs of data generation decrease, experimental design is moving towards measurement of many different single-cell samples in the same dataset. These samples can correspond to different patients, conditions, or treatments. While scalability of methods to datasets of these sizes is a challenge on its own, dealing with large-scale experimental design presents a whole new set of problems, including batch effects and sample comparison issues. Currently, there are no computational tools that can both handle large amounts of data in a scalable manner (many cells) and at the same time deal with many samples (many patients or conditions). Moreover, data analysis currently involves the use of different tools that each operate on their own data representation, not guaranteeing a synchronized analysis pipeline. For instance, data visualization methods can be disjoint and mismatched with the clustering method. For this purpose, we present SAUCIE, a deep neural network that leverages the high degree of parallelization and scalability offered by neural networks, as well as the deep representation of data that can be learned by them to perform many single-cell data analysis tasks, all on a unified representation.A well-known limitation of neural networks is their interpretability. Our key contribution here are newly formulated regularizations (penalties) that render features learned in hidden layers of the neural network interpretable. When large multi-patient datasets are fed into SAUCIE, the various hidden layers contain denoised and batch-corrected data, a low dimensional visualization, unsupervised clustering, as well as other information that can be used to explore the data. We show this capability by analyzing a newly generated 180-sample dataset consisting of T cells from dengue patients in India, measured with mass cytometry. We show that SAUCIE, for the first time, can batch correct and process this 11-million cell data to identify cluster-based signatures of acute dengue infection and create a patient manifold, stratifying immune response to dengue on the basis of single-cell measurements.
biorxiv bioinformatics 0-100-users 2017