Millefy visualizing cell-to-cell heterogeneity in read coverage of single-cell RNA sequencing datasets, bioRxiv, 2019-02-02
Background Read coverage of RNA sequencing data reflects gene expression and RNA processing events. Single-cell RNA sequencing (scRNA-seq) methods, particularly full-length ones, provide read coverage of many individual cells and have the potential to reveal cellular heterogeneity in RNA transcription and processing. However, visualization tools suited to highlighting cell-to-cell heterogeneity in read coverage are still lacking.Results Here, we have developed Millefy, a tool for visualizing read coverage of scRNA-seq data in genomic contexts. Millefy is designed to show read coverage of all individual cells at once in genomic contexts and to highlight cell-to-cell heterogeneity in read coverage. By visualizing read coverage of all cells as a heat map and dynamically reordering cells based on diffusion maps, Millefy facilitates discovery of local region-specific, cell-to-cell heterogeneity in read coverage, including variability of transcribed regions. Conclusions Millefy simplifies the examination of cellular heterogeneity in RNA transcription and processing events using scRNA-seq data. Millefy is available as an R package (httpsgithub.comyuifumillefy) and a Docker image to help use Millefy on the Jupyter notebook (httpshub.docker.comryuifudatascience-notebook-millefy).
biorxiv bioinformatics 0-100-users 2019Distinct characteristics of genes associated with phenome-wide variation in maize (Zea mays), bioRxiv, 2019-01-30
ABSTRACTNaturally occurring functional genetic variation is often employed to identify genetic loci that regulate specific traits. Existing approaches to link functional genetic variation to quantitative phenotypic outcomes typically evaluate one or several traits at a time. Advances in high throughput phenotyping now enable datasets which include information on dozens or hundreds of traits scored across multiple environments. Here, we develop an approach to use data from many phenotypic traits simultaneously to identify causal genetic loci. Using data for 260 traits scored across a maize diversity panel, we demonstrate that a distinct set of genes are identified relative to conventional genome wide association. The genes identified using this many-trait approach are more likely to be independently validated than the genes identified by conventional analysis of the same dataset. Genes identified by the new many-trait approach share a number of molecular, population genetic, and evolutionary features with a gold standard set of genes characterized through forward genetics. These features, as well as substantially stronger functional enrichment and purification, separate them from both genes identified by conventional genome wide association and from the overall population of annotated gene models. These results are consistent with a large subset of annotated gene models in maize playing little or no role in determining organismal phenotypes.
biorxiv bioinformatics 0-100-users 2019Comparison of long-read sequencing technologies in the hybrid assembly of complex bacterial genomes, bioRxiv, 2019-01-27
Illumina sequencing allows rapid, cheap and accurate whole genome bacterial analyses, but short reads (<300 bp) do not usually enable complete genome assembly. Long read sequencing greatly assists with resolving complex bacterial genomes, particularly when combined with short-read Illumina data (hybrid assembly); however, it is not clear how different long-read sequencing methods impact on assembly accuracy. Relative automation of the assembly process is also crucial to facilitating high-throughput complete bacterial genome reconstruction, avoiding multiple bespoke filtering and data manipulation steps. In this study, we compared hybrid assemblies for 20 bacterial isolates, including two reference strains, using Illumina sequencing and long reads from either Oxford Nanopore Technologies (ONT) or from SMRT Pacific Biosciences (PacBio) sequencing platforms. We chose isolates from the Enterobacteriaceae family, as these frequently have highly plastic, repetitive genetic structures and complete genome reconstruction for these species is relevant for a precise understanding of the epidemiology of antimicrobial resistance. We de novo assembled genomes using the hybrid assembler Unicycler and compared different read processing strategies. Both strategies facilitate high-quality genome reconstruction. Combining ONT and Illumina reads fully resolved most genomes without additional manual steps, and at a lower cost per isolate in our setting. Automated hybrid assembly is a powerful tool for complete and accurate bacterial genome assembly.
biorxiv bioinformatics 100-200-users 2019Fast and accurate long-read assembly with wtdbg2, bioRxiv, 2019-01-27
Existing long-read assemblers require tens of thousands of CPU hours to assemble a human genome and are being outpaced by sequencing technologies in terms of both throughput and cost. We developed a novel long-read assembler wtdbg2 that, for human data, is tens of times faster than published tools while achieving comparable contiguity and accuracy. It represents a significant algorithmic advance and paves the way for population-scale long-read assembly in future.
biorxiv bioinformatics 200-500-users 2019phyloFlash — Rapid SSU rRNA profiling and targeted assembly from metagenomes Supplementary Information, bioRxiv, 2019-01-17
The SSU rRNA gene is the key marker in molecular ecology for all domains of life, but is largely absent from metagenome-assembled genomes that often are the only resource available for environmental microbes. Here we present phyloFlash, a pipeline to overcome this gap with rapid, SSU rRNA-centered taxonomic classification, targeted assembly, and graph-based binning of full metagenomic assemblies. We show that a cleanup of artifacts is pivotal even with a curated reference database. With such a filtered database, the general-purpose mapper BBmap extracts SSU rRNA reads five times faster than the rRNA-specialized tool SortMeRNA with similar sensitivity and higher selectivity on simulated metagenomes. Reference-based targeted assemblers yielded either highly fragmented assemblies or high levels of chimerism, so we employ the general-purpose genomic assembler SPAdes. Our optimized implementation is independent of reference database composition and has satisfactory levels of chimera formation. Using the phyloFlash workflow we could recover the first complete genomes of several enigmatic taxa, including Marinamargulisbacteria from surface ocean seawater. phyloFlash quickly processes Illumina (meta)genomic data, is straightforward to use, even as part of high-throughput quality control, and has user-friendly output reports. The software is available at httpsgithub.comHRGVphyloFlash (GPL3 license) and is documented with an online manual.
biorxiv bioinformatics 0-100-users 2019BEHST genomic set enrichment analysis enhanced through integration of chromatin long-range interactions, bioRxiv, 2019-01-16
Transforming data from genome-scale assays into knowledge of affected molecular functions and pathways is a key challenge in biomedical research. Using vocabularies of functional terms and databases annotating genes with these terms, pathway enrichment methods can identify terms enriched in a gene list. With data that can refer to intergenic regions, however, one must first connect the regions to the terms, which are usually annotated only to genes. To make these connections, existing pathway enrichment approaches apply unwarranted assumptions such as annotating non-coding regions with the terms from adjacent genes. We developed a computational method that instead links genomic regions to annotations using data on long-range chromatin interactions. Our method, Biological Enrichment of Hidden Sequence Targets (BEHST), finds Gene Ontology (GO) terms enriched in genomic regions more precisely and accurately than existing methods. We demonstrate BEHST's ability to retrieve more pertinent and less ambiguous GO terms associated with results of in vivo mouse enhancer screens or enhancer RNA assays for multiple tissue types. BEHST will accelerate the discovery of affected pathways mediated through long-range interactions that explain non-coding hits in genome-wide association study (GWAS) or genome editing screens. BEHST is free software with a command-line interface for Linux or macOS and a web interface (httpbehst.hoffmanlab.org).
biorxiv bioinformatics 100-200-users 2019