URMAP, an ultra-fast read mapper, bioRxiv, 2020-01-14
AbstractMapping of reads to reference sequences is an essential step in a wide range of biological studies. The large size of datasets generated with next-generation sequencing technologies motivates the development of fast mapping software. Here, I describe URMAP, a new read mapping algorithm. URMAP is an order of magnitude faster than BWA and Bowtie2 with comparable accuracy on a benchmark test using simulated paired 150nt reads of a well-studied human genome. Software is freely available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsdrive5.comurmap>httpsdrive5.comurmap<jatsext-link>.
biorxiv bioinformatics 100-200-users 2020Using Natural Language Processing to Learn the Grammar of Glycans, bioRxiv, 2020-01-12
AbstractWhile nucleic acids and proteins receive ample attention, progress on understanding the structural and functional roles of carbohydrates has lagged behind. Here, we develop a language model for glycans, SweetTalk, taking into account glycan connectivity and composition. We use this model to investigate motifs in glycan substructures, classify them according to their O-N-linkage, and predict their immunogenicity with an accuracy of ∼92%, opening up the potential for rational glycoengineering.
biorxiv bioinformatics 0-100-users 2020The predictive power of the microbiome exceeds that of genome-wide association studies in the discrimination of complex human disease, bioRxiv, 2020-01-02
AbstractOver the past decade, studies of the human genome and microbiome have deepened our understanding of the connections between human genes, environments, microbes, and disease. For example, the sheer number of indicators of the microbiome and human genetic common variants associated with disease has been immense, but clinical utility has been elusive. Here, we compared the predictive capabilities of the human microbiome versus human genomic common variants across 13 common diseases. We concluded that microbiomic indicators outperform human genetics in predicting host phenotype (overall Microbiome-Association-Study [MAS] area under the curve [AUC] = 0.79 [SE = 0.03], overall Genome-Wide-Association-Study [GWAS] AUC = 0.67 [SE = 0.02]). Our results, while preliminary and focused on a subset of the totality of disease, demonstrate the relative predictive ability of the microbiome, indicating that it may outperform human genetics in discriminating human disease cases and controls. They additionally motivate the need for population-level microbiome sequencing resources, akin to the UK Biobank, to further improve and reproduce metagenomic models of disease.
biorxiv bioinformatics 200-500-users 2020Data structures based on k-mers for querying large collections of sequencing datasets, bioRxiv, 2019-12-06
High-throughput sequencing datasets are usually deposited in public repositories, e.g. the European Nucleotide Archive, to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow to perform online sequence searches; yet such a feature would be highly useful to investigators. Towards this goal, in the last few years several computational approaches have been introduced to index and query large collections of datasets. Here we propose an accessible survey of these approaches, which are generally based on representing datasets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.
biorxiv bioinformatics 100-200-users 2019Another look at microbe–metabolite interactions how scale invariant correlations can outperform a neural network, bioRxiv, 2019-12-02
AbstractMany scientists are now interested in studying the correlative relationships between microbes and metabolites. However, these kinds of analyses are complicated by the compositional (i.e., relative) nature of the data. Recently, Morton et al. proposed a neural network architecture called mmvec to predict metabolite abundances from microbe presence. They introduce this method as a scale invariant solution to the integration of multi-omics compositional data, and claim that “mmvec is the only method robust to scale deviations”. We do not doubt the utility of mmvec, but write in defense of simple linear statistics. In fact, when used correctly, correlation and proportionality can actually outperform the mmvec neural network.
biorxiv bioinformatics 0-100-users 2019SparK A Publication-quality NGS Visualization Tool, bioRxiv, 2019-11-17
AbstractWhile there are sophisticated resources available for displaying NGS data, including the Integrative Genomics Viewer (IGV) and the UCSC genome browser, exporting regions and assembling figures for publication remains challenging. In particular, customizing track appearance and overlaying track replicates is a manual and time-consuming process. Here, we present SparK, a tool which auto-generates publication-ready, high-resolution, true vector graphic figures from any NGS-based tracks, including RNA-seq, ChIP-seq, and ATAC-seq. Novel functions of SparK include averaging of replicates, plotting standard deviation tracks, and highlighting significantly changed areas. SparK is written in Python 3, making it executable on any major OS platform. Using command line prompts to generate figures allows later changes to be made very easy. For instance, if the genomic region of the plot needs to be changed, or tracks need to be added or removed, the figure can easily be re-generated within seconds without the manual process of re-exporting and re-assembling everything. After plotting with SparK, changes to the output SVG vector graphic files are simple to make, including text, lines, and colors. SparK is publicly available on GitHub <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comharbourlabSparK>httpsgithub.comharbourlabSparK<jatsext-link>.
biorxiv bioinformatics 100-200-users 2019