Algorithms for efficiently collapsing reads with Unique Molecular Identifiers, bioRxiv, 2019-05-25
AbstractBackgroundUnique Molecular Identifiers (UMI) are used in many experiments to find and remove PCR duplicates. Although there are many tools for solving the problem of deduplicating reads based on their finding reads with the same alignment coordinates and UMIs, many tools either cannot handle substitution errors, or require expensive pairwise UMI comparisons that do not efficiently scale to larger datasets.ResultsWe formulate the problem of deduplicating UMIs in a manner that enables optimizations to be made, and more efficient data structures to be used. We implement our data structures and optimizations in a tool called UMICollapse, which is able to deduplicate over one million unique UMIs of length 9 at a single alignment position in around 26 seconds.ConclusionsWe present a new formulation of the UMI deduplication problem, and show that it can be solved faster, with more sophisticated data structures.
biorxiv bioinformatics 100-200-users 2019Consistent metagenome-derived metrics verify and define bacterial species boundaries, bioRxiv, 2019-05-25
AbstractLongstanding questions relate to the existence of naturally distinct bacterial species and genetic approaches to distinguish them. Bacterial genomes in public databases form distinct groups, but these databases are subject to isolation and deposition biases. We compared 5,203 bacterial genomes from 1,457 environmental metagenomic samples to test for distinct clouds of diversity, and evaluated metrics that could be used to define the species boundary. Bacterial genomes from the human gut, soil, and the ocean all exhibited gaps in whole-genome average nucleotide identities (ANI) near the previously suggested species threshold of 95% ANI. While genome-wide ratios of non-synonymous and synonymous nucleotide differences (dNdS) decrease until ANI values approach ∼98%, estimates for homologous recombination approached zero at ∼95% ANI, supporting breakdown of recombination due to sequence divergence as a species-forming force. We evaluated 107 genome-based metrics for their ability to distinguish species when full genomes are not recovered. Full length 16S rRNA genes were least useful because they were under-recovered from metagenomes, but many ribosomal proteins displayed both high metagenomic recoverability and species-discrimination power. Taken together, our results verify the existence of sequence-discrete microbial species in metagenome-derived genomes and highlight the usefulness of ribosomal genes for gene-level species discrimination.
biorxiv microbiology 100-200-users 2019EpiScanpy integrated single-cell epigenomic analysis, bioRxiv, 2019-05-25
ABSTRACTEpigenetic single-cell measurements reveal a layer of regulatory information not accessible to single-cell transcriptomics, however single-cell-omics analysis tools mainly focus on gene expression data. To address this issue, we present epiScanpy, a computational framework for the analysis of single-cell DNA methylation and single-cell ATAC-seq data. EpiScanpy makes the many existing RNA-seq workflows from scanpy available to large-scale single-cell data from other -omics modalities. We introduce and compare multiple feature space constructions for epigenetic data and show the feasibility of common clustering, dimension reduction and trajectory learning techniques. We benchmark epiScanpy by interrogating different single-cell brain mouse atlases of DNA methylation, ATAC-seq and transcriptomics. We find that differentially methylated and differentially open markers between cell clusters enrich transcriptome-based cell type labels by orthogonal epigenetic information.
biorxiv bioinformatics 0-100-users 2019The genetic makeup of the electrocardiogram, bioRxiv, 2019-05-25
AbstractSince its original description in 1893 by Willem van Einthoven, the electrocardiogram (ECG) has been instrumental in the recognition of a wide array of cardiac disorders1,2. Although many electrocardiographic patterns have been well described, the underlying biology is incompletely understood. Genetic associations of particular features of the ECG have been identified by genome wide studies. This snapshot approach only provides fragmented information of the underlying genetic makeup of the ECG. Here, we follow the effecs of individual genetic variants through the complete cardiac cycle the ECG represents. We found that genetic variants have unique morphological signatures not identfied by previous analyses. By exploiting identified abberations of these morphological signatures, we show that novel genetic loci can be identified for cardiac disorders. Our results demonstrate how an integrated approach to analyse high-dimensional data can further our understanding of the ECG, adding to the earlier undertaken snapshot analyses of individual ECG components. We anticipate that our comprehensive resource will fuel in silico explorations of the biological mechanisms underlying cardiac traits and disorders represented on the ECG. For example, known disease causing variants can be used to identify novel morphological ECG signatures, which in turn can be utilized to prioritize genetic variants or genes for functional validation. Furthermore, the ECG plays a major role in the development of drugs, a genetic assessment of the entire ECG can drive such developments.
biorxiv genetics 0-100-users 2019A single bacterial genus maintains root development in a complex microbiome, bioRxiv, 2019-05-24
AbstractPlants grow within a complex web of species interacting with each other and with the plant. Many of these interactions are governed by a wide repertoire of chemical signals, and the resulting chemical landscape of the rhizosphere can strongly affect root health and development. To understand how microbe-microbe interactions influence root development in Arabidopsis, we established a model system for plant-microbe-microbe-environment interactions. We inoculated seedlings with a 185-member bacterial synthetic community (SynCom), manipulated the abiotic environment, and measured bacterial colonization of the plant. This enabled classification of the SynCom into four modules of co-occurring strains. We deconstructed the SynCom based on these modules, identifying microbe-microbe interactions that determine root phenotypes. These interactions primarily involve a single bacterial genus, Variovorax, which completely reverts severe root growth inhibition (RGI) induced by a wide diversity of bacterial strains as well as by the entire 185-member community. We demonstrate that Variovorax manipulate plant hormone levels to balance this ecologically realistic root community’s effects on root development. We identify a novel auxin degradation operon in the Variovorax genome that is necessary and sufficient for RGI reversion. Therefore, metabolic signal interference shapes bacteria-plant communication networks and is essential for maintaining the root’s developmental program. Optimizing the feedbacks that shape chemical interaction networks in the rhizosphere provides a promising new ecological strategy towards the development of more resilient and productive crops.
biorxiv microbiology 100-200-users 2019Can education be personalised using pupils’ genetic data?, bioRxiv, 2019-05-24
AbstractThe predictive power of polygenic scores for some traits now rivals that of more classical phenotypic measures, and as such they have been promoted as a potential tool for genetically informed policy. However, how predictive polygenic scores are conditional on other easily available phenotypic data is not well understood. Using data from a UK cohort study, the Avon Longitudinal Study of Parents and Children, we investigated how well polygenic scores for education predict individuals’ realised attainment over and above phenotypic data available to schools. Across our sample children’s polygenic scores predicted their educational outcomes almost as well as parent’s socioeconomic position or education. There was high overlap between the polygenic score and attainment distributions, leading to weak predictive accuracy at the individual level. Furthermore, conditional on prior attainment the polygenic score was not predictive of later attainment. Our results suggest that polygenic scores are informative for identifying group level differences, but they currently have limited use in predicting individual attainment.
biorxiv genetics 100-200-users 2019