Ohana detecting selection in multiple populations by modelling ancestral admixture components, bioRxiv, 2019-02-15
One of the most powerful and commonly used methods for detecting local adaptation in the genome is the identification of extreme allele frequency differences between populations. In this paper, we present a new maximum likelihood method for finding regions under positive selection. The method is based on a Gaussian approximation to allele frequency changes and it incorporates admixture between populations. The method can analyze multiple populations simultaneously and retains power to detect selection signatures specific to ancestry components that are not representative of any extant populations. We evaluate the method using simulated data and compare it to related methods based on summary statistics. We also apply it to human genomic data and identify loci with extreme genetic differentiation between major geographic groups. Many of the genes identified are previously known selected loci relating to hair pigmentation and morphology, skin and eye pigmentation. We also identify new candidate regions, including various selected loci in the Native American component of admixed Mexican-Americans. These involve diverse biological functions, like immunity, fat distribution, food intake, vision and hair development.
biorxiv bioinformatics 100-200-users 2019SyRI identification of syntenic and rearranged regions from whole-genome assemblies, bioRxiv, 2019-02-12
AbstractWe present SyRI, an efficient tool for genome-wide identification of structural rearrangements (SR) from genome graphs, which are built up from pair-wise whole-genome alignments. Instead of searching for differences, SyRI starts by finding all co-linear regions between the genomes. As all remaining regions are SRs by definition, they can be classified as inversions, translocations, or duplications based on their positions in convoluted networks of repetitive alignments. Finally, SyRI reports local variations like SNPs and indels within syntenic and rearranged regions. We show SyRI’s broad applicability to multiple species and genetically validate the presence of ∽100 translocations identified in Arabidopsis.
biorxiv bioinformatics 100-200-users 2019Socru Typing of genome level order and orientation in bacteria, bioRxiv, 2019-02-10
Genome rearrangements occur in bacteria between repeat sequences and impact growth and gene expression. Homologous recombination can occur between ribosomal operons, which are found in multiple copies in many bacteria. Inversion between indirect repeats and excisiontranslocation between direct repeats enable structural genome rearrangement. To identify what these rearrangements are by sequencing, reads of several thousand bases are required to span the ribosomal operons. With long read sequencing aiding the routine generation of complete bacterial assemblies, we have developed socru, a typing method for the order and orientation of genome fragments between ribosomal operons, defined against species-specific baselines. It allows for a single identifier to convey the order and orientation of genome level structure and 434 of the most common bacterial species are supported. Additionally, socru can be used to identify large scale misassemblies. Availability and implementation Socru is written in Python 3, runs on Linux and OSX systems and is available under the open source license GNU GPL 3 from httpsgithub.comquadram-institute-biosciencesocru.
biorxiv bioinformatics 0-100-users 2019Object Detection Networks and Augmented Reality for Cellular Detection in Fluorescence Microscopy Acquisition and Analysis, bioRxiv, 2019-02-09
AbstractIn this paper we demonstrate the application of object detection networks for the classification and localization of cells in fluorescence microscopy. We benchmark two leading object detection algorithms across multiple challenging 2-D microscopy datasets as well as develop and demonstrate an algorithm which can localize and image cells in 3-D, in real-time. Furthermore, we exploit the fast processing of these algorithms and develop a simple and effective Augmented Reality (AR) system for fluorescence microscopy systems. Object detection networks are well-known high performance networks famously applied to the task of identifying and localizing objects in photography images. Here we show their application and efficiency for localizing cells in fluorescence microscopy images. Object detection algorithms are typically trained on many thousands of images, which can be prohibitive within the biological sciences due to the cost of imaging and annotating large amounts of data. Through taking different cell types and assays as an example, we show that with some careful considerations it is possible to achieve very high performance with datasets with as few as 26 images present. Using our approach, it is possible for relatively non-skilled users to automate detection of cell classes with a variety of appearances and enable new avenues for automation of conventionally manual fluorescence microscopy acquisition pipelines.
biorxiv bioinformatics 0-100-users 2019Performance of neural network basecalling tools for Oxford Nanopore sequencing, bioRxiv, 2019-02-08
AbstractBackgroundBasecalling, the computational process of translating raw electrical signal to nucleotide sequence, is of critical importance to the sequencing platforms produced by Oxford Nanopore Technologies (ONT). Here we examine the performance of different basecalling tools, looking at accuracy at the level of bases within individual reads and at majority-rules consensus basecalls in an assembly. We also investigate some additional aspects of basecalling training using a taxon-specific dataset, using a larger neural network model and improving consensus basecalls in an assembly by additional signal-level analysis with Nanopolish.ResultsTraining basecallers on taxon-specific data results in a significant boost in consensus accuracy, mostly due to the reduction of errors in methylation motifs. A larger neural network is able to improve both read and consensus accuracy, but at a cost to speed. Improving consensus sequences (‘polishing’) with Nanopolish somewhat negates the accuracy differences in basecallers, but prepolish accuracy does have an effect on post-polish accuracy.ConclusionsBasecalling accuracy has seen significant improvements over the last two years. The current version of ONT’s Guppy basecaller performs well overall, with good accuracy and fast performance. If higher accuracy is required, users should consider producing a custom model using a larger neural network andor training data from the same species.
biorxiv bioinformatics 100-200-users 2019Assessing graph-based read mappers against a novel baseline approach highlights strengths and weaknesses of the current generation of methods, bioRxiv, 2019-02-02
AbstractGraph-based reference genomes have become popular as they allow read mapping and follow-up analyses in settings where the exact haplotypes underlying a high-throughput sequencing experiment are not precisely known. Two recent papers show that mapping to graph-based reference genomes can improve accuracy as compared to methods using linear references. Both of these methods index the sequences for most paths up to a certain length in the graph in order to enable direct mapping of reads containing common variants. However, the combinatorial explosion of possible paths through nearby variants also leads to a huge search space and an increased chance of false positive alignments to highly variable regions.We here assess three prominent graph-based read mappers against a novel hybrid baseline approach that combines an initial path determination with a tuned linear read mapping method. We show, using a previously proposed benchmark, that this simple approach is able to improve accuracy of read-mapping to graph-based reference genomes.Our method is implemented in a tool Two-step Graph Mapper, which is available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comuio-bmitwo_step_graph_mapper>httpsgithub.comuio-bmitwo_step_graph_mapper<jatsext-link> along with data and scripts for reproducing the experiments.
biorxiv bioinformatics 0-100-users 2019