Fast and accurate large multiple sequence alignments using root-to-leave regressive computation, bioRxiv, 2018-12-08
AbstractInferences derived from large multiple alignments of biological sequences are critical to many areas of biology, including evolution, genomics, biochemistry, and structural biology. However, the complexity of the alignment problem imposes the use of approximate solutions. The most common is the progressive algorithm, which starts by aligning the most similar sequences, incorporating the remaining ones following the order imposed by a guide-tree. We developed and validated on protein sequences a regressive algorithm that works the other way around, aligning first the most dissimilar sequences. Our algorithm produces more accurate alignments than non-regressive methods, especially on datasets larger than 10,000 sequences. By design, it can run any existing alignment method in linear time thus allowing the scale-up required for extremely large genomic analyses.One Sentence SummaryInitiating alignments with the most dissimilar sequences allows slow and accurate methods to be used on large datasets
biorxiv bioinformatics 200-500-users 2018New methods to calculate concordance factors for phylogenomic datasets, bioRxiv, 2018-12-05
Summary We introduce and implement two measures for quantifying genealogical concordance in phylogenomic datasets the gene concordance factor (gCF) and the site concordance factor (sCF). For every branch of a reference tree, gCF is defined as the percentage of decisive gene trees containing that branch. This measure is already in wide usage, but here we introduce a package that calculates it while accounting for variable taxon coverage among gene trees. sCF is a new measure defined as the percentage of decisive sites supporting a branch in the reference tree. gCF and sCF complement classical measures of branch support in phylogenetics by providing a full description of underlying disagreement among loci and sites. Availability and Implementation An easy to use implementation and tutorial is freely available in the IQ-TREE software (httpwww.iqtree.org). Supplementary information Data are available at httpsdoi.org10.5281zenodo.1949290
biorxiv bioinformatics 100-200-users 2018Ultra-deep, long-read nanopore sequencing of mock microbial community standards, bioRxiv, 2018-12-04
Background Long sequencing reads are information-rich aiding de novo assembly and reference mapping, and consequently have great potential for the study of microbial communities. However, the best approaches for analysis of long-read metagenomic data are unknown. Additionally, rigorous evaluation of bioinformatics tools is hindered by a lack of long-read data from validated samples with known composition.Methods We sequenced two commercially-available mock communities containing ten microbial species (ZymoBIOMICS Microbial Community Standards) with Oxford Nanopore GridION and PromethION. Isolates from the same mock community were sequenced individually with Illumina HiSeq.Data We generated 14 and 16 Gbp from GridION flowcells and 146 and 148 Gbp from PromethION flowcells for the even and odd communities respectively. Read length N50 was 5.3 Kbp and 5.2 Kbp for the even and log community, respectively. Basecalls and corresponding signal data are made available (4.2 TB in total). Results Alignment to Illumina-sequenced isolates demonstrated the expected microbial species at anticipated abundances, with the limit of detection for the lowest abundance species below 50 cells (GridION). De novo assembly of metagenomes recovered long contiguous sequences without the need for pre-processing techniques such as binning.Conclusions We present ultra-deep, long-read nanopore datasets from a well-defined mock community. These datasets will be useful for those developing bioinformatics methods for long-read metagenomics and for the validation and comparison of current laboratory and software pipelines.
biorxiv bioinformatics 100-200-users 2018Linked-read sequencing of gametes allows efficient genome-wide analysis of meiotic recombination, bioRxiv, 2018-12-01
ABSTRACTMeiotic crossovers (COs) ensure proper chromosome segregation and redistribute the genetic variation that is transmitted to the next generation. Existing methods for CO identification are challenged by large populations and the demand for genome-wide and fine-scale resolution. Taking advantage of linked-read sequencing, we developed a highly efficient method for genome-wide identification of COs at kilobase resolution in pooled recombinants. We first tested this method using a pool of Arabidopsis F2 recombinants, and obtained results that recapitulated those identified from the same plants using individual whole-genome sequencing. By applying this method to a pool of pollen DNA from a single F1 plant, we established a highly accurate CO landscape without generating or sequencing a single recombinant plant. The simplicity of this approach now enables the simultaneous generation and analysis of multiple CO landscapes and thereby allows for efficient comparison of genotypic and environmental effects on recombination, accelerating the pace at which the mechanisms for the regulation of recombination can be elucidated.
biorxiv bioinformatics 100-200-users 2018Generative modeling and latent space arithmetics predict single-cell perturbation response across cell types, studies and species, bioRxiv, 2018-11-30
AbstractAccurately modeling cellular response to perturbations is a central goal of computational biology. While such modeling has been proposed based on statistical, mechanistic and machine learning models in specific settings, no generalization of predictions to phenomena absent from training data (‘out-of-sample’) has yet been demonstrated. Here, we present scGen, a model combining variational autoencoders and latent space vector arithmetics for high-dimensional single-cell gene expression data. In benchmarks across a broad range of examples, we show that scGen accurately models dose and infection response of cells across cell types, studies and species. In particular, we demonstrate that scGen learns cell type and species specific response implying that it captures features that distinguish responding from non-responding genes and cells. With the upcoming availability of large-scale atlases of organs in healthy state, we envision scGen to become a tool for experimental design through in silico screening of perturbation response in the context of disease and drug treatment.
biorxiv bioinformatics 0-100-users 2018Naught all zeros in sequence count data are the same, bioRxiv, 2018-11-27
AbstractDue to the advent and utility of high-throughput sequencing, modern biomedical research abounds with multivariate count data. Yet such sequence count data is often extremely sparse; that is, much of the data is zero values. Such zero values are well known to cause problems for statistical analyses. In this work we provide a systematic description of different processes that can give rise to zero values as well as the types of methods for addressing zeros in sequence count studies. Importantly, we systematically review how various models perform on each type of zero generating process. Our results demonstrate that zero-inflated models can have substantial biases in both simulated and real data settings. Additionally, we find that zeros due to biological absences can, for many applications, be approximated as originating from under sampling. Beyond these results, this work provides a paired categorization scheme for models and zero generating processes to facilitate discussions and future research into the analysis of sequence count data.
biorxiv bioinformatics 100-200-users 2018