Phylofactorization a graph-partitioning algorithm to identify phylogenetic scales of ecological data, bioRxiv, 2017-12-17
AbstractThe problem of pattern and scale is a central challenge in ecology. The problem of scale is central to community ecology, where functional ecological groups are aggregated and treated as a unit underlying an ecological pattern, such as aggregation of “nitrogen fixing trees” into a total abundance of a trait underlying ecosystem physiology. With the emergence of massive community ecological datasets, from microbiomes to breeding bird surveys, there is a need to objectively identify the scales of organization pertaining to well-defined patterns in community ecological data.The phylogeny is a scaffold for identifying key phylogenetic scales associated with macroscopic patterns. Phylofactorization was developed to objectively identify phylogenetic scales underlying patterns in relative abundance data. However, many ecological data, such as presence-absences and counts, are not relative abundances, yet it is still desireable and informative to identify phylogenetic scales underlying a pattern of interest. Here, we generalize phylofactorization beyond relative abundances to a graph-partitioning algorithm for any community ecological data.Generalizing phylofactorization connects many tools from data analysis to phylogenetically-informe analysis of community ecological data. Two-sample tests identify three phylogenetic factors of mammalian body mass which arose during the K-Pg extinction event, consistent with other analyses of mammalian body mass evolution. Projection of data onto coordinates defined by the phylogeny yield a phylogenetic principal components analysis which refines our understanding of the major sources of variation in the human gut microbiome. These same coordinates allow generalized additive modeling of microbes in Central Park soils and confirm that a large clade of Acidobacteria thrive in neutral soils. Generalized linear and additive modeling of exponential family random variables can be performed by phylogenetically-constrained reduced-rank regression or stepwise factor contrasts. We finish with a discussion of how phylofac-torization produces an ecological species concept with a phylogenetic constraint. All of these tools can be implemented with a new R package available online.
biorxiv ecology 0-100-users 2017Amplicon sequencing of the 16S-ITS-23S rRNA operon with long-read technology for improved phylogenetic classification of uncultured prokaryotes, bioRxiv, 2017-12-16
AbstractAmplicon sequencing of the 16S rRNA gene is the predominant method to quantify microbial compositions of environmental samples and to discover previously unknown lineages. Its unique structure of interspersed conserved and variable regions is an excellent target for PCR and allows for classification of reads at all taxonomic levels. However, the relatively few phylogenetically informative sites prevent confident phylogenetic placements of novel lineages that are deep branching relative to reference taxa. This problem is exacerbated when only short 16S rRNA gene fragments are sequenced. To resolve their placement, it is common practice to gather more informative sites by combining multiple conserved genes into concatenated datasets. This however requires genomic data which may be obtained through relatively expensive metagenome sequencing and computationally demanding analyses. Here we develop a protocol that amplifies a large part of 16S and 23S rRNA genes within the rRNA operon, including the ITS region, and sequences the amplicons with PacBio long-read technology. We tested our method with a synthetic mock community and developed a read curation pipeline that reduces the overall error rate to 0.18%. Applying our method on four diverse environmental samples, we were able to capture near full-length rRNA operon amplicons from a large diversity of prokaryotes. Phylogenetic trees constructed with these sequences showed an increase in statistical support compared to trees inferred with shorter, Illumina-like sequences using only the 16S rRNA gene (250 bp). Our method is a cost-effective solution to generate high quality, near full-length 16S and 23S rRNA gene sequences from environmental prokaryotes.
biorxiv microbiology 0-100-users 2017High accuracy measurements of nanometer-scale distances between fluorophores at the single-molecule level, bioRxiv, 2017-12-16
To uncover the mechanisms of molecular machines it is useful to probe their structural conformations. Single-molecule Förster resonance energy transfer (smFRET) is a powerful tool for measuring intra-molecular shape changes of single-molecules, but is confined to distances of 2-8 nm. Current super-resolution measurements are error prone at <25 nm. Thus, reliable high-throughput distance information between 8-25 nm is currently difficult to achieve. Here, we describe methods that utilize information about localization and imaging errors to measure distances between two different color fluorophores with ∼1 nm accuracy at any distance >2 nm, using a standard TIRF microscope and open-source software. We applied our two-color localization method to uncover a ∼4 nm conformational change in the “stalk” of the motor protein dynein, revealing unexpected flexibility in this antiparallel coiled-coil domain. These new methods enable high-accuracy distance measurements of single-molecules that can be used over a wide range of length scales.
biorxiv biophysics 0-100-users 2017Real-time search of all bacterial and viral genomic data, bioRxiv, 2017-12-16
Genome sequencing of pathogens is now ubiquitous in microbiology, and the sequence archives are effectively no longer searchable for arbitrary sequences. Furthermore, the exponential increase of these archives is likely to be further spurred by automated diagnostics. To unlock their use for scientific research and real-time surveillance we have combined knowledge about bacterial genetic variation with ideas used in web-search, to build a DNA search engine for microbial data that can grow incrementally. We indexed the complete global corpus of bacterial and viral whole genome sequence data (447,833 genomes), using four orders of magnitude less storage than previous methods. The method allows future scaling to millions of genomes. This renders the global archive accessible to sequence search, which we demonstrate with three applications ultra-fast search for resistance genes MCR1-3, analysis of host-range for 2827 plasmids, and quantification of the rise of antibiotic resistance prevalence in the sequence archives.
biorxiv bioinformatics 500+-users 2017Sequence variation aware genome references and read mapping with the variation graph toolkit, bioRxiv, 2017-12-16
AbstractReference genomes guide our interpretation of DNA sequence data. However, conventional linear references are fundamentally limited in that they represent only one version of each locus, whereas the population may contain multiple variants. When the reference represents an individual’s genome poorly, it can impact read mapping and introduce bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation, including large scale structural variation such as inversions and duplications.1 Equivalent structures are produced by de novo genome assemblers.2,3 Here we present vg, a toolkit of computational methods for creating, manipulating, and utilizing these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays,4 with improved accuracy over alignment to a linear reference, creating data structures to support downstream variant calling and genotyping. These capabilities make using variation graphs as reference structures for DNA sequencing practical at the scale of vertebrate genomes, or at the topological complexity of new species assemblies.
biorxiv genomics 200-500-users 2017Specificity of RNAi, LNA and CRISPRi as loss-of-function methods in transcriptional analysis, bioRxiv, 2017-12-16
ABSTRACTLoss-of-function (LOF) methods, such as RNA interference (RNAi), antisense oligonucleotides or CRISPR-based genome editing, provide unparalleled power for studying the biological function of genes of interest. When coupled with transcriptomic analyses, LOF methods allow researchers to dissect networks of transcriptional regulation. However, a major concern is nonspecific targeting, which involves depletion of transcripts other than those intended. The off-target effects of each of these common LOF methods have yet to be compared at the whole-transcriptome level. Here, we systematically and experimentally compared non-specific activity of RNAi, antisense oligonucleotides and CRISPR interference (CRISPRi). All three methods yielded non-negligible offtarget effects in gene expression, with CRISPRi exhibiting clonal variation in the transcriptional profile. As an illustrative example, we evaluated the performance of each method for deciphering the role of a long noncoding RNA (lncRNA) with unknown function. Although all LOF methods reduced expression of the candidate lncRNA, each method yielded different sets of differentially expressed genes upon knockdown as well as a different cellular phenotype. Therefore, to definitively confirm the functional role of a transcriptional regulator, we recommend the simultaneous use of at least two different LOF methods and the inclusion of multiple, specifically designed negative controls.
biorxiv genomics 0-100-users 2017