A natural encoding of genetic variation in a Burrows-Wheeler Transform to enable mapping and genome inference, bioRxiv, 2016-06-16
AbstractWe show how positional markers can be used to encode genetic variation within aBurrows-Wheeler Transform (BWT), and use this to construct a generalisation ofthe traditional “reference genome”, incorporating known variation within aspecies. Our goal is to support the inference of the closest mosaic of previouslyknown sequences to the genome(s) under analysis.Our scheme results in an increased alphabet size, and by using a wavelet tree encoding of the BWT we reduce the performance impact on rank operations. We give a specialised form of the backward search that allows variation-aware exact matching. We implement this, and demonstrate the cost of constructing an index of the whole human genome with 8 million genetic variants is 25GB of RAM. We also show that inferring a closer reference can close large kilobase-scale coverage gaps in P. falciparum.
biorxiv bioinformatics 200-500-users 2016Adapterama I Universal stubs and primers for 384 unique dual-indexed or 147,456 combinatorially-indexed Illumina libraries (iTru & iNext), bioRxiv, 2016-06-16
AbstractNext-generation DNA sequencing (NGS) offers many benefits, but major factors limiting NGS include reducing costs of 1) start-up (i.e., doing NGS for the first time); 2) buy-in (i.e., getting the smallest possible amount of data from a run); and 3) sample preparation. Reducing sample preparation costs is commonly addressed, but start-up and buy-in costs are rarely addressed. We present dual-indexing systems to address all three of these issues. By breaking the library construction process into universal, re-usable, combinatorial components, we reduce all costs, while increasing the number of samples and the variety of library types that can be combined within runs. We accomplish this by extending the Illumina TruSeq dual-indexing approach to 768 (384 + 384) indexed primers that produce 384 unique dual-indexes or 147,456 (384 × 384) unique combinations. We maintain eight nucleotide indexes, with many that are compatible with Illumina index sequences. We synthesized these indexing primers, purifying them with only standard desalting and placing small aliquots in replicate plates. In qPCR validation tests, 206 of 208 primers tested passed (99% success). We then created hundreds of libraries in various scenarios. Our approach reduces start-up and per-sample costs by requiring only one universal adapter that works with indexed PCR primers to uniquely identify samples. Our approach reduces buy-in costs because 1) relatively few oligonucleotides are needed to produce a large number of indexed libraries; and 2) the large number of possible primers allows researchers to use unique primer sets for different projects, which facilitates pooling of samples during sequencing. Our libraries make use of standard Illumina sequencing primers and index sequence length and are demultiplexed with standard Illumina software, thereby minimizing customization headaches. In subsequent Adapterama papers, we use these same primers with different adapter stubs to construct amplicon and restriction-site associated DNA libraries, but their use can be expanded to any type of library sequenced on Illumina platforms.
biorxiv genomics 0-100-users 2016Towards an integration of deep learning and neuroscience, bioRxiv, 2016-06-14
Neuroscience has focused on the detailed implementation of computation, studying neural codes, dynamics and circuits. In machine learning, however, artificial neural networks tend to eschew precisely designed codes, dynamics or circuits in favor of brute force optimization of a cost function, often using simple and relatively uniform initial architectures. Two recent developments have emerged within machine learning that create an opportunity to connect these seemingly divergent perspectives. First, structured architectures are used, including dedicated systems for attention, recursion and various forms of short- and long-term memory storage. Second, cost functions and training procedures have become more complex and are varied across layers and over time. Here we think about the brain in terms of these ideas. We hypothesize that (1) the brain optimizes cost functions, (2) these cost functions are diverse and differ across brain locations and over development, and (3) optimization operates within a pre-structured architecture matched to the computational problems posed by behavior. Such a heterogeneously optimized system, enabled by a series of interacting cost functions, serves to make learning data-efficient and precisely targeted to the needs of the organism. We suggest directions by which neuroscience could seek to refine and test these hypotheses.
biorxiv neuroscience 200-500-users 2016Vibrio natriegens, a new genomic powerhouse, bioRxiv, 2016-06-13
Recombinant DNA technology has revolutionized biomedical research with continual innovations advancing the speed and throughput of molecular biology. Nearly all these tools, however, are reliant on Escherichia coli as a host organism, and its lengthy growth rate increasingly dominates experimental time. Here we report the development of Vibrio natriegens, a free-living bacteria with the fastest generation time known, into a genetically tractable host organism. We systematically characterize its growth properties to establish basic laboratory culturing conditions. We provide the first complete Vibrio natriegens genome, consisting of two chromosomes of 3,248,023 bp and 1,927,310 bp that together encode 4,578 open reading frames. We reveal genetic tools and techniques for working with Vibrio natriegens. These foundational resources will usher in an era of advanced genomics to accelerate biological, biotechnological, and medical discoveries.
biorxiv genomics 0-100-users 2016Differential analysis of RNA-Seq incorporating quantification uncertainty, bioRxiv, 2016-06-11
We describe a novel method for the differential analysis of RNA-Seq data that utilizes bootstrapping in conjunction with response error linear modeling to decouple biological variance from inferential variance. The method is implemented in an interactive shiny app called sleuth that utilizes kallisto quantifications and bootstraps for fast and accurate analysis of RNA-Seq experiments.
biorxiv bioinformatics 0-100-users 2016Thanatotranscriptome genes actively expressed after organismal death, bioRxiv, 2016-06-11
AbstractA continuing enigma in the study of biological systems is what happens to highly ordered structures, far from equilibrium, when their regulatory systems suddenly become disabled. In life, genetic and epigenetic networks precisely coordinate the expression of genes -- but in death, it is not known if gene expression diminishes gradually or abruptly stops or if specific genes are involved. We investigated the unwinding of the clock by identifying upregulated genes, assessing their functions, and comparing their transcriptional profiles through postmortem time in two species, mouse and zebrafish. We found transcriptional abundance profiles of 1,063 genes were significantly changed after death of healthy adult animals in a time series spanning from life to 48 or 96 h postmortem. Ordination plots revealed non-random patterns in profiles by time. While most thanatotranscriptome (thanatos-, Greek defn. death) transcript levels increased within 0.5 h postmortem, some increased only at 24 and 48 h. Functional characterization of the most abundant transcripts revealed the following categories stress, immunity, inflammation, apoptosis, transport, development, epigenetic regulation, and cancer. The increase of transcript abundance was presumably due to thermodynamic and kinetic controls encountered such as the activation of epigenetic modification genes responsible for unraveling the nucleosomes, which enabled transcription of previously silenced genes (e.g., development genes). The fact that new molecules were synthesized at 48 to 96 h postmortem suggests sufficient energy and resources to maintain self-organizing processes. A step-wise shutdown occurs in organismal death that is manifested by the apparent upregulation of genes with various abundance maxima and durations. The results are of significance to transplantology and molecular biology.
biorxiv systems-biology 200-500-users 2016