High-throughput annotation of full-length long noncoding RNAs with Capture Long-Read Sequencing, bioRxiv, 2017-02-02
AbstractAccurate annotations of genes and their transcripts is a foundation of genomics, but no annotation technique presently combines throughput and accuracy. As a result, reference gene collections remain incomplete many gene models are fragmentary, while thousands more remain uncatalogued–particularly for long noncoding RNAs (lncRNAs). To accelerate lncRNA annotation, the GENCODE consortium has developed RNA Capture Long Seq (CLS), combining targeted RNA capture with third-generation long-read sequencing. We present an experimental re-annotation of the GENCODE intergenic lncRNA population in matched human and mouse tissues, resulting in novel transcript models for 3574 561 gene loci, respectively. CLS approximately doubles the annotated complexity of targeted loci, outperforming existing short-read techniques. Full-length transcript models produced by CLS enable us to definitively characterize the genomic features of lncRNAs, including promoter- and gene-structure, and protein-coding potential. Thus CLS removes a longstanding bottleneck of transcriptome annotation, generating manual-quality full-length transcript models at high-throughput scales.Abbreviations<jatsdef-list><jatsdef-item>bpbase pair<jatsdef-item><jatsdef-item>FLfull length<jatsdef-item><jatsdef-item>ntnucleotide<jatsdef-item><jatsdef-item>ROIread of insert, i.e. PacBio read<jatsdef-item><jatsdef-item>SJsplice junction<jatsdef-item><jatsdef-item>SMRTsingle-molecule real-time<jatsdef-item><jatsdef-item>TMtranscript model<jatsdef-item><jatsdef-list>
biorxiv genomics 0-100-users 2017SvABA Genome-wide detection of structural variants and indels by local assembly, bioRxiv, 2017-02-02
AbstractStructural variants (SVs), including small insertion and deletion variants (indels), are challenging to detect through standard alignment-based variant calling methods. Sequence assembly offers a powerful approach to identifying SVs, but is difficult to apply at-scale genome-wide for SV detection due to its computational complexity and the difficulty of extracting SVs from assembly contigs. We describe SvABA, an efficient and accurate method for detecting SVs from short-read sequencing data using genome-wide local assembly with low memory and computing requirements. We evaluated SvABA’s performance on the NA12878 human genome and in simulated and real cancer genomes. SvABA demonstrates superior sensitivity and specificity across a large spectrum of SVs, and substantially improved detection performance for variants in the 20-300 bp range, compared with existing methods. SvABA also identifies complex somatic rearrangements with chains of short (< 1,000 bp) templated-sequence insertions copied from distant genomic regions. We applied SvABA to 344 cancer genomes from 11 cancer types, and found that templated-sequence insertions occur in ~4% of all somatic rearrangements. Finally, we demonstrate that SvABA can identify sites of viral integration and cancer driver alterations containing medium-sized SVs.
biorxiv genomics 0-100-users 2017A computational toolbox and step-by-step tutorial for the analysis of neuronal population dynamics in calcium imaging data, bioRxiv, 2017-01-29
The development of new imaging and optogenetics techniques to study the dynamics of large neuronal circuits is generating datasets of unprecedented volume and complexity, demanding the development of appropriate analysis tools. We present a tutorial for the use of a comprehensive computational toolbox for the analysis of neuronal population activity imaging. It consists of tools for image pre-processing and segmentation, estimation of significant single-neuron single-trial signals, mapping event-related neuronal responses, detection of activity-correlated neuronal clusters, exploration of population dynamics, and analysis of clusters’ features against surrogate control datasets. They are integrated in a modular and versatile processing pipeline, adaptable to different needs. The clustering module is capable of detecting flexible, dynamically activated neuronal assemblies, consistent with the distributed population coding of the brain. We demonstrate the suitability of the toolbox for a variety of calcium imaging datasets, and provide a case study to explain its implementation.
biorxiv neuroscience 0-100-users 2017A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases, bioRxiv, 2017-01-28
AbstractEmerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this paper, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290x faster than BWA-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥ 5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and > 60, 000 genomes.
biorxiv bioinformatics 100-200-users 2017A randomized placebo-controlled trial on the antidepressant effects of the psychedelic ayahuasca in treatment-resistant depression, bioRxiv, 2017-01-28
AbstractRecent open label trials show that psychedelics, such as ayahuasca, hold promise as fast-onset antidepressants in treatment-resistant depression. In order to further test the antidepressant effects of ayahuasca, we conducted a parallel-arm, double-blind randomized placebo-controlled trial in 29 patients with treatment-resistant depression. Patients received a single dose of either ayahuasca or placebo. Changes in depression severity were assessed with the Montgomery–Åsberg Depression Rating Scale (MADRS) and the Hamilton Depression Rating scale (HAM-D). Assessments were made at baseline, and at one (D1), two (D2) and seven (D7) days after dosing. We observed significant antidepressant effects of ayahuasca when compared to placebo at all timepoints. MADRS scores were significantly lower in the ayahuasca group compared to placebo (at D1 and D2 p=0.04; and at D7 p<0.0001). Between-group effect sizes increased from D1 to D7 (D1 Cohen’ s d=0.84; D2 Cohen’ s d=0.84; D7 Cohen’ s d=1.49). Response rates were high for both groups at D1 and D2, and significantly higher in the ayahuasca group at D7 (64% vs. 27%; p=0.04), while remission rate was marginally significant at D7 (36% vs. 7%, p=0.054). To our knowledge, this is the first controlled trial to test a psychedelic substance in treatment-resistant depression. Overall, this study brings new evidence supporting the safety and therapeutic value of ayahuasca, dosed within an appropriate setting, to help treat depression.
biorxiv clinical-trials 0-100-users 2017Nucleotide sequence and DNaseI sensitivity are predictive of 3D chromatin architecture, bioRxiv, 2017-01-28
AbstractRecently, Hi-C has been used to probe the 3D chromatin architecture of multiple organisms and cell types. The resulting collections of pairwise contacts across the genome have connected chromatin architecture to many cellular phenomena, including replication timing and gene regulation. However, high resolution (10 kb or finer) contact maps remain scarce due to the expense and time required for collection. A computational method for predicting pairwise contacts without the need to run a Hi-C experiment would be invaluable in understanding the role that 3D chromatin architecture plays in genome biology. We describe Rambutan, a deep convolutional neural network that predicts Hi-C contacts at 1 kb resolution using nucleotide sequence and DNaseI assay signal as inputs. Specifically, Rambutan identifies locus pairs that engage in high confidence contacts according to Fit-Hi-C, a previously described method for assigning statistical confidence estimates to Hi-C contacts. We first demonstrate Rambutan’s performance across chromosomes at 1 kb resolution in the GM12878 cell line. Subsequently, we measure Rambutan’s performance across six cell types. In this setting, the model achieves an area under the receiver operating characteristic curve between 0.7662 and 0.8246 and an area under the precision-recall curve between 0.3737 and 0.9008. We further demonstrate that the predicted contacts exhibit expected trends relative to histone modification ChlP-seq data, replication timing measurements, and annotations of functional elements such as promoters and enhancers. Finally, we predict Hi-C contacts for 53 human cell types and show that the predictions cluster by cellular function. [NOTE After our original submission we discovered an error in our calling of statistically significant contacts. Briefly, when calculating the prior probability of a contact, we used the number of contacts at a certain genomic distance in a chromosome but divided by the total number of bins in the full genome. When we corrected this mistake we noticed that the Rambutan model, as it curently stands, did not outperform simply using the GM12878 contact map that Rambutan was trained on as the predictor in other cell types. While we investigate these new results, we ask that readers treat this manuscript skeptically.]
biorxiv bioinformatics 0-100-users 2017