Scaling computational genomics to millions of individuals with GPUs, bioRxiv, 2018-11-14
Current genomics methods were designed to handle tens to thousands of samples, but will soon need to scale to millions to keep up with the pace of data and hypothesis generation in biomedical science. Moreover, costs associated with processing these growing datasets will become prohibitive without improving the computational efficiency and scalability of methods. Here, we show that recently developed machine-learning libraries (TensorFlow and PyTorch) facilitate implementation of genomics methods for GPUs and significantly accelerate computations. To demonstrate this, we re-implemented methods for two commonly performed computational genomics tasks QTL mapping and Bayesian non-negative matrix factorization. Our implementations ran > 200 times faster than current CPU-based versions, and these analyses are ~5-10 fold cheaper on GPUs due to the vastly shorter runtimes. We anticipate that the accessibility of these libraries, and the improvements in run-time will lead to a transition to GPU-based implementations for a wide range of computational genomics methods.
biorxiv genomics 200-500-users 2018Nanopore native RNA sequencing of a human poly(A) transcriptome, bioRxiv, 2018-11-10
ABSTRACTHigh throughput cDNA sequencing technologies have dramatically advanced our understanding of transcriptome complexity and regulation. However, these methods lose information contained in biological RNA because the copied reads are often short and because modifications are not carried forward in cDNA. We address these limitations using a native poly(A) RNA sequencing strategy developed by Oxford Nanopore Technologies (ONT). Our study focused on poly(A) RNA from the human cell line GM12878, generating 9.9 million aligned sequence reads. These native RNA reads had an aligned N50 length of 1294 bases, and a maximum aligned length of over 21,000 bases. A total of 78,199 high-confidence isoforms were identified by combining long nanopore reads with short higher accuracy Illumina reads. We describe strategies for assessing 3′ poly(A) tail length, base modifications and transcript haplotypes from nanopore RNA data. Together, these nanopore-based techniques are poised to deliver new insights into RNA biology.DISCLOSURESMA holds shares in Oxford Nanopore Technologies (ONT). MA is a paid consultant to ONT. REW, WT, TG, JRT, JQ, NJL, JTS, NS, AB, MA, HEO, MJ, and ML received reimbursement for travel, accommodation and conference fees to speak at events organised by ONT. NL has received an honorarium to speak at an ONT company meeting. WT has two patents (8,748,091 and 8,394,584) licensed to Oxford Nanopore. JTS, ML and MA received research funding from ONT.
biorxiv genomics 200-500-users 2018Expanding the CITE-seq tool-kit Detection of proteins, transcriptomes, clonotypes and CRISPR perturbations with multiplexing, in a single assay, bioRxiv, 2018-11-09
ABSTRACTRapid technological progress in the recent years has allowed the high-throughput interrogation of different types of biomolecules from single cells. Combining several of these readouts into integrated multi-omic assays is essential to comprehensively understand and model cellular processes. Here, we report the development of Expanded CRISPR-compatible Cellular Indexing of Transcriptomes and Epitopes by sequencing (ECCITE-seq) for the high-throughput characterization of at least five modalities of information from each single cell transcriptome, immune receptor clonotypes, surface markers, sample identity and sgRNAs. We demonstrate the use of ECCITE-seq to directly and efficiently capture sgRNA molecules and measure their effects on gene expression and protein levels, opening the possibility of performing high throughput single cell CRISPR screens with multimodal readout using existing libraries and commonly used vectors. Finally, by utilizing the combined phenotyping of clonotype and cell surface markers in immune cells, we apply ECCITE to study a lymphoma sample to discriminate cells and define molecular signatures of malignant cells within a heterogeneous population.
biorxiv genomics 100-200-users 2018A genome-wide algal mutant library reveals a global view of genes required for eukaryotic photosynthesis, bioRxiv, 2018-11-07
Photosynthetic organisms provide food and energy for nearly all life on Earth, yet half of their protein-coding genes remain uncharacterized1,2. Characterization of these genes could be greatly accelerated by new genetic resources for unicellular organisms that complement the use of multicellular plants by enabling higher-throughput studies. Here, we generated a genome-wide, indexed library of mapped insertion mutants for the flagship unicellular alga Chlamydomonas reinhardtii (Chlamydomonas hereafter). The 62,389 mutants in the library, covering 83% of nuclear, protein-coding genes, are available to the community. Each mutant contains unique DNA barcodes, allowing the collection to be screened as a pool. We leveraged this feature to perform a genome-wide survey of genes required for photosynthesis, which identified 303 candidate genes. Characterization of one of these genes, the conserved predicted phosphatase CPL3, showed it is important for accumulation of multiple photosynthetic protein complexes. Strikingly, 21 of the 43 highest-confidence genes are novel, opening new opportunities for advances in our understanding of this biogeochemically fundamental process. This library is the first genome-wide mapped mutant resource in any unicellular photosynthetic organism, and will accelerate the characterization of thousands of genes in algae, plants and animals.
biorxiv genomics 0-100-users 2018Comparative analysis of sequencing technologies platforms for single-cell transcriptomics, bioRxiv, 2018-11-06
AbstractAll single-cell RNA-seq protocols and technologies require library preparation prior to sequencing on a platform such as Illumina. Here, we present the first report to utilize the BGISEQ-500 platform for scRNA-seq, and compare the sensitivity and accuracy to Illumina sequencing. We generate a scRNA-seq resource of 468 unique single-cells and 1,297 matched single cDNA samples, performing SMARTer and Smart-seq2 protocols on mESCs and K562 cells with RNA spike-ins. We sequence these libraries on both BGISEQ-500 and Illumina HiSeq platforms using single- and paired-end reads. The two platforms have comparable sensitivity and accuracy in terms of quantification of gene expression, and low technical variability. Our study provides a standardised scRNA-seq resource to benchmark new scRNA-seq library preparation protocols and sequencing platforms.
biorxiv genomics 0-100-users 2018Systematic identification of human SNPs affecting regulatory element activity, bioRxiv, 2018-11-04
AbstractMost of the millions of single-nucleotide polymorphisms (SNPs) in the human genome are non-coding, and many overlap with putative regulatory elements. Genome-wide association studies have linked many of these SNPs to human traits or to gene expression levels, but rarely with sufficient resolution to identify the causal SNPs. Functional screens based on reporter assays have previously been of insufficient throughput to test the vast space of SNPs for possible effects on enhancer and promoter activity. Here, we have leveraged the throughput of the SuRE reporter technology to survey a total of 5.9 million SNPs, including 57% of the known common SNPs. We identified more than 30 thousand SNPs that alter the activity of putative regulatory elements, often in a cell-type specific manner. These data indicate that a large proportion of human non-coding SNPs may affect gene regulation. Integration of these SuRE data with genome-wide association studies may help pinpoint SNPs that underlie human traits.
biorxiv genomics 100-200-users 2018