Creating a universal SNP and small indel variant caller with deep neural networks, bioRxiv, 2016-12-15

AbstractNext-generation sequencing (NGS) is a rapidly evolving set of technologies that can be used to determine the sequence of an individual’s genome1 by calling genetic variants present in an individual using billions of short, errorful sequence reads2. Despite more than a decade of effort and thousands of dedicated researchers, the hand-crafted and parameterized statistical models used for variant calling still produce thousands of errors and missed variants in each genome3,4. Here we show that a deep convolutional neural network5 can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships (likelihoods) between images of read pileups around putative variant sites and ground-truth genotype calls. This approach, called DeepVariant, outperforms existing tools, even winning the “highest performance” award for SNPs in a FDA-administered variant calling challenge. The learned model generalizes across genome builds and even to other mammalian species, allowing non-human sequencing projects to benefit from the wealth of human ground truth data. We further show that, unlike existing tools which perform well on only a specific technology, DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, from deep whole genomes from 10X Genomics to Ion Ampliseq exomes. DeepVariant represents a significant step from expert-driven statistical modeling towards more automatic deep learning approaches for developing software to interpret biological instrumentation data.

biorxiv genomics 200-500-users 2016

Genetic determinants of chromatin accessibility in T cell activation across humans, bioRxiv, 2016-12-03

AbstractOver 90% of genetic variants associated with complex human traits map to non-coding regions, but little is understood about how they modulate gene regulation in health and disease. One possible mechanism is that genetic variants affect the activity of one or more cis-regulatory elements leading to gene expression variation in specific cell types. To identify such cases, we analyzed Assay for Transposase-Accessible Chromatin sequencing (ATAC-seq) and RNA-seq profiles from activated CD4+ T cells of up to 105 healthy donors. We found that regions of accessible chromatin (ATAC-peaks) are co-accessible at kilobase and megabase resolution, in patterns consistent with the 3D organization of chromosomes measured by in situ Hi-C in T cells. 15% of genetic variants located within ATAC-peaks affected the accessibility of the corresponding peak through disrupting binding sites for transcription factors important for T cell differentiation and activation. These ATAC quantitative trait nucleotides (ATAC-QTNs) have the largest effects on co-accessible peaks, are associated with gene expression from the same aliquot of cells, are rarely affecting core binding motifs, and are enriched for autoimmune disease variants. Our results provide insights into how natural genetic variants modulate cis- regulatory elements, in isolation or in concert, to influence gene expression in primary immune cells that play a key role in many human diseases.

biorxiv genomics 100-200-users 2016

Nanopore DNA Sequencing and Genome Assembly on the International Space Station, bioRxiv, 2016-09-28

AbstractThe emergence of nanopore-based sequencers greatly expands the reach of sequencing into low-resource field environments, enabling in situ molecular analysis. In this work, we evaluated the performance of the MinION DNA sequencer (Oxford Nanopore Technologies) in-flight on the International Space Station (ISS), and benchmarked its performance off-Earth against the MinION, Illumina MiSeq, and PacBio RS II sequencing platforms in terrestrial laboratories. Samples contained mixtures of genomic DNA extracted from lambda bacteriophage, Escherichia coli (strain K12) and Mus musculus (BALBc). The in-flight sequencing experiments generated more than 80,000 total reads with mean 2D accuracies of 85 – 90%, mean 1D accuracies of 75 – 80%, and median read lengths of approximately 6,000 bases. We were able to construct directed assemblies of the ~4.7 Mb E. coli genome, ~48.5 kb lambda genome, and a representative M. musculus sequence (the ~16.3 kb mitochondrial genome), at 100%, 100%, and 96.7% pairwise identity, respectively, and de novo assemblies of the lambda and E. coli genomes generated solely from nanopore reads yielded 100% and 99.8% genome coverage, respectively, at 100% and 98.5% pairwise identity. Across all surveyed metrics (base quality, throughput, staysbase, skipsbase), no observable decrease in MinION performance was observed while sequencing DNA in space. Simulated runs of in-flight nanopore data using an automated bioinformatic pipeline and cloud or laptop based genomic assembly demonstrated the feasibility of real-time sequencing analysis and direct microbial identification in space. Applications of sequencing for space exploration include infectious disease diagnosis, environmental monitoring, evaluating biological responses to spaceflight, and even potentially the detection of extraterrestrial life on other planetary bodies.

biorxiv genomics 100-200-users 2016

 

Created with the audiences framework by Jedidiah Carlson

Powered by Hugo