Direct determination of diploid genome sequences, bioRxiv, 2016-08-20
ABSTRACTDetermining the genome sequence of an organism is challenging, yet fundamental to understanding its biology. Over the past decade, thousands of human genomes have been sequenced, contributing deeply to biomedical research. In the vast majority of cases, these have been analyzed by aligning sequence reads to a single reference genome, biasing the resulting analyses and, in general, failing to capture sequences novel to a given genome.Some de novo assemblies have been constructed, free of reference bias, but nearly all were constructed by merging homologous loci into single ‘consensus’ sequences, generally absent from nature. These assemblies do not correctly represent the diploid biology of an individual. In exactly two cases, true diploid de novo assemblies have been made, at great expense. One was generated using Sanger sequencing and one using thousands of clone pools.Here we demonstrate a straightforward and low-cost method for creating true diploid de novo assemblies. We make a single library from ~1 ng of high molecular weight DNA, using the 10x Genomics microfluidic platform to partition the genome. We applied this technique to seven human samples, generating low-cost HiSeq X data, then assembled these using a new ‘pushbutton’ algorithm, Supernova. Each computation took two days on a single server. Each yielded contigs longer than 100 kb, phase blocks longer than 2.5 Mb, and scaffolds longer than 15 Mb. Our method provides a scalable capability for determining the actual diploid genome sequence in a sample, opening the door to new approaches in genomic biology and medicine.
biorxiv genomics 0-100-users 2016DNA damage is a major cause of sequencing errors, directly confounding variant identification, bioRxiv, 2016-08-20
AbstractPervasive mutations in somatic cells generate a heterogeneous genomic population within an organism and may result in serious medical conditions. While cancer is the most studied disease associated with somatic variations, recent advances in single cell and ultra deep sequencing indicate that a number of phenotypes and pathologies are impacted by cell specific variants. Currently, the accurate identification of low allelic frequency somatic variants relies on a combination of deep sequencing coverage and multiple evidences of the presence of variants. However, in this study we show that false positive variants can account for more than 70% of identified somatic variations, rendering conventional detection methods inadequate for accurate determination of low allelic variants. Interestingly, these false positive variants primarily originate from mutagenic DNA damage which directly confounds determination of genuine somatic mutations. Furthermore, we developed and validated a simple metric to measure mutagenic DNA damage, and demonstrated that mutagenic DNA damage is the leading cause of sequencing errors in widely used resources including the 1000 Genomes Project and The Cancer Genome Atlas.
biorxiv genomics 0-100-users 2016Phenome-wide Heritability Analysis of the UK Biobank, bioRxiv, 2016-08-19
Heritability estimation provides important information about the relative contribution of genetic and environmental factors to phenotypic variation, and provides an upper bound for the utility of genetic risk prediction models. Recent technological and statistical advances have enabled the estimation of additive heritability attributable to common genetic variants (SNP heritability) across a broad phenotypic spectrum. However, assessing the comparative heritability of multiple traits estimated in different cohorts may be misleading due to the population-specific nature of heritability. Here we report the SNP heritability for 551 complex traits derived from the large-scale, population-based UK Biobank, comprising both quantitative phenotypes and disease codes, and examine the moderating effect of three major demographic variables (age, sex and socioeconomic status) on the heritability estimates. Our study represents the first comprehensive phenome-wide heritability analysis in the UK Biobank, and underscores the importance of considering population characteristics in comparing and interpreting heritability.
biorxiv genetics 100-200-users 2016Highly parallel direct RNA sequencing on an array of nanopores, bioRxiv, 2016-08-13
AbstractRibonucleic acid sequencing can allow us to monitor the RNAs present in a sample. This enables us to detect the presence and nucleotide sequence of viruses, or to build a picture of how active transcriptional processes are changing – information that is useful for understanding the status and function of a sample. Oxford Nanopore Technologies’ sequencing technology is capable of electronically analysing a sample’s DNA directly, and in real-time. In this manuscript we demonstrate the ability of an array of nanopores to sequence RNA directly, and we apply it to a range of biological situations. Nanopore technology is the only available sequencing technology that can sequence RNA directly, rather than depending on reverse transcription and PCR. There are several potential advantages of this approach over other RNA-seq strategies, including the absence of amplification and reverse transcription biases, the ability to detect nucleotide analogues and the ability to generate full-length, strand-specific RNA sequences. Direct RNA sequencing is a completely new way of analysing the sequence of RNA samples and it will improve the ease and speed of RNA analysis, while yielding richer biological information.
biorxiv genomics 100-200-users 2016recount A large-scale resource of analysis-ready RNA-seq expression data, bioRxiv, 2016-08-09
Abstractrecount is a resource of processed and summarized expression data spanning nearly 60,000 human RNA-seq samples from the Sequence Read Archive (SRA). The associated recount Bio-conductor package provides a convenient API for querying, downloading, and analyzing the data. Each processed study consists of metaphenotype data, the expression levels of genes and their underlying exons and splice junctions, and corresponding genomic annotation. We also provide data summarization types for quantifying novel transcribed sequence including base-resolution coverage and potentially unannotated splice junctions. We present workflows illustrating how to use recount to perform differential expression analysis including meta-analysis, annotation-free base-level analysis, and replication of smaller studies using data from larger studies. recount provides a valuable and user-friendly resource of processed RNA-seq datasets to draw additional biological insights from existing public data. The resource is available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsjhubiostatistics.shinyapps.iorecount>httpsjhubiostatistics.shinyapps.iorecount<jatsext-link>.
biorxiv genomics 100-200-users 2016Cytoarchitectonic similarity is a wiring principle of the human connectome, bioRxiv, 2016-08-07
AbstractUnderstanding the wiring diagram of the human cerebral cortex is a fundamental challenge in neuroscience. Elemental aspects of its organization remain elusive. Here we examine which structural traits of cortical regions, particularly their cytoarchitecture and thickness, relate to the existence and strength of inter-regional connections. We use the architecture data from the classic work of von Economo and Koskinas and state-of-the-art diffusion-based connectivity data from the Human Connectome Project. Our results reveal a prominent role of the cytoarchitectonic similarity of supragranular layers for predicting the existence and strength of connections. In contrast, cortical thickness similarity was not related to the existence or strength of connections. These results are in line with findings for non-human mammalian cerebral cortices, suggesting overarching wiring principles of the mammalian cerebral cortex. The results invite hypotheses about evolutionary conserved neurobiological mechanisms that give rise to the relation of cytoarchitecture and connectivity in the human cerebral cortex.
biorxiv neuroscience 100-200-users 2016