Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage, bioRxiv, 2015-10-17
AbstractGenome assemblies that are accurate, complete, and contiguous are essential for identifying important structural and functional elements of genomes and for identifying genetic variation. Nevertheless, most recent genome assemblies remain incomplete and fragmented. While long molecule sequencing promises to deliver more complete genome assemblies with fewer gaps, concerns about error rates, low yields, stringent DNA requirements, and uncertainty about best practices may discourage many investigators from adopting this technology. Here, in conjunction with the platinum standard Drosophila melanogaster reference genome, we analyze recently published long molecule sequencing data to identify what governs completeness and contiguity of genome assemblies. We also present a hybrid meta-assembly approach that achieves remarkable assembly contiguity for both Drosophila and human assemblies with only modest long molecule sequencing coverage. Our results motivate a set of preliminary best practices for obtaining accurate and contiguous assemblies, a “missing manual” that guides key decisions in building high quality de novo genome assemblies, from DNA isolation to polishing the assembly.
biorxiv genomics 100-200-users 2015TP53 copy number expansion is associated with the evolution of increased body size and an enhanced DNA damage response in elephants, bioRxiv, 2015-10-07
SUMMARYA major constraint on the evolution of large body sizes in animals is an increased risk of developing cancer. There is no correlation, however, between body size and cancer risk. This lack of correlation is often referred to as ‘Peto’s Paradox’. Here we show that the elephant genome encodes 20 copies of the tumor suppressor gene TP53 and that the increase in TP53 copy number occurred coincident with the evolution of large body sizes, the evolution of extreme sensitivity to genotoxic stress, and a hyperactive TP53 signaling pathway in the elephant (Proboscidean) lineage. Furthermore we show that several of the TP53 retrogenes (TP53RTGs) are transcribed and likely translated. While TP53RTGs do not appear to directly function as transcription factors, they do contribute to the enhanced sensitivity of elephant cells to DNA damage and the induction of apoptosis by regulating activity of the TP53 signaling pathway. These results suggest that an increase in the copy number of TP53 may have played a direct role in the evolution of very large body sizes and the resolution of Peto’s paradox in Proboscideans.
biorxiv genetics 200-500-users 2015Basset Learning the regulatory code of the accessible genome with deep convolutional neural networks, bioRxiv, 2015-10-06
AbstractThe complex language of eukaryotic gene expression remains incompletely understood. Despite the importance suggested by many noncoding variants statistically associated with human disease, nearly all such variants have unknown mechanism. Here, we address this challenge using an approach based on a recent machine learning advance—deep convolutional neural networks (CNNs). We introduce an open source package Basset (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comdavek44Basset>httpsgithub.comdavek44Basset<jatsext-link>) to apply CNNs to learn the functional activity of DNA sequences from genomics data. We trained Basset on a compendium of accessible genomic sites mapped in 164 cell types by DNaseI-seq and demonstrate far greater predictive accuracy than previous methods. Basset predictions for the change in accessibility between variant alleles were far greater for GWAS SNPs that are likely to be causal relative to nearby SNPs in linkage disequilibrium with them. With Basset, a researcher can perform a single sequencing assay in their cell type of interest and simultaneously learn that cell’s chromatin accessibility code and annotate every mutation in the genome with its influence on present accessibility and latent potential for accessibility. Thus, Basset offers a powerful computational approach to annotate and interpret the noncoding genome.
biorxiv genomics 0-100-users 2015A genome-wide resource for the analysis of protein localisation inDrosophila, bioRxiv, 2015-10-05
The Drosophila genome contains >13,000 protein coding genes, the majority of which remain poorly investigated. Important reasons include the lack of antibodies or reporter constructs to visualise these proteins. Here we present a genome-wide fosmid library of ≈10,000 GFP-tagged clones, comprising tagged genes and most of their regulatory information. For 880 tagged proteins we have created transgenic lines and for a total of 207 lines we have assessed protein expression and localisation in ovaries, embryos, pupae or adults by stainings and live imaging approaches. Importantly, we can visualise many proteins at endogenous expression levels and find a large fraction of them localising to subcellular compartments. Using complementation tests we demonstrate that two-thirds of the tagged proteins are fully functional. Moreover, our clones enable interaction proteomics from developing pupae and adult flies. Taken together, this resource will enable systematic analysis of protein expression and localisation in various cellular and developmental contexts.
biorxiv genomics 0-100-users 2015Investigation of the cellular reprogramming phenomenon referred to as stimulus-triggered acquisition of pluripotency (STAP), bioRxiv, 2015-09-29
In January 2014, it was reported that strong external stimuli, such as a transient low-pH stressor, was capable of inducing the reprogramming of mammalian somatic cells, resulting in the generation of pluripotent cells (Obokata et al. 2014a, b). This cellular reprograming event was designated 'stimulus-triggered acquisition of pluripotency' (STAP) by the authors of these reports. However, after multiple instances of scientific misconduct in the handling and presentation of the data were brought to light, both reports were retracted. To investigate the actual scientific significance of the purported STAP phenomenon, we sought to repeat the original experiments based on the methods presented in the retracted manuscripts and other relevant information. As a result, we have concluded that the STAP phenomenon as described in the original studies is not reproducible.
biorxiv cell-biology 0-100-users 2015Extensive sequencing of seven human genomes to characterize benchmark reference materials, bioRxiv, 2015-09-16
The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.
biorxiv genomics 100-200-users 2015