Text mining of 15 million full-text scientific articles, bioRxiv, 2017-07-12
AbstractAcross academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823–2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein–protein, disease–gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.
biorxiv bioinformatics 100-200-users 2017The evolutionary history of 2,658 cancers, bioRxiv, 2017-07-12
SummaryCancer develops through a process of somatic evolution. Here, we use whole-genome sequencing of 2,778 tumour samples from 2,658 donors to reconstruct the life history, evolution of mutational processes, and driver mutation sequences of 39 cancer types. The early phases of oncogenesis are driven by point mutations in a small set of driver genes, often including biallelic inactivation of tumour suppressors. Early oncogenesis is also characterised by specific copy number gains, such as trisomy 7 in glioblastoma or isochromosome 17q in medulloblastoma. By contrast, increased genomic instability, a nearly four-fold diversification of driver genes, and an acceleration of point mutation processes are features of later stages. Copy-number alterations often occur in mitotic crises leading to simultaneous gains of multiple chromosomal segments. Timing analysis suggests that driver mutations often precede diagnosis by many years, and in some cases decades, providing a window of opportunity for early cancer detection.
biorxiv cancer-biology 200-500-users 2017Sequential regulatory activity prediction across chromosomes with convolutional neural networks, bioRxiv, 2017-07-11
AbstractModels for predicting phenotypic outcomes from genotypes have important applications to understanding genomic function and improving human health. Here, we develop a machine-learning system to predict cell type-specific epigenetic and transcriptional profiles in large mammalian genomes from DNA sequence alone. Using convolutional neural networks, this system identifies promoters and distal regulatory elements and synthesizes their content to make effective gene expression predictions. We show that model predictions for the influence of genomic variants on gene expression align well to causal variants underlying eQTLs in human populations and can be useful for generating mechanistic hypotheses to enable fine mapping of disease loci.
biorxiv genomics 0-100-users 2017Speed breeding a powerful tool to accelerate crop research and breeding, bioRxiv, 2017-07-10
The growing human population and a changing environment have raised significant concern for global food security, with the current improvement rate of several important crops inadequate to meet future demand [1]. This slow improvement rate is attributed partly to the long generation times of crop plants. Here we present a method called ‘speed breeding’, which greatly shortens generation time and accelerates breeding and research programs. Speed breeding can be used to achieve up to 6 generations per year for spring wheat (Triticum aestivum), durum wheat (T. durum), barley (Hordeum vulgare), chickpea (Cicer arietinum), and pea (Pisum sativum) and 4 generations for canola (Brassica napus), instead of 2-3 under normal glasshouse conditions. We demonstrate that speed breeding in fully-enclosed controlled-environment growth chambers can accelerate plant development for research purposes, including phenotyping of adult plant traits, mutant studies, and transformation. The use of supplemental lighting in a glasshouse environment allows rapid generation cycling through single seed descent and potential for adaptation to larger-scale crop improvement programs. Cost-saving through LED supplemental lighting is also outlined. We envisage great potential for integrating speed breeding with other modern crop breeding technologies, including high-throughput genotyping, genome editing, and genomic selection, accelerating the rate of crop improvement.
biorxiv plant-biology 200-500-users 2017Enhanced proofreading governs CRISPR-Cas9 targeting accuracy, bioRxiv, 2017-07-07
The RNA-guided CRISPR-Cas9 nuclease from Streptococcus pyogenes (SpCas9) has been widely repurposed for genome editing1-4. High-fidelity (SpCas9-HF1) and enhanced specificity (eSpCas9(1.1)) variants exhibit substantially reduced off-target cleavage in human cells, but the mechanism of target discrimination and the potential to further improve fidelity were unknown5-9. Using single-molecule Förster resonance energy transfer (smFRET) experiments, we show that both SpCas9-HF1 and eSpCas9(1.1) are trapped in an inactive state10 when bound to mismatched targets. We find that a non-catalytic domain within Cas9, REC3, recognizes target mismatches and governs the HNH nuclease to regulate overall catalytic competence. Exploiting this observation, we identified residues within REC3 involved in mismatch sensing and designed a new hyper-accurate Cas9 variant (HypaCas9) that retains robust on-target activity in human cells. These results offer a more comprehensive model to rationalize and modify the balance between target recognition and nuclease activation for precision genome editing.
biorxiv biochemistry 100-200-users 2017Robust and Bright Genetically Encoded Fluorescent Markers for Highlighting Structures and Compartments in Mammalian Cells, bioRxiv, 2017-07-07
To increase our understanding of cells, there is a need for specific markers to identify biomolecules, cellular structures and compartments. One type of markers comprises genetically encoded fluorescent probes that are linked with protein domains, peptides andor signal sequences. These markers are encoded on a plasmid and they allow straightforward, convenient labeling of cultured mammalian cells by introducing the plasmid into the cells. Ideally, the fluorescent marker combines favorable spectroscopic properties (brightness, photostability) with specific labeling of the structure or compartment of interest. Here, we report on our ongoing efforts to generate robust and bright genetically encoded fluorescent markers for highlighting structures and compartments in living cells.
biorxiv cell-biology 200-500-users 2017