Deep Convolutional Neural Networks for Breast Cancer Histology Image Analysis, bioRxiv, 2018-02-08
AbstractBreast cancer is one of the main causes of cancer death worldwide. Early diagnostics significantly increases the chances of correct treatment and survival, but this process is tedious and often leads to a disagreement between pathologists. Computer-aided diagnosis systems showed potential for improving the diagnostic accuracy. In this work, we develop the computational approach based on deep convolution neural networks for breast cancer histology image classification. Hematoxylin and eosin stained breast histology microscopy image dataset is provided as a part of the ICIAR 2018 Grand Challenge on Breast Cancer Histology Images. Our approach utilizes several deep neural network architectures and gradient boosted trees classifier. For 4-class classification task, we report 87.2% accuracy. For 2-class classification task to detect carcinomas we report 93.8% accuracy, AUC 97.3%, and sensitivityspecificity 96.588.0% at the high-sensitivity operating point. To our knowledge, this approach outperforms other common methods in automated histopathological image classification. The source code for our approach is made publicly available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comalexander-rakhlinICIAR2018>httpsgithub.comalexander-rakhlinICIAR2018<jatsext-link>
biorxiv pathology 100-200-users 2018Integrating Hi-C links with assembly graphs for chromosome-scale assembly, bioRxiv, 2018-02-08
AbstractLong-read sequencing and novel long-range assays have revolutionized de novo genome assembly by automating the reconstruction of reference-quality genomes. In particular, Hi-C sequencing is becoming an economical method for generating chromosome-scale scaffolds. Despite its increasing popularity, there are limited open-source tools available. Errors, particularly inversions and fusions across chromosomes, remain higher than alternate scaffolding technologies. We present a novel open-source Hi-C scaffolder that does not require an a priori estimate of chromosome number and minimizes errors by scaffolding with the assistance of an assembly graph. We demonstrate higher accuracy than the state-of-the-art methods across a variety of Hi-C library preparations and input assembly sizes. The Python and C++ code for our method is openly available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.commachinegunSALSA>httpsgithub.commachinegunSALSA<jatsext-link>Author summaryHi-C technology was originally proposed to study the 3D organization of a genome. Recently, it has also been applied to assemble large eukaryotic genomes into chromosome-scale scaffolds. Despite this, there are few open source methods to generate these assemblies. Existing methods are also prone to small inversion errors due to noise in the Hi-C data. In this work, we address these challenges and develop a method, named SALSA2. SALSA2 uses sequence overlap information from an assembly graph to correct inversion errors and provide accurate chromosome-scale assemblies.
biorxiv bioinformatics 100-200-users 2018A proposal for a standardized bacterial taxonomy based on genome phylogeny, bioRxiv, 2018-01-31
AbstractTaxonomy is a fundamental organizing principle of biology, which ideally should be based on evolutionary relationships. Microbial taxonomy has been greatly restricted by the inability to obtain most microorganisms in pure culture and, to a lesser degree, the historical use of phenotypic properties as the basis for classification. However, we are now at the point of obtaining genome sequences broadly representative of microbial diversity by using culture-independent techniques, which provide the opportunity to develop a comprehensive genome-based taxonomy. Here we propose a standardized bacterial taxonomy based on a concatenated protein phylogeny that conservatively removes polyphyletic groups and normalizes ranks based on relative evolutionary divergence. From 94,759 bacterial genomes, 99 phyla are described including six major normalized monophyletic units from the subdivision of the Proteobacteria, and amalgamation of the Candidate Phyla Radiation into the single phylum Patescibacteria. In total, 73% of taxa had one or more changes to their existing taxonomy.
biorxiv microbiology 200-500-users 2018Genome-wide Analysis of Insomnia (N=1,331,010) Identifies Novel Loci and Functional Pathways, bioRxiv, 2018-01-31
AbstractInsomnia is the second-most prevalent mental disorder, with no sufficient treatment available. Despite a substantial role of genetic factors, only a handful of genes have been implicated and insight into the associated neurobiological pathways remains limited. Here, we use an unprecedented large genetic association sample (N=1,331,010) to allow detection of a substantial number of genetic variants and gain insight into biological functions, cell types and tissues involved in insomnia complaints. We identify 202 genome-wide significant loci implicating 956 genes through positional, eQTL and chromatin interaction mapping. We show involvement of the axonal part of neurons, of specific cortical and subcortical tissues, and of two specific cell-types in insomnia striatal medium spiny neurons and hypothalamic neurons. These cell-types have been implicated previously in the regulation of reward processing, sleep and arousal in animal studies, but have never been genetically linked to insomnia in humans. We found weak genetic correlations with other sleep-related traits, but strong genetic correlations with psychiatric and metabolic traits. Mendelian randomization identified causal effects of insomnia on specific psychiatric and metabolic traits. Our findings reveal key brain areas and cells implicated in the neurobiology of insomnia and its related disorders, and provide novel targets for treatment.
biorxiv genetics 100-200-users 2018On the design of CRISPR-based single cell molecular screens, bioRxiv, 2018-01-30
AbstractSeveral groups recently reported coupling CRISPRCas9 perturbations and single cell RNA-seq as a potentially powerful approach for forward genetics. Here we demonstrate that vector designs for such screens that rely on cis linkage of guides and distally located barcodes suffer from swapping of intended guide-barcode associations at rates approaching 50% due to template switching during lentivirus production, greatly reducing sensitivity. We optimize a published strategy, CROP-seq, that instead uses a Pol II transcribed copy of the sgRNA sequence itself, doubling the rate at which guides are assigned to cells to 94%. We confirm this strategy performs robustly and further explore experimental best practices for CRISPRCas9-based single cell molecular screens.
biorxiv genomics 100-200-users 2018The Juicebox Assembly Tools module facilitates de novo assembly of mammalian genomes with chromosome-length scaffolds for under 1000, bioRxiv, 2018-01-29
Hi-C contact maps are valuable for genome assembly (Lieberman-Aiden, van Berkum et al. 2009; Burton et al. 2013; Dudchenko et al. 2017). Recently, we developed Juicebox, a system for the visual exploration of Hi-C data (Durand, Robinson et al. 2016), and 3D-DNA, an automated pipeline for using Hi-C data to assemble genomes (Dudchenko et al. 2017). Here, we introduce “Assembly Tools,” a new module for Juicebox, which provides a point-and-click interface for using Hi-C heatmaps to identify and correct errors in a genome assembly. Together, 3D-DNA and the Juicebox Assembly Tools greatly reduce the cost of accurately assembling complex eukaryotic genomes. To illustrate, we generated de novo assemblies with chromosome-length scaffolds for three mammals the wombat, Vombatus ursinus (3.3Gb), the Virginia opossum, Didelphis virginiana (3.3Gb), and the raccoon, Procyon lotor (2.5Gb). The only inputs for each assembly were Illumina reads from a short insert DNA-Seq library (300 million Illumina reads, maximum length 2x150 bases) and an in situ Hi-C library (100 million Illumina reads, maximum read length 2x150 bases), which cost <$1000.
biorxiv genomics 100-200-users 2018