Virtual ChIP-seq predicting transcription factor binding by learning from the transcriptome, bioRxiv, 2018-03-01
AbstractMotivationIdentifying transcription factor binding sites is the first step in pinpointing non-coding mutations that disrupt the regulatory function of transcription factors and promote disease. ChIP-seq is the most common method for identifying binding sites, but performing it on patient samples is hampered by the amount of available biological material and the cost of the experiment. Existing methods for computational prediction of regulatory elements primarily predict binding in genomic regions with sequence similarity to known transcription factor sequence preferences. This has limited efficacy since most binding sites do not resemble known transcription factor sequence motifs, and many transcription factors are not even sequence-specific.ResultsWe developed Virtual ChIP-seq, which predicts binding of individual transcription factors in new cell types using an artificial neural network that integrates ChIP-seq results from other cell types and chromatin accessibility data in the new cell type. Virtual ChIP-seq also uses learned associations between gene expression and transcription factor binding at specific genomic regions. This approach outperforms methods that predict TF binding solely based on sequence preference, pre-dicting binding for 36 transcription factors (Matthews correlation coefficient > 0.3).AvailabilityThe datasets we used for training and validation are available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsvirchip.hoffmanlab.org>httpsvirchip.hoffmanlab.org<jatsext-link>. We have deposited in Zenodo the current version of our software (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpdoi.org10.5281zenodo.1066928>httpdoi.org10.5281zenodo.1066928<jatsext-link>), datasets (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpdoi.org10.5281zenodo.823297>httpdoi.org10.5281zenodo.823297<jatsext-link>), predictions for 36 transcription factors on Roadmap Epigenomics cell types (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpdoi.org10.5281zenodo.1455759>httpdoi.org10.5281zenodo.1455759<jatsext-link>), and predictions in Cistrome as well as ENCODE-DREAM in vivo TF Binding Site Prediction Challenge (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpdoi.org10.5281zenodo.1209308>httpdoi.org10.5281zenodo.1209308<jatsext-link>).
biorxiv bioinformatics 200-500-users 2018End-to-end deep image reconstruction from human brain activity, bioRxiv, 2018-02-28
AbstractDeep neural networks (DNNs) have recently been applied successfully to brain decoding and image reconstruction from functional magnetic resonance imaging (fMRI) activity. However, direct training of a DNN with fMRI data is often avoided because the size of available data is thought to be insufficient to train a complex network with numerous parameters. Instead, a pre-trained DNN has served as a proxy for hierarchical visual representations, and fMRI data were used to decode individual DNN features of a stimulus image using a simple linear model, which were then passed to a reconstruction module. Here, we present our attempt to directly train a DNN model with fMRI data and the corresponding stimulus images to build an end-to-end reconstruction model. We trained a generative adversarial network with an additional loss term defined in a high-level feature space (feature loss) using up to 6,000 training data points (natural images and the fMRI responses). The trained deep generator network was tested on an independent dataset, directly producing a reconstructed image given an fMRI pattern as the input. The reconstructions obtained from the proposed method showed resemblance with both natural and artificial test stimuli. The accuracy increased as a function of the training data size, though not outperforming the decoded feature-based method with the available data size. Ablation analyses indicated that the feature loss played a critical role to achieve accurate reconstruction. Our results suggest a potential for the end-to-end framework to learn a direct mapping between brain activity and perception given even larger datasets.
biorxiv neuroscience 200-500-users 2018Population Replacement in Early Neolithic Britain, bioRxiv, 2018-02-19
The roles of migration, admixture and acculturation in the European transition to farming have been debated for over 100 years. Genome-wide ancient DNA studies indicate predominantly Anatolian ancestry for continental Neolithic farmers, but also variable admixture with local Mesolithic hunter-gatherers1–9. Neolithic cultures first appear in Britain c. 6000 years ago (kBP), a millennium after they appear in adjacent areas of northwestern continental Europe. However, the pattern and process of the British Neolithic transition remains unclear10–15. We assembled genome-wide data from six Mesolithic and 67 Neolithic individuals found in Britain, dating from 10.5-4.5 kBP, a dataset that includes 22 newly reported individuals and the first genomic data from British Mesolithic hunter-gatherers. Our analyses reveals persistent genetic affinities between Mesolithic British and Western European hunter-gatherers over a period spanning Britain’s separation from continental Europe. We find overwhelming support for agriculture being introduced by incoming continental farmers, with small and geographically structured levels of additional hunter-gatherer introgression. We find genetic affinity between British and Iberian Neolithic populations indicating that British Neolithic people derived much of their ancestry from Anatolian farmers who originally followed the Mediterranean route of dispersal and likely entered Britain from northwestern mainland Europe.
biorxiv evolutionary-biology 200-500-users 2018End-to-end differentiable learning of protein structure, bioRxiv, 2018-02-15
AbstractPredicting protein structure from sequence is a central challenge of biochemistry. Co‐evolution methods show promise, but an explicit sequence‐to‐structure map remains elusive. Advances in deep learning that replace complex, human‐designed pipelines with differentiable models optimized end‐to‐end suggest the potential benefits of similarly reformulating structure prediction. Here we report the first end‐to‐end differentiable model of protein structure. The model couples local and global protein structure via geometric units that optimize global geometry without violating local covalent chemistry. We test our model using two challenging tasks predicting novel folds without co‐evolutionary data and predicting known folds without structural templates. In the first task the model achieves state‐of‐the‐art accuracy and in the second it comes within 1‐2Å; competing methods using co‐evolution and experimental templates have been refined over many years and it is likely that the differentiable approach has substantial room for further improvement, with applications ranging from drug discovery to protein design.
biorxiv bioinformatics 200-500-users 2018Genome-wide association analyses of risk tolerance and risky behaviors in over 1 million individuals identify hundreds of loci and shared genetic influences1, bioRxiv, 2018-02-12
AbstractHumans vary substantially in their willingness to take risks. In a combined sample of over one million individuals, we conducted genome-wide association studies (GWAS) of general risk tolerance, adventurousness, and risky behaviors in the driving, drinking, smoking, and sexual domains. We identified 611 approximately independent genetic loci associated with at least one of our phenotypes, including 124 with general risk tolerance. We report evidence of substantial shared genetic influences across general risk tolerance and risky behaviors 72 of the 124 general risk tolerance loci contain a lead SNP for at least one of our other GWAS, and general risk tolerance is moderately to strongly genetically correlated (<jatsinline-formula><jatsinline-graphic xmlnsxlink=httpwww.w3.org1999xlink xlinkhref=261081_inline1.gif ><jatsinline-formula> to 0.50) with a range of risky behaviors. Bioinformatics analyses imply that genes near general-risk-tolerance-associated SNPs are highly expressed in brain tissues and point to a role for glutamatergic and GABAergic neurotransmission. We find no evidence of enrichment for genes previously hypothesized to relate to risk tolerance.
biorxiv genetics 200-500-users 2018A proposal for a standardized bacterial taxonomy based on genome phylogeny, bioRxiv, 2018-01-31
AbstractTaxonomy is a fundamental organizing principle of biology, which ideally should be based on evolutionary relationships. Microbial taxonomy has been greatly restricted by the inability to obtain most microorganisms in pure culture and, to a lesser degree, the historical use of phenotypic properties as the basis for classification. However, we are now at the point of obtaining genome sequences broadly representative of microbial diversity by using culture-independent techniques, which provide the opportunity to develop a comprehensive genome-based taxonomy. Here we propose a standardized bacterial taxonomy based on a concatenated protein phylogeny that conservatively removes polyphyletic groups and normalizes ranks based on relative evolutionary divergence. From 94,759 bacterial genomes, 99 phyla are described including six major normalized monophyletic units from the subdivision of the Proteobacteria, and amalgamation of the Candidate Phyla Radiation into the single phylum Patescibacteria. In total, 73% of taxa had one or more changes to their existing taxonomy.
biorxiv microbiology 200-500-users 2018