biorxiv | audiences

Virtual ChIP-seq predicting transcription factor binding by learning from the transcriptome, bioRxiv, 2018-03-01

AbstractMotivationIdentifying transcription factor binding sites is the first step in pinpointing non-coding mutations that disrupt the regulatory function of transcription factors and promote disease. ChIP-seq is the most common method for identifying binding sites, but performing it on patient samples is hampered by the amount of available biological material and the cost of the experiment. Existing methods for computational prediction of regulatory elements primarily predict binding in genomic regions with sequence similarity to known transcription factor sequence preferences. This has limited efficacy since most binding sites do not resemble known transcription factor sequence motifs, and many transcription factors are not even sequence-specific.ResultsWe developed Virtual ChIP-seq, which predicts binding of individual transcription factors in new cell types using an artificial neural network that integrates ChIP-seq results from other cell types and chromatin accessibility data in the new cell type. Virtual ChIP-seq also uses learned associations between gene expression and transcription factor binding at specific genomic regions. This approach outperforms methods that predict TF binding solely based on sequence preference, pre-dicting binding for 36 transcription factors (Matthews correlation coefficient > 0.3).AvailabilityThe datasets we used for training and validation are available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsvirchip.hoffmanlab.org>httpsvirchip.hoffmanlab.org<jatsext-link>. We have deposited in Zenodo the current version of our software (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpdoi.org10.5281zenodo.1066928>httpdoi.org10.5281zenodo.1066928<jatsext-link>), datasets (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpdoi.org10.5281zenodo.823297>httpdoi.org10.5281zenodo.823297<jatsext-link>), predictions for 36 transcription factors on Roadmap Epigenomics cell types (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpdoi.org10.5281zenodo.1455759>httpdoi.org10.5281zenodo.1455759<jatsext-link>), and predictions in Cistrome as well as ENCODE-DREAM in vivo TF Binding Site Prediction Challenge (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpdoi.org10.5281zenodo.1209308>httpdoi.org10.5281zenodo.1209308<jatsext-link>).

biorxiv bioinformatics 200-500-users 2018

End-to-end deep image reconstruction from human brain activity, bioRxiv, 2018-02-28

AbstractDeep neural networks (DNNs) have recently been applied successfully to brain decoding and image reconstruction from functional magnetic resonance imaging (fMRI) activity. However, direct training of a DNN with fMRI data is often avoided because the size of available data is thought to be insufficient to train a complex network with numerous parameters. Instead, a pre-trained DNN has served as a proxy for hierarchical visual representations, and fMRI data were used to decode individual DNN features of a stimulus image using a simple linear model, which were then passed to a reconstruction module. Here, we present our attempt to directly train a DNN model with fMRI data and the corresponding stimulus images to build an end-to-end reconstruction model. We trained a generative adversarial network with an additional loss term defined in a high-level feature space (feature loss) using up to 6,000 training data points (natural images and the fMRI responses). The trained deep generator network was tested on an independent dataset, directly producing a reconstructed image given an fMRI pattern as the input. The reconstructions obtained from the proposed method showed resemblance with both natural and artificial test stimuli. The accuracy increased as a function of the training data size, though not outperforming the decoded feature-based method with the available data size. Ablation analyses indicated that the feature loss played a critical role to achieve accurate reconstruction. Our results suggest a potential for the end-to-end framework to learn a direct mapping between brain activity and perception given even larger datasets.

biorxiv neuroscience 200-500-users 2018

In vivo CRISPR-Cas gene editing with no detectable genome-wide off-target mutations, bioRxiv, 2018-02-28

CRISPR-Cas genome-editing nucleases hold substantial promise for human therapeutics1–5 but identifying unwanted off-target mutations remains an important requirement for clinical translation6, 7. For ex vivo therapeutic applications, previously published cell-based genome-wide methods provide potentially useful strategies to identify and quantify these off-target mutation sites8–12. However, a well-validated method that can reliably identify off-targets in vivo has not been described to date, leaving the question of whether and how frequently these types of mutations occur. Here we describe Verification of In Vivo Off-targets (VIVO), a highly sensitive, unbiased, and generalizable strategy that we show can robustly identify genome-wide CRISPR-Cas nuclease off-target effects in vivo. To our knowledge, these studies provide the first demonstration that CRISPR-Cas nucleases can induce substantial off-target mutations in vivo, a result we obtained using a deliberately promiscuous guide RNA (gRNA). More importantly, we used VIVO to show that appropriately designed gRNAs can direct efficient in vivo editing without inducing detectable off-target mutations. Our findings provide strong support for and should encourage further development of in vivo genome editing therapeutic strategies.

biorxiv molecular-biology 100-200-users 2018

Complete assembly of parental haplotypes with trio binning, bioRxiv, 2018-02-27

AbstractReference genome projects have historically selected inbred individuals to minimize heterozygosity and simplify assembly. We challenge this dogma and present a new approach designed specifically for heterozygous genomes. “Trio binning” uses short reads from two parental genomes to partition long reads from an offspring into haplotype-specific sets prior to assembly. Each haplotype is then assembled independently, resulting in a complete diploid reconstruction. On a benchmark human trio, this method achieved high accuracy and recovered complex structural variants missed by alternative approaches. To demonstrate its effectiveness on a heterozygous genome, we sequenced an F1 cross between cattle subspecies Bos taurus taurus and Bos taurus indicus, and completely assembled both parental haplotypes with NG50 haplotig sizes >20 Mbp and 99.998% accuracy, surpassing the quality of current cattle reference genomes. We propose trio binning as a new best practice for diploid genome assembly that will enable new studies of haplotype variation and inheritance.

biorxiv genomics 100-200-users 2018

Best Practices for Benchmarking Germline Small Variant Calls in Human Genomes, bioRxiv, 2018-02-24

AbstractAssessing accuracy of NGS variant calling is immensely facilitated by a robust benchmarking strategy and tools to carry it out in a standard way. Benchmarking variant calls requires careful attention to definitions of performance metrics, sophisticated comparison approaches, and stratification by variant type and genome context. The Global Alliance for Genomics and Health (GA4GH) Benchmarking Team has developed standardized performance metrics and tools for benchmarking germline small variant calls. This team includes representatives from sequencing technology developers, government agencies, academic bioinformatics researchers, clinical laboratories, and commercial technology and bioinformatics developers for whom benchmarking variant calls is essential to their work. Benchmarking variant calls is a challenging problem for many reasons<jatslist list-type=bullet><jatslist-item>Evaluating variant calls requires complex matching algorithms and standardized counting because the same variant may be represented differently in truth and query callsets.<jatslist-item><jatslist-item>Defining and interpreting resulting metrics such as precision (aka positive predictive value = TP(TP+FP)) and recall (aka sensitivity = TP(TP+FN)) requires standardization to draw robust conclusions about comparative performance for different variant calling methods.<jatslist-item><jatslist-item>Performance of NGS methods can vary depending on variant types and genome context; and as a result understanding performance requires meaningful stratification.<jatslist-item><jatslist-item>High-confidence variant calls and regions that can be used as “truth” to accurately identify false positives and negatives are difficult to define, and reliable calls for the most challenging regions and variants remain out of reach.<jatslist-item>We have made significant progress on standardizing comparison methods, metric definitions and reporting, as well as developing and using truth sets. Our methods are publicly available on GitHub (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comga4ghbenchmarking-tools>httpsgithub.comga4ghbenchmarking-tools<jatsext-link>) and in a web-based app on precisionFDA, which allow users to compare their variant calls against truth sets and to obtain a standardized report on their variant calling performance. Our methods have been piloted in the precisionFDA variant calling challenges to identify the best-in-class variant calling methods within high-confidence regions. Finally, we recommend a set of best practices for using our tools and critically evaluating the results.

biorxiv genomics 100-200-users 2018

Genetic meta-analysis identifies 9 novel loci and functional pathways for Alzheimer’s disease risk, bioRxiv, 2018-02-21

AbstractLate onset Alzheimer’s disease (AD) is the most common form of dementia with more than 35 million people affected worldwide, and no curative treatment available. AD is highly heritable and recent genome-wide meta-analyses have identified over 20 genomic loci associated with AD, yet only explaining a small proportion of the genetic variance indicating that undiscovered loci exist. Here, we performed the largest genome-wide association study of clinically diagnosed AD and AD-by-proxy (71,880 AD cases, 383,378 controls). AD-by-proxy status is based on parental AD diagnosis, and showed strong genetic correlation with AD (rg=0.81). Genetic meta analysis identified 29 risk loci, of which 9 are novel, and implicating 215 potential causative genes. Independent replication further supports these novel loci in AD. Associated genes are strongly expressed in immune-related tissues and cell types (spleen, liver and microglia). Furthermore, gene-set analyses indicate the genetic contribution of biological mechanisms involved in lipid-related processes and degradation of amyloid precursor proteins. We show strong genetic correlations with multiple health-related outcomes, and Mendelian randomisation results suggest a protective effect of cognitive ability on AD risk. These results are a step forward in identifying more of the genetic factors that contribute to AD risk and add novel insights into the neurobiology of AD to guide new drug development.

biorxiv genetics 100-200-users 2018