Scaling accurate genetic variant discovery to tens of thousands of samples, bioRxiv, 2017-11-15
AbstractComprehensive disease gene discovery in both common and rare diseases will require the efficient and accurate detection of all classes of genetic variation across tens to hundreds of thousands of human samples. We describe here a novel assembly-based approach to variant calling, the GATK HaplotypeCaller (HC) and Reference Confidence Model (RCM), that determines genotype likelihoods independently per-sample but performs joint calling across all samples within a project simultaneously. We show by calling over 90,000 samples from the Exome Aggregation Consortium (ExAC) that, in contrast to other algorithms, the HC-RCM scales efficiently to very large sample sizes without loss in accuracy; and that the accuracy of indel variant calling is superior in comparison to other algorithms. More importantly, the HC-RCM produces a fully squared-off matrix of genotypes across all samples at every genomic position being investigated. The HC-RCM is a novel, scalable, assembly-based algorithm with abundant applications for population genetics and clinical studies.
biorxiv genomics 0-100-users 2017EpiGraph an open-source platform to quantify epithelial organization, bioRxiv, 2017-11-14
SUMMARYDuring development, cells must coordinate their differentiation with their growth and organization to form complex multicellular structures such as tissues and organs. Healthy tissues must maintain these structures during homeostasis. Epithelia are packed ensembles of cells from which the different tissues of the organism will originate during embryogenesis. A large barrier to the analysis of the morphogenetic changes in epithelia is the lack of simple tools that enable the quantification of cell arrangements. Here we present EpiGraph, an image analysis tool that quantifies epithelial organization. Our method combines computational geometry and graph theory to measure the degree of order of any packed tissue. EpiGraph goes beyond the traditional polygon distribution analysis, capturing other organizational traits that improve the characterization of epithelia. EpiGraph can objectively compare the rearrangements of epithelial cells during development and homeostasis to quantify how the global ensemble is affected. Importantly, it has been implemented in the open-access platform FIJI. This makes EpiGraph very user friendly, with no programming skills required.
biorxiv developmental-biology 0-100-users 2017Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies, bioRxiv, 2017-11-02
AbstractIn genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, linear mixed model and the recently proposed logistic mixed model, perform poorly – producing large type I error rates – in the analysis of phenotypes with unbalanced case-control ratios. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation (SPA) to calibrate the distribution of score test statistics. This method, SAIGE, provides accurate p-values even when case-control ratios are extremely unbalanced. It utilizes state-of-art optimization strategies to reduce computational time and memory cost of generalized mixed model. The computation cost linearly depends on sample size, and hence can be applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 white British European-ancestry samples for >1400 binary phenotypes, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness.
biorxiv genomics 0-100-users 2017Germline determinants of the somatic mutation landscape in 2,642 cancer genomes, bioRxiv, 2017-11-02
AbstractCancers develop through somatic mutagenesis, however germline genetic variation can markedly contribute to tumorigenesis via diverse mechanisms. We discovered and phased 88 million germline single nucleotide variants, short insertionsdeletions, and large structural variants in whole genomes from 2,642 cancer patients, and employed this genomic resource to study genetic determinants of somatic mutagenesis across 39 cancer types. Our analyses implicate damaging germline variants in a variety of cancer predisposition and DNA damage response genes with specific somatic mutation patterns. Mutations in the MBD4 DNA glycosylase gene showed association with elevated C>T mutagenesis at CpG dinucleotides, a ubiquitous mutational process acting across tissues. Analysis of somatic structural variation exposed complex rearrangement patterns, involving cycles of templated insertions and tandem duplications, in BRCA1-deficient tumours. Genome-wide association analysis implicated common genetic variation at the APOBEC3 gene cluster with reduced basal levels of somatic mutagenesis attributable to APOBEC cytidine deaminases across cancer types. We further inferred over a hundred polymorphic L1LINE elements with somatic retrotransposition activity in cancer. Our study highlights the major impact of rare and common germline variants on mutational landscapes in cancer.
biorxiv genomics 0-100-users 2017Unsupervised discovery of demixed, low-dimensional neural dynamics across multiple timescales through tensor components analysis, bioRxiv, 2017-10-31
AbstractPerceptions, thoughts and actions unfold over millisecond timescales, while learned behaviors can require many days to mature. While recent experimental advances enable large-scale and long-term neural recordings with high temporal fidelity, it remains a formidable challenge to extract unbiased and interpretable descriptions of how rapid single-trial circuit dynamics change slowly over many trials to mediate learning. We demonstrate a simple tensor components analysis (TCA) can meet this challenge by extracting three interconnected low dimensional descriptions of neural data neuron factors, reflecting cell assemblies; temporal factors, reflecting rapid circuit dynamics mediating perceptions, thoughts, and actions within each trial; and trial factors, describing both long-term learning and trial-to-trial changes in cognitive state. We demonstrate the broad applicability of TCA by revealing insights into diverse datasets derived from artificial neural networks, large-scale calcium imaging of rodent prefrontal cortex during maze navigation, and multielectrode recordings of macaque motor cortex during brain machine interface learning.
biorxiv neuroscience 0-100-users 2017Graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells, bioRxiv, 2017-10-26
AbstractSingle-cell RNA-seq quantifies biological heterogeneity across both discrete cell types and continuous cell transitions. Partition-based graph abstraction (PAGA) provides an interpretable graph-like map of the arising data manifold, based on estimating connectivity of manifold partitions (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comtheislabpaga>httpsgithub.comtheislabpaga<jatsext-link>). PAGA maps provide interpretable discrete and continuous latent coordinates for both disconnected and continuous structure in data, preserve the global topology of data, allow analyzing data at different resolutions and result in much higher computational efficiency of the typical exploratory data analysis workflow — one million cells take on the order of a minute, a speedup of 130 times compared to UMAP. We demonstrate the method by inferring structure-rich cell maps with consistent topology across four hematopoietic datasets, confirm the reconstruction of lineage relations of adult planaria and the zebrafish embryo, benchmark computational performance on a neuronal dataset and detect a biological trajectory in one deep-learning processed image dataset.
biorxiv bioinformatics 0-100-users 2017