Elephantid genomes reveal the molecular bases of Woolly Mammoth adaptations to the arctic, bioRxiv, 2015-04-24
Woolly mammoths and the living elephants are characterized by major phenotypic differences that allowed them to live in very different environments. To identify the genetic changes that underlie the suite of adaptations in woolly mammoths to life in extreme cold, we sequenced the nuclear genome from three Asian elephants and two woolly mammoths, identified and functionally annotated genetic changes unique to the woolly mammoth lineage. We find that genes with mammoth specific amino acid changes are enriched in functions related to circadian biology, skin and hair development and physiology, lipid metabolism, adipose development and physiology, and temperature sensation. Finally we resurrect and functionally test the mammoth and ancestral elephant TRPV3 gene, which encodes a temperature sensitive transient receptor potential (thermoTRP) channel involved in thermal sensation and hair growth, and show that a single mammoth-specific amino acid substitution in an otherwise highly conserved region of the TRPV3 channel strongly affected its temperature sensitivity. Our results have identified a set of genetic changes that likely played important roles in the adaptation of woolly mammoths to life in the high artic.
biorxiv genomics 0-100-users 2015Tools and best practices for allelic expression analysis, bioRxiv, 2015-03-07
Allelic expression (AE) analysis has become an important tool for integrating genome and transcriptome data to characterize various biological phenomena such as cis-regulatory variation and nonsense-mediated decay. In this paper, we systematically analyze the properties of AE read count data and technical sources of error, such as low-quality or double-counted RNA-seq reads, genotyping errors, allelic mapping bias, and technical covariates due to sample preparation and sequencing, and variation in total read depth. We provide guidelines for correcting and filtering for such errors, and show that the resulting AE data has extremely low technical noise. Finally, we introduce novel software for high-throughput production of AE data from RNA-sequencing data, implemented in the GATK framework. These improved tools and best practices for AE analysis yield higher quality AE data by reducing technical bias. This provides a practical framework for wider adoption of AE analysis by the genomics community.
biorxiv genomics 0-100-users 2015An Atlas of Genetic Correlations across Human Diseases and Traits, bioRxiv, 2015-02-09
Identifying genetic correlations between complex traits and diseases can provide useful etiological insights and help prioritize likely causal relationships. The major challenges preventing estimation of genetic correlation from genome-wide association study (GWAS) data with current methods are the lack of availability of individual genotype data and widespread sample overlap among meta-analyses. We circumvent these difficulties by introducing a technique for estimating genetic correlation that requires only GWAS summary statistics and is not biased by sample overlap. We use our method to estimate 300 genetic correlations among 25 traits, totaling more than 1.5 million unique phenotype measurements. Our results include genetic correlations between anorexia nervosa and schizophrenia body mass index and associations between educational attainment and several diseases. These results highlight the power of a polygenic modeling framework, since there currently are no genome-wide significant SNPs for anorexia nervosa and only three for educational attainment.
biorxiv genomics 100-200-users 2015Inexpensive Multiplexed Library Preparation for Megabase-Sized Genomes, bioRxiv, 2015-01-17
Whole-genome sequencing has become an indispensible tool of modern biology. However, the cost of sample preparation relative to the cost of sequencing remains high, especially for small genomes where the former is dominant. Here we present a protocol for the rapid and inexpensive preparation of hundreds of multiplexed genomic libraries for Illumina sequencing. By carrying out the Nextera tagmentation reaction in small volumes, replacing costly reagents with cheaper equivalents, and omitting unnecessary steps, we achieve a cost of library preparation of $8 per sample, approximately 6 times cheaper than the widely-used Nextera XT protocol. Furthermore, our procedure takes less than 5 hours for 96 samples and uses nanograms of genomic DNA. Many hundreds of samples can then be pooled on the same HiSeq lane via custom barcodes. Our method is especially useful for re-sequencing of large numbers of full microbial or viral genomes, including those from evolution experiments, genetic screens, and environmental samples.
biorxiv genomics 100-200-users 2015When to use Quantile Normalization?, bioRxiv, 2014-12-05
Normalization and preprocessing are essential steps for the analysis of high-throughput data including next-generation sequencing and microarrays. Multi-sample global normalization methods, such as quantile normalization, have been successfully used to remove technical variation from noisy data. These methods rely on the assumption that observed global changes across samples are due to unwanted technical variability. Transforming the data to remove these differences has the potential to remove interesting biologically driven global variation and therefore may not be appropriate depending on the type and source of variation. Currently, it is up to the subject matter experts, for example biologists, to determine if the stated assumptions are appropriate or not. Here, we propose a data-driven method to test for the assumptions of global normalization methods. We demonstrate the utility of our method (quantro), by applying it to multiple gene expression and DNA methylation and show examples of when global normalization methods are not appropriate. We also perform a Monte Carlo simulation study to illustrate how our method generally outperforms the current approach. An R-package implementing our method is available on Bioconductor (httpwww.bioconductor.orgpackagesreleasebiochtmlquantro.html).
biorxiv genomics 0-100-users 2014Redefining Genomic Privacy Trust and Empowerment, bioRxiv, 2014-06-26
Fulfilling the promise of the genetic revolution requires the analysis of large datasets containing information from thousands to millions of participants. However, sharing human genomic data requires protecting subjects from potential harm. Current models rely on de-identification techniques that treat privacy versus data utility as a zero-sum game. Instead we propose using trust-enabling techniques to create a solution where researchers and participants both win. To do so we introduce three principles that facilitate trust in genetic research and outline one possible framework built upon those principles. Our hope is that such trust-centric frameworks provide a sustainable solution that reconciles genetic privacy with data sharing and facilitates genetic research.
biorxiv genomics 0-100-users 2014