A practical guide to methods controlling false discoveries in computational biology, bioRxiv, 2018-10-31
In high-throughput studies, hundreds to millions of hypotheses are typically tested. Statistical methods that control the false discovery rate (FDR) have emerged as popular and powerful tools for error rate control. While classic FDR methods use only p-values as input, more modern FDR methods have been shown to increase power by incorporating complementary information as informative covariates to prioritize, weight, and group hypotheses. However, there is currently no consensus on how the modern methods compare to one another. We investigated the accuracy, applicability, and ease of use of two classic and six modern FDR-controlling methods by performing a systematic benchmark comparison using simulation studies as well as six case studies in computational biology. Methods that incorporate informative covariates were modestly more powerful than classic approaches, and did not underperform classic approaches, even when the covariate was completely uninformative. The majority of methods were successful at controlling the FDR, with the exception of two modern methods under certain settings. Furthermore, we found the improvement of the modern FDR methods over the classic methods increased with the informativeness of the covariate, total number of hypothesis tests, and proportion of truly non-null hypotheses. Modern FDR methods that use an informative covariate provide advantages over classic FDR-controlling procedures, with the relative gain dependent on the application and informativeness of available covariates. We present our findings as a practical guide and provide recommendations to aid researchers in their choice of methods to correct for false discoveries.
biorxiv bioinformatics 200-500-users 2018Genetic Consequences of Social Stratification in Great Britain, bioRxiv, 2018-10-30
Human DNA varies across geographic regions, with most variation observed so far reflecting distant ancestry differences. Here, we investigate the geographic clustering of genetic variants that influence complex traits and disease risk in a sample of ~450,000 individuals from Great Britain. Out of 30 traits analyzed, 16 show significant geographic clustering at the genetic level after controlling for ancestry, likely reflecting recent migration driven by socio-economic status (SES). Alleles associated with educational attainment (EA) show most clustering, with EA-decreasing alleles clustering in lower SES areas such as coal mining areas. Individuals that leave coal mining areas carry more EA-increasing alleles on average than the rest of Great Britain. In addition, we leveraged the geographic clustering of complex trait variation to further disentangle regional differences in socio-economic and cultural outcomes through genome-wide association studies on publicly available regional measures, namely coal mining, religiousness, 19702015 general election outcomes, and Brexit referendum results.
biorxiv genetics 200-500-users 2018Atlas of Subcellular RNA Localization Revealed by APEX-seq, bioRxiv, 2018-10-27
SUMMARYWe introduce APEX-seq, a method for RNA sequencing based on spatial proximity to the peroxidase enzyme APEX2. APEX-seq in nine distinct subcellular locales produced a nanometer-resolution spatial map of the human transcriptome, revealing extensive and exquisite patterns of localization for diverse RNA classes and transcript isoforms. We uncover a radial organization of the nuclear transcriptome, which is gated at the inner surface of the nuclear pore for cytoplasmic export of processed transcripts. We identify two distinct pathways of messenger RNA localization to mitochondria, each associated with specific sets of transcripts for building complementary macromolecular machines within the organelle. APEX-seq should be widely applicable to many systems, enabling comprehensive investigations of the spatial transcriptome.
biorxiv cell-biology 200-500-users 2018RAxML-NG A fast, scalable, and user-friendly tool for maximum likelihood phylogenetic inference, bioRxiv, 2018-10-19
AbstractMotivationPhylogenies are important for fundamental biological research, but also have numerous applications in biotechnology, agriculture, and medicine. Finding the optimal tree under the popular maximum like-lihood (ML) criterion is known to be NP-hard. Thus, highly optimized and scalable codes are needed to analyze constantly growing empirical datasets.ResultsWe present RAxML-NG, a from scratch re-implementation of the established greedy tree search algorithm of RAxMLExaML. RAxML- NG offers improved accuracy, flexibility, speed, scalability, and usability compared to RAxMLExaML. On taxon-rich datasets, RAxML-NG typically finds higher-scoring trees than IQTree, an increasingly popular recent tool for ML-based phylogenetic inference (although IQ-Tree shows better stability). Finally, RAxML-NG introduces several new features, such as the detection of terraces in tree space and a the recently introduced transfer bootstrap support metric.AvailabilityThe code is available under GNU GPL at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comamkozlovraxml-ng.RAxML-NG>httpsgithub.comamkozlovraxml-ng.RAxML-NG<jatsext-link> web service (maintained by Vital- IT) is available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsraxml-ng.vital-it.ch>httpsraxml-ng.vital-it.ch<jatsext-link>.Contactalexey.kozlov@h-its.org
biorxiv bioinformatics 200-500-users 2018Unraveling the polygenic architecture of complex traits using blood eQTL meta-analysis, bioRxiv, 2018-10-19
While many disease-associated variants have been identified through genome-wide association studies, their downstream molecular consequences remain unclear. To identify these effects, we performed cis- and trans-expression quantitative trait locus (eQTL) analysis in blood from 31,684 individuals through the eQTLGen Consortium. We observed that cis-eQTLs can be detected for 88% of the studied genes, but that they have a different genetic architecture compared to disease-associated variants, limiting our ability to use cis-eQTLs to pinpoint causal genes within susceptibility loci. In contrast, trans-eQTLs (detected for 37% of 10,317 studied trait-associated variants) were more informative. Multiple unlinked variants, associated to the same complex trait, often converged on trans-genes that are known to play central roles in disease etiology. We observed the same when ascertaining the effect of polygenic scores calculated for 1,263 genome-wide association study (GWAS) traits. Expression levels of 13% of the studied genes correlated with polygenic scores, and many resulting genes are known to drive these traits.
biorxiv genomics 200-500-users 2018Unraveling the polygenic architecture of complex traits using blood eQTL metaanalysis, bioRxiv, 2018-10-19
SummaryWhile many disease-associated variants have been identified through genome-wide association studies, their downstream molecular consequences remain unclear.To identify these effects, we performed cis- and trans-expression quantitative trait locus (eQTL) analysis in blood from 31,684 individuals through the eQTLGen Consortium.We observed that cis-eQTLs can be detected for 88% of the studied genes, but that they have a different genetic architecture compared to disease-associated variants, limiting our ability to use cis-eQTLs to pinpoint causal genes within susceptibility loci.In contrast, trans-eQTLs (detected for 37% of 10,317 studied trait-associated variants) were more informative. Multiple unlinked variants, associated to the same complex trait, often converged on trans-genes that are known to play central roles in disease etiology.We observed the same when ascertaining the effect of polygenic scores calculated for 1,263 genome-wide association study (GWAS) traits. Expression levels of 13% of the studied genes correlated with polygenic scores, and many resulting genes are known to drive these traits.
biorxiv genomics 200-500-users 2018