Gephebase, a Database of Genotype-Phenotype Relationships for natural and domesticated variation in Eukaryotes, bioRxiv, 2019-04-25
AbstractGephebase is a manually-curated database compiling our accumulated knowledge of the genes and mutations that underlie natural, domesticated and experimental phenotypic variation in all Eukaryotes — mostly animals, plants and yeasts. Gephebase aims to compile studies where the genotype-phenotype association (based on linkage mapping, association mapping or a candidate gene approach) is relatively well supported or understood. Human disease and aberrant mutant phenotypes in laboratory model organisms are not included in Gephebase and can be found in other databases (eg. OMIM, OMIA, Monarch Initiative). Gephebase contains more than 1700 entries. Each entry corresponds to an allelic difference at a given gene and its associated phenotypic change(s) between two species or between two individuals of the same species, and is enriched with molecular details, taxonomic information, and bibliographic information. Users can easily browse entries for their topic of interest and perform searches at various levels, whether phenotypic, genetic, taxonomic or bibliographic (eg. transposable elements, cis-regulatory mutations, snakes, carotenoid content, an author name). Data can be searched using keywords and boolean operators and is exportable in spreadsheet format. This database allows to perform meta-analysis to extract general information and global trends about evolution, genetics, and the field of evolutionary genetics itself. Gephebase should also help breeders, conservationists and others to identify the most promising target genes for traits of interest, with potential applications such as crop improvement, parasite and pest control, bioconservation, and genetic diagnostic. It is freely available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpwww.gephebase.org>www.gephebase.org<jatsext-link>.
biorxiv bioinformatics 0-100-users 2019ModelTest-NG a new and scalable tool for the selection of DNA and protein evolutionary models, bioRxiv, 2019-04-23
AbstractModelTest-NG is a re-implementation from scratch of jModelTest and ProtTest, two popular tools for selecting the best-fit nucleotide and amino acid substitution models, respectively. ModelTest-NG is one to two orders of magnitude faster than jModelTest and ProtTest but equally accurate, and introduces several new features, such as ascertainment bias correction, mixture and FreeRate models, or the automatic processing of partitioned datasets. ModelTest-NG is available under a GNU GPL3 license at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comddarribamodeltest>httpsgithub.comddarribamodeltest<jatsext-link>.
biorxiv bioinformatics 0-100-users 2019pathwayPCA an R package for integrative pathway analysis with modern PCA methodology and gene selection, bioRxiv, 2019-04-23
ABSTRACTWith the advance in high-throughput technology for molecular assays, multi-omics datasets have become increasingly available. However, most currently available pathway analysis software provide little or no functionalities for analyzing multiple types of -omics data simultaneously. In addition, most tools do not provide sample-specific estimates of pathway activities, which are important for precision medicine. To address these challenges, we present pathwayPCA, a unique R package for integrative pathway analysis that utilizes modern statistical methodology including supervised PCA and adaptive elastic-net PCA for principal component analysis. pathwayPCA can analyze continuous, binary, and survival outcomes in studies with multiple covariate andor interaction effects. We provide three case studies to illustrate pathway analysis with gene selection, integrative analysis of multi-omics datasets to identify driver genes, estimating and visualizing sample-specific pathway activities in ovarian cancer, and identifying sex-specific pathway effects in kidney cancer. pathwayPCA is an open source R package, freely available to the research community. We expect pathwayPCA to be a useful tool for empowering the wide scientific community on the analyses and interpretation of the wealth of multiomics data recently made available by TCGA, CPTAC and other large consortiums.
biorxiv bioinformatics 0-100-users 2019deSALT fast and accurate long transcriptomic read alignment with de Bruijn graph-based index, bioRxiv, 2019-04-18
AbstractLong-read RNA sequencing (RNA-seq) is a promising approach in transcriptomics studies, however, the alignment of the long reads is a fundamental but still non-trivial task due to sequencing errors and complicated gene structures. We propose de Bruijn graph-based Spliced Aligner for Long Transcriptome read (deSALT), a tailored two-pass long RNA-seq read alignment approach, which constructs graph-based alignment skeletons to sensitively infer exons and uses them to generate high-quality spliced reference sequences to produce refined alignments. deSALT addresses several difficult technical issues, such as small exons and serious sequencing errors, which breakthroughs the bottlenecks of long RNA-seq read alignment. Benchmarks demonstrate that this approach has a greater ability to produce accurate and homogeneous full-length alignments and thus has enormous potentials in transcriptomics studies.
biorxiv bioinformatics 100-200-users 2019Benchmarking of alignment-free sequence comparison methods, bioRxiv, 2019-04-16
ABSTRACTAlignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. Here, we present a community resource (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpafproject.org>httpafproject.org<jatsext-link>) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference and reconstruction of species trees under horizontal gene transfer and recombination events. The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.
biorxiv bioinformatics 100-200-users 2019nf-core Community curated bioinformatics pipelines, bioRxiv, 2019-04-16
AbstractThe standardization, portability, and reproducibility of analysis pipelines is a renowned problem within the bioinformatics community. Most pipelines are designed for execution on-premise, and the associated software dependencies are tightly coupled with the local compute environment. This leads to poor pipeline portability and reproducibility of the ensuing results - both of which are fundamental requirements for the validation of scientific findings. Here, we introduce nf-core a framework that provides a community-driven, peer-reviewed platform for the development of best practice analysis pipelines written in Nextflow. Key obstacles in pipeline development such as portability, reproducibility, scalability and unified parallelism are inherently addressed by all nf-core pipelines. We are also continually developing a suite of tools that assist in the creation and development of both new and existing pipelines. Our primary goal is to provide a platform for high-quality, reproducible bioinformatics pipelines that can be utilized across various institutions and research facilities.
biorxiv bioinformatics 100-200-users 2019