Bacterial contribution to genesis of the novel germ line determinant oskar, bioRxiv, 2018-10-26
New cellular functions and developmental processes can evolve by modifying the functions or regulation of preexisting genes, but the creation of new genes and their contributions to novel processes is less well understood. New genes can arise not only from mutations or rearrangements of existing sequences, but also via acquisition of foreign DNA, also called horizontal gene transfer (HGT). Here we present evidence that HGT contributed to the creation of a novel gene indispensable for reproduction in some insects. The oskar gene evolved to fulfil a crucial role in insect germ cell formation, but was long considered a novel gene with unknown evolutionary origins. Our analysis of over 100 Oskar sequences suggests that Oskar arose through a novel gene formation history involving fusion of eukaryotic and prokaryotic sequences. One of its two conserved domains (LOTUS), was likely present in the genome of a last common insect ancestor, while the second (OSK) domain appears to have been acquired through horizontal transfer of a bacterial GDSL-like lipase domain. Our evidence suggests that the bacterial contributor of the OSK domain may have been a germ line endosymbiont. This shows that gene origin processes often considered highly unusual, including HGT and de novo coding region evolution, can give rise to novel genes that can both participate in pre-existing gene regulatory networks, and also facilitate the evolution of novel developmental mechanisms.
biorxiv evolutionary-biology 100-200-users 2018Challenges and recommendations to improve installability and archival stability of omics computational tools, bioRxiv, 2018-10-26
AbstractDeveloping new software tools for analysis of large-scale biological data is a key component of advancing modern biomedical research. Scientific reproduction of published findings requires running computational tools on data generated by such studies, yet little attention is presently allocated to the installability and archival stability of computational software tools. Scientific journals require data and code sharing, but none currently require authors to guarantee the continuing functionality of newly published tools. We have estimated the archival stability of computational biology software tools by performing an empirical analysis of the internet presence for 36,702 omics software resources published from 2005 to 2017. We found that almost 28% of all resources are currently not accessible through URLs published in the paper they first appeared in. Among the 98 software tools selected for our installability test, 51% were deemed “easy to install,” and 28% of the tools failed to be installed at all due to problems in the implementation. Moreover, for papers introducing new software, we found that the number of citations significantly increased when authors provided an easy installation process. We propose for incorporation into journal policy several practical solutions for increasing the widespread installability and archival stability of published bioinformatics software.
biorxiv bioinformatics 500+-users 2018ETDB-Caltech a blockchain-based distributed public database for electron tomography, bioRxiv, 2018-10-26
AbstractThree-dimensional electron microscopy techniques like electron tomography provide valuable insights into cellular structures, and present significant challenges for data storage and dissemination. Here we explored a novel method to publicly release more than 11,000 such datasets, more than 30 TB in total, collected by our group. Our method, based on a peer-to-peer file sharing network built around a blockchain ledger, offers a distributed solution to data storage. In addition, we offer a user-friendly browser-based interface, <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsetdb.caltech.edu>httpsetdb.caltech.edu<jatsext-link>, for anyone interested to explore and download our data. We discuss the relative advantages and disadvantages of this system and provide tools for other groups to mine our data andor use the same approach to share their own imaging datasets.
biorxiv cell-biology 0-100-users 2018Identifying loci under positive selection in complex population histories, bioRxiv, 2018-10-26
AbstractDetailed modeling of a species’ history is of prime importance for understanding how natural selection operates over time. Most methods designed to detect positive selection along sequenced genomes, however, use simplified representations of past histories as null models of genetic drift. Here, we present the first method that can detect signatures of strong local adaptation across the genome using arbitrarily complex admixture graphs, which are typically used to describe the history of past divergence and admixture events among any number of populations. The method—called Graph-aware Retrieval of Selective Sweeps (GRoSS)—has good power to detect loci in the genome with strong evidence for past selective sweeps and can also identify which branch of the graph was most affected by the sweep. As evidence of its utility, we apply the method to bovine, codfish and human population genomic data containing multiple population panels related in complex ways. We find new candidate genes for important adaptive functions, including immunity and metabolism in under-studied human populations, as well as muscle mass, milk production and tameness in specific bovine breeds. We are also able to pinpoint the emergence of large regions of differentiation due to inversions in the history of Atlantic codfish.
biorxiv evolutionary-biology 100-200-users 2018Proximity RNA labeling by APEX-Seq Reveals the Organization of Translation Initiation Complexes and Repressive RNA Granules, bioRxiv, 2018-10-26
AbstractDiverse ribonucleoprotein complexes control messenger RNA processing, translation, and decay. Transcripts in these complexes localize to specific regions of the cell and can condense into non-membrane-bound structures such as stress granules. It has proven challenging to map the RNA composition of these large and dynamic structures, however. We therefore developed an RNA proximity labeling technique, APEX-Seq, which uses the ascorbate peroxidase APEX2 to probe the spatial organization of the transcriptome. We show that APEX-Seq can resolve the localization of RNAs within the cell and determine their enrichment or depletion near key RNA-binding proteins. Matching the spatial transcriptome, as revealed by APEX-Seq, with the spatial proteome determined by APEX-mass spectrometry (APEX-MS) provides new insights into the organization of translation initiation complexes on active mRNAs, as well as exposing unanticipated complexity in stress granule composition, and provides a powerful and general approach to explore the spatial environment of macromolecules.
biorxiv genomics 100-200-users 2018The art of using t-SNE for single-cell transcriptomics, bioRxiv, 2018-10-26
AbstractSingle-cell transcriptomics yields ever growing data sets containing RNA expression levels for thousands of genes from up to millions of cells. Common data analysis pipelines include a dimensionality reduction step for visualising the data in two dimensions, most frequently performed using t-distributed stochastic neighbour embedding (t-SNE). It excels at revealing local structure in high-dimensional data, but naive applications often suffer from severe shortcomings, e.g. the global structure of the data is not represented accurately. Here we describe how to circumvent such pitfalls, and develop a protocol for creating more faithful t-SNE visualisations. It includes PCA initialisation, a high learning rate, and multi-scale similarity kernels; for very large data sets, we additionally use exaggeration and downsampling-based initialisation. We use published single-cell RNA-seq data sets to demonstrate that this protocol yields superior results compared to the naive application of t-SNE.
biorxiv bioinformatics 100-200-users 2018