An introduction to MPEG-G, the new ISO standard for genomic information representation, bioRxiv, 2018-09-27
AbstractThe MPEG-G standardization initiative is a coordinated international effort to specify a compressed data format that enables large scale genomic data to be processed, transported and shared. The standard consists of a set of specifications (i.e., a book) describing i) a nor-mative format syntax, and ii) a normative decoding process to retrieve the information coded in a compliant file or bitstream. Such decoding process enables the use of leading-edge com-pression technologies that have exhibited significant compression gains over currently used formats for storage of unaligned and aligned sequencing reads. Additionally, the standard provides a wealth of much needed functionality, such as selective access, data aggregation, ap-plication programming interfaces to the compressed data, standard interfaces to support data protection mechanisms, support for streaming and a procedure to assess the conformance of implementations. ISOIEC is engaged in supporting the maintenance and availability of the standard specification, which guarantees the perenniality of applications using MPEG-G. Fi-nally, the standard ensures interoperability and integration with existing genomic information processing pipelines by providing support for conversion from the FASTQSAMBAM file formats.In this paper we provide an overview of the MPEG-G specification, with particular focus on the main advantages and novel functionality it offers. As the standard only specifies the decoding process, encoding performance, both in terms of speed and compression ratio, can vary depending on specific encoder implementations, and will likely improve during the lifetime of MPEG-G. Hence, the performance statistics provided here are only indicative baseline examples of the technologies included in the standard.
biorxiv bioinformatics 100-200-users 2018Cohort Profile East London Genes & Health (ELGH), a community based population genomics and health study of British-Bangladeshi and British-Pakistani people., bioRxiv, 2018-09-27
Cohort profile in a nutshell East London Genes & Health (ELGH) is a large scale, community genomics and health study (to date >34,000 volunteers; target 100,000 volunteers). ELGH was set up in 2015 to gain deeper understanding of health and disease, and underlying genetic influences, in British-Bangladeshi and British-Pakistani people living in east London. ELGH prioritises studies in areas important to, and identified by, the community it represents. Current priorities include cardiometabolic diseases and mental illness, these being of notably high prevalence and severity. However studies in any scientific area are possible, subject to community advisory group and ethical approval. ELGH combines health data science (using linked UK National Health Service (NHS) electronic health record data) with exome sequencing and SNP array genotyping to elucidate the genetic influence on health and disease, including the contribution from high rates of parental relatedness on rare genetic variation and homozygosity (autozygosity), in two understudied ethnic groups. Linkage to longitudinal health record data enables both retrospective and prospective analyses. Through Stage 2 studies, ELGH offers researchers the opportunity to undertake recall-by-genotype andor recall-by-phenotype studies on volunteers. Sub-cohort, trial-within-cohort, and other study designs are possible. ELGH is a fully collaborative, open access resource, open to academic and life sciences industry scientific research partners.
biorxiv genomics 0-100-users 2018Collective intercellular communication through ultra-fast hydrodynamic trigger waves, bioRxiv, 2018-09-27
The biophysical relationships between sensors and actuators have been fundamental to the development of complex life forms; abundant flows are generated and persist in aquatic environments by swimming organisms, while responding promptly to external stimuli is key to survival. Here, akin to a chain reaction, we present the discovery of hydrodynamic trigger waves in cellular communities of the protist Spirostomum ambiguum, propagating hundreds of times faster than the swimming speed. Coiling its cytoskeleton, Spirostomum can contract its long body by 50% within milliseconds, with accelerations reaching 14g-forces. Surprisingly, a single cellular contraction (transmitter) is shown to generate long-ranged vortex flows at intermedi- ate Reynolds numbers, which can trigger neighbouring cells, in turn. To measure the sensitivity to hydrodynamic signals (receiver), we further present a high-throughput suction-flow device to probe mechanosensitive ion channel gating by back-calculating the microscopic forces on the cell mem- brane. These ultra-fast hydrodynamic trigger waves are analysed and modelled quantitatively in a universal framework of antenna and percolation theory. A phase transition is revealed, requiring a critical colony density to sustain collective communication. Our results suggest that this signalling could help organise cohabiting communities over large distances, influencing long-term behaviour through gene expression, comparable to quorum sensing. More immediately, as contractions release toxins, synchronised discharges could also facilitate the repulsion of large predators, or conversely immobilise large prey. We postulate that beyond protists numerous other freshwater and marine organisms could coordinate with variations of hydrodynamic trigger waves.
biorxiv biophysics 200-500-users 2018Liposome-based transfection enhances RNAi and CRISPR-mediated mutagenesis in non-model nematode systems, bioRxiv, 2018-09-27
AbstractNematodes belong to one of the most diverse animal phyla. However, functional genomic studies in nematodes, other than in a few species, have often been limited in their reliability and success. Here we report that by combining liposome-based technology with microinjection, we were able to establish a wide range of genomic techniques in the newly described nematode genus Auanema. The method also allowed heritable changes in dauer larvae of Auanema, despite the immaturity of the gonad at the time of the microinjection. As proof of concept for potential functional studies in other nematode species, we also induced RNAi in the free-living nematode Pristionchus pacificus and targeted the human parasite Strongyloides stercoralis.
biorxiv developmental-biology 0-100-users 2018Local epigenomic state cannot discriminate interacting and non-interacting enhancer–promoter pairs with high accuracy, bioRxiv, 2018-09-27
AbstractWe report an overfitting issue in recent machine learning formulations of the enhancer-promoter interaction problem arising from the fact that many enhancer-promoter pairs share features. Cross- fold validation schemes which do not correctly separate these feature sharing enhancer-promoter pairs into one test set report high accuracy, which is actually due to overfitting. Cross-fold validation schemes which properly segregate pairs with shared features show markedly reduced ability to predict enhancer-promoter interactions from epigenomic state. Parameter scans with multiple models indicate that local epigenomic features of individual pairs of enhancers and promoters cannot distinguish those pairs that interact from those which do with high accuracy, suggesting that additional information is required to predict enhancer-promoter interactions.
biorxiv genomics 0-100-users 2018PlotsOfData – a web app for visualizing data together with its summaries, bioRxiv, 2018-09-27
AbstractReporting of the actual data in graphs and plots increases transparency and enables independent evaluation. On the other hand, data summaries are often used in graphs since they aid interpretation. State-of-the art data visualizations can be made with the ggplot2 package, which uses the ideas of a ‘grammar of graphics’ to generate a graphic from multiple layers of data. However, ggplot2 requires coding skills and an understanding of the tidy data structure. To democratize state-of-the-art data visualization of raw data with a selection of statistical summaries, a web app was written using Rshiny that uses the ggplot2 package for generating plots. A multilayered approach together with adjustable transparency offers a unique flexibility, enabling users can to choose how to display the data and which of the data summaries to add. Four data summaries are provided, mean, median, boxplot, violinplot, to accommodate several types of data distributions. In addition, 95% confidence intervals can be added for visual inferences. By adjusting the transparency of the layers, the visualization of the raw data together with the summary can be tuned for optimal presentation and interpretation. The app is dubbed PlotsOfData and is available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpshuygens.science.uva.nlPlotsOfData>httpshuygens.science.uva.nlPlotsOfData<jatsext-link><jatsfig id=ufig1 position=float orientation=portrait fig-type=figure><jatsgraphic xmlnsxlink=httpwww.w3.org1999xlink xlinkhref=426767v3_ufig1 position=float orientation=portrait >
biorxiv scientific-communication-and-education 0-100-users 2018