A Data Citation Roadmap for Scholarly Data Repositories, bioRxiv, 2016-12-29
AbstractThis article presents a practical roadmap for scholarly data repositories to implement data citation in accordance with the Joint Declaration of Data Citation Principles, a synopsis and harmonization of the recommendations of major science policy bodies. The roadmap was developed by the Repositories Expert Group, as part of the Data Citation Implementation Pilot (DCIP) project, an initiative of FORCE11.org and the NIH BioCADDIE (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsbiocaddie.org>httpsbiocaddie.org<jatsext-link>) program. The roadmap makes 11 specific recommendations, grouped into three phases of implementation a) required steps needed to support the Joint Declaration of Data Citation Principles, b) recommended steps that facilitate articledata publication workflows, and c) optional steps that further improve data citation support provided by data repositories.
biorxiv scientific-communication-and-education 200-500-users 2016Targeted degradation of CTCF decouples local insulation of chromosome domains from higher-order genomic compartmentalization, bioRxiv, 2016-12-22
The molecular mechanisms underlying folding of mammalian chromosomes remain poorly understood. The transcription factor CTCF is a candidate regulator of chromosomal structure. Using the auxin-inducible degron system in mouse embryonic stem cells, we show that CTCF is absolutely and dose-dependently required for looping between CTCF target sites and segmental organization into topologically associating domains (TADs). Restoring CTCF reinstates proper architecture on altered chromosomes, indicating a powerful instructive function for CTCF in chromatin folding, and CTCF remains essential for TAD organization in non-dividing cells. Surprisingly, active and inactive genome compartments remain properly segregated upon CTCF depletion, revealing that compartmentalization of mammalian chromosomes emerges independently of proper insulation of TADs. Further, our data supports that CTCF mediates transcriptional insulator function through enhancer-blocking but not direct chromatin barrier activity. These results define the functions of CTCF in chromosome folding, and provide new fundamental insights into the rules governing mammalian genome organization.
biorxiv genomics 200-500-users 2016Creating a universal SNP and small indel variant caller with deep neural networks, bioRxiv, 2016-12-15
AbstractNext-generation sequencing (NGS) is a rapidly evolving set of technologies that can be used to determine the sequence of an individual’s genome1 by calling genetic variants present in an individual using billions of short, errorful sequence reads2. Despite more than a decade of effort and thousands of dedicated researchers, the hand-crafted and parameterized statistical models used for variant calling still produce thousands of errors and missed variants in each genome3,4. Here we show that a deep convolutional neural network5 can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships (likelihoods) between images of read pileups around putative variant sites and ground-truth genotype calls. This approach, called DeepVariant, outperforms existing tools, even winning the “highest performance” award for SNPs in a FDA-administered variant calling challenge. The learned model generalizes across genome builds and even to other mammalian species, allowing non-human sequencing projects to benefit from the wealth of human ground truth data. We further show that, unlike existing tools which perform well on only a specific technology, DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, from deep whole genomes from 10X Genomics to Ion Ampliseq exomes. DeepVariant represents a significant step from expert-driven statistical modeling towards more automatic deep learning approaches for developing software to interpret biological instrumentation data.
biorxiv genomics 200-500-users 2016C. elegans discriminate colors without eyes or opsins, bioRxiv, 2016-12-09
AbstractHere we establish that, contrary to expectations, Caenorhabditis elegans nematode worms possess a color discrimination system despite lacking any opsin or other known visible light photoreceptor genes. We found that white light guides C. elegans foraging decisions away from harmful bacteria that secrete a blue pigment toxin. Absorption of amber light by this blue pigment toxin alters the color of light sensed by the worm, and thereby triggers an increase in avoidance. By combining narrow-band blue and amber light sources, we demonstrated that detection of the specific blueamber ratio by the worm guides its foraging decision. These behavioral and psychophysical studies thus establish the existence of a color detection system that is distinct from those of other animals.
biorxiv neuroscience 200-500-users 2016Tractography-based connectomes are dominated by false-positive connections, bioRxiv, 2016-11-08
AbstractFiber tractography based on non-invasive diffusion imaging is at the heart of connectivity studies of the human brain. To date, the approach has not been systematically validated in ground truth studies. Based on a simulated human brain dataset with ground truth white matter tracts, we organized an open international tractography challenge, which resulted in 96 distinct submissions from 20 research groups. While most state-of-the-art algorithms reconstructed 90% of ground truth bundles to at least some extent, on average they produced four times more invalid than valid bundles. About half of the invalid bundles occurred systematically in the majority of submissions. Our results demonstrate fundamental ambiguities inherent to tract reconstruction methods based on diffusion orientation information, with critical consequences for the approach of diffusion tractography in particular and human connectivity studies in general.
biorxiv neuroscience 200-500-users 2016I Tried a Bunch of Things The Dangers of Unexpected Overfitting in Classification, bioRxiv, 2016-10-04
ABSTRACTMachine learning is a powerful set of techniques that has enhanced the abilities of neuroscientists to interpret information collected through EEG, fMRI, MEG, and PET data. With these new techniques come new dangers of overfitting that are not well understood by the neuroscience community. In this article, we use Support Vector Machine (SVM) classifiers, and genetic algorithms to demonstrate the ease by which overfitting can occur, despite the use of cross validation. We demonstrate that comparable and non-generalizable results can be obtained on informative and non-informative (i.e. random) data by iteratively modifying hyperparameters in seemingly innocuous ways. We recommend a number of techniques for limiting overfitting, such as lock boxes, blind analyses, and pre-registrations. These techniques, although uncommon in neuroscience applications, are common in many other fields that use machine learning, including computer science and physics. Adopting similar safeguards is critical for ensuring the robustness of machine-learning techniques.
biorxiv neuroscience 200-500-users 2016