Deep Sequencing of 10,000 Human Genomes, bioRxiv, 2016-07-02
AbstractWe report on the sequencing of 10,545 human genomes at 30-40x coverage with an emphasis on quality metrics and novel variant and sequence discovery. We find that 84% of an individual human genome can be sequenced confidently. This high confidence region includes 91.5% of exon sequence and 95.2% of known pathogenic variant positions. We present thedistribution of over 150 million single nucleotide variants in the coding and non-coding genome. Each newly sequenced genome contributes an average of 8,579 novel variants. In addition, each genome carries in average 0.7 Mb of sequence that is not found in the main build of the hg38 reference genome. The density of this catalog of variation allowed us to construct highresolution profiles that define genomic sites that are highly intolerant of genetic variation. These results indicate that the data generated by deep genome sequencing is of the quality necessary for clinical use.Significance statementDeclining sequencing costs and new large-scale initiatives towards personalized medicine are driving a massive expansion in the number of human genomes being sequenced. Therefore, there is an urgent need to define quality standards for clinical use. This includes deep coverage and sequencing accuracy of an individual’s genome, rather than aggregated coverage of data across a cohort or population. Our work represents the largest effort to date in sequencing human genomes at deep coverage with these new standards. This study identifies over 150 million human variants, a majority of them rare and unknown. Moreover, these data identify sites in the genome that are highly intolerant to variation - possibly essential for life or health. We conclude that high coverage genome sequencing provides accurate detail on human variation for discovery and for clinical applications.
biorxiv genomics 200-500-users 2016Democratizing DNA Fingerprinting, bioRxiv, 2016-07-01
AbstractWe report a rapid, inexpensive, and portable strategy to re-identify human DNA using the MinION, a miniature sequencing sensor by Oxford Nanopore Technologies. Our strategy requires only 10-30 minutes of MinION sequencing, works with low input DNA, and enables familial searches. We also show that it can re-identify individuals from Direct-to-Consumer genomic datasets that are publicly available. We discuss potential forensic applications as well as the legal and ethical implications of a democratized DNA fingerprinting strategy available to the public.
biorxiv genomics 100-200-users 2016Suite2p beyond 10,000 neurons with standard two-photon microscopy, bioRxiv, 2016-07-01
AbstractTwo-photon microscopy of calcium-dependent sensors has enabled unprecedented recordings from vast populations of neurons. While the sensors and microscopes have matured over several generations of development, computational methods to process the resulting movies remain inefficient and can give results that are hard to interpret. Here we introduce Suite2p a fast, accurate and complete pipeline that registers raw movies, detects active cells, extracts their calcium traces and infers their spike times. Suite2p runs on standard workstations, operates faster than real time, and recovers ~2 times more cells than the previous state-of-the-art method. Its low computational load allows routine detection of ~10,000 cells simultaneously with standard two-photon resonant-scanning microscopes. Recordings at this scale promise to reveal the fine structure of activity in large populations of neurons or large populations of subcellular structures such as synaptic boutons.
biorxiv neuroscience 200-500-users 2016Voodoo Machine Learning for Clinical Predictions, bioRxiv, 2016-06-20
AbstractThe availability of smartphone and wearable sensor technology is leading to a rapid accumulation of human subject data, and machine learning is emerging as a technique to map that data into clinical predictions. As machine learning algorithms are increasingly used to support clinical decision making, it is important to reliably quantify their prediction accuracy. Cross-validation is the standard approach for evaluating the accuracy of such algorithms; however, several cross-validations methods exist and only some of them are statistically meaningful. Here we compared two popular cross-validation methods record-wise and subject-wise. Using both a publicly available dataset and a simulation, we found that record-wise cross-validation often massively overestimates the prediction accuracy of the algorithms. We also found that this erroneous method is used by almost half of the retrieved studies that used accelerometers, wearable sensors, or smartphones to predict clinical outcomes. As we move towards an era of machine learning based diagnosis and treatment, using proper methods to evaluate their accuracy is crucial, as erroneous results can mislead both clinicians and data scientists.
biorxiv bioinformatics 100-200-users 2016Scanning the Horizon Towards transparent and reproducible neuroimaging research, bioRxiv, 2016-06-17
AbstractFunctional neuroimaging techniques have transformed our ability to probe the neurobiological basis of behaviour and are increasingly being applied by the wider neuroscience community. However, concerns have recently been raised that the conclusions drawn from some human neuroimaging studies are either spurious or not generalizable. Problems such as low statistical power, flexibility in data analysis, software errors, and lack of direct replication apply to many fields, but perhaps particularly to fMRI. Here we discuss these problems, outline current and suggested best practices, and describe how we think the field should evolve to produce the most meaningful answers to neuroscientific questions.
biorxiv scientific-communication-and-education 100-200-users 2016The genetic structure of the world’s first farmers, bioRxiv, 2016-06-17
We report genome-wide ancient DNA from 44 ancient Near Easterners ranging in time between ~12,000-1,400 BCE, from Natufian hunter-gatherers to Bronze Age farmers. We show that the earliest populations of the Near East derived around half their ancestry from a ‘Basal Eurasian’ lineage that had little if any Neanderthal admixture and that separated from other non-African lineages prior to their separation from each other. The first farmers of the southern Levant (Israel and Jordan) and Zagros Mountains (Iran) were strongly genetically differentiated, and each descended from local hunter-gatherers. By the time of the Bronze Age, these two populations and Anatolian-related farmers had mixed with each other and with the hunter-gatherers of Europe to drastically reduce genetic differentiation. The impact of the Near Eastern farmers extended beyond the Near East farmers related to those of Anatolia spread westward into Europe; farmers related to those of the Levant spread southward into East Africa; farmers related to those from Iran spread northward into the Eurasian steppe; and people related to both the early farmers of Iran and to the pastoralists of the Eurasian steppe spread eastward into South Asia.
biorxiv genetics 200-500-users 2016