Extensive sequencing of seven human genomes to characterize benchmark reference materials, bioRxiv, 2015-09-16
The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.
biorxiv genomics 100-200-users 2015Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation, bioRxiv, 2015-08-30
RNA-seq technology is widely used in biomedical and basic science research. These studies rely on complex computational methods that quantify expression levels for observed transcripts. We find that current computational methods can lead to hundreds of false positive results related to alternative isoform usage. This flaw in the current methodology stems from a lack of modeling sample-specific bias that leads to drops in coverage and is related to sequence features like fragment GC content and GC stretches. By incorporating features that explain this bias into transcript expression models, we greatly increase the specificity of transcript expression estimates, with more than a four-fold reduction in the number of false positives for reported changes in expression. We introduce alpine, a method for estimation of bias-corrected transcript abundance. The method is available as a Bioconductor package that includes data visualization tools useful for bias discovery.
biorxiv bioinformatics 100-200-users 2015Missing Data and Technical Variability in Single-Cell RNA- Sequencing Experiments, bioRxiv, 2015-08-26
Until recently, high-throughput gene expression technology, such as RNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RNA-Seq and scRNA-seq data are markedly different. In particular, unlike RNA-Seq, the majority of reported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, gene expressing RNA, but not at a sufficient level to detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.
biorxiv genomics 200-500-users 2015Iron Age and Anglo-Saxon genomes from East England reveal British migration history, bioRxiv, 2015-07-18
British population history has been shaped by a series of immigrations and internal movements, including the early Anglo-Saxon migrations following the breakdown of the Roman administration after 410CE. It remains an open question how these events affected the genetic composition of the current British population. Here, we present whole-genome sequences generated from ten ancient individuals found in archaeological excavations close to Cambridge in the East of England, ranging from 2,300 until 1,200 years before present (Iron Age to Anglo-Saxon period). We use present-day genetic data to characterize the relationship of these ancient individuals to contemporary British and other European populations. By analyzing the distribution of shared rare variants across ancient and modern individuals, we find that today’s British are more similar to the Iron Age individuals than to most of the Anglo-Saxon individuals, and estimate that the contemporary East English population derives 30% of its ancestry from Anglo-Saxon migrations, with a lower fraction in Wales and Scotland. We gain further insight with a new method, rarecoal, which fits a demographic model to the distribution of shared rare variants across a large number of samples, enabling fine scale analysis of subtle genetic differences and yielding explicit estimates of population sizes and split times. Using rarecoal we find that the ancestors of the Anglo-Saxon samples are closest to modern Danish and Dutch populations, while the Iron Age samples share ancestors with multiple Northern European populations including Britain.
biorxiv genetics 0-100-users 2015Accelerating Scientific Publication in Biology, bioRxiv, 2015-07-12
Scientific publications enable results and ideas to be transmitted throughout the scientific community. The number and type of journal publications also have become the primary criteria used in evaluating career advancement. Our analysis suggests that publication practices have changed considerably in the life sciences over the past thirty years. More experimental data is now required for publication, and the average time required for graduate students to publish their first paper has increased and is approaching the desirable duration of Ph.D. training. Since publication is generally a requirement for career progression, schemes to reduce the time of graduate student and postdoctoral training may be difficult to implement without also considering new mechanisms for accelerating communication of their work. The increasing time to publication also delays potential catalytic effects that ensue when many scientists have access to new information. The time has come for life scientists, funding agencies, and publishers to discuss how to communicate new findings in a way that best serves the interests of the public and the scientific community.
biorxiv scientific-communication-and-education 200-500-users 2015Salmon provides accurate, fast, and bias-aware transcript expression estimates using dual-phase inference, bioRxiv, 2015-06-28
We introduce Salmon, a new method for quantifying transcript abundance from RNA-seq reads that is highly-accurate and very fast. Salmon is the first transcriptome-wide quantifier to model and correct for fragment GC content bias, which we demonstrate substantially improves the accuracy of abundance estimates and the reliability of subsequent differential expression analysis compared to existing methods that do not account for these biases. Salmon achieves its speed and accuracy by combining a new dual-phase parallel inference algorithm and feature-rich bias models with an ultra-fast read mapping procedure. These innovations yield both exceptional accuracy and order-of-magnitude speed benefits over alignment-based methods.
biorxiv bioinformatics 100-200-users 2015