fastp an ultra-fast all-in-one FASTQ preprocessor, bioRxiv, 2018-03-02
AbstractMotivationQuality control and preprocessing of FASTQ files are essential to providing clean data for downstream analysis. Traditionally, a different tool is used for each operation, such as quality control, adapter trimming, and quality filtering. These tools are often insufficiently fast as most are developed using high-level programming languages (e.g., Python and Java) and provide limited multi-threading support. Reading and loading data multiple times also renders preprocessing slow and IO inefficient.ResultsWe developed fastp as an ultra-fast FASTQ preprocessor with useful quality control and data-filtering features. It can perform quality control, adapter trimming, quality filtering, per-read quality cutting, and many other operations with a single scan of the FASTQ data. It also supports unique molecular identifier preprocessing, poly tail trimming, output splitting, and base correction for paired-end data. It can automatically detect adapters for single-end and paired-end FASTQ data. This tool is developed in C++ and has multi-threading support. Based on our evaluation, fastp is 2–5 times faster than other FASTQ preprocessing tools such as Trimmomatic or Cutadapt despite performing far more operations than similar tools.Availability and ImplementationThe open-source code and corresponding instructions are available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.comOpenGenefastp>httpsgithub.comOpenGenefastp<jatsext-link>Contactchen@haplox.com
biorxiv bioinformatics 100-200-users 2018Virtual ChIP-seq predicting transcription factor binding by learning from the transcriptome, bioRxiv, 2018-03-01
AbstractMotivationIdentifying transcription factor binding sites is the first step in pinpointing non-coding mutations that disrupt the regulatory function of transcription factors and promote disease. ChIP-seq is the most common method for identifying binding sites, but performing it on patient samples is hampered by the amount of available biological material and the cost of the experiment. Existing methods for computational prediction of regulatory elements primarily predict binding in genomic regions with sequence similarity to known transcription factor sequence preferences. This has limited efficacy since most binding sites do not resemble known transcription factor sequence motifs, and many transcription factors are not even sequence-specific.ResultsWe developed Virtual ChIP-seq, which predicts binding of individual transcription factors in new cell types using an artificial neural network that integrates ChIP-seq results from other cell types and chromatin accessibility data in the new cell type. Virtual ChIP-seq also uses learned associations between gene expression and transcription factor binding at specific genomic regions. This approach outperforms methods that predict TF binding solely based on sequence preference, pre-dicting binding for 36 transcription factors (Matthews correlation coefficient > 0.3).AvailabilityThe datasets we used for training and validation are available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsvirchip.hoffmanlab.org>httpsvirchip.hoffmanlab.org<jatsext-link>. We have deposited in Zenodo the current version of our software (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpdoi.org10.5281zenodo.1066928>httpdoi.org10.5281zenodo.1066928<jatsext-link>), datasets (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpdoi.org10.5281zenodo.823297>httpdoi.org10.5281zenodo.823297<jatsext-link>), predictions for 36 transcription factors on Roadmap Epigenomics cell types (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpdoi.org10.5281zenodo.1455759>httpdoi.org10.5281zenodo.1455759<jatsext-link>), and predictions in Cistrome as well as ENCODE-DREAM in vivo TF Binding Site Prediction Challenge (<jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpdoi.org10.5281zenodo.1209308>httpdoi.org10.5281zenodo.1209308<jatsext-link>).
biorxiv bioinformatics 200-500-users 2018End-to-end differentiable learning of protein structure, bioRxiv, 2018-02-15
AbstractPredicting protein structure from sequence is a central challenge of biochemistry. Co‐evolution methods show promise, but an explicit sequence‐to‐structure map remains elusive. Advances in deep learning that replace complex, human‐designed pipelines with differentiable models optimized end‐to‐end suggest the potential benefits of similarly reformulating structure prediction. Here we report the first end‐to‐end differentiable model of protein structure. The model couples local and global protein structure via geometric units that optimize global geometry without violating local covalent chemistry. We test our model using two challenging tasks predicting novel folds without co‐evolutionary data and predicting known folds without structural templates. In the first task the model achieves state‐of‐the‐art accuracy and in the second it comes within 1‐2Å; competing methods using co‐evolution and experimental templates have been refined over many years and it is likely that the differentiable approach has substantial room for further improvement, with applications ranging from drug discovery to protein design.
biorxiv bioinformatics 200-500-users 2018Identification of transcriptional signatures for cell types from single-cell RNA-Seq, bioRxiv, 2018-02-13
AbstractSingle-cell RNA-Seq makes it possible to characterize the transcriptomes of cell types and identify their transcriptional signatures via differential analysis. We present a fast and accurate method for discriminating cell types that takes advantage of the large numbers of cells that are assayed. When applied to transcript compatibility counts obtained via pseudoalignment, our approach provides a quantification-free analysis of 3’ single-cell RNA-Seq that can identify previously undetectable marker genes.
biorxiv bioinformatics 100-200-users 2018Integrating Hi-C links with assembly graphs for chromosome-scale assembly, bioRxiv, 2018-02-08
AbstractLong-read sequencing and novel long-range assays have revolutionized de novo genome assembly by automating the reconstruction of reference-quality genomes. In particular, Hi-C sequencing is becoming an economical method for generating chromosome-scale scaffolds. Despite its increasing popularity, there are limited open-source tools available. Errors, particularly inversions and fusions across chromosomes, remain higher than alternate scaffolding technologies. We present a novel open-source Hi-C scaffolder that does not require an a priori estimate of chromosome number and minimizes errors by scaffolding with the assistance of an assembly graph. We demonstrate higher accuracy than the state-of-the-art methods across a variety of Hi-C library preparations and input assembly sizes. The Python and C++ code for our method is openly available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.commachinegunSALSA>httpsgithub.commachinegunSALSA<jatsext-link>Author summaryHi-C technology was originally proposed to study the 3D organization of a genome. Recently, it has also been applied to assemble large eukaryotic genomes into chromosome-scale scaffolds. Despite this, there are few open source methods to generate these assemblies. Existing methods are also prone to small inversion errors due to noise in the Hi-C data. In this work, we address these challenges and develop a method, named SALSA2. SALSA2 uses sequence overlap information from an assembly graph to correct inversion errors and provide accurate chromosome-scale assemblies.
biorxiv bioinformatics 100-200-users 2018Inference of CRISPR Edits from Sanger Trace Data, bioRxiv, 2018-01-21
AbstractEfficient precision genome editing requires a quick, quantitative, and inexpensive assay of editing outcomes. Here we present ICE (Inference of CRISPR Edits), which enables robust analysis of CRISPR edits using Sanger data. ICE proposes potential outcomes for editing with guide RNAs (gRNAs) and then determines which are supported by the data via regression. Additionally, we develop a score called ICE-D (Discordance) that can provide information on large or unexpected edits. We empirically confirm through over 1,800 edits that the ICE algorithm is robust, reproducible, and can analyze CRISPR experiments within days after transfection. We also confirm that ICE strongly correlates with next-generation sequencing of amplicons (Amp-Seq). The ICE tool is free to use and offers several improvements over current analysis tools. For instance, ICE can analyze individual experiments as well as multiple experiments simultaneously (batch analysis). ICE can also detect a wider variety of outcomes, including multi-guide edits (multiple gRNAs per target) and edits resulting from homology-directed repair (HDR), such as knock-ins and base edits. ICE is a reliable analysis tool that can significantly expedite CRISPR editing workflows. It is available online at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpice.synthego.com>ice.synthego.com<jatsext-link>, and the source code is at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpgithub.comsynthego-openice>github.comsynthego-openice<jatsext-link>
biorxiv bioinformatics 0-100-users 2018