The genome of Caenorhabditis bovis, bioRxiv, 2019-09-20
AbstractThe free-living nematode Caenorhabditis elegans is a key laboratory model for metazoan biology. C. elegans is also used as a model for parasitic nematodes despite being only distantly related to most parasitic species. All ∼65 Caenorhabditis species currently in culture are free-living with most having been isolated from decaying plant or fungal matter. Caenorhabditis bovis is a particularly unusual species, having been isolated several times from the inflamed ears of Zebu cattle in Eastern Africa where it is believed to be the cause of bovine parasitic otitis. C. bovis is therefore of particular interest to researchers interested in the evolution of nematode parasitism and in Caenorhabditis diversity. However, as C. bovis is not in laboratory culture, it remains little studied and details of its prevalence, role in bovine parasitic otitis and relationships to other Caenorhabditis species are scarce. Here, by sampling livestock markets and slaughterhouses in Western Kenya, we successfully reisolate C. bovis from the ear of adult female Zebu. We sequence the genome of C. bovis using the Oxford Nanopore MinION platform in a nearby field laboratory and use the data to generate a chromosome-scale draft genome sequence. We exploit this draft genome to reconstruct the phylogenetic relationships of C. bovis to other Caenorhabditis species and reveal the changes in genome size and content that have occurred during its evolution. We also identify expansions in several gene families that have been implicated in parasitism in other nematode species, including those associated with resistance to antihelminthic drugs. The high-quality draft genome and our analyses thereof represent a significant advancement in our understanding of this unusual Caenorhabditis species.
biorxiv genomics 0-100-users 2019An improved de novo assembly and annotation of the tomato reference genome using single-molecule sequencing, Hi-C proximity ligation and optical maps, bioRxiv, 2019-09-14
AbstractThe original Heinz 1706 reference genome was produced by a large team of scientists from across the globe from a variety of input sources that included 454 sequences in addition to full-length BACs, BAC and fosmid ends sequenced with Sanger technology. We present here the latest tomato reference genome (SL4.0) assembled de novo from PacBio long reads and scaffolded using Hi-C contact maps. The assembly was validated using Bionano optical maps and 10X linked-read sequences. This assembly is highly contiguous with fewer gaps compared to previous genome builds and almost all scaffolds have been anchored and oriented to the 12 tomato chromosomes. We have found more repeats compared to the previous versions and one of the largest repeat classes identified are the LTR retrotransposons. We also describe updates to the reference genome and annotation since the last publication. The corresponding ITAG4.0 annotation has 4,794 novel genes along with 29,281 genes preserved from ITAG2.4. Most of the updated genes have extensions in the 5’ and 3’ UTRs resulting in doubling of annotated UTRs per gene. The genome and annotation can be accessed using SGN through BLAST database, Pathway database (SolCyc), Apollo, JBrowse genome browser and FTP available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpssolgenomics.net>httpssolgenomics.net<jatsext-link>.
biorxiv genomics 0-100-users 2019Complete characterization of the human immune cell transcriptome using accurate full-length cDNA sequencing, bioRxiv, 2019-09-13
ABSTRACTThe human immune system relies on highly complex and diverse transcripts and the proteins they encode. These include transcripts for Human Leukocyte Antigen (HLA) class I and II receptors which are essential for selfnon-self discrimination by the immune system as well as transcripts encoding B cell and T cell receptors (BCR and TCR) which recognize, bind, and help eliminate foreign antigens.HLA genes are highly diverse within the human population with each individual possessing two of thousands of different alleles in each of the 9 major HLA genes. Determining which combination of alleles an individual possesses for each HLA gene (high-resolution HLA-typing) is essential to establish donor-recipient compatibility in organ and bone-marrow transplantations. BCR and TCR genes in turn are generated by recombining a diverse set of gene segments on the DNA level in each maturing B and T cell, respectively. This process generates adaptive immune receptor repertoires (AIRR) of composed of unique transcripts expressed by each B and T cells. These repertoires carry a vast amount of health relevant information. Both short-read RNA-seq based HLA-typing1 and adaptive immune receptor repertoire sequencing2–5 currently rely heavily on our incomplete knowledge of the genetic diversity at HLA6 and BCRTCR loci7,8.Here we used our nanopore sequencing based Rolling Circle toConcatemeric Consensus (R2C2) protocol9 to generate over 10,000,000 full-length cDNA sequences at a median accuracy of 97.9%. We used this dataset to demonstrate that deep and accurate full-length cDNA sequencing can - in addition to providing isoform-level transcriptome analysis for over 9,000 loci - be used to generate accurate sequences of HLA alleles for HLA allele typing and discovery as well as detailed AIRR data for the analysis of the adaptive immune system without requiring specific knowledge of the diversity at HLA and BCRTCR loci.
biorxiv genomics 0-100-users 2019Ancient DNA reconstructs the genetic legacies of pre-contact Puerto Rico communities, bioRxiv, 2019-09-12
AbstractIndigenous peoples have occupied the island of Puerto Rico since at least 3000 B.C. Due to the demographic shifts that occurred after European contact, the origin(s) of these ancient populations, and their genetic relationship to present-day islanders, are unclear. We use ancient DNA to characterize the population history and genetic legacies of pre-contact Indigenous communities from Puerto Rico. Bone, tooth and dental calculus samples were collected from 124 individuals from three pre-contact archaeological sites Tibes, Punta Candelero and Paso del Indio. Despite poor DNA preservation, we used target enrichment and high-throughput sequencing to obtain complete mitochondrial genomes (mtDNA) from 45 individuals and autosomal genotypes from two individuals. We found a high proportion of Native American mtDNA haplogroups A2 and C1 in the pre-contact Puerto Rico sample (40% and 44%, respectively). This distribution, as well as the haplotypes represented, support a primarily Amazonian South American origin for these populations, and mirrors the Native American mtDNA diversity patterns found in present-day islanders. Three mtDNA haplotypes from pre-contact Puerto Rico persist among Puerto Ricans and other Caribbean islanders, indicating that present-day populations are reservoirs of pre-contact mtDNA diversity. Lastly, we find similarity in autosomal ancestry patterns between pre-contact individuals from Puerto Rico and the Bahamas, suggesting a shared component of Indigenous Caribbean ancestry with close affinity to South American populations. Our findings contribute to a more complete reconstruction of pre-contact Caribbean population history and explore the role of Indigenous peoples in shaping the biocultural diversity of present-day Puerto Ricans and other Caribbean islanders.
biorxiv genomics 200-500-users 2019Quantifying the tradeoff between sequencing depth and cell number in single-cell RNA-seq, bioRxiv, 2019-09-10
The allocation of a sequencing budget when designing single cell RNA-seq experiments requires consideration of the tradeoff between number of cells sequenced and the read depth per cell. One approach to the problem is to perform a power analysis for a univariate objective such as differential expression. However, many of the goals of single-cell analysis requires consideration of the multivariate structure of gene expression, such as clustering. We introduce an approach to quantifying the impact of sequencing depth and cell number on the estimation of a multivariate generative model for gene expression that is based on error analysis in the framework of a variational autoencoder. We find that at shallow depths, the marginal benefit of deeper sequencing per cell significantly outweighs the benefit of increased cell numbers. Above about 15,000 reads per cell the benefit of increased sequencing depth is minor. Code for the workflow reproducing the results of the paper is available at <jatsext-link xmlnsxlink=httpwww.w3.org1999xlink ext-link-type=uri xlinkhref=httpsgithub.compachterlabSBP_2019>httpsgithub.compachterlabSBP_2019<jatsext-link>.
biorxiv genomics 200-500-users 2019A multi-view model for relative and absolute microbial abundances, bioRxiv, 2019-09-09
AbstractThe absolute abundance of bacterial taxa in human host-associated environments play a critical role in reproductive and gastrointestinal health. However, obtaining the absolute abundance of many bacterial species is typically prohibitively expensive. In contrast, relative abundance data for many species is comparatively cheap and easy to collect (e.g., with universal primers for the 16S rRNA gene). In this paper, we propose a method to jointly model relative abundance data for many taxa and absolute abundance data for a subset of taxa. Our method provides point and interval estimates for the absolute abundance of all taxa. Crucially, our proposal accounts for differences in the efficiency of taxon detection in the relative and absolute abundance data. We show that modeling taxon-specific efficiencies substantially reduces the estimation error for absolute abundance, and controls the coverage of interval estimators. We demonstrate the performance of our proposed method via a simulation study, a sensitivity study where we jackknife the taxa with observed absolute abundances, and a study of women with bacterial vaginosis.
biorxiv genomics 0-100-users 2019