Low-N protein engineering with data-efficient deep learning, bioRxiv, 2020-01-24
AbstractProtein engineering has enormous academic and industrial potential. However, it is limited by the lack of experimental assays that are consistent with the design goal and sufficiently high-throughput to find rare, enhanced variants. Here we introduce a machine learning-guided paradigm that can use as few as 24 functionally assayed mutant sequences to build an accurate virtual fitness landscape and screen ten million sequences via in silico directed evolution. As demonstrated in two highly dissimilar proteins, avGFP and TEM-1 β-lactamase, top candidates from a single round are diverse and as active as engineered mutants obtained from previous multi-year, high-throughput efforts. Because it distills information from both global and local sequence landscapes, our model approximates protein function even before receiving experimental data, and generalizes from only single mutations to propose high-functioning epistatically non-trivial designs. With reproducible >500% improvements in activity from a single assay in a 96-well plate, we demonstrate the strongest generalization observed in machine-learning guided protein design to date. Taken together, our approach enables efficient use of resource intensive high-fidelity assays without sacrificing throughput. By encouraging alignment with endpoint objectives, low-N design will accelerate engineered proteins into the fermenter, field, and clinic.
biorxiv synthetic-biology 0-100-users 2020Engineering E. coli for magnetic control and the spatial localization of functions, bioRxiv, 2020-01-07
AbstractThe fast-developing field of synthetic biology enables broad applications of programmed microorganisms including the development of whole-cell biosensors, delivery vehicles for therapeutics, or diagnostic agents. However, the lack of spatial control required for localizing microbial functions could limit their use and induce their dilution leading to ineffective action or dissemination. To overcome this limitation, the integration of magnetic properties into living systems enables a contact-less and orthogonal method for spatiotemporal control. Here, we generated a magnetic-sensing Escherichia coli by driving the formation of iron-rich bodies into bacteria. We found that these bacteria could be spatially controlled by magnetic forces and sustained cell growth and division, by transmitting asymmetrically their magnetic properties to one daughter cell. We combined the spatial control of bacteria with genetically encoded-adhesion properties to achieve the magnetic capture of specific target bacteria as well as the spatial modulation of human cell invasions.
biorxiv synthetic-biology 0-100-users 2020A synthetic Calvin cycle enables autotrophic growth in yeast, bioRxiv, 2019-12-03
AbstractThe methylotrophic yeast Pichia pastoris is frequently used for heterologous protein production and it assimilates methanol efficiently via the xylulose-5-phosphate pathway. This pathway is entirely localized in the peroxisomes and has striking similarities to the Calvin-Benson-Bassham (CBB) cycle, which is used by a plethora of organisms like plants to assimilate CO2 and is likewise compartmentalized in chloroplasts. By metabolic engineering the methanol assimilation pathway of P. pastoris was re-wired to a CO2 fixation pathway resembling the CBB cycle. This new yeast strain efficiently assimilates CO2 into biomass and utilizes it as its sole carbon source, which changes the lifestyle from heterotrophic to autotrophic.In total eight genes, including genes encoding for RuBisCO and phosphoribulokinase, were integrated into the genome of P. pastoris, while three endogenous genes were deleted to block methanol assimilation. The enzymes necessary for the synthetic CBB cycle were targeted to the peroxisome. Methanol oxidation, which yields NADH, is employed for energy generation defining the lifestyle as chemoorganoautotrophic. This work demonstrates that the lifestyle of an organism can be changed from chemoorganoheterotrophic to chemoorganoautotrophic by metabolic engineering. The resulting strain can grow exponentially and perform multiple cell doublings on CO2 as sole carbon source with a µmax of 0.008 h−1.Graphical Abstract<jatsfig id=ufig1 position=float fig-type=figure orientation=portrait><jatsgraphic xmlnsxlink=httpwww.w3.org1999xlink xlinkhref=862599v1_ufig1 position=float orientation=portrait >
biorxiv synthetic-biology 100-200-users 2019Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences, bioRxiv, 2019-04-30
AbstractIn the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In biology, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Learning the natural distribution of evolutionary protein sequence variation is a logical step toward predictive and generative modeling for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million sequences spanning evolutionary diversity. The resulting model maps raw sequences to representations of biological properties without labels or prior domain knowledge. The learned representation space organizes sequences at multiple levels of biological granularity from the biochemical to proteomic levels. Unsupervised learning recovers information about protein structure secondary structure and residue-residue contacts can be identified by linear projections from the learned representations. Training language models on full sequence diversity rather than individual protein families increases recoverable information about secondary structure. The unsupervised models can be adapted with supervision from quantitative mutagenesis data to predict variant activity. Predictions from sequences alone are comparable to results from a state-of-the-art model of mutational effects that uses evolutionary and structurally derived features.
biorxiv synthetic-biology 200-500-users 2019Rapid, Low-Cost Detection of Water Contaminants Using Regulated In Vitro Transcription, bioRxiv, 2019-04-26
ABSTRACTSynthetic biology has enabled the development of powerful nucleic acid diagnostic technologies for detecting pathogens and human health biomarkers. Here we expand the reach of synthetic biology-enabled diagnostics by developing a cell-free biosensing platform that uses RNA output sensors activated by ligand induction (ROSALIND) to detect harmful contaminants in aqueous samples. ROSALIND consists of three programmable components highly-processive RNA polymerases, allosteric transcription factors, and synthetic DNA transcription templates. Together, these components allosterically regulate the in vitro transcription of a fluorescence-activating RNA aptamer in the absence of a target compound, transcription is blocked, while in its presence a fluorescent signal is produced. We demonstrate that ROSALIND can be configured to detect a range of water contaminants, including antibiotics, toxic small molecules, and metals. Our cell-free biosensing platform, which can be freeze-dried for field deployment, creates a new capability for point-of-use monitoring of molecular species to address growing global crises in water quality and human health.
biorxiv synthetic-biology 100-200-users 2019Unified rational protein engineering with sequence-only deep representation learning, bioRxiv, 2019-03-26
AbstractRational protein engineering requires a holistic understanding of protein function. Here, we apply deep learning to unlabelled amino acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily, and biophysically grounded. We show that the simplest models built on top of this unified representation (UniRep) are broadly applicable and generalize to unseen regions of sequence space. Our data-driven approach reaches near state-of-the-art or superior performance predicting stability of natural and de novo designed proteins as well as quantitative function of molecularly diverse mutants. UniRep further enables two orders of magnitude cost savings in a protein engineering task. We conclude UniRep is a versatile protein summary that can be applied across protein engineering informatics.
biorxiv synthetic-biology 100-200-users 2019