Low-N protein engineering with data-efficient deep learning, bioRxiv, 2020-01-24

AbstractProtein engineering has enormous academic and industrial potential. However, it is limited by the lack of experimental assays that are consistent with the design goal and sufficiently high-throughput to find rare, enhanced variants. Here we introduce a machine learning-guided paradigm that can use as few as 24 functionally assayed mutant sequences to build an accurate virtual fitness landscape and screen ten million sequences via in silico directed evolution. As demonstrated in two highly dissimilar proteins, avGFP and TEM-1 β-lactamase, top candidates from a single round are diverse and as active as engineered mutants obtained from previous multi-year, high-throughput efforts. Because it distills information from both global and local sequence landscapes, our model approximates protein function even before receiving experimental data, and generalizes from only single mutations to propose high-functioning epistatically non-trivial designs. With reproducible >500% improvements in activity from a single assay in a 96-well plate, we demonstrate the strongest generalization observed in machine-learning guided protein design to date. Taken together, our approach enables efficient use of resource intensive high-fidelity assays without sacrificing throughput. By encouraging alignment with endpoint objectives, low-N design will accelerate engineered proteins into the fermenter, field, and clinic.

biorxiv synthetic-biology 0-100-users 2020

A synthetic Calvin cycle enables autotrophic growth in yeast, bioRxiv, 2019-12-03

AbstractThe methylotrophic yeast Pichia pastoris is frequently used for heterologous protein production and it assimilates methanol efficiently via the xylulose-5-phosphate pathway. This pathway is entirely localized in the peroxisomes and has striking similarities to the Calvin-Benson-Bassham (CBB) cycle, which is used by a plethora of organisms like plants to assimilate CO2 and is likewise compartmentalized in chloroplasts. By metabolic engineering the methanol assimilation pathway of P. pastoris was re-wired to a CO2 fixation pathway resembling the CBB cycle. This new yeast strain efficiently assimilates CO2 into biomass and utilizes it as its sole carbon source, which changes the lifestyle from heterotrophic to autotrophic.In total eight genes, including genes encoding for RuBisCO and phosphoribulokinase, were integrated into the genome of P. pastoris, while three endogenous genes were deleted to block methanol assimilation. The enzymes necessary for the synthetic CBB cycle were targeted to the peroxisome. Methanol oxidation, which yields NADH, is employed for energy generation defining the lifestyle as chemoorganoautotrophic. This work demonstrates that the lifestyle of an organism can be changed from chemoorganoheterotrophic to chemoorganoautotrophic by metabolic engineering. The resulting strain can grow exponentially and perform multiple cell doublings on CO2 as sole carbon source with a µmax of 0.008 h−1.Graphical Abstract<jatsfig id=ufig1 position=float fig-type=figure orientation=portrait><jatsgraphic xmlnsxlink=httpwww.w3.org1999xlink xlinkhref=862599v1_ufig1 position=float orientation=portrait >

biorxiv synthetic-biology 100-200-users 2019

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences, bioRxiv, 2019-04-30

AbstractIn the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In biology, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Learning the natural distribution of evolutionary protein sequence variation is a logical step toward predictive and generative modeling for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million sequences spanning evolutionary diversity. The resulting model maps raw sequences to representations of biological properties without labels or prior domain knowledge. The learned representation space organizes sequences at multiple levels of biological granularity from the biochemical to proteomic levels. Unsupervised learning recovers information about protein structure secondary structure and residue-residue contacts can be identified by linear projections from the learned representations. Training language models on full sequence diversity rather than individual protein families increases recoverable information about secondary structure. The unsupervised models can be adapted with supervision from quantitative mutagenesis data to predict variant activity. Predictions from sequences alone are comparable to results from a state-of-the-art model of mutational effects that uses evolutionary and structurally derived features.

biorxiv synthetic-biology 200-500-users 2019

 

Created with the audiences framework by Jedidiah Carlson

Powered by Hugo