Site-wise mutation effects enable combinatorial protein variant design
Tuesday June 6th, 4-5 pm EST | David Ding — Postdoctoral Scholar, UC Berkeley
Abstract: Design and natural evolution of proteins is profoundly impacted by the amount and frequency of epistasis, ie. how substitutions affect each other. Recent efforts in developing large deep learning models with 100s of millions of parameters trained on structural and sequence databases have made progress in predicting protein variant effects by utilizing the ability of such models to fit increasingly complicated protein fitness landscapes with specific interactions between substitutions. However, for most proteins, it remains unclear how important biological epistasis is for prediction and design of combinatorial variants. Here, we systematically examined 8 combinatorial protein variant effect datasets for the complexity of their fitness landscape. We start by measuring the effect of ~10.000 combinatorial variants at ten binding residues in the antitoxin ParD3 on its in vivo ability to neutralize its cognate toxin, ParE3. Using this and two additional datasets in this protein, we show that a simple logistic regression model considering only site-wise amino acid preferences without interactions between residues can explain much and, in some datasets, virtually all of the combinatorial mutation effects (R2: 83-98%). As a result, this minimal model with ~60-200 parameters can be trained on a small number (~200-1000) of observations to predict unobserved combinatorial variant effects well (Pearson r~0.80-0.98). We find that such site-wise preferences are affected by mutations at neighboring residues, leading us to develop an unsupervised strategy - which we call ALMS for ‘assessment of local mutations from structure’ – for design of functional and diverse sequences using structural microenvironment information in data-poor regimes. ALMS outperforms not just random library sampling approaches but also high-capacity neural networks that can model specific dependencies between residues. Finally, we observe that these results generalize to observed combinatorial variant effects in 7 other proteins (R2 ~ 78-95%). These results demonstrate that simple, site-wise supervised and unsupervised approaches enable the design of combinatorial protein variant effects, including therapeutically relevant ones.
Preprint: https://www.biorxiv.org/content/10.1101/2022.10.31.514613v1
Google Scholar: https://scholar.google.com/citations?user=5aiD42AAAAAJ&hl=en
Recording link: https://youtu.be/OaOj3znHPn0