Learning Protein Fitness Models from Evolutionary and Experimental Data

Tuesday March 1st, 4-5pm EST | Chloe Hsu, University of California Berkeley Computer Science

There are several approaches to predict functional properties of a given protein from the protein’s amino acid sequence. Existing machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily related sequences or variant sequences with experimentally measured labels. To reduce the amount of data that the model requires to make reliable functional predictions for a protein, recent work has suggested methods for combining both sources of information including evolutionary and experimental data. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with a probability density feature from modeling the evolutionary data. Within this approach, we find that a variational autoencoder-based probability density model showed the best overall performance regardless which evolutionary density model was used. Moreover, our analysis highlights the importance of systematic evaluation and sufficient baseline. In addition to evolutionary and assay-labeled data, we also demonstrate that our combination approach can be extended to include protein structure information to further improve fitness prediction.

Preprint: https://doi.org/10.1101/2021.03.28.437402
Published work: https://www.nature.com/articles/s41587-021-01146-5

Recording Link: https://youtu.be/UfeXApKTufQ