A Deep Unsupervised Language Model for Protein Design

Tuesday May 3rd, 4-5pm EST | Noelia Ferruz, University of Bayreuth

Protein design aims to build new proteins from scratch thereby holding the potential to tackle many environmental and biomedical problems. Recent progress in the field of natural language processing (NLP) has enabled the implementation of ever-growing language models capable of understanding and generating text with human-like capabilities. Given the many similarities between human languages and protein sequences, the use of NLP models offers itself for predictive tasks in protein research. Motivated by the evident success of generative Transformer-based language models such as the GPT-x series, we developed ProtGPT2, a language model trained on protein space that generates de novo protein sequences that follow the principles of natural ones. In particular, the generated proteins display amino acid propensities which resemble natural proteins. Disorder and secondary structure prediction indicate that 88% of ProtGPT2-generated proteins are globular, which is also in line with natural sequences. Sensitive sequence searches in protein databases show that ProtGPT2 sequences are distantly related to natural ones, and similarity networks further demonstrate that ProtGPT2 is sampling unexplored regions of protein space. AlphaFold prediction of sequences yielded well-folded non-idealized structures with embodiments and large loops and revealed new topologies not captured in current structure databases. ProtGPT2 has the potential to generate de novo proteins in a high throughput fashion in a matter of seconds. The model is easy-to-use and available to the community.

Preprint: https://www.biorxiv.org/content/10.1101/2022.03.09.483666v1

Recording link: https://www.youtube.com/watch?v=BA5C0kLcErM&t=1s