ESM3: Simulating 500 million years of evolution with a language model

Tuesday July 23rd, 4-5pm EST | Roshan Rao, PhD (EvoScale)

Abstract: More than three billion years of evolution have produced an image of biology encoded into the space of natural proteins. Here we show that language models trained on tokens generated by evolution can act as evolutionary simulators to generate functional proteins that are far away from known proteins. We present ESM3, a frontier multimodal generative language model that reasons over the sequence, structure, and function of proteins. ESM3 can follow complex prompts combining its modalities and is highly responsive to biological alignment. We have prompted ESM3 to generate fluorescent proteins with a chain of thought. Among the generations that we synthesized, we found a bright fluorescent protein at far distance (58% identity) from known fluorescent proteins. Similarly distant natural fluorescent proteins are separated by over five hundred million years of evolution.

Preprint: https://www.biorxiv.org/content/10.1101/2024.07.01.600583v1

 

Roshan Rao is a Research Scientist and Co-Founder at EvolutionaryScale. His work focuses on the development and understanding of foundation models for proteins. Previously, he worked at Meta AI and completed his Ph.D. at UC Berkeley, where he developed methods for protein structure prediction, function prediction, and design.