ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning

Tuesday November 26th, 7-8pm EST | Jin Su (Westlake University)

Abstract: ProTrek redefines protein exploration by seamlessly fusing sequence, structure, and natural language function (SSF) into an advanced tri-modal language model. Through contrastive learning, ProTrek bridges the gap between protein data and human understanding, enabling lightning-fast searches across nine SSF pairwise modality combinations. Trained on vastly larger datasets, ProTrek demonstrates quantum leaps in performance: (1) Elevating protein sequence-function interconversion by 30-60 fold; (2) Surpassing current alignment tools (i.e., Foldseek and MMseqs2) in both speed (100-fold acceleration) and accuracy, identifying functionally similar proteins with diverse structures; and (3) Outperforming ESM-2 in 9 of 11 downstream prediction tasks, setting new benchmarks in protein intelligence. These results suggest that ProTrek will become a core tool for protein searching, understanding, and analysis.

Preprint: https://www.biorxiv.org/content/10.1101/2024.05.30.596740v2

 

Jin Su is a third-year Ph. D. candidate supervised by Prof. Fajie Yuan at Westlake University. He received his B.S. degree from Huazhong University of Science and Technology in 2022, where he worked with Prof. Wei Wei on the adversarial attack and defense for NLP models. Currently His research interests lie in AI for proteins, primarily focusing on the protein representation learning, multimodality and pre-training.