ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning
Tuesday November 26th, 7-8pm EST | Jin Su (Westlake University)
Abstract: ProTrek redefines protein exploration by seamlessly fusing sequence, structure, and natural language function (SSF) into an advanced tri-modal language model. Through contrastive learning, ProTrek bridges the gap between protein data and human understanding, enabling lightning-fast searches across nine SSF pairwise modality combinations. Trained on vastly larger datasets, ProTrek demonstrates quantum leaps in performance: (1) Elevating protein sequence-function interconversion by 30-60 fold; (2) Surpassing current alignment tools (i.e., Foldseek and MMseqs2) in both speed (100-fold acceleration) and accuracy, identifying functionally similar proteins with diverse structures; and (3) Outperforming ESM-2 in 9 of 11 downstream prediction tasks, setting new benchmarks in protein intelligence. These results suggest that ProTrek will become a core tool for protein searching, understanding, and analysis.
Preprint: https://www.biorxiv.org/content/10.1101/2024.05.30.596740v2