
An Artificial Intelligence Model for Translating Natural Language into Functional de Novo Proteins
Paper Author
Presented by
Abstract
Traditional protein design is fundamentally constrained by known sequences and folds. To break free from these limitations, we introduce a new alternative: designing proteins directly from plain-language specifications. To achieve this, we trained MP4, a transformer-based model that maps natural language prompts to protein sequences, on a dataset of 3.2 billion points and 138k tokens. In a benchmark of 96 prompts representing a wide array of functions and contexts, MP4 excelled by simultaneously improving on three key metrics: sequence realism, predicted fold quality, and alignment to the requested function. This high performance is particularly significant as it was achieved using only text as input which is a major departure from other models. Experimental validation confirmed our computational predictions: two de novo designs were experimentally shown to be both expressible and thermostable, with high-resolution crystallography (1.30 Å and 1.77 Å) ultimately revealing one to possess a paradigm-shifting novel fold. Functionally, the designs were also active, demonstrating both ATP binding and hydrolysis in vitro. This work demonstrates the realization of natural-language intent as functional proteins that express, crystallize, and catalyze. Although the underlying approach is still in early development with incomplete coverage and controllability, MP4 delivers a profound impact: it lowers the barrier to protein design and vastly expands the space for creative exploration in molecular programming.