GPT-based protein design overcomes decades-old barrier in regenerative medicine

Artificial intelligence may be giving a second life not just to machines, but to human cells themselves. OpenAI, in collaboration with the longevity biotech company Retro Biosciences, has unveiled a custom-built AI model that successfully designed new proteins capable of reprogramming human cells with far greater efficiency than ever before.

The work focuses on the Yamanaka factors, four proteins (OCT4, SOX2, KLF4, and MYC) that won the 2012 Nobel Prize in Medicine for their ability to revert adult cells back into pluripotent stem cells — cells that can develop into almost any type of tissue in the body. These stem cells hold enormous promise for regenerative medicine, offering potential treatments for blindness, diabetes, organ failure, and more.

Yet, the process has long been frustratingly inefficient. Until now, the only strategy to improve reprogramming has been to make tiny, targeted mutations to the proteins. But this is like looking for a needle in a haystack. Proteins like SOX2 and KLF4 contain hundreds of amino acids, and the possible variations run into trillions. Years of effort yielded only a handful of slightly improved versions, with success rates often below 10%. This inefficiency has kept stem-cell therapies largely out of reach for many real-world medical applications.

That may now be changing, thanks to OpenAI and Retro Biosciences' newly built AI model, GPT-4b micro, designed specifically for protein engineering. The model was trained on protein data composed mostly of protein sequences, along with biological text and tokenized 3D structure data, and can handle extremely long inputs—up to 64,000 tokens. This enriched context allowed the model to generate proteins with specific functions, even in intrinsically disordered regions (IDRs) that are hard to engineer.

When Retro’s scientists tested GPT-4b micro’s protein designs in human fibroblasts, the results exceeded expectations. More than 30% of the AI-designed SOX2 variants (“RetroSOX”) outperformed the natural protein, despite differing by over 100 amino acids on average. Traditional methods typically yield improvements in fewer than 10% of cases, and often only from tiny, incremental changes.

Encouraged by this success, the team next targeted KLF4, another Yamanaka factor considered “stubborn” because earlier attempts to improve it had failed. Using the same AI-guided approach, they generated and tested a library of KLF4 variants. Nearly half of them outperformed the best versions found in past human-led experiments. To put this in perspective: in previous large-scale mutagenesis studies, only 1 in 19 KLF4 variants showed even marginal improvement.

Instead of being limited to random trial-and-error, researchers can now rapidly explore huge regions of “protein space” that were previously inaccessible. If further validated, these optimized reprogramming factors could significantly boost the efficiency of generating induced pluripotent stem cells (iPSCs). That would speed up research on regenerative medicine, drug discovery, and potentially future treatments for aging-related decline.

The work shows how artificial intelligence is beginning to accelerate the long quest to make cellular reprogramming practical and has a potential to reshape the future of medicine.

The content of this article is drawn from OpenAI’s original publication; for full technical and experimental details, please refer to the original source.

Trained on sequences, structure data, and biological context, OpenAI’s GPT-4b micro brings protein engineering into the age of AI

Most read articles