Join the club for FREE to access the whole archive and other member benefits.

Protein language models redefine protein design and optimisation

Open-source tools allow researchers to explore and innovate further

07-Jan-2025

Protein engineering is undergoing a revolution, thanks to advancements in machine learning. Large language models (LLMs), which have already transformed fields like natural language processing, are now making significant strides in protein science.

In their publication in Cell Systems, Jacob D. Galson and team from Alchemab—explore the transformative capabilities of two new protein language models (PLMs): ProGen2 and IgLM. These models showcase how AI-driven approaches are reshaping the field of protein design, opening new avenues for research and applications.

Concept of Protein as a "Language"

Proteins, made up of amino acid sequences, can be viewed as a structured "language" where sequence determines function. This analogy has fueled the development of PLMs, which use machine learning techniques to decode, analyse, and even create protein sequences.

At the core of this innovation is the transformer architecture, a type of neural network that uses self-attention mechanisms to understand the relationships between amino acids in a sequence.

PLMs are trained on vast databases of protein sequences, with millions of examples enabling the models to learn underlying patterns. This process, known as pre-training, establishes a foundation of knowledge that can be applied to specific tasks like designing new proteins or predicting their properties.

ProGen2: Scaling the Landscape of Protein Design

ProGen2 is a family of advanced PLMs, ranging in size from 151 million to 6.4 billion parameters. These models, especially the larger variants, represent a significant leap in computational protein design. ProGen2’s training on extensive databases like UniRef and BFD allows it to predict and generate protein sequences with high structural integrity and functional relevance.

Key Features of ProGen2:

  • Generative Protein Design: ProGen2 creates full-length protein sequences that maintain structural similarities to known proteins while exploring new sequence variations.
  • Customisable Outputs: Fine-tuning enables the generation of proteins tailored to specific families or structures, which is particularly useful in fields like antibody design.
  • Scalability: Larger models demonstrate enhanced accuracy and predictive capabilities, reducing sequence perplexity and improving the model's understanding of protein "language."

One standout application of ProGen2 is in antibody design. By training on antibody-specific databases, ProGen2 can generate specialized sequences, including those optimised for binding properties, stability, or therapeutic use. Experimental validation, such as structural determination via X-ray crystallography, can further confirm the functionality of these designed proteins.

IgLM: Redefining Antibody Engineering

While ProGen2 excels in generating novel proteins, IgLM shines in modifying and optimising existing ones. Using a unique span prediction method, IgLM can precisely alter key regions of antibodies, such as Complementarity-Determining Regions (CDRs), which play a crucial role in antigen binding.

Notable Applications of IgLM:

  • Antibody Optimisation: IgLM can redesign antibody regions to improve properties like solubility and stability without compromising functionality.
  • Tag-Based Design: The model allows researchers to input specific tags (e.g., [HEAVY] or [MOUSE]) to guide the generation of antibody sequences with desired characteristics, such as species specificity or chain type.

These capabilities highlight the potential of PLMs to refine and innovate in biopharmaceutical development, addressing challenges in drug design and therapeutic engineering.

Transforming Protein Engineering with AI

The work by Alchemab's researchers illustrates how PLMs like ProGen2 and IgLM are not just tools for generating sequences but also powerful predictors of protein properties. For instance, ProGen2 can evaluate protein "fitness"—a broad term that includes factors like thermostability, binding affinity, and functional efficiency—without requiring additional training. This zero-shot learning capability demonstrates the adaptability of PLMs to various tasks in protein science.

Moreover, the open-source availability of these models ensures that the broader scientific community can build on these innovations. Researchers can combine PLMs with other modalities, such as structural modeling or experimental validation techniques, to push the boundaries of protein design even further.

Future of Protein Language Models

The introduction of ProGen2 and IgLM marks the beginning of a new era in protein engineering. These models offer unprecedented tools for designing, predicting, and optimising proteins at a scale and accuracy that were previously unattainable. As PLMs continue to evolve, they promise to accelerate discoveries in fields ranging from biotechnology to medicine.

The authors, representing Alchemab Therapeutics, emphasise the untapped potential of PLMs. By enabling researchers to explore protein sequence space with precision and creativity, these models pave the way for innovations that could transform drug development, synthetic biology, and beyond.

For an in-depth understanding, refer to the original publication in Cell Systems.

Mentioned in this article:

Click on resource name for more details.

Alchemab Therapeutics

Biotechnology company

Cell Systems

Scientific journal covering research in systems biology.

Jacob D. Galson

Vice President of Technology at Alchemab

Topics mentioned on this page:
Synthetic Proteins, AI in Medical Research
Protein language models redefine protein design and optimisation