In the vast universe of life sciences, enzymes are the "molecular engines" that drive everything. The discovery of new enzymes is far more than adding a new functional annotation to the database. It is profoundly affecting the boundaries of our understanding of the basics of life, driving the industrial transformation of biotechnology, and carrying the hope of mankind to cope with global challenges. From the "miracle of survival" in extreme environments, to the "core engine" of industrial green manufacturing, to the "precision tool" of gene editing, the birth of each new enzyme may open a new era of applications. Today, we will delve into the strategic value of new enzyme discovery and systematically analyze the five core strategies that are reshaping the future of biocatalysis driven by AI and big data.

Five core strategies: from “random screening” to “precise mining”

Traditional enzyme discovery relies on "random screening", which is inefficient. Today, "precision mining" based on big data, AI and evolutionary theory has become mainstream. Here’s an in-depth look at the five core strategies:

1. Sequence-based discovery of new enzymes

Principle hypothesis : "Homology determines function." Key catalytic sequences are highly conserved during evolution, and new enzymes are located by comparing similarities. Implementation method :use BLAST Perform a global comparison, or use HMMER Construct a Hidden Markov Model (HMM) to mine conserved motifs for specific enzyme families. real case : Zhang Feng team (Broad Institute) discovered a variety of new CRISPR-Cas systems (such as Cas12) through sequence comparison and HMM mining of massive bacterial genomes, which greatly enriched the gene editing toolbox. paper ：Discovery and functional characterization of diverse Cas9 effector proteins (DOI: 10.1126/science.aad5227)

2. Methods based on structural similarity and clustering

Principle hypothesis : "Structure is more conserved than sequence". Even if the sequence similarity is less than 20%, the three-dimensional fold may be stable, allowing the discovery of "distant homologous enzymes". Implementation method :use AlphaFold2 or ESMFold Predict large-scale protein structures and then pass Foldseek or Dali Perform high-throughput alignment and clustering. real case : Martin Steinegger Team (Seoul National University) Clustered hundreds of millions of predicted protein structures to identify thousands of novel enzyme families in "structural space" that could not be discovered by traditional sequence comparisons. Paper: Clusteringpredicted structures at the scale of the known protein universe (DOI: 10.1038/s41586-023-06510-w)
Fig. 1: In silico structure-based PETase discovery pipeline and clustering results.

Fig. 1: In silico structure-based PETase discovery pipeline and clustering results.

Figure 1. Discovery process of new PETase enzyme based on structural similarity

3. Based on pocket similarity algorithm

Principle hypothesis : "The local active center determines the nature of catalysis." Ignore the overall folding and focus only on the geometry, electrostatic potential, and hydrophobicity of the active pocket. Implementation method :use P2Rank or DeepPocket Identify binding sites and use PocketAlign and other algorithms to compare the physical and chemical characteristics of the pockets. real case : Come to Luhua Team (Peking University) Through active pocket similarity search, new metabolic enzymes that can convert cholesterol in the complex human intestinal flora were accurately identified. paper ：Computational discovery of cholesterol-lowering bacteria from the humangutmicrobiota (DOI: 10.1016/j.chom.2022.09.007)

Figure 2. Schematic diagram of enzyme digestion substrate. The pocket similarity hypothesis is a strong hypothesis for new enzyme discovery.

4. Method based on ancestral sequence recombination (ASR)

Principle hypothesis : “Evolutionary backtracking enhances robustness”. Ancient enzymes usually have higher thermal stability and a broader substrate spectrum, and high-performance catalysts can be obtained by "resurrecting" ancient enzymes. Implementation method :use IQ-TREE Construct a phylogenetic tree using PAML Ancestral sequences are inferred and subsequently synthesized and functionally verified. real case : Elizabeth Gillam Team (University of Queensland) resurrected ancient P450 enzymes and obtained a new class of biocatalysts with ultra-high thermal stability and tolerance to a variety of unnatural substrates. paper ：AncestralcytochromeP450 enzymes show increased thermostability andsubstratepromiscuity (DOI: 10.1016/j.abb.2016.03.024)

5. Algorithm based on substrate-enzyme docking structure

Principle hypothesis : "Induced fit and binding energy prediction". Simulate the process of substrate entering the enzyme pocket and evaluate binding free energy and catalytic distance. Implementation method : Using reverse virtual screening, using AutoDock Vina or RosettaMatch Conduct large-scale docking of target substrates and set catalytic constraints for screening. real case : Christian Sonnendecker Team (Leipzig University) Through structural docking and energy evaluation of compost metagenomic candidate enzymes, efficient plastic degradation enzymes were screened PHL7 , its speed of degrading PET is significantly better than that of previous star enzymes. Paper: Low-carbon footprint enzymatic surfacing of poly(ethylene terephthalate) (PET) (DOI: 10.1002/cssc.202102262)

Summary: Build an efficient and collaborative “network of discovery”

Looking at the above strategies, the discovery of new enzymes has evolved from the past random screening of "finding a needle in a haystack" to "precision guidance" driven by data and algorithms. These five strategies do not exist in isolation, but form an organic whole that complements each other and progresses layer by layer: * from sequence arrive structure , broadening the boundaries of search; * from overall arrive Partial (pocket) , improving the positioning accuracy; * from modern back to Ancient times (ancestors) , achieved a breakthrough in performance; * from static comparison go deep into Dynamic interaction (docking) , ensuring the reliability of prediction. Together, they have built an efficient and scalable new enzyme discovery technology system, which has greatly accelerated the mining process of catalytic components. It is the continuous improvement of this system that enables new enzymes to continuously move from databases to laboratories and then to industrialization, injecting continuous innovation power into the fields of synthetic biology, green manufacturing, medicine and health.