25 years of the Human Genome: AI and biological chemistry break in to help decode the secrets of the genome

  • A team in which IDIBELL participates leads a global initiative to help understand the genome with the help of AI and biological chemistry, and thus facilitate drug discovery.
  • The project is committed to open science in the creation of a massive database of chemical compounds accessible to everyone with which to train different AI models.
Quimica Antolin IA NOTI

It has been 25 years since the completion of what was one of the great milestones of Big Science, the large-scale science that made its way in the middle of the last century. The Human Genome Project (HGP) was a joint international research effort dedicated to decoding the sequence of the human genome and identifying the approximately 20,000 genes it contains. And it was achieved: the result was a reference sequence that covers 99% of the regions of the human genome that have genes. Being public and accessible to everyone, since then anyone has been able to read our genetic code.

Now then, from reading it to understanding it there is a step. Knowing the order in which the four letters of the genome (A, C, G, T) are arranged throughout the sequence does not make it intelligible. Doing so requires more research, focused not only on reading the letters that make up the code of each gene, but rather on understanding the function they have. Genes are the instructions for creating proteins, and proteins are responsible for carrying out the functions that make us who we are. Understanding what the instructions in genes say (and therefore what the proteins they code for are designed to do) is essential to understand the biochemical processes that occur daily in our bodies. And this is key to deciphering the enigma of many diseases: if we know what happens in our body when we are healthy, we can know what is wrong when we are not.

The problem is that, despite having all this information, most of the proteins that are studied are the same as before HGP. We have good tools to study these proteins and we know that they are important, but there are “30% of the genes that we don’t know what they do because no one studies them,” explains Dr. Albert Antolín, head of the Medicinal Chemistry and Drug Design research group  at IDIBELL. “The reality is that if we don’t have the right tools, the incentive to investigate on understudied proteins is low,” he adds.

 

A shift in focus: what if AI is used to discover new chemical tools?

Now, however, it seems that a way has been found to benefit from Artificial Intelligence (AI) to tackle this challenge. To get down to work, a team in which IDIBELL participates has launched, after a few years of proofs of concept, a revolutionary project to generate large amounts of data on the molecules that interact with proteins and, with the help of AI, develop new chemical tools to study them. The details of the project in question are explained in the article they have recently published in the scientific journal Nature Reviews in Chemistry.

The project is part of the Structural Genomics Consortium (SGC), a global public-private consortium made up of universities and pharmaceutical companies. The aim is to facilitate the discovery of the function of proteins in the genome, so that the discovery of new drugs for diseases that do not yet have a cure can be accelerated. We return to the same philosophy: if we discover the compounds that interact with proteins, we will have an easy way to study their function and, at the same time, we will know how to modulate them so that they correctly perform the function they are programmed to do.

 

Training AI with experimental data

The project is part of an ambitious initiative, Target 2035, which aims to discover a chemical compound for every human protein by 2035. To do this, AI models must be trained by collecting many experimental data. The data available now is insufficient: it is not enough to train the models properly, which at the moment can only practice with small and fragmented datasets. Therefore, as Dr. Antolín explains, “The goal over the next five years is to generate a huge amount of data to create more accurate AI models.”

The more data, the greater the accuracy. To achieve this, advanced screening techniques will be used that will experimentally cross more than 1000 proteins of the human genome against billions of chemical compounds over the next five years. In the long term, the idea is that this data will allow training on a foundational AI model that will make it easier to achieve Target’s ambitious goal by 2035.

 

Science open to all

Project 2035 is part of an open science initiative. The data extracted can be used by any research centre or pharmaceutical company to train its AI and, thus, facilitate and accelerate the discovery of new drugs. “It is very important that fundamental science is open and that everyone can access this information,” says Dr. Antolín. He adds, “Research into many diseases requires clinical trials that are very expensive. Public-private collaboration accelerates this process, especially in the early stages of new drug development.”

 

 

The Bellvitge Biomedical Research Institute (IDIBELL) is a research center established in 2004 specialized in cancer, neuroscience, translational medicine, and regenerative medicine. It counts on a team of more than 1.500 professionals who, from 73 research groups, publish more than 1.400 scientific articles per year. IDIBELL is participated by the Bellvitge University Hospital and the Viladecans Hospital of the Catalan Institute of Health, the Catalan Institute of Oncology, the University of Barcelona, ​​and the City Council of L’Hospitalet de Llobregat.

IDIBELL is a member of the Campus of International Excellence of the University of Barcelona HUBc and is part of the CERCA institution of the Generalitat de Catalunya. In 2009 it became one of the first five Spanish research centers accredited as a health research institute by the Carlos III Health Institute. In addition, it is part of the “HR Excellence in Research” program of the European Union and is a member of EATRIS and REGIC. Since 2018, IDIBELL has been an Accredited Center of the AECC Scientific Foundation (FCAECC).

RELATED CONTENT

Original source: Aled M. Edwards, et al. Protein–ligand data at scale to support machine learning. Nature Reviews in Chemistry, 2025.

Share on:

Scroll to Top