Imagine this scenario: for the question "What interactions would occur when a new drug molecule is injected into a mouse?", if there is no need for complex clinical trial designs, nor the need for tedious repeated experiments for verification.
Simply by informing a chatbot like ChatGPT of the drug and the multiple molecules present in the environment, it can quickly and accurately tell scientists the impacts the drug will have, which would greatly reduce the time cost for researchers and the resource cost for related manufacturers, aiding in the faster and more precise discovery of drugs.
Not long ago, Fang Junfeng, a Ph.D. student at the University of Science and Technology of China, and his team developed the first unified multimodal large language model molecular interaction learning framework—MolTC (Molecular inTeraction Modeling enhanced by Chain-of-thought theory), bringing new hope to solving the aforementioned issues.
Currently, the reliability of the MolTC framework has been verified among more than 4,000,000 molecules in multiple datasets. "Indeed, this issue still seems to be a pipe dream and far from realization. However, our work is just a small step forward in this long journey," said Fang Junfeng.
Advertisement
MolTC: Capable of efficiently modeling molecular graph information.In the study, Fang Junfeng and his colleagues focused on molecular relationship learning, drug interactions, solution-solvent interactions (SSI), and other elements, understanding and modeling the interactions between molecular pairs, and designed this unified multimodal large language model molecular interaction learning framework—MolTC.
By utilizing graph encoders and projectors, MolTC can efficiently model molecular graph information.
In addition, to enhance information sharing between data and achieve unified molecular interaction learning, the research team proposed the concept of multi-hierarchical chain-of-thought to optimize the thinking and training paradigms of large models.
At the same time, the team also adopts dynamic parameters between molecular interaction tasks to share strategies, in order to achieve a win-win situation for prediction efficiency and prediction accuracy.
At present, the most intuitive application of this framework is that it can be used to build a more comprehensive, unified molecular interaction output platform that does not require deep learning foundations and prior biochemical knowledge.This means that by further collecting and absorbing a larger number of molecular interaction tasks with a wider coverage, MolTC can explicitly and efficiently learn the underlying paradigms and mechanisms of universal molecular interactions, thereby more accurately grasping the hidden molecular relationships.
This not only overturns the situation where traditional deep learning models can only adapt to a few tasks at the same time, but also makes up for the shortcomings of traditional large models that can only learn the laws of molecular interactions in an implicit internal way.
At the same time, with an explicit and unified architecture, MolTC can still maintain accurate and efficient output in tasks with few samples, or even zero samples.
On the other hand, most of the current molecular interaction models, whether based on traditional deep learning models or models fine-tuned based on classic large models, require users to have a certain foundation in deep learning and biochemistry knowledge to train the model through specific datasets.
However, once the MolTC framework integrates more comprehensive interaction tasks, with its superior performance in zero-sample tasks, it can directly give out interaction results. At the same time, the MolTC framework can also be used for the analysis and modeling of multi-molecular interaction tasks.So, what kind of background has led Fang Junfeng and others to initiate this study?
In recent years, with a wealth of knowledge reserves and excellent deductive abilities, large models have become an important tool for efficient learning of molecular relationships. However, although the current paradigm has achieved remarkable results, it still faces certain problems.
Specifically, the current paradigm is too dependent on text data, such as the SMILES (Simplified Molecular Input Line Entry System) information of molecules, and thus has not deeply mined the rich structural information contained in molecular graphs.
More critically, there is currently a lack of a unified molecular interactive learning framework, which hinders people from learning and extracting key information from different datasets.
This is disastrous for tasks that contain a small amount of labeled data. Taking the solution-solvent interaction as an example, the CombiSolv dataset of SSI, which contains 100,000 pairs of molecules, can be well used to train the current mainstream framework.However, the lack of a unified paradigm leads to an inability to share underlying molecular interaction mechanisms, and many SSI datasets with limited data, such as FreeSolv, even with excellent fine-tuning strategies based on LoRA, will not support the training of large models due to the high risk of overfitting.
What's worse, alleviating this problem requires generating labeled data through biochemical experiments, a process that is very time-consuming and resource-intensive.
Fang Junfeng and others have noticed that in recent years, large models have achieved several major breakthroughs in the field of biochemistry, such as AlphaFold2, which can predict protein structures.
These LLM4Science works that promote basic scientific research and benefit humanity are very impressive. At the same time, they also have a significant practical significance for the field of Biochemical LLM.
Initially, they aimed at optimizing the framework based on the large model of biochemistry. They believe that most of these mainstream paradigms and biochemical tasks are related to single molecules, such as molecular property prediction, IUPAC naming, etc.Later, they found that in many cases: a large number of molecular properties, such as the Gibbs free energy produced by dissolution, cannot exist independently of molecular interactions.
People are often more concerned with the role that molecules play in interactions, rather than the properties of individual molecules themselves. Taking the properties of drug molecules as an example, the interaction between drug molecules is crucial for drug development.
At the same time, people are also very concerned about the impact of drugs on the human body, that is, the interaction between drug molecules and specific molecules in the human body environment, rather than the complex biochemical properties of the drug itself.
In contrast to the importance of molecular interactions, the research group found that most of the current large models focusing on molecular interaction tasks only pay attention to the interaction tasks of individual or a few molecules.
The unified large model molecular interaction learning paradigm is still in a blank area. The team believes that a unified learning paradigm can fully utilize the sharing between underlying molecular interaction mechanisms, and more thoroughly mobilize the reasoning ability and knowledge reserve ability of the large model.The large model may be a "slow-to-warm-up person".
For the aforementioned reasons, the research team intends to develop a unified framework for molecular interaction learning based on a large language model.
During the research period, the first challenge they faced was: how to efficiently extract information from the two molecules in the interaction and make the large model understand them?
Later, they found that most of the large models currently used for modeling molecular interactions rely on the text information of the molecules, and few large models can delve into the structural information contained in the molecular graphs.The Q-Formers (Querying Transformers) network architecture is a lightweight transformer that has consistently demonstrated "stellar performance" in multimodal "vision-language" research.
Inspired by this, the research team uses two graph neural network (GNN, Graph Neural Network) encoders to obtain representations of molecular pairs and maps them into the input space of large language models using Q-Formers.
This design equips the large model with a "clear-sighted eye" that allows it to understand the interactions between biochemical molecules in an efficient and accurate manner.
However, they found that the difficulty of analyzing the interactive properties of molecular pairs increases exponentially compared to the analysis of the properties of individual molecules.
Specifically, it is necessary to accurately understand the properties of the two molecules and, for different interaction targets, extract specific key substructures separately in order to complete interaction modeling and analysis.For traditional large models, they are inherently not good at handling quantitative valuation tasks, so it is difficult to directly give precise numerical values of interaction properties based on input molecular pairs (such as the maximum absorption wavelength in the chromophore solubility task).
Fang Junfeng said: "At that time, the large model was already able to complete qualitative tasks very well, but it has always been unable to give precise numerical values of molecular interactions."
So, they tried to modify the architecture of the large model, tested many different model architectures, but all in vain.
"Later, during a meal, Zhang Shuai from the team proposed that the large model might be a 'slow starter', and it might be too demanding to let it speak its mind directly.
It might be better to give it a gradual process of expression, and we can listen to it like a psychological counselor or a long-time friend, guiding it to speak out," said Fang Junfeng.Inspired by this, they tested the form of multi-level thinking chains to improve the accuracy of quantitative analysis tasks.
Specifically, the thinking chain at the upper level guides the pre-training process of MolTC, thereby prioritizing the identification and sequential presentation of key biochemical properties of molecules, thus enhancing the prediction accuracy of molecular interactions.
During this period, the data for the pre-training phase comes from Drugbank and PubChem, both of which are authoritative biochemical databases containing molecule-property pairs.
In addition, to make the MolTC framework applicable to a variety of application scenarios, they randomly combined molecules from the aforementioned databases to construct different molecule pairs across multiple fields.
At the same time, when it comes to more complex and thorny quantitative molecular interaction tasks, under the guidance of the lower-level thinking chain, MolTC will first estimate a rough range for the target value, and then gradually refine it to an accurate value.The multi-level thinking chain approach brings the benefit of enabling MolTC to think and deduce in an orderly manner, completing molecular interactions in a fast-paced, incremental way, especially being able to make precise predictions of quantitative molecular interactions.
At the same time, the research team accidentally discovered from the MolTC framework that adopting the method of "giving a range first, then gradually converging" can also help improve the accuracy of quantitative output tasks for large models.
So far, the basic framework and training paradigm of MolTC have been basically established. However, in the experiments, they found that because the graph encoder (Encoder) and mapper (Projector) structures that the two molecules go through before inputting into the large model are completely identical, the large language model often confuses the properties of the two molecules.
That is, when answering the properties of molecule 2, it will incorrectly give the properties of molecule 1. To solve this problem, they realized that it is not enough to only give the molecular graph information to the large model, and additional molecular information still needs to be introduced for assistance.
Therefore, they introduced the SIMLES form of the two molecules at the input end of the large model, so that MolTC can clearly distinguish the input order of the two molecules.After completing the aforementioned design, the team finally saw, as they had hoped, that MolTC could achieve good results in various molecular interaction tasks.
Fang Junfeng said: "However, it wasn't long before we were disappointed again. We found that as we kept adding new datasets to achieve a unified learning framework, the prediction accuracy of MolTC declined significantly."
Later, Wu Chang from the team realized that although the underlying interaction mechanisms are similar, their specific manifestations are not the same. At the same time, the focus of each interaction dataset is also different.
Therefore, how to distill a universal underlying interaction mechanism from the commonalities between molecular interaction datasets and to eliminate the interference of redundant information from each dataset is another challenge faced by the research group.
Only by solving this challenge can a unified molecular interaction learning framework be constructed. To address this issue, they verified the following properties of molecular interaction tasks:Firstly, it verifies the importance of molecular roles in interactions. Secondly, it verifies the importance of the order of molecules in interactions. Thirdly, it verifies the differences in the importance of characteristics brought about by molecular roles/sequence.
Then, they guided MolTC to create unique encodings for molecules based on their roles and order when learning about each molecular property.
And in order for MolTC to learn this difference well, they introduced a strategy for sharing dynamic parameters.
Ultimately, the research team verified the effectiveness of MolTC on more than 4,000,000 pairs of molecules across multiple molecular interaction fields, accumulating 12 molecular interaction datasets, proving that MolTC can efficiently and accurately predict the target molecular interactions.
Recently, the related paper was published on arXiv[1] with the title "MolTC: Towards Molecular Relational Modeling In Language Models", with Fang Junfeng as the first author and Professor Wang Xiang from the University of Science and Technology of China serving as the corresponding author.Subsequently, they plan to further increase the training data for MolTC, aiming to create a truly unified molecular interaction learning framework.
Additionally, the research group found that although MolTC performs exceptionally well on small molecule tasks, its performance occasionally falls short when the task involves large molecule interactions.
Therefore, they plan to embed an additional information compression module at the front end of the large model interface, utilizing techniques such as "Graph Information Bottleneck" (GIB) commonly used in the field of interpretable deep learning, to compress the input information of large molecules, thereby excluding the interference of redundant information and further expanding the applicability of the MolTC framework.
Comments