tech

Zeng Zhongshun is a Chaoshan native. He graduated with a bachelor's degree and a master's degree from the University of Illinois and the Georgia Institute of Technology in the United States, respectively. After obtaining his bachelor's and master's degrees, he worked for a period of time at IBM-Research and Shenzhen IDEA Research Institute.

After the launch of ChatGPT, he realized that there were certain shortcomings in the research paradigm for large models, so he decided to pursue a Ph.D. at the Chinese University of Hong Kong.

Not long ago, Zeng Zhongshun and his team proposed a new evaluation paradigm. Based on this evaluation paradigm, they also proposed a method of transformation for existing datasets.

Experiments have proven that this method can effectively distinguish the differences in the capabilities of different models. At the same time, they also revealed the robustness of this new evaluation paradigm against current data pollution.

Previously, due to the lack of transparency in training data, people could not determine whether the performance improvement of large models on some leaderboards was brought about by data pollution and question leakage.The new evaluation paradigm proposed this time has a very strong resistance to the improvement of scores through "memorizing answers to questions". With this resistance, it is possible to carry out "renovation of the old" for the vast majority of datasets.

Advertisement

At the same time, this new evaluation method can not only reveal the differences in the capabilities of large models, but also bring some inspiration to downstream applications.

Recently, the relevant paper was published on arXiv with the title "A Meta-Reasoning Revolution in Large Language Model Evaluation" (MR-GSM8K: A Meta-Reasoning Revolution in Large Language Model Evaluation), with Zongxun Zeng as the first author and Professor Jia Jia from the Chinese University of Hong Kong as the corresponding author [1].

Are large models also relying on the "sea of questions" tactic?"Rote memorization" and "drill and practice" are learning methods that many people have used during their school years. However, did you know that large models are also using these two learning methods? Moreover, where is the current boundary of the capabilities of large models?

Starting from the two dimensions of reasoning and cognition, when a paper claims that a large model has achieved results beyond human level on an evaluation index, should we feel panic?

Or should we carefully examine whether some factors were overlooked when setting the index, so that the cognitive ability of the large model is exaggerated?

In fact, insufficient consideration of the design significance of the index will at least bring the following potential dangers:

Firstly, can the evaluation results truly reflect the capabilities of the large model? If there is insufficient understanding of this, it will often overstate the model's effects.Secondly, it can lead people to assume that the improvement in indicators is equivalent to the enhancement of the capabilities of large models, and also equivalent to the improvement in the effectiveness and practicality in real scenarios, leading to a blind pursuit and comparison of the list effects, and falling into a vicious cycle.

Thirdly, there is an excessive focus on and comparison of performance in specific scenarios, neglecting the improvement of the overall cognitive ability of large models.

Currently, the evaluation sets for the reasoning and cognitive abilities of large models mainly rely on some standardized test questions or some carefully designed rule-based games.

The original intention of designing these evaluation sets is largely that the designers believe that the pattern recognition, memory recall, analysis of assumptions, and induction and deduction abilities required to solve such reasoning tasks are a kind of "meta" ability needed to handle all tasks, and believe that such abilities are crucial for the generalization and robustness of large models in real scenarios.

However, when it comes to designing the evaluation methods for these tasks, these evaluation sets often only rely on a simple match of the final calculation results, and ignore the cognitive detection of the calculation process.It can be seen that this deviation between the goal and the implementation method has greatly aggravated the chaos in the field of large model evaluation.   For example, there is a famous "shortcut" case in image recognition. It refers to the rule learned by the large model when classifying wolves and snow wolves, which is to identify whether there is snow in the background, rather than to identify the difference in physiological characteristics of the two animals.   Similar phenomena also exist in cognitive reasoning datasets. When faced with a math problem, if the large model is required to give a "thinking chain" of step-by-step reasoning, the large model often confuses quantities of different units, such as multiplying and adding speed per hour and kilometers, which shows that the large model has insufficient understanding of the physical meaning behind different concepts.   So, how to better detect the large model's level of cognition of concepts and its application generalization ability?   Take the following figure as an example. For a complex reasoning problem, if there are multiple solutions from the starting point to the end point, and each step of the reasoning can be regarded as a node, the nodes form a path.  In the current paradigm of training large models, they are often only exposed to a few correct solution paths (in cyan or blue), while ignoring the incorrect paths (in orange).

 

Similarly, when evaluating the performance of large models, people only focus on whether the final reasoning path endpoint matches the standard answer, while neglecting the potential errors in reasoning nodes or paths during the reasoning process.

 

For example:

 

In the field of education, if GPT4's evaluation accuracy on elementary-level math problems is only 40%, then we cannot help but question the practicality of GPT4.

 

In the consulting field, the application scenarios of large models heavily rely on their abilities to deduce different plans, break down overall steps, analyze, and so on.The current lack of capabilities in large models in these areas inevitably raises questions about the reliability of their downstream applications.

 

 

Let the large model "change from a student to a teacher"

 

Based on this, Zeng Zhongxun and his team carried out this study. In fact, the inspiration for this study came from a competition. Previously, Zeng Zhongxun participated in the "Guangdong-Hong Kong-Macao Greater Bay Area (Huangpu) International Algorithm Competition" in the sub-track "Comprehensive Strength Enhancement of Large Language Models."

 

At that time, he surveyed some papers on enhancing the reasoning capabilities of large models, which were mainly divided into the following directions: the first direction is homologous data augmentation, and the second direction is using feedback models for data screening or for reinforcement learning training of large models.

 

When he tried to use them, he found that both methods had significant problems:Firstly, when using ChatGPT for data augmentation, ChatGPT does not truly understand some of the concepts that people hope it will generate. When applying these concepts to create and solve problems, various errors often occur. Therefore, it often requires very fine-grained program design and guidance to improve accuracy.

Secondly, after carefully studying the role of the feedback model, Zongxun Zeng believes that asking a feedback model to filter reasoning data is essentially equivalent to asking it to perform "meta-reasoning."

This difficulty is even higher than solving the problem directly, because in order to improve the problem-solving effect, a more difficult evaluation of the problem-solving task is introduced, which seems to turn one problem into another more difficult problem.

After realizing this problem, he and his team developed a meta-reasoning paradigm and applied it to some common datasets.

The results showed that both open-source large models and closed-source large models began to experience a sharp decline in performance, especially the open-source vertical reasoning large models, which even dropped to an accuracy rate of less than one percent.Therefore, he and his colleagues call for a shift in the focus of detecting cognitive reasoning in large models, from matching the final computational results to detecting the computational process itself.

The specific approach is as follows: First, sample some given reasoning paths from the problem-solving space, and then let the large model evaluate them. The evaluation includes: Is the reasoning path correct? Where are the error nodes and steps? What is the reason for the error?

This shift in the evaluation paradigm means that the large model must have a global and macroscopic understanding of the entire problem-solving space, knowing not only the results but also the reasons behind them.

In detail, the large model needs to achieve the following aspects:

Firstly, it needs to know what the final result and nodes of the reasoning are;Secondly, it is necessary to critically evaluate the conditions and premises of each reasoning node, and to think about the logical connections between nodes, in order to determine whether the current step is wrong;

Thirdly, it is necessary to be able to substitute different assumptions, or counterfactually (counterfactually) to pre-enact and analyze the future reasoning path, so as to determine whether this answer is on the correct reasoning path.

These requirements will force the large model to rise from the perspective of an answerer to the height of a teacher for global examination and global reasoning. For this "reasoning about the reasoning process", the team calls it the "meta-reasoning" evaluation paradigm.

As shown in the figure above, when they applied the meta-reasoning paradigm to a popular mathematical evaluation set GSM8k, the performance of GPT4 dropped by more than half, and the accuracy of GPT3.5 dropped from more than 80% to a single digit.

This indicates that after a simple meta-reasoning paradigm transformation on the same data set, there will be a huge difference in model capabilities. It is worth noting that after the paradigm transformation, the capability difference of large models has become more differentiated.The open-source models that have achieved leading results on the GSM8K, such as Mammoth, WizardMath, and MetaMath, are trained in the following way: a large amount of homologous data augmentation is performed on the data of this dataset to make the effect close to GPT3.5.

Unfortunately, after the research group carried out a paradigm shift on it, the performance of the open-source large mathematical models, which was originally close to GPT3.5, has become far inferior to GPT3.5.

This may also indicate that the current popular simple data augmentation methods are more similar to "memorizing problems" or "drilling through a sea of problems," which cannot truly enhance the actual capabilities of large models.

As a general evaluation paradigm, the meta-reasoning paradigm proposed by Zeng Zhongxun and others can be extended to more evaluation scenarios.

In addition, the difficulty of annotation in this study far exceeded expectations. During the research period, they carried out a transformation of the meta-reasoning paradigm for the primary and junior high school mathematics dataset GSM8K.This transformation method requires annotators to perform similar meta-reasoning for the dataset and to record the results of the meta-reasoning as an evaluation set.

Although it is only a primary school level question, they found that from reading the question, reading the standard answer, to reading the sampled answers to be evaluated, they must conduct a detailed analysis and reasoning for each step.

Due to the long time consumption, the unit annotation price is also higher; at the same time, because of the high difficulty, the requirements for the qualifications of the annotators are also high.

Zeng Zhongxun said: "When I saw the quotation, I suddenly remembered that OpenAI has a paper that annotates the questions and problem-solving processes of the Mathematical Olympiad for reinforcement learning training. The nature and content of OpenAI's annotation are somewhat similar to ours."

In the OpenAI dataset named PRM800K, it contains 800,000 annotated questions. Conservatively estimated, the annotation cost for one question is 10 US dollars, then the price of the OpenAI dataset is 8 million US dollars. And the paper by OpenAI did not produce particularly direct results, nor did it bring about a significant improvement in practical effects.After truly understanding the cost and difficulty of labeling, one cannot help but marvel at OpenAI's financial strength and tolerance for failure, said Zeng Zhongxun.

It is also reported that one of the founders of OpenAI, Ilya Sutskever, was asked in an interview, "What would he choose to do if general artificial intelligence is realized?" Ilya replied, "Perhaps I would actively integrate into AI (be part of AI)."

When reading the aforementioned interview report, Zeng Zhongxun did not understand what it meant to integrate into AI at the time. However, as the work continued to progress, he vaguely felt that for AI to cognitively fit with humans, it may largely depend on humans continuously providing rich feedback signals.

"This may also be a way to integrate into AI? A kind of mythical romantic feeling similar to Gan Jiang and Mo Xie sacrificing themselves for the sword," said Zeng Zhongxun.

In the future, he and his team are committed to creating a more comprehensive and diverse evaluation system. Currently, they have contacted several top domestic labeling companies, with the goal of building meta-reasoning scenarios in four directions: academic, logical, embodied, and application-based.

Comments