Recently, researchers Li Zhiyu and colleagues from the large model team at the Shanghai Institute of Algorithmic Innovation proposed a new paradigm for situational learning: SLEICL (Situational Learning Enhanced by Strong Large Models), which can better accelerate the academic research and industrial application of small models.
With this method, the performance of small models can be greatly improved, making them more competitive in various application scenarios.
In the current research and industrial practice of large models, there are two directions: "making the model larger" and "making the model smaller."
The former is committed to achieving a very large scale of parameters, often reaching around a hundred billion; the latter is committed to achieving a smaller scale of parameters, often around a billion.
"Making it bigger" allows large models to have stronger emergent and reasoning capabilities, making them suitable for more difficult tasks. "Making it smaller" allows large models to have better reasoning capabilities, enabling them to be deployed on various small and micro terminals such as mobile phones, watches, earphones, and voice recorders.In-context Learning (ICL) is an important manifestation of the capabilities of large language models.
Advertisement
Recently, research on the mechanisms and principles of in-context learning for large models has become a hot topic for large models.
Not long ago, at several top computer artificial intelligence conferences, research on in-context learning was discussed enthusiastically.
The common practice of in-context learning is to provide the large model with some examples and corresponding answers, and then the large model can infer the answer to the next unknown question.
For example, give the large model the examples "I love you" and "I hate you". The label for "I love you" is "positive", and the label for "I hate you" is "negative".So, when you tell the large model "I like the sunshine today," the large model is likely to infer a "positive" label.
Currently, the main research directions for situational learning include: example selection methods, example order methods, example structure methods, and example label distribution methods.
However, the limitation of these methods lies in: they are still helping the large model to better grasp the method of solving problems from example learning by selecting better examples and by choosing the form of example presentation.
So, how to reduce the learning difficulty of the large model? That is, how to let the large model directly acquire the method of solving downstream tasks without the medium of examples?
Generally speaking, the larger the parameter scale of the large model, the stronger its situational learning ability is. However, when the parameter scale expands, the computing power requirements also increase, and the training cost and inference cost will also increase sharply.The rapid growth in computational power demands has limited the application scenarios of large models, making it difficult to deploy them on mobile devices.
As the scale of large model parameters gradually increases, the consumption of computational and storage costs also rises. Especially for super-large models like GPT-4 or those with parameters in the hundreds of billions, the training costs are quite high.
Therefore, one of the current research directions is how to efficiently compress the model to maintain its performance while accelerating inference. If the model can be compressed, its inference cost can also be reduced, even making it comparable to the cost of purchasing and running the model on end-side devices.
Recently, many studies have been dedicated to developing models with small scale and low computational power requirements, and have achieved certain results.
In June 2023, Microsoft released the 1.3 billion parameter language model Phi, and in September of the same year, the parameters of Phi-2 expanded to 2.7 billion. It is reported that Microsoft's "small model" has been tested among financial and banking customers. Since then, domestic manufacturers have gradually followed up with research and application of small models.The release of this series of small-scale parameter models also indicates that the development of large models is gradually shifting from "making big" to "making small," and it shows a phenomenon where there are N large models and K small models, with N being much less than K.
Therefore, how to maintain high efficiency of small models while improving their performance in downstream tasks has become an important direction.
Based on this, people are also exploring how to enable the capabilities of small parameter models to match those of large parameter models.
On the other hand, in the current situational learning methods, it is usually necessary to perform a selection of examples for each test question, and it is not possible to form a universal "demonstration content" for a specific downstream problem, thus achieving a one-time effort.
Taking human learning as an example, after obtaining some examples: first, we can not only directly infer the label of the given problem by finding the rules. Second, we can also study the examples to form a more abstract and more universal set of problem-solving rules.The second method is more universally applicable and stable, and is also a widely recognized learning method. Taking the task of emotional classification as an example, humans can summarize some general problem-solving rules.
For instance, when we are learning some keywords that express emotions, we should pay attention to the reversal of the original emotions by negation words.
In this study, Li Zhiyu and others found through experiments that: based on a more capable large model, some skills and experiences can be summarized, which they call the Grimoire.
When these skills and experiences are passed on to a less capable large model, it can significantly improve the performance of the less capable large model in downstream tasks. Even for some small models, by learning Grimoire, their performance in some tasks even surpasses GPT-4.
Overall:For the scenario learning of large models, the team has provided a fresh perspective, helping large models to achieve better generalization for problems, without being confined to the construction and selection of example samples.
For the collaboration between large and small models, the study provides a new reference plan for model interaction in end-cloud collaboration and the utilization of the capabilities of small models.
If previous AI research was calculated on a monthly basis, then in the era of large models it is calculated on a weekly basis. Various AI technologies are "changing day by day and week by week". Under the pressure of such a high-speed innovation environment, it also poses greater challenges to practitioners in the era of large models.
At the beginning of the study, Li Zhiyu and his colleagues hoped to improve the performance of small models by relying on the model's self-correction. However, as the experiments progressed, they found that due to the limited reasoning and understanding capabilities of the small models themselves, it was difficult to achieve effective improvement.
Just when they were at a loss, they accidentally saw a post in a circle of friends. The person who posted this circle of friends is a parent, who shared the content about "top student's notes".This made them instantly realize: since the reasoning and summarization abilities of small models are relatively weak, why not let strong models (top students) summarize the experience (magic book), and then impart the experience to small models (weak students)?
The idea was immediately agreed upon by other members of the group, and Li Zhiyu and others quickly started model design and model experiments.
"When we found that the final effect exceeded expectations, we had to sigh: scientific research comes from life!" said Li Zhiyu.
Recently, the relevant paper was published on arXiv[1] with the title "Grimoire is All You Need for Enhancing Large Language Models."
Chen Ding is the first author, and Li Zhiyu serves as the corresponding author.In addition, about a month after the publication of this paper, a research team composed of the University of California, Berkeley, Carnegie Mellon University, and DeepMind, published a similar paper [2].
Li Zhiyu said: "The peer's paper and our ideas are identical, the method they proposed is one of the sample screening in the first stage of our proposed method, that is, the screening of difficult samples. The method proposed by peers is more like a subset of our proposed solution, which has strengthened our confidence in follow-up research."
At present, the new situation learning method proposed by Li Zhiyu and his colleagues aims to improve the performance of the "weak model" in downstream tasks by generating Grimoire based on representative example samples through the "strong model."
In the future, they plan to train a large model specifically for generating Grimoire to ensure the stability and controllability of Grimoire generation.
At the same time, based on the task description of the small model and existing examples, representative example samples will be generated. In this way, it is not necessary to traverse the training set for screening, but specific representative samples can be generated through a dedicated small model.Not only can it make the samples more targeted, but it can also ensure the stability of representative samples, while avoiding dependence on the training set samples.
At that time, by inputting a small amount of information about the test examples, several example samples can be generated, which can be used as context learning examples to prompt the downstream model to complete the task, thereby greatly enhancing the performance of the downstream model.
If these follow-up studies can be successfully completed, it will be able to further enhance the capabilities of small models, thus providing more support for industrialization.
Comments