After we proposed the new paradigm of alignment, the Aligner, it has garnered widespread attention from industry companies. Within just one month of its release, several technology companies have already started using the training paradigm of this Aligner for alignment tasks in various downstream applications.
The lightweight model of the Aligner, its efficient training, and its insensitivity to the parameters of large models, make it a promising new alternative in the field of large model alignment. Researcher Yang Yaodong from the Institute of Artificial Intelligence at Peking University stated,
The Aligner is actually a brand-new paradigm for aligning large language models. This paradigm is based on the insight of "correcting the residual between the unaligned and aligned answers," and is characterized by its efficiency and scalability.
Advertisement
In terms of application prospects:
Firstly, as an alternative to Reinforcement Learning with Human Feedback (RLHF), the Aligner acts as an intelligent add-on and patch for large language models.In the current alignment scenarios, both industry and academia commonly provide human annotation supervision signals at the end of a conversation.
However, this sparse reward mechanism increases the instability of RLHF (Reinforcement Learning from Human Feedback), thereby increasing the difficulty of alignment.
The aligner, by learning to correct erroneous responses, ensures that large models output content that is stable and efficient and aligns with human intent and values.
Secondly, the aligner is an effective means of AI safety and governance. A lightweight and efficient aligner can provide a potential and feasible solution for government and third-party organizations (such as non-profit organizations, non-governmental organizations) to audit and regulate AI.
Without the need for huge computational power reserves and access to the parameters of large models, regulatory bodies can achieve efficient alignment and release aligners that meet the requirements.Thirdly, the aligner is the execution path for value alignment. How to maintain the consistency of large models and other artificial intelligence systems with human values (such as fairness, justice, kindness, etc.), and effectively address ethical and value issues, constitutes the main challenge of value alignment.
The aligner—provides a feasible solution for achieving value alignment: by using an external alignment module to carry the value alignment function, to make additional "value corrections" to the decisions and outputs of large models.
AI Alignment's ResNet Moment
It is reported that since the 21st century, the development of large-scale neural networks has become increasingly difficult, and the superposition of multi-layer neural networks often ends in gradient explosion or gradient disappearance. Many researchers have exhausted computational power to repeatedly adjust the architecture, but have not achieved good results.At this point, the emergence of ResNet was like the "sacred fire of Prometheus," illuminating the training of deep networks. By relying on the idea of residual learning, adding residual identity mapping blocks to the network architecture allowed for a large-scale expansion of the number of neural network layers, and the problem of gradient explosion was thus resolved.
Under the era of general models, as AI systems become increasingly powerful, how to ensure that AI systems are aligned with human values and intentions (i.e., AI alignment) has become an important concern for AI researchers.
However, the current alignment methods, such as RLHF, often face difficulties such as difficulty in replication, inconsistency of human reward signals, complex tuning of reinforcement learning, and inability to fine-tune API-based models (such as GPT-4/Claude).
AI alignment researchers have made many optimization adjustments based on the existing alignment paradigm, including changes in architecture and algorithm optimization, but the results are often minimal.
With experience in the field of alignment, Yang Yaodong's research group makes such a prediction: there must be an efficient alignment method that can save parameters.The team believes that the existing alignment paradigm has fallen into a local saddle point, where people use various training techniques to expect large models to generate answers aligned with humans, meeting the 3H standard of "Helpful, Harmless, Honest," but this approach will compromise the original performance of the model.
However, from another perspective, it is easier to let large models correct "misaligned answers" than to let large models directly generate "aligned answers."
But the problem that arises is: does the large model have the ability to correct answers?
"The answer is: not necessarily, because the existing few-short method based on prompt words, on the one hand, puts forward requirements for the reasoning ability of large models, and on the other hand, occupies the precious context space of large models."
In fact, it is simpler for large models to learn "the transition from the distribution of misaligned answers to the distribution of aligned answers" than to directly learn "the mapping from questions to aligned answers."This is an idea of residual learning, which is similar to the concept used in the classic work of neural networks, ResNet.
For the first time, the concept of residual learning is applied to large model alignment.
To this end, Yang Yaodong's team was the first to apply the idea of residual learning from ResNet to large model alignment, proposing an aligner—a highly efficient paradigm that significantly optimizes alignment effects by learning the residuals between unaligned and aligned answers.
The working principle of the aligner is: to attach a model to the pre-existing model, and then let the attached model directly learn the "correction residual" between the "unaligned answer and aligned answer".In the experiment, the research team continuously optimized training techniques and adjusted the model architecture, training aligners of different scales on datasets of various sizes.
A 7B-parameter aligner, after just one training, can simultaneously improve the helpfulness and safety of 11 large models by an average of 21.9% and 23.8%, respectively.
These models include closed-source models, open-source models, safety-aligned models, and non-safety-aligned models. Among them, this aligner can enhance the helpfulness of GPT-4 by 17.5% and its harmlessness by 26.9%.
In the study, the team also tested the possibility of using the aligner for value alignment: the team fine-tuned the 7B and 13B model aligners using the Empathetic Dialogue dataset to improve their empathetic capabilities.
After fine-tuning, Aligner-7B and Aligner-13B can enhance the empathetic ability of GPT-4's output by more than 50%.Super Alignment: A New Path from Weak to Strong Generalization
Super alignment primarily addresses the issue of how to align strong models with weak models. Imagine a future where models surpass human capabilities; how should humans then provide effective supervisory signals?
In the realm of super alignment, the research team has been deeply exploring the implementation of "scalable supervision" and the progressive realization of "generalization from weak to strong."
Surprisingly, for achieving the goal of "generalization from weak to strong and scalable supervision," the aligner can also offer a more innovative solution.In general, compared to the paradigm of "directly training giants" like OpenAI, the aligner proposed in this paper is akin to a "supervisor standing on the shoulders of giants." It can make modifications based on the output of strong models, thereby providing more accurate labels for the training of strong models.
Recently, the relevant paper was published on arXiv[1] with the title "Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction."
The entire paper was completed by the AI Safety and Governance Center of Peking University as the sole unit, with Ji Jiaming and Chen Boyuan as the first authors, and Yang Yaodong serving as the corresponding author.
The aligner will be applied to large models of text-to-video generation such as Sora and Pika.In the subsequent plans, the research group mainly has the following intentions:
1. Release lightweight and diverse versions of the aligner, such as 0.5B, 1.8B, and 2B models, to further verify the effectiveness of the correction paradigm on smaller models.
Additionally, an aligner based on token-level and sentence-level will be developed to enhance the model's output efficiency and reasoning capabilities.
2. Develop an aligner based on the mixed-expert architecture and streaming processing. By specialized training and efficient integration of multiple aligners, the research group will develop an aligner with streaming and mixed-expert architecture.
It is expected that this method will achieve efficient alignment in multiple dimensions and values, providing the industry and academia with a feasible solution for the integration of multiple values and needs.Thirdly, integrate the alignment concept into the training process. By incorporating it into the pre-model architecture and conducting specialized training for relevant parameter layers, achieve local fine-tuning and global alignment. This approach aims to reduce the subsequent alignment pressure and enhance the security and universality of the pre-trained model.
Fourthly, develop a plus version of the aligner, including the development of aligners for fields such as code, mathematics, music, and the development of personalized custom aligners to meet the specific needs of users.
Fifthly, extend the aligner to more scenarios. With the popularity of large models such as Pika and Sora, the fields of text-to-image and text-to-video have gained attention.
Currently, the videos and images generated by these models sometimes still have issues with physical law inconsistencies and unnatural lighting and shading processing.
By applying the aligner to this, the generated content can be fine-tuned, improving the quality of the final output and making it closer to real scenarios.Sixth, utilize an aligner to assist in achieving scalable supervision. The aligner is used as an assistant to help humans provide reward signals, thereby providing more precise reward supervision signals for complex scenarios, helping to solve the problem of super alignment.
It is also reported that the research group has been deeply involved in AI safety and governance for many years, committed to the study of AI alignment. Deeply cultivating the alignment field of large models, Yang Yaodong's research group has open-sourced a million-level safe alignment preference dataset BeaverTails and a safe alignment algorithm for large models SafeRLHF. The relevant papers were published at NeurIPS 2023 and ICLR 2024 (highlight papers), and the developed technologies have been adopted by multiple open-source models.
At the same time, the research group also wrote the industry's first comprehensive survey paper on artificial intelligence alignment, "AI Alignment: A Comprehensive Survey" [2], and provided a supporting resource website ().In the aforementioned paper, they summarized the AI alignment goals as the RICE principles: Robustness, Interpretability, Controllability, and Ethicality, comprehensively encapsulating the future direction and core components of AI alignment.
In this review, the team first introduced the concept of the alignment cycle, dividing AI alignment into two important components: forward alignment and backward alignment.
Forward alignment focuses on learning from feedback and learning under distribution shift conditions, with the aim of initially constructing an AI system with a certain degree of alignment through aligned training.
Backward alignment, on the other hand, emphasizes the alignment assurance and governance throughout the entire cycle, aiming to assess and manage the alignment of AI systems. In addition, the experience gained and alignment requirements obtained during the backward alignment process can also help update the alignment goals.
After the paper "AI Alignment: A Comprehensive Survey" was published online, the National Institute of Standards and Technology (NIST) of the U.S. Department of Commerce adopted the alignment cycle framework proposed in the aforementioned paper in its research project on trustworthy and responsible artificial intelligence.Specifically, in the NIST paper "Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations," the team's concepts of forward alignment and backward alignment were cited, elucidating the core steps and processes of AI alignment.
In the future, the research group will continue to delve into AI alignment, contributing to the development of research on the alignment of strong artificial intelligence with human intent and values.
Comments