Two years ago, researchers from OpenAI, Yuri Burda and Harri Edwards, attempted to determine how to enable a large language model to perform basic arithmetic operations.

They were curious about how many examples of adding two numbers the model needed to see in order to correctly add any two numbers they presented.

Initially, progress was not smooth. The model simply memorized the addition operations it had seen but could not solve problems it had not encountered before.

By chance, Burda and Edwards allowed some experiments to run for several days, far exceeding the few hours they had originally anticipated.

These models reviewed examples of addition operations over and over again, and if the researchers had been supervising, they would have stopped it long ago.When they finally returned, the two were surprised to find that the experiment had worked. They had trained a large language model that knew how to add two numbers together, only it took much longer than anyone had anticipated.

Advertisement

Out of curiosity as to what had happened, Burda and Edwards collaborated with colleagues to study this phenomenon.

They found that in some cases, the model seemed to be unable to learn a task and then suddenly learned it, as if a light bulb had suddenly lit up.

This is not the way traditional deep learning works, so they called this behavior "grokking."

Hattie Zhou, an artificial intelligence researcher at the University of Montreal in Canada and the Apple Machine Learning Research Institute, said, "This is really interesting. Can we be sure that the model has stopped learning? Perhaps it's just that we haven't trained long enough." She did not participate in this study.This peculiar behavior has garnered broader attention in the scientific community. Lauro Langosco from the University of Cambridge said, "Many people have different opinions. I don't think there is a consensus on what exactly happened."

"Rokking" is just one of several strange phenomena that have puzzled artificial intelligence researchers. The largest models to date, especially large language models, seem to operate in a way that differs from what mathematics shows they should.

Deep learning is the underlying technology behind today's AI boom, and this discovery reveals a fact about deep learning: despite its tremendous success, no one knows exactly how it works or why it is effective.

"Obviously, we are not completely ignorant," said computer scientist Mikhail Belkin from the University of California, San Diego. "But our theoretical analysis is far from what these models can do. For example, why can they learn languages? I think it's very mysterious."

Large models are now so complex that researchers are studying them as strange natural phenomena, conducting experiments and trying to explain the results.Many of the observations contradict classical statistics, which usually provides the best explanation for the behavior of predictive models.

You might say, so what? In the past few weeks, Google DeepMind has launched its generative AI model Gemini in most of its consumer applications. OpenAI has amazed people with its latest text-to-video model, Sora.

Enterprises around the world are scrambling to leverage artificial intelligence to meet their needs. This technology is not only effective but also entering our lives. Isn't that reason enough?

However, figuring out why deep learning is so effective is not only an interesting scientific puzzle but also may be the key to unlocking the next generation of technology and addressing its huge risks.

"It's an exciting time," said Boaz Barak, a computer scientist at Harvard University in the United States, who has been seconded to OpenAI's super alignment team for a year, "Many people in the field often compare it to physics at the beginning of the 20th century."There are many experimental results that we cannot fully understand. When you conduct experiments, the results can often surprise you.

Old code, new tricks

The most surprising thing is that the model can complete tasks that you have not shown it. This is called "generalization," which is one of the most fundamental ideas in machine learning and also one of the biggest challenges.

A model can be trained on a set of specific examples to perform a task, such as recognizing faces, translating sentences, or avoiding pedestrians. However, they can also generalize and learn to perform this task using examples they have never seen before.For some unknown reason, models are not only capable of remembering the patterns they have seen, but they can also come up with rules that allow them to apply these patterns to new tasks. Sometimes, just like with grokking, generalization occurs at times we do not expect.

Large language models, such as OpenAI's GPT-4 and Google DeepMind's Gemini, both possess impressive generalization capabilities.

Barak said: "The magic is not that the model can learn math problems in English and then generalize to new math problems.

What's magical is that the model can learn math problems in English, then look at some French literature, and then generalize to learn to solve math problems in French. This is not something that statistics can tell you."

A few years ago, when Hadi Zhou began studying artificial intelligence, she did not understand why teachers were more focused on the process of implementation rather than the principles behind the implementation.She said: "It's like telling you this is the method of training these models, and then you get the results. But it's not clear why this process produces models that can do such amazing things."

She wanted to know more, but no one could give her a good answer: "My assumption is that scientists know what they are doing. For example, they already have a theory, and then they build a model. But that's not the case."

In the past more than 10 years, the rapid development of deep learning has come more from trial and error than from understanding. Researchers have replicated effective methods found by others and added their own innovations.

Now there are many different "ingredients" that can be added to the models, and we also have a growing "recipe book" of deep learning, filled with ways to use these models.

Bergin said: "People just try this, try that, and try all the tricks. Some are very important, and some are meaningless."He said: "It worked, and we would think it's too magical. Our brains are stunned by the power of these things."

However, despite their success, these "recipes" are more like alchemy than rigorous chemistry. He said: "It's like we mixed something at midnight and then came up with some kind of correct spell."

Overfitting

The problem is that in the era of large language models, artificial intelligence seems to contradict the principles of statistics in textbooks. Today's most powerful models are very large, with up to one trillion parameters. These parameters will be continuously adjusted during model training.But statistics show that as models get bigger, they improve performance at first, but then get worse. The reason is a phenomenon called "overfitting." When a model is trained on a dataset, it tries to fit the data to a pattern. To give a simple example, this is like plotting a set of data points on a graph, and the pattern that best fits the data is the line that passes through the points on the graph. The process of training a model is to have it find a line that fits both the training data (the points that are already on the graph) and the new data (the new points). A straight line is the simplest pattern (linear regression), but it may not be very accurate and will miss some points. If there is a curve that connects every point, it will get a perfect score on the training data, but it will not generalize to new points. When this happens, the model is overfitting the data. According to classical statistical theory, the larger the model, the more likely it is to overfit. This is because with more parameters, it is easier for the model to find a line that connects every point. This means that if a model is to generalize, it must find a sweet spot between underfitting and overfitting. However, this is not what we see with large models. The most famous example of this is a phenomenon known as "double descent." The performance of a model is usually expressed in terms of the number of mistakes it makes: as performance improves, the error rate also decreases. For decades, it was believed that as models get bigger, the error rate first decreases and then increases. Like a U-shaped curve, its lowest point is the sweet spot for generalization. But in 2018, Belkin and his colleagues found that when some models get bigger, their error rates decrease, then increase, and then decrease again. Hence the name double descent, or W-shaped curve. In other words, large models will somehow surpass what we once thought was the sweet spot and solve the overfitting problem. As the model gets bigger, things (performance) get better. A year later, Barak co-authored a paper showing that the double-dip phenomenon was more common than many thought. This happens not only when models get larger, but also in models with large amounts of training data or models that take longer to train. This behavior is called benign overfitting, and we don’t fully understand it yet. It raises some fundamental questions about how models should be trained to maximize their potential. Researchers have already shed some light on what they think is going on. Belkin believes that there is a kind of Occam's razor effect at work: the simplest pattern that describes the data, that is, the smoothest curve between all data points, tends to generalize best. Larger models take longer to train than one would have thought, probably because they are more likely to find a mediocre curve than smaller models: more parameters means more curves to try. “Our theory seemed to explain the basic principles of why it works,” Belkin said. “Then people built models that spoke 100 languages, which proved that we didn’t understand anything.” He added, laughing, “It turns out we hadn’t even scratched the surface.” For Belkin, large language models are a whole new puzzle. These models are based on transformers, a type of neural network that excels at processing sequences of data, like words in a sentence. There’s a lot of complexity inside transformers, Belkin said. But he thinks that, at heart, they do more or less the same thing as Markov chains. The latter is a more easily understood statistical structure that predicts the next thing in a sequence based on what’s come before. But that’s not enough to explain everything large language models can do. “Until recently, we thought it shouldn’t work,” Belkin said. This means that there is something fundamentally missing from our understanding of it, that there are gaps in our understanding of the world. ” Belkin further speculates that there may be a hidden mathematical pattern in language that large language models have found a way to exploit: “This is pure speculation on my part, but who knows?” ” If we do find that these things model language, this could be one of the greatest discoveries in history, he said. “It’s astounding to me that you can learn language using Markov chains to predict the next word.” ” Start small.Researchers are trying to figure it out, piece by piece. Because big models are too complex to study directly, Belkin, Barak, Zhou and others are instead experimenting with smaller (and older) statistical models that are easier to understand. Train these agents on a variety of data under different conditions and observe what happens. That can provide insight into what’s going on. That can help inspire new theories, but it’s not clear whether those theories will also hold true for larger models. After all, much of the weird behavior lies in the complexity of big models. Is a theory of deep learning on the horizon? David Hsu, a computer scientist at Columbia University and a co-author of Belkin’s double-descent paper, doesn’t expect we’ll have all the answers anytime soon. “We have a better intuition now,” he says, “but can we really explain why neural networks behave in this unexpected way? We’re not there yet.” In 2016, Chi-Yuan Zhang of MIT and colleagues at Google Brain published an influential paper titled “Understanding Deep Learning Requires Rethinking Generalization.” Five years later, in 2021, they republished the paper as “Understanding Deep Learning (Still) Requires Rethinking Generalization.” So today? “Yes and no,” Zhang said. “There has been a lot of progress in recent years, but there are probably many more new problems than have been solved.” Meanwhile, researchers are still struggling to understand the underlying observations. In December 2023, Langosko and his colleagues published a paper at NeurIPS, a top AI conference. Their paper claims that grokking and double descent are actually different aspects of the same phenomenon, Langosko said: "If you look at them, they look a bit alike." He believes that the explanation of what is happening (behind deep learning) should take into account both. At the same meeting, Alicia Curth, who studies statistics at the University of Cambridge in the United Kingdom, and her colleagues argued that double descent is actually an "illusion." "I don't agree that modern machine learning is some kind of black magic that defies all the laws we've established so far," Curth said. Her team believes that double descent occurs because of the way model complexity is measured. Belkin and his colleagues used model size (the number of parameters) to measure complexity. But Curth and her colleagues found that the number of parameters may not be a good proxy for complexity, because adding parameters sometimes makes a model more complex and sometimes makes it less complex. It depends on what the parameter values are, how they are used in training, and how they interact with other parameters, which are mostly hidden in the model. "We concluded that not all model parameters are equal," Curth said. In short, large models may fit classical statistical theory quite well if a different measure of complexity is used. That’s not to say we won’t see more things we don’t understand as models get bigger, but we already have all the math we need to explain it, Kuth said. One of the great mysteries of our time Admittedly, the debate will rage. So why does it matter whether AI models are grounded in classical statistics? One answer is that a better theoretical understanding will help in building better artificial intelligence or improving its efficiency.

At present, our progress is rapid but unpredictable. Many things that OpenAI's GPT-4 can do surprise even those who created it.

Researchers are still debating what it can and cannot achieve. Belkin said, "Without some kind of fundamental theory, it is difficult for us to know what we should expect to see from these things."

Barak agrees with this. He said, "Even if we have models now, even looking back, it is hard for us to accurately say the exact reasons why certain capabilities emerged."

This is not only about managing the development of technology but also about predicting the risks of technology. Many researchers who delve into the theory behind deep learning are motivated by concerns about the safety of future models.Langosko said: "Before we trained and tested GPT-5, we didn't know what capabilities it would have.

Now this may be a medium-sized issue, but as the model becomes more powerful, it will become a real big problem in the future."

Barak works in OpenAI's Super Alignment Team, which was established by the company's chief scientist Ilya Sutskever, aiming to find out how to prevent a hypothetical superintelligence from getting out of control.

"I am very interested in control," he said, "If you can do something amazing, but you can't really control it, then it's not that amazing. What's the value of a car that can reach 300 miles per hour if the steering wheel is unstable?"

But behind all this, there is a huge scientific challenge. Barak said: "Intelligence is undoubtedly one of the great mysteries of our time.""Our science is still in its infancy," he said, "This month there might be a problem that excites me, but it could change next month. We are still discovering many things, so we very much need to conduct experiments and see surprises."

## Comments