tech

The year 2024 is destined to be an extraordinarily lively year for the AI industry. Although it has just entered March, news about AI has already occupied the headlines several times. Just last month, OpenAI released the large model Sora for text-to-video generation, whose realistic effects directly cleared the entrepreneurs who have been working hard in this niche track. A few days later, Nvidia's market value reached 2 trillion US dollars, becoming the fastest company in history to achieve a market value of 1 trillion to 2 trillion US dollars. As the saying goes, "When you find a gold mine, the best business is not mining but selling shovels." Nvidia has become the biggest winner in the "arms race" of the AI era.

Just as everyone sighed, "There are only two kinds of AI in the world, one is called OpenAI, and the other is called other AI," the long-silent Anthropic released a trump card. This company, founded by the former vice president of research at OpenAI, released the latest Claude3 model, which has surpassed GPT4 in all indicators.

Advertisement

The ups and downs of the AI industry also indicate that this industry is still in a primary stage. Technology iterations are too fast, and companies that are temporarily leading may be subverted by new technologies overnight. Some dazzling new technologies, although they have been introduced, have not been made public or deployed for a long time. For example, the Sora mentioned above has not been officially opened to the public as of the time of writing.

There is a gap between the development and local deployment of generative AI. At present, the generative AI products used by the public are often deployed in the cloud and accessed locally (such as the ChatGPT website), but this cannot meet all needs and will produce some hidden dangers.

Firstly, as large models become more and more complex, the transmission between the cloud and the local becomes insufficient under limited bandwidth. For example, a Boeing 787 aircraft generates 5G of data per second. If it is uploaded to the cloud, calculated, and the results are returned, the aircraft may have flown several kilometers away (estimated at 800 kilometers per hour).If AI functions are used on an airplane but deployed in the cloud, the transmission speed cannot meet the requirements.

In addition, for some user-sensitive data and privacy data, is it necessary to go to the cloud? Obviously, placing it locally is more reassuring to users.

No matter how powerful generative AI is, how to deploy it locally is always an unavoidable issue. This is the trend of industry development, although there are some difficulties at present.

The difficulty lies in how to put the "big model" into the "small device". Note that the "size" here is relative. Behind the cloud computing may be a computing center covering tens of thousands of square meters, while local deployment needs to let generative AI run on your mobile phone. The mobile phone does not have liquid nitrogen cooling, nor does it have endless power supply, so how to deploy AI?Heterogeneous Computing, a Possible Solution?

---

Qualcomm's heterogeneous computing AI engine (hereinafter referred to as Qualcomm AI Engine) provides the industry with a viable solution. By leveraging the collaboration of CPU, GPU, NPU, as well as Qualcomm's sensor hub and memory subsystem, it achieves the deployment of AI and significantly enhances the AI experience.

Different types of processors excel in different tasks, and the principle of heterogeneous computing is to let "specialists do specialized work." CPUs are good at sequential control and are suitable for applications that require low latency. At the same time, some smaller traditional models, such as convolutional neural network models (CNNs), or some specific large language models (LLMs), can be handled with ease by CPUs. GPUs, on the other hand, are more adept at parallel processing for high-precision formats, such as video and games with very high quality requirements.

CPUs and GPUs are very common and well-known to the public, while NPU is relatively a newer technology. NPU, or Neural Processing Unit, is specifically designed to achieve low power consumption and accelerate AI inference. When we continuously use AI, we need to output high peak performance stably with low power consumption, where NPU can exert its maximum advantage.

For example, when a user is playing a high-load game, the GPU may be fully occupied, or when a user is browsing multiple web pages, the CPU may be fully occupied. At this time, the NPU, as a true AI-specific engine, will take on the computation related to AI, ensuring a smooth AI experience for the user.In summary, CPUs and GPUs are general-purpose processors designed for flexibility, easy to program, and primarily responsible for operating systems, games, and other applications. NPU, on the other hand, is born for AI, with AI as its main job, achieving higher peak performance and energy efficiency by sacrificing some of the easy programming characteristics, and escorting the user's AI experience all the way.

When we integrate CPUs, GPUs, NPUs, Qualcomm Sensor Hub, and memory subsystems together, it is a heterogeneous computing architecture.

Qualcomm AI Engine integrates Qualcomm Oryon or Kryo CPU, Adreno GPU, Hexagon NPU, as well as Qualcomm Sensor Hub and memory subsystem. Hexagon NPU, as a core component, has been upgraded and iterated for many years, and now it has reached the leading AI processing level in the industry. Taking the mobile platform as an example, the third-generation Snapdragon 8 integrated with Qualcomm AI Engine supports industry-leading LPDDR5x memory, with a frequency of up to 4.8GHz, which allows it to run large language models such as Bai Chuan, Llama 2, etc., at a very high chip memory reading speed, thereby achieving a very fast token generation rate and bringing a new experience to users.

Qualcomm's research on NPU did not start in recent years. If we trace back to the origin of Hexagon NPU, it goes back to 2007, which is 15 years before generative AI came into the public eye. The first Qualcomm Hexagon DSP made its debut on the Snapdragon platform, and the DSP control and scalar architecture became the basis for future generations of Qualcomm NPU.

Eight years later, in 2015, the Snapdragon 820 processor integrated the first Qualcomm AI Engine;In 2018, Qualcomm added a tensor accelerator to the Hexagon NPU in the Snapdragon 855;

 

In 2019, Qualcomm expanded the on-device AI use cases on the Snapdragon 865, including AI imaging, AI video, AI voice, and other functions;

 

In 2020, the Hexagon NPU welcomed a transformative architectural update. The integration of scalar, vector, and tensor accelerators laid the foundation for Qualcomm's future NPU architecture;

 

In 2022, the second-generation Snapdragon 8 introduced a series of significant technological enhancements to the Hexagon NPU. The micro-slice technology eliminated up to 10 layers of content occupancy, reduced power consumption, and achieved a 4.35x improvement in AI performance.

 

On October 25, 2023, Qualcomm officially released the third-generation Snapdragon 8. As the first mobile platform specifically crafted by Qualcomm Technologies for generative AI, its integrated Hexagon NPU is currently the newest and best design from Qualcomm for generative AI.Due to Qualcomm's provision of a complete solution set for AI developers and downstream manufacturers (which will be detailed in the third part), rather than offering a single chip or a specific software application, this means that in terms of hardware design and optimization, Qualcomm can consider the whole picture, identify the bottlenecks in current AI development, and make targeted improvements.

For example, why pay special attention to the technical point of memory bandwidth? When we shift our perspective from the chip to the development of large AI models, we will find that memory bandwidth is the bottleneck for token generation in large language models. One of the reasons why the third-generation Snapdragon 8's NPU architecture can help accelerate the development of large AI models is that it has specifically improved the efficiency of memory bandwidth.

This improvement in efficiency mainly benefits from the application of two technologies.

The first is micro-slice inference. By dividing the neural network into multiple independently executed micro-slices, it eliminates memory occupation of up to more than 10 layers, which maximizes the use of scalar, vector, and tensor accelerators in the Hexagon NPU and reduces power consumption. The second is local 4-bit integer (INT4) operations. It can double the throughput of INT4 layers and neural networks and tensors, while improving the efficiency of memory bandwidth.

On February 26, the Mobile World Congress (MWC 2024) kicked off in Barcelona. Based on Snapdragon X Elite, Qualcomm showcased to the world the first large multimodal language model (LMM) with more than 7 billion parameters running on the terminal side. This model can accept text and audio input (such as music, traffic environmental audio, etc.), and generate multi-turn dialogues based on audio content.So, what kind of AI experience can be expected on mobile terminals integrated with the Hexagon NPU? And how is it achieved? Qualcomm has provided a detailed breakdown of a case.

With the AI travel assistant on mobile terminals, users can directly request the model to plan a travel itinerary. The AI assistant can immediately provide flight itinerary suggestions and adjust the output results through voice dialogue, finally creating a complete flight schedule through the Skyscanner plugin.

How is this one-stop experience achieved?

Step one, the user's voice is transformed into text through the automatic speech recognition (ASR) model Whisper. This model has 240 million parameters and mainly operates on the Qualcomm Sensor Hub;

Step two, using the Llama 2 or BaiChuan large language model to generate text replies based on the text content, this model runs on the Hexagon NPU;Step three, the text is transformed into speech by running an open-source TTS (Text to Speech) model on the CPU;

The final step, network connection is achieved through modem technology, and the ticket booking operation is completed using the Skyscanner plugin.

On the eve of the industry explosion, developers need to seize the initiative.

By testing the AI performance of Snapdragon and Qualcomm platforms with different tools, it can be found that their scores are several times higher than those of similar competitors. Looking at the results of the Lu Master AIMark V4.3 benchmark test, the total score of the third-generation Snapdragon 8 is 5.7 times higher than that of competitor B, and 7.9 times higher than that of competitor C.In the Antutu benchmark test, the total score of the third-generation Snapdragon 8 is 6.3 times higher than that of its competitor B. A detailed comparison has also been made for different sub-items of the MLCommon MLPerf inference, including image classification, language understanding, and super-resolution.

Further comparison between Snapdragon X Elite and other X86 architecture competitors shows that Snapdragon X Elite has a clear leading position in tests such as ResNet-50 and DeeplabV3, with its benchmark total scores being 3.4 times that of X86 architecture competitor A and 8.6 times that of competitor B. Therefore, on the PC side, whether running Microsoft Copilot or performing generative AI applications such as document summarization and document writing, the experience is very smooth.

The leading AI performance is not entirely due to Qualcomm's AI engine. To be precise, Qualcomm's empowerment of AI manufacturers is all-encompassing.

Firstly, there is the Qualcomm AI engine. It includes the Hexagon NPU, Adreno GPU, Qualcomm Oryon CPU (for PC platforms), Qualcomm sensor hub, and memory subsystem. With specialized industrial design and good collaboration between different components, this heterogeneous computing architecture provides a low-power, high-energy-efficient development platform for terminal-side products.

Based on advanced hardware, Qualcomm has also launched the AI software stack (Qualcomm AI Stack). The birth of this product is to solve the stubborn problem in AI development - the same function needs to be developed multiple times for different platforms, resulting in repetitive work. AI Stack supports all current mainstream AI frameworks, and OEM manufacturers and developers can create, optimize, and deploy AI applications on the platform, and achieve "one development, deployment across all platforms," greatly reducing the repetitive work of R&D personnel.Additionally, there is the AI Hub that Qualcomm just released at MWC2024. The AI Hub is a model library containing nearly 80 AI models, including both generative AI models and traditional AI models, as well as image recognition or facial recognition models such as BaiChuan, Stable Diffusion, and Whisper. Developers can select the models they want to use from the AI Hub to generate binary plugins, achieving "plug-and-play" in AI development.

In summary, if we look at the depth vertically, Qualcomm is comprehensively accelerating the AI development progress of manufacturers in three dimensions: hardware (AI engine), software (AI Stack), and material library (AI Hub). Looking at the breadth horizontally, Qualcomm's products have already covered almost all terminal-side devices (the third-generation Snapdragon 8 supports terminal devices such as mobile phones, and X Elite empowers AI PC products).

AI applications are in the brewing period before the explosion.In the field of education, AI can develop personalized teaching plans tailored to students' learning abilities and progress; in the field of medicine, AI can be used to discover new types of antibiotics; in the field of elderly care, in the future, in some areas with more serious aging problems, AI terminals can be used to collect all personal data from the homes of the elderly, thereby helping to prevent emergency medical accidents.

The term "before the surge" is used precisely because there has not yet been large-scale deployment. On the other hand, AI applications, as one of the products that are most likely to generate user stickiness, have a strong first-mover advantage effect.

AI product developers need to take the lead and let users experience their products earlier, establish connections with users, and cultivate stickiness, so as to gain a competitive edge.

Comments