Beijing Academy of Artificial Intelligence (BAAI) Releases Evaluation Results for Over 100 Large Language Models: Rise of Multimodal Models, Application Landing Crucial

Beijing Academy of Artificial Intelligence (BAAI) Releases Evaluation Results for Over 100 Large Language Models: Rise of Multimodal Models, Application Landing CrucialBAAI recently released comprehensive and specialized evaluation results for over 100 open-source and commercially closed-source large language models (LLMs) from both domestic and international sources. These models encompass various modalities, including language, vision-language, text-to-image, text-to-video, and speech-language

Beijing Academy of Artificial Intelligence (BAAI) Releases Evaluation Results for Over 100 Large Language Models: Rise of Multimodal Models, Application Landing Crucial

BAAI recently released comprehensive and specialized evaluation results for over 100 open-source and commercially closed-source large language models (LLMs) from both domestic and international sources. These models encompass various modalities, including language, vision-language, text-to-image, text-to-video, and speech-language. The evaluation expanded and refined the assessment of LLMs' task-solving capabilities, adding several key capabilities and tasks to comprehensively measure the latest advancements and ecological changes in LLM technology.

Beijing Academy of Artificial Intelligence (BAAI) Releases Evaluation Results for Over 100 Large Language Models: Rise of Multimodal Models, Application Landing Crucial

The evaluation results reveal a new trend in LLM development in the second half of 2024: model vendors are placing greater emphasis on improving comprehensive capabilities and achieving practical application landing. Multimodal models are developing rapidly, with many new vendors and models emerging, while the development of language models has slowed relatively. Lin Yonghua, Vice President and Chief Engineer of BAAI, provided in-depth insights into the development trends of LLMs, as well as the evaluation standards and methods in an interview.

Lin noted that many domestic vendors have trained models over the past year with considerable application potential, and commercialization has become the primary goal for most vendors. She believes that with the continuous improvement of AI model's fundamental capabilities, two clear trends are emerging in current AI applications: first, the complex application capabilities based on language models are constantly improving; second, multimodal applications such as text-to-image and text-to-video are continuously emerging. These improvements in multimodal LLM capabilities lay the foundation for further development of AI applications, create conditions for commercialization, and contribute to a virtuous cycle in the overall LLM market.

However, Lin also mentioned that while vendors are actively seeking application landing, current investments are mainly concentrated on the fundamental capabilities of LLMs. This investment strategy helps prevent AI applications from lagging behind due to iterative upgrades of LLM fundamental capabilities. Regarding recent industry discussions about the "stagnation of pre-training effects in AI LLMs," Lin disagrees. She believes that the so-called stagnation is actually due to the increasing trend of data silos on the internet, with a large amount of data, particularly video data, remaining underutilized. Effectively utilizing this data to enhance AI models' understanding of the world is a major current challenge. She also emphasized the important role of synthetic data in AI model training, especially in situations where real-world data is difficult to obtain, such as driving data in rainy, dark, or other adverse conditions for autonomous driving. Synthetic data can effectively fill data gaps.

This LLM evaluation also improved data processing. To mitigate the risks of dataset leakage and dataset saturation, the evaluation incorporated recently released datasets and continuously updated evaluation data, replacing 98% of the questions and increasing their difficulty. Lin emphasized that the BAAI evaluation adheres to the principles of scientific rigor, authority, fairness, and openness. All evaluations of closed-source LLMs were conducted at BAAI by accessing public APIs, mimicking the perspective of an ordinary user.

In addition to comprehensive evaluations across multiple modalities, BAAI also launched four specialized evaluation leaderboards to explore the boundaries of model capabilities and application potential in multiple dimensions. The evaluation results of K12 all-grade, multi-subject exams showed that the average score of LLMs improved by 12.86% compared to six months ago, but still lags behind the average score of students in Haidian District (a high-achieving area in Beijing). Results from the FlagEvalDebate model debate platform indicate that LLMs still need improvement in core capabilities such as logical reasoning, viewpoint understanding, and language expression.

Significantly, this evaluation explored novel methods based on real-world application scenarios, such as evaluating a model's ability to implement quantitative code, exploring potential applications and commercial value in the field of financial quantitative trading. The evaluation found that LLMs already possess the ability to generate strategy code with a retracement return, can develop code for typical scenarios in quantitative trading, and the capabilities of top models are approaching those of junior quantitative traders.

Lin stated that the FlagEval evaluation system consistently upholds the principles of scientific rigor, authority, fairness, and openness, continuously innovating through technological methods and platforms to provide insights into the development of the LLM technology ecosystem. In the future, the FlagEval evaluation system will further explore dynamic evaluations and multi-task capability assessment systems to use evaluation as a benchmark for understanding the development trends of LLMs.

Compared to the comprehensive assessment of model capabilities in May of this year, the BAAI evaluation has been improved and expanded in several aspects. For example, it expanded, enriched, and refined the connotation of task-solving capabilities, adding capabilities and tasks related to data processing, advanced programming, and tool invocation; it added for the first time an evaluation of application capabilities oriented toward real-world financial quantitative trading scenarios, measuring capabilities such as return optimization and performance optimization; and it explored for the first time a comparative evaluation method based on model debate, conducting in-depth analysis of core capabilities such as logical reasoning, viewpoint understanding, and language expression. This demonstrates BAAI's continuous efforts to improve the LLM evaluation system to more comprehensively and accurately assess LLM capabilities and promote their continuous development. These improvements allow BAAI's evaluation results to more effectively provide guidance for LLM vendors, driving the development of LLM technology towards greater practicality and reliability. The release of these evaluation results undoubtedly provides important references and directions for the development of the LLM industry and points the way for future technological innovation.


Disclaimer: The content of this article is sourced from the internet. The copyright of the text, images, and other materials belongs to the original author. The platform reprints the materials for the purpose of conveying more information. The content of the article is for reference and learning only, and should not be used for commercial purposes. If it infringes on your legitimate rights and interests, please contact us promptly and we will handle it as soon as possible! We respect copyright and are committed to protecting it. Thank you for sharing.(Email:[email protected])