Chapter 2: Technical Performance | 2024 AI Index Report
CHAPTER 2: Technical Performance
Overview
Overview The technical performance section of this year’s AI Index offers a comprehensive overview of AI advancements in 2023. It starts with a high-level overview of AI technical performance, tracing its broad evolution over time. The chapter then examines the current state of a wide range of AI capabilities, including language processing, coding, computer vision (image and video analysis), reasoning, audio processing, autonomous agents, robotics, and reinforcement learning. It also shines a spotlight on notable AI research breakthroughs from the past year, exploring methods for improving LLMs through prompting, optimization, and fine-tuning, and wraps up with an exploration of AI systems’ environmental footprint.
Chapter Highlights
- AI beats humans on some tasks, but not on all. AI has surpassed human performance on several benchmarks, including some in image classification, visual reasoning, and English understanding. Yet it trails behind on more complex tasks like competition-level mathematics, visual commonsense reasoning and planning.
- Here comes multimodal AI. Traditionally AI systems have been limited in scope, with language models excelling in text comprehension but faltering in image processing, and vice versa. However, recent advancements have led to the development of strong multimodal models, such as Google’s Gemini and OpenAI’s GPT-4. These models demonstrate flexibility and are capable of handling images and text and, in some instances, can even process audio.
- Harder benchmarks emerge. AI models have reached performance saturation on established benchmarks such as ImageNet, SQuAD, and SuperGLUE, prompting researchers to develop more challenging ones. In 2023, several challenging new benchmarks emerged, including SWE-bench for coding, HEIM for image generation, MMMU for general reasoning, MoCa for moral reasoning, AgentBench for agent-based behavior, and HaluEval for hallucinations.
- Better AI means better data which means … even better AI. New AI models such as SegmentAnything and Skoltech are being used to generate specialized data for tasks like image segmentation and 3D reconstruction. Data is vital for AI technical improvements. The use of AI to create more data enhances current capabilities and paves the way for future algorithmic improvements, especially on harder tasks.
- Human evaluation is in. With generative models producing high-quality text, images, and more, benchmarking has slowly started shifting toward incorporating human evaluations like the Chatbot Arena Leaderboard rather than computerized rankings like ImageNet or SQuAD. Public feeling about AI is becoming an increasingly important consideration in tracking AI progress.
- Thanks to LLMs, robots have become more flexible. The fusion of language modeling with robotics has given rise to more flexible robotic systems like PaLM-E and RT-2. Beyond their improved robotic capabilities, these models can ask questions, which marks a significant step toward robots that can interact more effectively with the real world.
- More technical research in agentic AI. Creating AI agents, systems capable of autonomous operation in specific environments, has long challenged computer scientists. However, emerging research suggests that the performance of autonomous AI agents is improving. Current agents can now master complex games like Minecraft and effectively tackle real-world tasks, such as online shopping and research assistance.
- Closed LLMs significantly outperform open ones. On 10 select AI benchmarks, closed models outperformed open ones, with a median performance advantage of 24.2%. Differences in the performance of closed and open models carry important implications for AI policy debates.
2.1 Overview of AI in 2023
Timeline: Significant Model Releases
As chosen by the AI Index Steering Committee, here are some of the most notable model releases of 2023.
Date | Model | Type | Creator(s) | Significance | Image |
---|---|---|---|---|---|
Mar. 14, 2023 | Claude | Large language model | Anthropic | Claude is the first publicly released LLM from Anthropic, one of OpenAI’s main rivals. Claude is designed to be as helpful, honest, and harmless as possible. | |
Mar. 14, 2023 | GPT-4 | Large language model | OpenAI | GPT-4, improving over GPT-3, is among the most powerful and capable LLMs to date and surpasses human performance | |
Mar. 23, 2023 | Stable Diffusion v2 | Text-to-image model | Stability AI | Stability AI Stable Diffusion v2 is an upgrade of Stability AI’s existing text-to-image model and produces higher-resolution, superior-quality images. | |
Apr. 5, 2023 | Segment Anything | Image segmentation | Meta | Meta Segment Anything is an AI model capable of isolating objects in images using zero-shot generalization. | |
Jul. 18, 2023 | Llama 2 | Large language model | Meta | Meta Llama 2, an updated version of Meta’s flagship LLM, is open-source. Its smaller variants (7B and 13B) deliver relatively high performance for their size. | |
Aug. 20, 2023 | DALL-E 3 | Image generation | OpenAI | OpenAI DALL-E 3 is an improved version of OpenAI’s existing text-to-vision model DALL-E. | |
Aug. 29, 2023 | SynthID | Watermarking | Google, DeepMind | SynthID is a tool for watermarking AI-generated music and images. Its watermarks remain detectable even after image alterations. | |
Sep. 27, 2023 | Mistral 7B | Large language model | Mistral AI | Mistral AI Mistral 7B, launched by French AI company Mistral, is a compact 7 billion parameter model that surpasses Llama 2 13B in performance, ranking it top in its class for size. | |
Oct. 27, 2023 | Ernie 4.0 | Large language model | Baidu | Baidu, a multinational Chinese technology company, has launched Ernie 4.0, which is among the highest-performing Chinese LLMs to date. | |
Nov. 6, 2023 | GPT-4 Turbo | Large language model | OpenAI | OpenAI GPT-4 Turbo is an upgraded large language model boasting a 128K context window and reduced pricing. | |
Nov. 6, 2023 | Whisper v3 | Speech-to-text model | OpenAI | OpenAI Whisper v3 is an open-source speech-to-text model known for its increased accuracy and extended language support. | |
Nov. 21, 2023 | Claude 2.1 | Large language model | Anthropic | Anthropic’s latest LLM, Claude 2.1, features an industry-leading 200K context window, which enhances its capacity to process extensive content such as lengthy literary works. | |
Nov. 22, 2023 | Inflection-2 | Large language model | Inflection | Inflection-2 is the second LLM from the new startup Inflection, founded by DeepMind’s Mustafa Suleyman. Inflection-2’s launch underscores the intensifying competition in the LLM arena. | |
Dec. 6, 2023 | Gemini | Large language model | Google Gemini emerges as a formidable competitor to GPT-4, with one of its variants, Gemini Ultra, outshining GPT-4 on numerous benchmarks. | ||
Dec. 21, 2023 | Midjourney v6 | Text-to-image model | Midjourney | Midjourney’s latest update enhances user experience with more intuitive prompts and superior image quality. |
State of AI Performance
As of 2023, AI has achieved levels of performance that surpass human capabilities across a range of tasks. Figure 2.1.16 illustrates the progress of AI systems relative to human baselines for nine AI benchmarks corresponding to nine tasks (e.g., image classification or basic-level reading comprehension).1 The AI Index team selected one benchmark to represent each task.
Over the years, AI has surpassed human baselines on a handful of benchmarks, such as image classification in 2015, basic reading comprehension in 2017, visual reasoning in 2020, and natural language inference in 2021. As of 2023, there are still some task categories where AI fails to exceed human ability. These tend to be more complex cognitive tasks, such as visual commonsense reasoning and advanced-level mathematical problem-solving (competition-level math problems).
1 An AI benchmark is a standardized test used to evaluate the performance and capabilities of AI systems on specific tasks. For example, ImageNet is a canonical AI benchmark that features a large collection of labeled images, and AI systems are tasked with classifying these images accurately. Tracking progress on benchmarks has been a standard way for the AI community to monitor the advancement of AI systems.
2 In Figure 2.1.16, the values are scaled to establish a standard metric for comparing different benchmarks. The scaling function is calibrated such that the performance of the best model for each year is measured as a percentage of the human baseline for a given task. A value of 105% indicates, for example, that a model performs 5% better than the human baseline.
AI Index Benchmarks
An emerging theme in AI technical performance, as emphasized in last year’s report, is the observed saturation on many benchmarks, such as ImageNet, used to assess the proficiency of AI models. Performance on these benchmarks has stagnated in recent years, indicating either a plateau in AI capabilities or a shift among researchers toward more complex research challenges.3
Due to saturation, several benchmarks featured in the 2023 AI Index have been omitted from this year’s report. Figure 2.1.17 highlights a selection of benchmarks that were included in the 2023 edition but not featured in this year’s report.4 It also shows the improvement on these benchmarks since 2022. “NA” indicates no improvement was noted.
3 Benchmarks can also saturate or see limited improvement because the problem created is hard and the corresponding performance fails to improve. The issue of benchmark saturation discussed in this section refers more to benchmarks where performance reaches a close-to-perfection level on which it is difficult to improve.
4 For brevity, Figure 2.1.17 highlights a selection of deprecated benchmarks. Additional benchmarks that were deprecated either because there was saturation, no new state-of-the-art score was documented, or research focus shifted away from the benchmark include: Celeb-DF (deepfake detection), CIFAR-10 (image classification), NIST FRVT (facial recognition), and Procgen (reinforcement learning).
Figure 2.1.18 illustrates the year-over-year improvement, in percent, on a selection of benchmarks featured in the 2023 AI Index report. Most benchmarks see significant performance increases relatively soon after they are introduced, then the improvement slows. In the last few years, many of these benchmarks have shown little or no improvement.
In response to benchmark saturation, AI researchers are pivoting away from traditional benchmarks and testing AI on more difficult challenges. The 2024 AI Index tracks progress on several new benchmarks including those for tasks in coding, advanced reasoning, and agentic behavior—areas that were underrepresented in previous versions of the report (Figure 2.1.19).5
5 This report includes an Appendix with details regarding the sourcing of new benchmarks featured in this chapter.
2.2 Language
Natural language processing (NLP) enables computers to understand, interpret, generate, and transform text. Current stateof-the-art models, such as OpenAI’s GPT-4 and Google’s Gemini, are able to generate fluent and coherent prose and display high levels of language understanding ability (Figure 2.2.1). Many of these models can also now handle different input forms, such as images and audio (Figure 2.2.2).
Understanding
English language understanding challenges AI systems to understand the English language in various ways such as reading comprehension and logical reasoning.
HELM: Holistic Evaluation of Language Models
As illustrated above, in recent years LLMs have surpassed human performance on traditional Englishlanguage benchmarks, such as SQuAD (question answering) and SuperGLUE (language understanding). This rapid advancement has led to the need for more comprehensive benchmarks.
In 2022, Stanford researchers introduced HELM (Holistic Evaluation of Language Models), designed to evaluate LLMs across diverse scenarios, including reading comprehension, language understanding, and mathematical reasoning.6 HELM assesses models from several leading companies like Anthropic, Google, Meta, and OpenAI, and uses a “mean win rate” to track average performance across all scenarios. As of January 2024, GPT-4 leads the aggregate HELM leaderboard with a mean win rate of 0.96 (Figure 2.2.3); however, different models top different task categories (Figure 2.2.4).7
6 HELM evaluates 10 scenarios: (1) NarrativeQA (reading comprehension), (2) Natural Questions (closed-book) (closed-book short-answer question answering), (3) Natural Questions (open-book) (open-book short-answer question answering), (4) OpenBookQA (commonsense question answering), (5) MMLU (multisubject understanding), (6) GSM8K (grade school math), (7) MATH (competition math), (8) LegalBench (legal reasoning), (9) MedQA (medical knowledge), and (10) WMT 2014 (machine translation).
7 There are several versions of HELM. This section reports the score on HELM Lite, Release v1.0.0 (2023-12-19), with the data having been collected in January 2024.
MMLU: Massive Multitask Language Understanding
The Massive Multitask Language Understanding (MMLU) benchmark assesses model performance in zero-shot or few-shot scenarios across 57 subjects, including the humanities, STEM, and social sciences (Figure 2.2.5). MMLU has emerged as a premier benchmark for assessing LLM capabilities: Many stateof-the-art models like GPT-4, Claude 2, and Gemini have been evaluated against MMLU.
In early 2023, GPT-4 posted a state-of-the-art score on MMLU, later surpassed by Google’s Gemini Ultra. Figure 2.2.6 highlights the top model scores on the MMLU benchmark in different years. The scores reported are the averages across the test set. As of January 2024, Gemini Ultra holds the top score of 90.0%, marking a 14.8 percentage point improvement since 2022 and a 57.6 percentage point increase since MMLU’s inception in 2019. Gemini Ultra’s score was the first to surpass MMLU’s human baseline of 89.8%.
Generation
In generation tasks, AI models are tested on their ability to produce fluent and practical language responses.
Chatbot Arena Leaderboard
The rise of capable LLMs has made it increasingly important to understand which models are preferred by the general public. Launched in 2023, the Chatbot Arena Leaderboard is one of the first comprehensive evaluations of public LLM preference. The leaderboard allows users to query two anonymous models and vote for the preferred generations (Figure 2.2.7). As of early 2024, the platform has garnered over 200,000 votes, and users ranked OpenAI’s GPT-4 Turbo as the most preferred model (Figure 2.2.8).
Factuality and Truthfulness
Despite remarkable achievements, LLMs remain susceptible to factual inaccuracies and content hallucination—creating seemingly realistic, yet false, information. The presence of real-world instances where LLMs have produced hallucinations—in court cases, for example—underscores the growing necessity of closely monitoring trends in LLM factuality.
TruthfulQA
Introduced at ACL 2022, TruthfulQA is a benchmark designed to evaluate the truthfulness of LLMs in generating answers to questions. This benchmark comprises approximately 800 questions across 38 categories, including health, politics, and finance. Many questions are crafted to challenge commonly held misconceptions, which typically lead humans to answer incorrectly (Figure 2.2.9). Although one of the observations of the paper is that larger models tend to be less truthful, GPT-4 (RLHF) released in early 2024, has achieved the highest performance thus far on the TruthfulQA benchmark, with a score of 0.6 (Figure 2.2.10). This score is nearly three times higher than that of a GPT-2-based model tested in 2021, indicating that LLMs are becoming progressively better at providing truthful answers.
HaluEval
As previously mentioned, LLMs are prone to hallucinations, a concerning trait given their widespread deployment in critical fields such as law and medicine. While existing research has aimed to understand the causes of hallucinations, less effort has been directed toward assessing the frequency of LLM hallucinations and identifying specific content areas where they are especially vulnerable.
HaluEval, introduced in 2023, is a new benchmark designed to assess hallucinations in LLMs. It includes over 35,000 samples, both hallucinated and normal, for analysis and evaluation by LLMs (Figure 2.2.11). The research indicates that ChatGPT fabricates unverifiable information in approximately 19.5% of its responses, with these fabrications spanning a variety of topics such as language, climate, and technology. Furthermore, the study examines how well current LLMs can detect hallucinations. Figure 2.2.12 illustrates the performance of leading LLMs in identifying hallucinations across various tasks, including question answering, knowledge-grounded dialogue, and text summarization. The findings reveal that many LLMs struggle with these tasks, highlighting that hallucination is a significant ongoing issue.
2.3 Coding
Generation
On many coding tasks, AI models are challenged to generate usable code or to solve computer science problems.
HumanEval
HumanEval, a benchmark for evaluating AI systems’ coding ability, was introduced by OpenAI researchers in 2021. It consists of 164 challenging handwritten programming problems (Figure 2.3.1). A GPT-4 model variant (AgentCoder) currently leads in HumanEval performance, scoring 96.3%, which is a 11.2 percentage point increase from the highest score in 2022 (Figure 2.3.2). Since 2021, performance on HumanEval has increased 64.1 percentage points.
SWE-bench
As AI systems’ coding capabilities improve, it has become increasingly important to benchmark models on more challenging tasks. In October 2023, researchers introduced SWE-bench, a dataset comprising 2,294 software engineering problems sourced from real GitHub issues and popular Python repositories (Figure 2.3.3). SWE-bench presents a tougher test for AI coding proficiency, demanding that systems coordinate changes across multiple functions, interact with various execution environments, and perform complex reasoning.
Even state-of-the-art LLMs face significant challenges with SWE-bench. Claude 2, the best-performing model, solved only 4.8% of the dataset’s problems (Figure 2.3.4).8 In 2023, the topperforming model on SWE-bench surpassed the best model from 2022 by 4.3 percentage points.
8 According to the SWE-bench leaderboard, unassisted systems have no assistance in finding the relevant files in the repository. Assisted systems operate under the “oracle” retrieval setting, which means the systems are provided with the list of files that were modified in the pull request.
Computer vision allows machines to understand images and videos and create realistic visuals from textual prompts or other inputs. This technology is widely used in fields such as autonomous driving, medical imaging, and video game development.
2.4 Image Computer Vision and Image Generation
Generation
Image generation is the task of generating images that are indistinguishable from real ones. Today’s image generators are so advanced that most people struggle to differentiate between AI-generated images and actual images of human faces (Figure 2.4.1). Figure 2.4.2 highlights several generations from various Midjourney model variants from 2022 to 2024 for the prompt “a hyper-realistic image of Harry Potter.” The progression demonstrates the significant improvement in Midjourney’s ability to generate hyper-realistic images over a two-year period. In 2022, the model produced cartoonish and inaccurate renderings of Harry Potter, but by 2024, it could create startlingly realistic depictions.
Midjourney generations over time: “a hyper-realistic image of Harry Potter” Source: Midjourney, 2023 | Chart: 2024 AI Index report
HEIM: Holistic Evaluation of Text-to-Image Models
The rapid progress of AI text-to-image systems has prompted the development of more sophisticated evaluation methods. In 2023, Stanford researchers introduced the Holistic Evaluation of Text-toImage Models (HEIM), a benchmark designed to comprehensively assess image generators across 12 key aspects crucial for real-world deployment, such as image-text alignment, image quality, and aesthetics.9 Human evaluators are used to rate the models, a crucial feature since many automated metrics struggle to accurately assess various aspects of images.
HEIM’s findings indicate that no single model excels in all criteria. For human evaluation of image-to-text alignment (assessing how well the generated image matches the input text), OpenAI’s DALL-E 2 scores highest (Figure 2.4.3). In terms of image quality (gauging if the images resemble real photographs), aesthetics (evaluating the visual appeal), and originality (a measure of novel image generation and avoidance of copyright infringement), the Stable Diffusion–based Dreamlike Photoreal model ranks highest (Figure 2.4.4).
9 The 12 evaluation aspects of HEIM are: (1) Alignment: How closely does the image align with the given text? (2) Quality: What is the quality of the produced image? (3) Aesthetic: How aesthetically pleasing is the generated image? (4) Originality: How original is the image? (5) Reasoning: Does the model understand objects, counts, and spatial relations? (6) Knowledge: Does the model have knowledge about the world? (7) Bias: Are the generated images biased? (8) Toxicity: Are the generated images toxic or inappropriate? (9) Fairness: Do the generated images exhibit performance disparities? (10) Robust: Is the model robust to input perturbations? (11) Multilinguality: Does the model support non-English languages? (12) Efficiency: How fast is model inference?
Highlighted Research: MVDream
Creating 3D geometries or models from text prompts has been a significant challenge for AI researchers, with existing models struggling with problems such as multiface Janus issue (inaccurately regenerating context described by text prompts) and content drift (inconsistency across different 3D views). MVDream is a new 3D generation system developed by ByteDance and University of California, San Diego researchers that overcomes some of these hurdles (Figure 2.4.5). In quantitative evaluations, MVDream’s generated models achieve Inception Score (IS) and CLIP scores comparable to those in the training set, indicating the high quality of the generated images (Figure 2.4.6). MVDream has major implications, especially for creative industries where 3D content creation is traditionally time-consuming and labor-intensive.
Instruction-Following
In computer vision, instruction-following is the capacity of vision-language models to interpret text-based directives related to images. For instance, an AI system could be given an image of various ingredients and tasked with suggesting how to use them to prepare a healthy meal. Capable instructionfollowing vision-language models are necessary for developing advanced AI assistants.
VisIT-Bench
In 2023, a team of industry and academic researchers introduced VisIT-Bench, a benchmark consisting of 592 challenging vision-language instructions across about 70 instruction categories, such as plot analysis, art knowledge, and location understanding (Figure 2.4.8). As of January 2024, the leading model on VisIT-Bench is GPT-4V, the visionenabled variant of GPT-4 Turbo, with an Elo score of 1,349, marginally surpassing the human reference score for VisIT-Bench (Figure 2.4.9).
Editing
Image editing involves using AI to modify images based on text prompts. This AIassisted approach has broad real-world applications in fields such as engineering, industrial design, and filmmaking.
EditVal
Despite the promise of text-guided image editing, few robust methods can evaluate how accurately AI image editors adhere to editing prompts. EditVal, a new benchmark for assessing text-guided image editing, includes over 13 edit types, such as adding objects or changing their positions, across 19 object classes (Figure 2.4.10). The benchmark was applied to evaluate eight leading text-guided image editing methods including SINE and Null-text. Performance improvements since 2021 on a variety of the benchmark’s editing tasks, are shown in Figure 2.4.11.
Highlighted Research: ControlNet
Conditioning inputs or performing conditional control refers to the process of guiding the output created by an image generator by specifying certain conditions that a generated image must meet. Existing text-to-image models often lack precise control over the spatial composition of an image, making it difficult to use prompts alone to generate images with complex layouts, diverse shapes, and specific poses. Fine-tuning these models for greater compositional control by training them on additional images is theoretically feasible, but many specialized datasets, such as those for human poses, are not large enough to support successful training.
In 2023, researchers from Stanford introduced a new model, ControlNet, that improves conditional control editing for large textto-image diffusion models (Figure 2.4.12). ControlNet stands out for its ability to handle various conditioning inputs. Compared to other previously released models in 2022, human raters prefer ControlNet both in terms of superior quality and better condition fidelity (Figure 2.4.13). The introduction of ControlNet is a significant step toward creating advanced textto-image generators capable of editing images to more accurately replicate the complex images frequently encountered in the real world.
Highlighted Research: Instruct-NeRF2NeRF
New models can edit 3D geometries using only text instructions. Instruct-NeRF2NeRF is a model developed by Berkeley researchers that employs an image-conditioned diffusion model for iterative text-based editing of 3D geometries (Figure 2.4.14). This method efficiently generates new, edited images that adhere to textual instructions, achieving greater consistency than current leading methods (Figure 2.4.15)
Segmentation
Segmentation involves assigning individual image pixels to specific categories (for example: human, bicycle, or street)
Highlighted Research: Segment Anything
In 2023, Meta researchers launched Segment Anything, a project that featured the Segment Anything Model (SAM) and an extensive SA1B dataset for image segmentation. SAM is remarkable for being one of the first broadly generalizable segmentation models that performs well zero-shot on new tasks and distributions. Segment Anything outperforms leading segmentation methods like RITM on 16 out of 23 segmentation datasets (Figure 2.4.17). The metric on which Segment Anything is evaluated is the mean Intersection over Union (IoU).
Meta’s Segment Anything model was then used, alongside human annotators, to create the SA-1B dataset, which included over 1 billion segmentation masks across 11 million images (Figure 2.4.16). A new segmentation dataset of this size will accelerate the training of future image segmentors. Segment Anything demonstrates how AI models can be used alongside humans to more efficiently create large datasets, which in turn can be used to train even better AI systems.
3D Reconstruction From Images
3D image reconstruction is the process of creating three-dimensional digital geometries from two-dimensional images. This type of reconstruction can be used in medical imaging, robotics, and virtual reality.
Highlighted Research: Skoltech3D
Data scarcity often hinders the development of AI systems for specific tasks. In 2023, a team of international researchers introduced an extensive new dataset, Skoltech3D, for multiview 3D surface reconstruction (Figure 2.4.18). Encompassing 1.4 million images of 107 scenes captured from 100 different viewpoints under 14 distinct lighting conditions, this dataset represents a major improvement over existing 3D reconstruction datasets (Figure 2.4.19).
Highlighted Research: RealFusion
RealFusion, developed by Oxford researchers, is a new method for generating complete 3D models of objects from single images, overcoming the challenge of often having insufficient information from single images for full 360 degree reconstruction. RealFusion utilizes existing 2D image generators to produce multiple views of an object, and then assembles these views into a comprehensive 360 degree model (Figure 2.4.20). This technique yields more accurate 3D reconstructions compared to stateof-the-art methods from 2021 (Shelf-Supervised), across a wide range of objects (Figure 2.4.21).
2.5 Video Computer Vision and Video Generation
Generation
Video generation involves the use of AI to generate videos from text or images.
UCF101
UCF101 is an action recognition dataset of realistic action videos that contain 101 action categories (Figure 2.5.1). More recently, UCF101 has been used to benchmark video generators. This year’s top model, W.A.L.T-XL, posted an FVD16 score of 36, more than halving the state-of-the-art score posted the previous year (Figure 2.5.2).
Highlighted Research: Align Your Latents
Most existing methods can only create short, lowresolution videos. To address this limitation, an international team of researchers has applied latent diffusion models, traditionally used for generating high-quality images, to produce high-resolution videos (Figure 2.5.3). Their Latent Diffusion Model (LDM) notably outperforms previous state-ofthe-art methods released in 2022 like Long Video GAN (LVG) in resolution quality (Figure 2.5.4). The adaptation of a text-to-image architecture to create LDM, a highly effective text-to-video model, exemplifies how advanced AI techniques can be repurposed across different domains of computer vision. The LDM’s strong video generation capabilities have many real-world applications, such as creating realistic driving simulations.
Highlighted Research: Emu Video
Traditionally, progress in video generation has trailed that in image generation due to its higher complexity and the smaller datasets available for training. Emu Video, a new transformerbased video generation model created by Meta researchers, represents a significant step forward (Figure 2.5.5). Emu Video generates an image from text and then creates a video based on both the text and image. Figure 2.5.6 illustrates the degree to which the Emu Video model outperforms previously released state-of-the-art video generation methods. The metric is the proportion of cases when human evaluators preferred Emu Video’s image quality or faithfulness to text instructions over the compared method. Emu Video simplifies the video generation process and signals a new era of high-quality video generation.
Reasoning in AI involves the ability of AI systems to draw logically valid conclusions from different forms of information. AI systems are increasingly being tested in diverse reasoning contexts, including visual (reasoning about images), moral (understanding moral dilemmas), and social reasoning (navigating social situations).10
2.6 Reasoning
General Reasoning
General reasoning pertains to AI systems being able to reason across broad, rather than specific, domains. As part of a general reasoning challenge, for example, an AI system might be asked to reason across multiple subjects rather than perform one narrow task (e.g., playing chess).
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
In recent years, the reasoning abilities of AI systems have advanced so much that traditional benchmarks like SQuAD (for textual reasoning) and VQA (for visual reasoning) have become saturated, indicating a need for more challenging reasoning tests.
Responding to this, researchers from the United States and Canada recently developed MMMU, the Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. MMMU comprises about 11,500 college-level questions from six core disciplines: art and design, business, science, health and medicine, humanities and social science, and technology and engineering (Figure 2.6.1). The question formats include charts, maps, tables, chemical structures, and more. MMMU is one of the most demanding tests of perception, knowledge, and reasoning in AI to date. As of January 2024, the highest performing model is Gemini Ultra, which leads in all subject categories with an overall score of 59.4% (Figure 2.6.2).11 On most individual task categories, top models are still well beyond medium-level human experts (Figure 2.6.3). This relatively low score is evidence of MMMU’s effectiveness as a benchmark for assessing AI reasoning capabilities.
10 Some abilities highlighted in the previous sections implicitly involve some form of reasoning. This section highlights tasks that have a more specific reasoning focus.
11 The AI Index reports results from the MMMU validation set, as recommended by the paper authors for the most comprehensive coverage. According to the authors, the test set, with its unreleased labels and larger size, presents a more challenging yet unbiased benchmark for model performance, ensuring a more robust evaluation. The test set results are available on the MMMU page.
12 An asterisk (*) next to the model names indicates that the results were provided by the authors.
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
In the last year, researchers from NYU, Anthropic, and Meta introduced the GPQA benchmark to test general multisubject AI reasoning. This dataset consists of 448 difficult multiple-choice questions that cannot be easily answered by Google searching. The questions were crafted by subject-matter experts in various fields like biology, physics, and chemistry (Figure 2.6.4). PhD-level experts achieved a 65% accuracy rate in their respective domains on GPQA, while nonexpert humans scored around 34%. The best-performing AI model, GPT-4, only reached a score of 41.0% on the main test set (Figure 2.6.5).
Highlighted Research: Comparing Humans, GPT-4, and GPT-4V on Abstraction and Reasoning Tasks
Abstract reasoning involves using known information to solve unfamiliar and novel problems and is a key aspect of human cognition that is evident even in toddlers. While recent LLMs like GPT-4 have shown impressive performance, their capability for true abstract reasoning remains a hotly debated subject.13 To further explore this topic, researchers from the Santa Fe Institute tested GPT-4 on the ConceptARC benchmark, a collection of analogy puzzles designed to assess general abstract reasoning skills (Figure 2.6.6). The study revealed that GPT-4 significantly trails behind humans in abstract reasoning abilities: While humans score 95% on the benchmark, the best GPT-4 system only scores 69% (Figure 2.6.7). The development of truly general AI requires abstract reasoning capabilities. Therefore, it will be important to continue tracking progress in this area.
Mathematical Reasoning
Mathematical problem-solving benchmarks evaluate AI systems’ ability to reason mathematically. AI models can be tested with a range of math problems, from grade-school level to competition-standard mathematics.
GSM8K
GSM8K, a dataset comprising approximately 8,000 varied grade school math word problems, requires that AI models develop multistep solutions utilizing arithmetic operations (Figure 2.6.8). GSM8K has quickly become a favored benchmark for evaluating advanced LLMs. The top-performing model on GSM8K is a GPT-4 variant (GPT-4 Code Interpreter), which scores an accuracy of 97%, a 4.4% improvement from the state-of-the-art score in the previous year and a 30.4% improvement from 2022 when the benchmark was first introduced (Figure 2.6.9).
MATH
MATH is a dataset of 12,500 challenging competition-level mathematics problems introduced by UC Berkeley researchers in 2021 (Figure 2.6.10). AI systems struggled on MATH when it was first released, managing to solve only 6.9% of the problems. Performance has significantly improved. In 2023, a GPT-4-based model posted the top result, successfully solving 84.3% of the dataset’s problems (Figure 2.6.11).
PlanBench
A planning system receives a specified goal, an initial state, and a collection of actions. Each action is defined by preconditions, which must be met for the action to be executed, and the effects that result from the action’s execution. The system constructs a plan, comprising a series of actions, to achieve the goal from the initial state.
Claims have been made that LLMs can solve planning problems. A group from Arizona State University has proposed PlanBench, a benchmark suite containing problems used in the automated planning community, especially those used in the International Planning Competition. They tested I-GPT-3 and GPT-4 on 600 problems in the Blocksworld domain (where a hand tries to construct stacks of blocks when it is only allowed to move one block at a time to the table or to the top of a clear block) using one-shot learning and showed that GPT-4 could generate correct plans and cost-optimal plans about 34% of the time, and I-GPT-3 about 6% (Figure 2.6.12). Verifying the correctness of a plan is easier.
Visual Reasoning
Visual reasoning tests how well AI systems can reason across both visual and textual data.
Visual Commonsense Reasoning (VCR)
Introduced in 2019, the Visual Commonsense Reasoning (VCR) challenge tests the commonsense visual reasoning abilities of AI systems. In this challenge, AI systems not only answer questions based on images but also reason about the logic behind their answers (Figure 2.6.13). Performance in VCR is measured using the Q->AR score, which evaluates the machine’s ability to both select the correct answer to a question (Q->A) and choose the appropriate rationale behind that answer (Q->R). While AI systems have yet to outperform humans on this task, their capabilities are steadily improving. Between 2022 and 2023, there was a 7.93% increase in AI performance on the VCR challenge (Figure 2.6.14).
Moral Reasoning
In the future, AI will be increasingly applied to domains where ethical considerations are crucial, such as in healthcare and judicial systems. Therefore, it is essential for AI systems to possess robust moral reasoning capabilities, enabling them to effectively navigate and reason about ethical principles and moral considerations.
MoCa
The ability of AI models to reason in linguistic and visual domains is well established, yet their capacity for moral reasoning, especially moral reasoning that aligns with human moral judgments, is less understood.14 To further explore this topic, a team of Stanford researchers created a new dataset (MoCa) of human stories with moral elements (Figure 2.6.15). The researchers then presented these models with stories of human actions and prompted the models to respond, measuring moral agreement with the discrete agreement metric: A higher score indicates closer alignment with human moral judgment. The study yielded intriguing results. No model perfectly matches human moral systems, but newer, larger models like GPT-4 and Claude show greater alignment with human moral sentiments than smaller models like GPT-3, suggesting that as AI models scale, they are gradually becoming more morally aligned with humans. Of all models surveyed, GPT-4 showed the greatest agreement with human moral sentiments (Figure 2.6.16).
Causal Reasoning
Causal reasoning assesses an AI system’s ability to understand cause-and-effect relationships. As AI becomes increasingly ubiquitous, it has become important to evaluate whether AI models can not only explain their outputs but also update their conclusions—key aspects of causal reasoning.
BigToM
Assessing whether LLMs have theory-of-mind (ToM) capabilities—understanding and attributing mental states such as beliefs, intentions, and emotions—has traditionally challenged AI researchers. Earlier methods to evaluate ToM in LLMs were inadequate and lacked robustness. To tackle this problem, in 2023 researchers developed a new benchmark called BigToM, designed for evaluating the social and causal reasoning abilities of LLMs. BigToM, comprising 25 controls and 5,000 model-generated evaluations, has been rated by human evaluators as superior to existing ToM benchmarks. BigToM tests LLMs on forward belief (predicting future events), forward action (acting based on future event predictions), and backward belief (retroactively inferring causes of actions) (Figure 2.6.17).
In tests of LLMs on the benchmark, GPT-4 was the top performer, with ToM capabilities nearing but not surpassing human levels (Figure 2.6.18, Figure 2.6.19, and Figure 2.6.20). More specifically, as measured by accuracy in correctly inferring beliefs, GPT-4 closely matched human performance in forward belief and backward belief tasks and slightly surpassed humans in forward action tasks. Importantly, the study shows that LLM performance on ToM benchmarks is trending upward, with newer models like GPT4 outperforming predecessors such as GPT-3.5 (released in 2022).
Highlighted Research: Tübingen Cause-Effect Pairs
Researchers from Microsoft and the University of Chicago have demonstrated that LLMs are effective causal reasoners. The team evaluated several recent LLMs, including GPT4, using the Tübingen causeeffect pairs dataset. This benchmark comprises over 100 cause-and-effect pairs across 37 subdisciplines, testing AI systems’ ability to discern causal relationships (Figure 2.6.21). GPT4’s performance, a 96% accuracy score, surpassed the previous year’s best by 13 percentage points (Figure 2.6.22). Notably, GPT-4 outperformed prior covariance-based AI models, which were explicitly trained for causal reasoning tasks. Furthermore, the researchers discovered that certain prompts, especially those designed to encourage helpfulness, can significantly enhance an LLM’s causal reasoning capabilities.
AI systems are adept at processing human speech, with audio capabilities that include transcribing spoken words to text and recognizing individual speakers. More recently, AI has advanced in generating synthetic audio content.
2.7 Audio
Generation
2023 marked a significant year in the field of audio generation, which involves creating synthetic audio content, ranging from human speech to music files. This advancement was highlighted by the release of several prominent audio generators, such as UniAudio, MusicGen, and MusicLM.
Highlighted Research: UniAudio
UniAudio is a high-level language modeling technique to create audio content. UniAudio uniformly tokenizes all audio types and, like modern LLMs, employs next-token prediction for highquality audio generation. UniAudio is capable of generating high-quality speech, sound, and music. UniAudio surpasses leading methods in tasks, including text-to-speech, speech enhancement, and voice conversion (Figure 2.7.1). With 1 billion parameters and trained on 165,000 hours of audio, UniAudio exemplifies the efficacy of big data and self-supervision for music generation.
Highlighted Research: MusicGEN and MusicLM
Meta’s MusicGen is a novel audio generation model that also leverages the transformer architecture common in language models to generate audio. MusicGen enables users to specify text for a desired audio outcome and then fine-tune it using specific melodies. In comparative studies, MusicGen outshines other popular text-to-music models like Riffusion, Moûsai, and MusicLM across various generative music metrics. It boasts a lower FAD score, indicating more plausible music generation, a lower KL score for better alignment with reference music, and a higher CLAP score, reflecting greater adherence to textual descriptions of reference music (Figure 2.7.2). Human evaluators also favor MusicGen for its overall quality (OVL).
Although MusicGen outperforms certain textto-music models released earlier in the year, MusicLM is worth highlighting because its release was accompanied by the launch of MusicCaps, a state-of-the-art dataset of 5.5K music-text pairs. MusicCaps was used by MusicGen researchers to benchmark the performance of their family of models. The emergence of new models like MusicGen, and new music-to-text benchmarks like MusicCaps, highlights the expansion of generative AI beyond language and images into more diverse skill modalities like audio generation.
AI agents, autonomous or semiautonomous systems designed to operate within specific environments to accomplish goals, represent an exciting frontier in AI research. These agents have a diverse range of potential applications, from assisting in academic research and scheduling meetings to facilitating online shopping and vacation booking.
2.8 Agents
General Agents
This section highlights benchmarks and research into agents that can flexibly operate in general task environments
AgentBench
AgentBench, a new benchmark designed for evaluating LLM-based agents, encompasses eight distinct interactive settings, including web browsing, online shopping, household management, puzzles, and digital card games (Figure 2.8.1). The study assessed over 25 LLM-based agents, including those built on OpenAI’s GPT-4, Anthropic’s Claude 2, and Meta’s Llama 2. GPT-4 emerged as the top performer, achieving an overall score of 4.01, significantly higher than Claude 2’s score of 2.49 (Figure 2.8.2). The research also suggests that LLMs released in 2023 outperform earlier versions in agentic settings. Additionally, the AgentBench team speculated that agents’ struggles on certain benchmark subsections can be attributed to their limited abilities in long-term reasoning, decision-making, and instruction-following.
Highlighted Research: Voyageur
Recent research by Nvidia, Caltech, UT Austin, Stanford, and UW Madison demonstrates that existing LLMs like GPT-4 can be used to develop flexible agents capable of continuous learning. The team created Voyager, a GPT-4-based agent for Minecraft—a complex video game with no set endpoint that is essentially a boundless virtual playground for its players (Figure 2.8.3). Voyager excels in this environment, adeptly remembering plans, adapting to new settings, and transferring knowledge. It significantly outperforms previous models, collecting 3.3 times more unique items, traveling 2.3 times further, and reaching key milestones 15.3 times faster (Figure 2.8.4).
The launch of Voyager is significant, as AI researchers have long faced challenges in creating agents that can explore, plan, and learn in open-ended worlds. While previous AI systems like AlphaZero succeeded in closed, rule-defined environments like chess, Go, and shogi, they struggled in more dynamic settings, lacking the ability to continuously learn. Voyager, however, demonstrates remarkable proficiency in a dynamic video game setting, thereby representing a notable advancement in the field of agentic AI.
Task-Specific Agents
This section highlights benchmarks and research into agents that are optimized to perform in specific task environments, such as mathematical problemsolving or academic research.
MLAgentBench
MLAgentBench, a new benchmark for evaluating AI research agents’ performance, tests whether AI agents are capable of engaging in scientific experimentation. More specifically, MLAgentBench assesses AI systems’ potential as computer science research assistants, evaluating their performance across 15 varied research tasks. Examples of the tasks include improving a baseline model on the CIFAR-10 image dataset and training a language model on over 10 million words in BabyLM. Various LLM-based agents, including GPT-4, Claude-1, AutoGPT, and LangChain, were tested. The results demonstrate that although there is promise in AI research agents, performance varies significantly across tasks. While some agents achieved over 80% on tasks like ogbnarxiv (improving a baseline paper classification model), all scored 0% on BabyLM (training a small language model) (Figure 2.8.5). Among these, GPT-4 consistently delivered the best results.
15 The full tasks include: (1) CIFAR-10 (improve a baseline image classification model), (2) imdb (improve a baseline sentiment classification model), (3) ogbn-arxiv (improve a baseline paper classification model from scratch), (4) house prices (train a regression model), (5) spaceship titanic (train a classifier model from scratch), (6) Parkinson’s-disease (train a time-series regression model), (7) FathomNet (train an out-of-distribution image classification model), (8) feedback (train an out-of-distribution text regression model), (9) identify contrails (train an out-of-distribution image segmentation model), (10) CLRS (model classic algorithms over graphs and lists), (11) BabyLM (train language model over 10M words), (12) llama-inference (improve the runtime/autoregressive generation speed of Llama 7B, (13) vectorization (improve the inference speed of a model), (14) literature-review-tool (perform literature review), and (15) bibtexgeneration (generate BibTex from sketch).
Over time, AI has become increasingly integrated into robotics, enhancing robots’ capabilities to perform complex tasks. Especially with the rise of foundation models, this integration allows robots to iteratively learn from their surroundings, adapt flexibly to new settings, and make autonomous decisions.
2.9 Robotics
Highlighted Research: PaLM-E
PaLM-E is a new AI model from Google that merges robotics with language modeling to address real-world tasks like robotic manipulation and knowledge tasks like question answering and image captioning. Leveraging transformer-based architectures, the largest PaLM-E model is scaled up to 562B parameters. The model is trained on diverse visual language as well as robotics data, which results in superior performance on a variety of robotic benchmarks. PaLM-E also sets new standards in visual tasks like OK-VQA, excels in other language tasks, and can engage in chain-of-thought, mathematical, and multi-image reasoning, even without specific training in these areas. Figure 2.9.1 illustrates some of the tasks that the PaLM-E model can perform.
On Task and Motion Planning (TAMP) domains, where robots have to manipulate objects, PaLM-E outperforms previous state-of-the-art methods like SayCan and PaLI on both embodied visual question answering and planning (Figure 2.9.2).16 On robotic manipulation tasks, PaLM-E outperforms competing models (PaLI and CLIP-FT) in its ability to detect failures, which is a crucial step for robots to perform closed-loop planning (Figure 2.9.3).
PaLM-E is significant in that it demonstrates that language modeling techniques as well as text data can enhance the performance of AI systems in nonlanguage domains, like robotics. PaLM-E also highlights how there are already linguistically adept robots capable of real-world interaction and high-level reasoning. Developing these kinds of multifaceted robots is an essential step in creating more general robotic assistants that can, for example, assist in household work.
16 Embodied Visual Question Answering (Embodied VQA) is a task where agents need to navigate through 3D environments and answer questions about the objects they visually perceive in the environment.
Highlighted Research: RT-2
Real-world robots could benefit from certain capabilities possessed by LLMs, such as text and code generation, as well as visual understanding. RT-2, a new robot released from DeepMind, represents an ambitious attempt to create a generalizable robotic model that has certain LLM capabilities. RT-2 uses a transformer-based architecture and is trained on both robotic trajectory data that is tokenized into text and extensive visual-language data.
RT-2 stands out as one of the most impressive and adaptable approaches for conditioning robotic policy. It outshines state-of-the-art models like Manipulation of Open-World Objects (MOO) across various benchmarks, particularly in tasks involving unseen objects. On such tasks, an RT-2/PaLM-E variant achieves an 80% success rate, significantly higher than MOO’s (53%) (Figure 2.9.4). In unseen object tasks, RT-2 surpasses the previous year’s state-of-the-art model, RT-1, by 43 percentage points. This indicates an improvement in robotic performance in novel environments over time.
In reinforcement learning, AI systems are trained to maximize performance on a given task by interactively learning from their prior actions. Systems are rewarded if they achieve a desired goal and punished if they fail.
2.10 Reinforcement Learning
Reinforcement Learning from Human Feedback
Reinforcement learning has gained popularity in enhancing state-of-the-art language models like GPT-4 and Llama 2. Introduced in 2017, Reinforcement Learning from Human Feedback (RLHF) incorporates human feedback into the reward function, enabling models to be trained for characteristics like helpfulness and harmlessness.
This year, the AI Index tracked data on the number of foundation models using RLHF as part of their training. More specifically, the Index team looked through the technical reports and other documentation of all models included in CRFM’s Ecosystem graph, one of the most comprehensive repositories of the foundation model ecosystem.17 Figure 2.10.1 illustrates how many foundation models reported using RLHF over time. In 2021, no newly released foundation models used RLHF. In 2022, seven models reported using RLHF, and in 2023, 16 models reported using RLHF. The rising popularity of RLHF is also evidenced by the fact that many leading LLMs report improving their models with RLHF (Figure 2.10.2).
17 It is possible that more models use RLHF as part of their training than reported. The Index only tracks data for models that publicly report using RLHF.
Highlighted Research: RLAIF
RLHF is a powerful method for aligning AI models but can be hindered by the time and labor required to generate human preference datasets for model alignment. As an alternative, Reinforcement Learning from AI Feedback (RLAIF) uses reinforcement learning based on the preferences of LLMs to align other AI models toward human preferences.
Recent research from Google Research compares RLAIF with RLHF, the traditional gold standard, to assess whether RLAIF can serve as a reliable substitute. The study finds that both RLAIF and RLHF are preferred over supervised fine-tuning (SFT) for summarization and helpfulness tasks, and that there is not a statistically significant difference in the degree to which RLHF is preferred (Figure 2.10.3). Notably, in harmless dialogue generation tasks focused on producing the least harmful outputs, RLAIF (88%) surpasses RLHF (76%) in effectiveness (Figure 2.10.4). This research indicates that RLAIF could be a more resource-efficient and cost-effective approach for AI model alignment.
Highlighted Research: Direct Preference Optimization
As illustrated above, RLHF is a useful method for aligning LLMs with human preferences. However, RLHF requires substantial computational resources, involving the training of multiple language models and integrating LM policy sampling within training loops. This complexity can hinder its broader adoption.
In response, researchers from Stanford and CZ Biohub have developed a new reinforcement learning algorithm for aligning models named Direct Preference Optimization (DPO). DPO is simpler than RLHF but equally effective. The researchers show that DPO is as effective as other existing alignment methods, such as Proximal Policy Optimization (PPO) and Supervised FineTuning (SFT), on tasks like summarization (Figure 2.10.5). The emergence of techniques like DPO suggests that model alignment methods are becoming more straightforward and accessible.
This section focuses on research exploring critical properties of LLMs, such as their capacity for sudden behavioral shifts and self-correction in reasoning. It is important to highlight these studies to develop an understanding of how LLMs, which are increasingly representative of the frontier of AI research, operate and behave.
2.11 Properties of LLMs
Highlighted Research: Challenging the Notion of Emergent Behavior
Many papers have argued that LLMs exhibit emergent abilities, meaning they can unpredictably and suddenly display new capabilities at larger scales.18 This has raised concerns that even larger models could develop surprising, and perhaps uncontrollable, new abilities.
However, research from Stanford challenges this notion, arguing that the perceived emergence of new capabilities is often a reflection of the benchmarks used for evaluation rather than an inherent property of the models themselves. The researchers found that when nonlinear or discontinuous metrics like multiple-choice grading are used to evaluate models, emergent abilities seem more apparent. In contrast, when linear or continuous metrics are employed, these abilities largely vanish. Analyzing a suite of benchmarks from BIGbench, a comprehensive LLM evaluation tool, the researchers noted emergent abilities on only five of the 39 benchmarks (Figure 2.11.1). These findings have important implications for AI safety and alignment research as they challenge a prevailing belief that AI models will inevitably learn new, unpredictable behaviors as they scale.
18 Some of these papers include Brown et al., 2023, Ganguli et al., 2022, Srivastava et al., 2022, and Wei et al., 2022.
Highlighted Research: Challenging the Notion of Emergent Behavior (cont’d)
Highlighted Research: Changes in LLM Performance Over Time
Publicly usable closed-source LLMs, such as GPT-4, Claude 2, and Gemini, are often updated over time by their developers in response to new data or user feedback. However, there is little research on how the performance of such models changes, if at all, in response to such updating.
A study conducted at Stanford and Berkeley explores the performance of certain publicly usable LLMs over time and highlights that, in fact, their performance can significantly vary. More specifically, the study compared the March and June 2023 versions of GPT-3.5 and GPT-4 and demonstrated that performance declined on several tasks. For instance, the June version of GPT-4, compared to the March version, was 42 percentage points worse at generating code, 16 percentage points worse at answering sensitive questions, and 33 percentage points worse on certain mathematical tasks (Figure 2.11.2). The researchers also found that GPT-4’s ability to follow instructions diminished over time, which potentially explains the broader performance declines. This research highlights that LLM performance can evolve over time and suggests that regular users should be mindful of such changes.
Highlighted Research: LLMs Are Poor Self-Correctors
It is generally understood that LLMs like GPT-4 have reasoning limitations and can sometimes produce hallucinations. One proposed solution to such issues is self-correction, whereby LLMs identify and correct their own reasoning flaws. As AI’s societal role grows, the concept of intrinsic self-correction—allowing LLMs to autonomously correct their reasoning without external guidance— is especially appealing. However, it is currently not well understood whether LLMs are in fact capable of this kind of self-correction.
Researchers from DeepMind and the University of Illinois at Urbana–Champaign tested GPT-4’s performance on three reasoning benchmarks: GSM8K (grade-school math), CommonSenseQA (common-sense reasoning), and HotpotQA (multidocument reasoning). They found that when the model was left to decide on self-correction without guidance, its performance declined across all tested benchmarks (Figure 2.11.3).
Closed vs. Open Model Performance
As LLMs become increasingly ubiquitous, debate intensifies over their varying degrees of accessibility. Some models such as Google’s Gemini remain closed, accessible solely to their developers. In contrast, models like OpenAI’s GPT-4 and Anthropic’s Claude 2 offer limited access, available publicly via an API. However, model weights are not fully released, which means the model cannot be independently modified by the public or further scrutinized. Conversely, Meta’s Llama 2 and Stability AI’s Stable Diffusion adopt an open approach, fully releasing their model weights. Open-source models can be modified and freely used by anyone.
Viewpoints differ on the merits of closed versus open AI models. Some argue in favor of open models, citing their ability to counteract market concentration, foster Artificial Intelligence Chapter 2: Technical Performance Index Report 2024 19 By closed models, the AI Index is referring both to models that are fully closed and those with limited access. 20 The data in this section was collected in early January 2024. 2.11 Properties of LLMs innovation, and enhance transparency within the AI ecosystem. Others contend that open-source models present considerable security risks, such as facilitating the creation of disinformation or bioweapons, and should therefore be approached with caution.
In the context of this debate, it is important to acknowledge that current evidence indicates a notable performance gap between open and closed models.19 Figures 2.11.4 and 2.11.5 juxtapose the performances of the top closed versus open model on a selection of benchmarks.20 On all selected benchmarks, closed models outperform open ones. Specifically, on 10 selected benchmarks, closed models achieved a median performance advantage of 24.2%, with differences ranging from as little as 4.0% on mathematical tasks like GSM8K to as much as 317.7% on agentic tasks like AgentBench.
19 By closed models, the AI Index is referring both to models that are fully closed and those with limited access.
20 The data in this section was collected in early January 2024.
2.12 Techniques for LLM Improvement
Prompting
Prompting, a vital aspect of the AI pipeline, entails supplying a model with natural language instructions that describe tasks the model should execute.
Mastering the art of crafting effective prompts significantly enhances the performance of LLMs without requiring that models undergo underlying improvements.
Highlighted Research: Graph of Thoughts Prompting
Highlighted Research: Chain of thought (CoT) and Tree of Thoughts (ToT) are prompting methods that can improve the performance of LLMs on reasoning tasks. In 2023, European researchers introduced another prompting method, Graph of Thoughts (GoT), that has also shown promise (Figure 2.12.1). GoT enables LLMs to model their thoughts in a more flexible, graph-like structure which more closely mirrors actual human reasoning. The researchers then designed a model architecture to implement GoT and found that, compared to ToT, it increased the quality of outputs by 62% on a sorting task while reducing cost by around 31% (Figure 2.12.2).
Highlighted Research: Optimization by PROmpting (OPRO)
A paper from DeepMind has introduced Optimization by PROmpting (OPRO), a method that uses LLMs to iteratively generate prompts to improve algorithmic performance. OPRO uses natural language to guide LLMs in creating new prompts based on problem descriptions and previous solutions (Figure 2.12.3). The generated prompts aim to enhance the performance of AI systems on particular benchmarks. Compared to other prompting approaches like “let’s think step by step” or an empty starting point, ORPO leads to significantly greater accuracy on virtually all 23 BIG-bench Hard tasks (Figure 2.12.4).
Fine-Tuning
Fine-tuning has grown increasingly popular as a method of enhancing LLMs and involves further training or adjusting models on smaller datasets. Fine-tuning not only boosts overall model performance but also sharpens the model’s capabilities on specific tasks. It also allows for more precise control over the model’s behavior.
Highlighted Research: QLoRA
QLoRA, developed by researchers from the University of Washington in 2023, is a new method for more efficient model fine-tuning. It dramatically reduces memory usage, enabling the fine-tuning of a 65 billion parameter model on a single 48 GB GPU while maintaining full 16-bit fine-tuning performance. To put this in perspective, fine-tuning a 65B Llama model, a leading open-source LLM, typically requires about 780 GB of GPU memory. Therefore, QLoRA is nearly 16 times more efficient. QLoRA manages to increase efficiency with techniques like a 4-bit NormalFloat (NF4), double quantization, and page optimizers. QLoRA is used to train a model named Guanaco, which matched or even surpassed models like ChatGPT in performance on the Vicuna benchmark (a benchmark that ranks the outputs of LLMs) (Figure 2.12.5). Remarkably, the Guanaco models were created with just 24 hours of fine-tuning on a single GPU. QLoRa highlights how methods for optimizing and further improving models have become more efficient, meaning fewer resources will be required to make increasingly capable models.
Attention
LLMs can flexibly handle various tasks but often demand substantial computational resources to train. As previously noted, high training costs can hinder AI’s broader adoption. Optimization methods aim to enhance AI’s efficiency by, for example, improving memory usage, thereby making LLMs more accessible and practical.
Highlighted Research: Flash-Decoding
Flash-Decoding, developed by Stanford researchers, tackles inefficiency in traditional LLMs by speeding up the attention mechanism, particularly in tasks requiring long sequences. It achieves this by parallelizing the loading of keys and values, then separately rescaling and combining them to maintain right attention outputs (Figure 2.12.6). In various tests, Flash-Decoding outperforms other leading methods like PyTorch Eager and FlashAttention-2, showing much faster inference: For example, on a 256 batch size and 256 sequence length, Flash-Decoding is 48 times faster than PyTorch Eager and six times faster than FlashAttention-2 (Figure 2.12.7). Inference on models like ChatGPT can cost $0.01 per response, which can become highly expensive when deploying such models to millions of users. Innovations like Flash-Decoding are critical for reducing inference costs in AI.
This section examines trends in the environmental impact of AI systems, highlighting the evolving landscape of transparency and awareness. Historically, model developers seldom disclosed the carbon footprint of their AI systems, leaving researchers to make their best estimates. Recently, there has been a shift toward greater openness, particularly regarding the carbon costs of training AI models. However, disclosure of the environmental costs associated with inference—a potentially more significant concern—remains insufficient. This section presents data on carbon emissions as reported by developers in addition to featuring notable research exploring the intersection of AI and environmental impact. With AI models growing in size and becoming more widely used, it has never been more critical for the AI research community to diligently monitor and mitigate the environmental effects of AI systems.
2.13 Environmental Impact of AI Systems
General Environmental Impact
Training
Figure 2.13.1 presents the carbon released by (in tonnes) of select LLMs during their training, compared with human reference points. Emissions data of models marked with an asterisk were estimated by independent researchers as they were not disclosed by their developers. Emission data varies widely. For instance, Meta’s Llama 2 70B model released approximately 291.2 tonnes of carbon, which is nearly 291 times more than the emissions released by one traveler on a round-trip flight from New York to San Francisco, and roughly 16 times the amount of annual carbon emitted by an average American in one year.21 However, the emissions from Llama 2 are still less than the 502 tonnes reportedly released during the training of OpenAI’s GPT-3.
The variance in emission estimates is due to factors such as model size, data center energy efficiency, and the carbon intensity of energy grids. Figure 2.13.2 shows the emissions of select models in relation to their size. Generally, larger models emit more carbon, a trend clearly seen in the Llama 2 model series, which were all trained on the same supercomputer (Meta’s Research Super Cluster). However, smaller models can still have high emissions if trained on energy grids powered by less efficient energy sources. Some estimates suggest that model emissions have declined over time, which is presumably tied to increasingly efficient mechanisms of model training. Figure 2.13.3 features the emissions of select models along with their power consumption.
A major challenge in evaluating the environmental impacts of AI models is a lack of transparency about emissions. Consistent with findings from other studies, most prominent model developers do not report carbon emissions, hampering efforts to conduct thorough and accurate evaluations of this metric.22 For example, many prominent model developers such as OpenAI, Google, Anthropic, and Mistral do not report emissions in training, although Meta does.
Inference
As highlighted earlier, the environmental impact of training AI models can be significant. While the perquery emissions of inference may be relatively low, the total impact can surpass that of training when models are queried thousands, if not millions, of times daily. Research on the emissions from model inference is scant. A study by Luccioni et al., published in 2023, is among the first to comprehensively assess the emissions from model inference. Figure 2.13.4 illustrates the emissions from 1,000 inferences across various model tasks, revealing that tasks like image generation have a much higher carbon footprint than text classification.
22 Research also suggests that the reporting of carbon emissions on open model development platforms, such as Hugging Face, is declining over time.
Positive Use Cases
Despite the widely recognized environmental costs of training AI systems, AI can contribute positively to environmental sustainability. Figure 2.13.5 showcases a variety of recent cases where AI supports environmental efforts.23 These applications include enhancing thermal energy system management, improving pest control strategies, and boosting urban air quality.
Appendix
Acknowledgments
The AI Index would like to acknowledge Andrew Shi for his work doing a literature review on the environmental impact of AI models; Emily Capstick for her work studying the use of RLHF in machine learning models; Sukrut Oak for his work generating sample Midjourney generations; and Emma Williamson for her work identifying significant AI technical advancements for the timeline.
Benchmarks
- AgentBench: Data on AgentBench was taken from the AgentBench paper in January 2024. To learn more about AgentBench, please read the original paper.
- BigToM: Data on BigToM was taken from the BigToM paper in January 2024. To learn more about BigToM, please read the original paper.
- Chatbot Arena Leaderboard: Data on the Chatbot Arena Leaderboard was taken from the Chatbot Arena Leaderboard in January 2024. To learn more about the Chatbot Arena Leaderboard, please read the original paper.
- EditVal: Data on EditVal was taken from the EditVal paper in January 2024. To learn more about EditVal, please read the original paper.
- GPQA: Data on GPQA was taken from the GPQA paper in January 2024. To learn more about GPQA, please read the original paper.
- GSM8K: Data on GSM8K was taken from the GSM8K Papers With Code leaderboard in January 2024. To learn more about GSM8K, please read the original paper.
- HEIM: Data on HEIM was taken from the HEIM leaderboard in January 2024. To learn more about HEIM, please read the original paper.
- HELM: Data on HELM was taken from the HELM leaderboard in January 2024. To learn more about HELM, please read the original paper.
- HumanEval: Data on HumanEval was taken from the HumanEval Papers With Code leaderboard in January 2024. To learn more about HumanEval, please read the original paper.
- MATH: Data on MATH was taken from the MATH Papers With Code leaderboard in January 2024. To learn more about MATH, please read the original paper.
- MLAgentBench: Data on MLAgentBench was taken from the MLAgentBench paper in January 2024. To learn more about MLAgentBench, please read the original paper.
- MMLU: Data on MMLU was taken from the MMLU Papers With Code leaderboard in January 2024. To learn more about MMLU, please read the original paper.
- MMMU: Data on MMMU was taken from the MMMU leaderboard in January 2024. To learn more about MMMU, please read the original paper.
- MoCa: Data on MoCa was taken from the MoCa paper in January 2024. To learn more about MoCa, please read the original paper.
- PlanBench: Data on PlanBench was taken from the PlanBench paper in January 2024. To learn more about PlanBench, please read the original paper.
- SWE-bench: Data on SWE-bench was taken from the SWE-bench leaderboard in January 2024. To learn more about SWE-bench, please read the original paper.