CHAPTER 2: Technical Performance

Overview

Overview The technical performance section of this year’s AI Index offers a comprehensive overview of AI advancements in 2023. It starts with a high-level overview of AI technical performance, tracing its broad evolution over time. The chapter then examines the current state of a wide range of AI capabilities, including language processing, coding, computer vision (image and video analysis), reasoning, audio processing, autonomous agents, robotics, and reinforcement learning. It also shines a spotlight on notable AI research breakthroughs from the past year, exploring methods for improving LLMs through prompting, optimization, and fine-tuning, and wraps up with an exploration of AI systems’ environmental footprint.

Chapter Highlights

AI beats humans on some tasks, but not on all. AI has surpassed human performance on several benchmarks, including some in image classification, visual reasoning, and English understanding. Yet it trails behind on more complex tasks like competition-level mathematics, visual commonsense reasoning and planning.
Here comes multimodal AI. Traditionally AI systems have been limited in scope, with language models excelling in text comprehension but faltering in image processing, and vice versa. However, recent advancements have led to the development of strong multimodal models, such as Google’s Gemini and OpenAI’s GPT-4. These models demonstrate flexibility and are capable of handling images and text and, in some instances, can even process audio.
Harder benchmarks emerge. AI models have reached performance saturation on established benchmarks such as ImageNet, SQuAD, and SuperGLUE, prompting researchers to develop more challenging ones. In 2023, several challenging new benchmarks emerged, including SWE-bench for coding, HEIM for image generation, MMMU for general reasoning, MoCa for moral reasoning, AgentBench for agent-based behavior, and HaluEval for hallucinations.
Better AI means better data which means … even better AI. New AI models such as SegmentAnything and Skoltech are being used to generate specialized data for tasks like image segmentation and 3D reconstruction. Data is vital for AI technical improvements. The use of AI to create more data enhances current capabilities and paves the way for future algorithmic improvements, especially on harder tasks.
Human evaluation is in. With generative models producing high-quality text, images, and more, benchmarking has slowly started shifting toward incorporating human evaluations like the Chatbot Arena Leaderboard rather than computerized rankings like ImageNet or SQuAD. Public feeling about AI is becoming an increasingly important consideration in tracking AI progress.
Thanks to LLMs, robots have become more flexible. The fusion of language modeling with robotics has given rise to more flexible robotic systems like PaLM-E and RT-2. Beyond their improved robotic capabilities, these models can ask questions, which marks a significant step toward robots that can interact more effectively with the real world.
More technical research in agentic AI. Creating AI agents, systems capable of autonomous operation in specific environments, has long challenged computer scientists. However, emerging research suggests that the performance of autonomous AI agents is improving. Current agents can now master complex games like Minecraft and effectively tackle real-world tasks, such as online shopping and research assistance.
Closed LLMs significantly outperform open ones. On 10 select AI benchmarks, closed models outperformed open ones, with a median performance advantage of 24.2%. Differences in the performance of closed and open models carry important implications for AI policy debates.

2.1 Overview of AI in 2023

Timeline: Significant Model Releases

As chosen by the AI Index Steering Committee, here are some of the most notable model releases of 2023.

Date	Model	Type	Creator(s)	Significance	Image
Mar. 14, 2023	Claude	Large language model	Anthropic	Claude is the first publicly released LLM from Anthropic, one of OpenAI’s main rivals. Claude is designed to be as helpful, honest, and harmless as possible.	Figure 2.1.1 Source: Anthropic, 2023
Mar. 14, 2023	GPT-4	Large language model	OpenAI	GPT-4, improving over GPT-3, is among the most powerful and capable LLMs to date and surpasses human performance	Figure 2.1.2 Source: Medium, 2023
Mar. 23, 2023	Stable Diffusion v2	Text-to-image model	Stability AI	Stability AI Stable Diffusion v2 is an upgrade of Stability AI’s existing text-to-image model and produces higher-resolution, superior-quality images.	Figure 2.1.3 Source: Stability AI, 2023
Apr. 5, 2023	Segment Anything	Image segmentation	Meta	Meta Segment Anything is an AI model capable of isolating objects in images using zero-shot generalization.	Figure 2.1.4 Source: Meta, 2023
Jul. 18, 2023	Llama 2	Large language model	Meta	Meta Llama 2, an updated version of Meta’s flagship LLM, is open-source. Its smaller variants (7B and 13B) deliver relatively high performance for their size.	Figure 2.1.5 Source: Meta, 2023
Aug. 20, 2023	DALL-E 3	Image generation	OpenAI	OpenAI DALL-E 3 is an improved version of OpenAI’s existing text-to-vision model DALL-E.	Figure 2.1.6 Source: OpenAI, 2023
Aug. 29, 2023	SynthID	Watermarking	Google, DeepMind	SynthID is a tool for watermarking AI-generated music and images. Its watermarks remain detectable even after image alterations.	Figure 2.1.7 Source: DeepMind, 2023
Sep. 27, 2023	Mistral 7B	Large language model	Mistral AI	Mistral AI Mistral 7B, launched by French AI company Mistral, is a compact 7 billion parameter model that surpasses Llama 2 13B in performance, ranking it top in its class for size.	Figure 2.1.8 Source: Mistral AI, 2023
Oct. 27, 2023	Ernie 4.0	Large language model	Baidu	Baidu, a multinational Chinese technology company, has launched Ernie 4.0, which is among the highest-performing Chinese LLMs to date.	Figure 2.1.9 Source: PR Newswire, 2023
Nov. 6, 2023	GPT-4 Turbo	Large language model	OpenAI	OpenAI GPT-4 Turbo is an upgraded large language model boasting a 128K context window and reduced pricing.	Figure 2.1.10 Source: Tech.co, 2023
Nov. 6, 2023	Whisper v3	Speech-to-text model	OpenAI	OpenAI Whisper v3 is an open-source speech-to-text model known for its increased accuracy and extended language support.	Figure 2.1.11 Source: AI Business, 2023
Nov. 21, 2023	Claude 2.1	Large language model	Anthropic	Anthropic’s latest LLM, Claude 2.1, features an industry-leading 200K context window, which enhances its capacity to process extensive content such as lengthy literary works.	Figure 2.1.12 Source: Medium, 2023
Nov. 22, 2023	Inflection-2	Large language model	Inflection	Inflection-2 is the second LLM from the new startup Inflection, founded by DeepMind’s Mustafa Suleyman. Inflection-2’s launch underscores the intensifying competition in the LLM arena.	Figure 2.1.13 Source: Inflection, 2023
Dec. 6, 2023	Gemini	Large language model	Google	Google Gemini emerges as a formidable competitor to GPT-4, with one of its variants, Gemini Ultra, outshining GPT-4 on numerous benchmarks.	Figure 2.1.14 Source: Medium, 2023
Dec. 21, 2023	Midjourney v6	Text-to-image model	Midjourney	Midjourney’s latest update enhances user experience with more intuitive prompts and superior image quality.	Figure 2.1.15 Source: Bootcamp, 2023

State of AI Performance

As of 2023, AI has achieved levels of performance that surpass human capabilities across a range of tasks. Figure 2.1.16 illustrates the progress of AI systems relative to human baselines for nine AI benchmarks corresponding to nine tasks (e.g., image classification or basic-level reading comprehension).1 The AI Index team selected one benchmark to represent each task.

Over the years, AI has surpassed human baselines on a handful of benchmarks, such as image classification in 2015, basic reading comprehension in 2017, visual reasoning in 2020, and natural language inference in 2021. As of 2023, there are still some task categories where AI fails to exceed human ability. These tend to be more complex cognitive tasks, such as visual commonsense reasoning and advanced-level mathematical problem-solving (competition-level math problems).

Select AI Index technical performance benchmarks vs. human performance

1 An AI benchmark is a standardized test used to evaluate the performance and capabilities of AI systems on specific tasks. For example, ImageNet is a canonical AI benchmark that features a large collection of labeled images, and AI systems are tasked with classifying these images accurately. Tracking progress on benchmarks has been a standard way for the AI community to monitor the advancement of AI systems.

2 In Figure 2.1.16, the values are scaled to establish a standard metric for comparing different benchmarks. The scaling function is calibrated such that the performance of the best model for each year is measured as a percentage of the human baseline for a given task. A value of 105% indicates, for example, that a model performs 5% better than the human baseline.

AI Index Benchmarks

An emerging theme in AI technical performance, as emphasized in last year’s report, is the observed saturation on many benchmarks, such as ImageNet, used to assess the proficiency of AI models. Performance on these benchmarks has stagnated in recent years, indicating either a plateau in AI capabilities or a shift among researchers toward more complex research challenges.3

Due to saturation, several benchmarks featured in the 2023 AI Index have been omitted from this year’s report. Figure 2.1.17 highlights a selection of benchmarks that were included in the 2023 edition but not featured in this year’s report.4 It also shows the improvement on these benchmarks since 2022. “NA” indicates no improvement was noted.

A selection of deprecated benchmarks from the 2023 AI Index report Source: AI Index, 2024

3 Benchmarks can also saturate or see limited improvement because the problem created is hard and the corresponding performance fails to improve. The issue of benchmark saturation discussed in this section refers more to benchmarks where performance reaches a close-to-perfection level on which it is difficult to improve.

4 For brevity, Figure 2.1.17 highlights a selection of deprecated benchmarks. Additional benchmarks that were deprecated either because there was saturation, no new state-of-the-art score was documented, or research focus shifted away from the benchmark include: Celeb-DF (deepfake detection), CIFAR-10 (image classification), NIST FRVT (facial recognition), and Procgen (reinforcement learning).

Figure 2.1.18 illustrates the year-over-year improvement, in percent, on a selection of benchmarks featured in the 2023 AI Index report. Most benchmarks see significant performance increases relatively soon after they are introduced, then the improvement slows. In the last few years, many of these benchmarks have shown little or no improvement.

Year-over-year improvement over time on select AI Index technical performance benchmarks Source: AI Index, 2024 | Chart: 2024 AI Index report

In response to benchmark saturation, AI researchers are pivoting away from traditional benchmarks and testing AI on more difficult challenges. The 2024 AI Index tracks progress on several new benchmarks including those for tasks in coding, advanced reasoning, and agentic behavior—areas that were underrepresented in previous versions of the report (Figure 2.1.19).5

New benchmarks featured in the 2024 AI Index report

5 This report includes an Appendix with details regarding the sourcing of new benchmarks featured in this chapter.

2.2 Language

Natural language processing (NLP) enables computers to understand, interpret, generate, and transform text. Current stateof-the-art models, such as OpenAI’s GPT-4 and Google’s Gemini, are able to generate fluent and coherent prose and display high levels of language understanding ability (Figure 2.2.1). Many of these models can also now handle different input forms, such as images and audio (Figure 2.2.2).

Gemini handling image and audio inputs Source: Google, 2024

Understanding

English language understanding challenges AI systems to understand the English language in various ways such as reading comprehension and logical reasoning.

HELM: Holistic Evaluation of Language Models

As illustrated above, in recent years LLMs have surpassed human performance on traditional Englishlanguage benchmarks, such as SQuAD (question answering) and SuperGLUE (language understanding). This rapid advancement has led to the need for more comprehensive benchmarks.

Mean win rate HELM: mean win rate Source: CRFM, 2023 | Chart: 2024 AI Index report

In 2022, Stanford researchers introduced HELM (Holistic Evaluation of Language Models), designed to evaluate LLMs across diverse scenarios, including reading comprehension, language understanding, and mathematical reasoning.6 HELM assesses models from several leading companies like Anthropic, Google, Meta, and OpenAI, and uses a “mean win rate” to track average performance across all scenarios. As of January 2024, GPT-4 leads the aggregate HELM leaderboard with a mean win rate of 0.96 (Figure 2.2.3); however, different models top different task categories (Figure 2.2.4).7

Leaders on individual HELM sub-benchmarks Source: CRFM, 2023 | Table: 2024 AI Index report

6 HELM evaluates 10 scenarios: (1) NarrativeQA (reading comprehension), (2) Natural Questions (closed-book) (closed-book short-answer question answering), (3) Natural Questions (open-book) (open-book short-answer question answering), (4) OpenBookQA (commonsense question answering), (5) MMLU (multisubject understanding), (6) GSM8K (grade school math), (7) MATH (competition math), (8) LegalBench (legal reasoning), (9) MedQA (medical knowledge), and (10) WMT 2014 (machine translation).

7 There are several versions of HELM. This section reports the score on HELM Lite, Release v1.0.0 (2023-12-19), with the data having been collected in January 2024.

MMLU: Massive Multitask Language Understanding

The Massive Multitask Language Understanding (MMLU) benchmark assesses model performance in zero-shot or few-shot scenarios across 57 subjects, including the humanities, STEM, and social sciences (Figure 2.2.5). MMLU has emerged as a premier benchmark for assessing LLM capabilities: Many stateof-the-art models like GPT-4, Claude 2, and Gemini have been evaluated against MMLU.

In early 2023, GPT-4 posted a state-of-the-art score on MMLU, later surpassed by Google’s Gemini Ultra. Figure 2.2.6 highlights the top model scores on the MMLU benchmark in different years. The scores reported are the averages across the test set. As of January 2024, Gemini Ultra holds the top score of 90.0%, marking a 14.8 percentage point improvement since 2022 and a 57.6 percentage point increase since MMLU’s inception in 2019. Gemini Ultra’s score was the first to surpass MMLU’s human baseline of 89.8%.

A sample question from MMLU Source: Hendrycks et al., 2021

MMLU: average accuracy Source: Papers With Code, 2023 | Chart: 2024 AI Index report

Generation

In generation tasks, AI models are tested on their ability to produce fluent and practical language responses.

Chatbot Arena Leaderboard

The rise of capable LLMs has made it increasingly important to understand which models are preferred by the general public. Launched in 2023, the Chatbot Arena Leaderboard is one of the first comprehensive evaluations of public LLM preference. The leaderboard allows users to query two anonymous models and vote for the preferred generations (Figure 2.2.7). As of early 2024, the platform has garnered over 200,000 votes, and users ranked OpenAI’s GPT-4 Turbo as the most preferred model (Figure 2.2.8).

LMSYS Chatbot Arena for LLMs: Elo rating Source: Hugging Face, 2024 | Chart: 2024 AI Index report

Factuality and Truthfulness

Despite remarkable achievements, LLMs remain susceptible to factual inaccuracies and content hallucination—creating seemingly realistic, yet false, information. The presence of real-world instances where LLMs have produced hallucinations—in court cases, for example—underscores the growing necessity of closely monitoring trends in LLM factuality.

TruthfulQA

Introduced at ACL 2022, TruthfulQA is a benchmark designed to evaluate the truthfulness of LLMs in generating answers to questions. This benchmark comprises approximately 800 questions across 38 categories, including health, politics, and finance. Many questions are crafted to challenge commonly held misconceptions, which typically lead humans to answer incorrectly (Figure 2.2.9). Although one of the observations of the paper is that larger models tend to be less truthful, GPT-4 (RLHF) released in early 2024, has achieved the highest performance thus far on the TruthfulQA benchmark, with a score of 0.6 (Figure 2.2.10). This score is nearly three times higher than that of a GPT-2-based model tested in 2021, indicating that LLMs are becoming progressively better at providing truthful answers.

Multiple-choice task on TruthfulQA: MC1 Source: Papers with Code, 2023 | Chart: 2024 AI Index report

HaluEval

As previously mentioned, LLMs are prone to hallucinations, a concerning trait given their widespread deployment in critical fields such as law and medicine. While existing research has aimed to understand the causes of hallucinations, less effort has been directed toward assessing the frequency of LLM hallucinations and identifying specific content areas where they are especially vulnerable.

HaluEval, introduced in 2023, is a new benchmark designed to assess hallucinations in LLMs. It includes over 35,000 samples, both hallucinated and normal, for analysis and evaluation by LLMs (Figure 2.2.11). The research indicates that ChatGPT fabricates unverifiable information in approximately 19.5% of its responses, with these fabrications spanning a variety of topics such as language, climate, and technology. Furthermore, the study examines how well current LLMs can detect hallucinations. Figure 2.2.12 illustrates the performance of leading LLMs in identifying hallucinations across various tasks, including question answering, knowledge-grounded dialogue, and text summarization. The findings reveal that many LLMs struggle with these tasks, highlighting that hallucination is a significant ongoing issue.

A generated hallucinated QA example and a human-labeled ChatGPT response for a user query Source: Li et al., 2023

HaluEval hallucination classiǇcation accuracy Source: Li et al., 2023 | Table: 2024 AI Index report

2.3 Coding

Generation

On many coding tasks, AI models are challenged to generate usable code or to solve computer science problems.

HumanEval

HumanEval, a benchmark for evaluating AI systems’ coding ability, was introduced by OpenAI researchers in 2021. It consists of 164 challenging handwritten programming problems (Figure 2.3.1). A GPT-4 model variant (AgentCoder) currently leads in HumanEval performance, scoring 96.3%, which is a 11.2 percentage point increase from the highest score in 2022 (Figure 2.3.2). Since 2021, performance on HumanEval has increased 64.1 percentage points.

HumanEval: Pass@1 Source: Papers With Code, 2023 | Chart: 2024 AI Index report

SWE-bench

As AI systems’ coding capabilities improve, it has become increasingly important to benchmark models on more challenging tasks. In October 2023, researchers introduced SWE-bench, a dataset comprising 2,294 software engineering problems sourced from real GitHub issues and popular Python repositories (Figure 2.3.3). SWE-bench presents a tougher test for AI coding proficiency, demanding that systems coordinate changes across multiple functions, interact with various execution environments, and perform complex reasoning.

Even state-of-the-art LLMs face significant challenges with SWE-bench. Claude 2, the best-performing model, solved only 4.8% of the dataset’s problems (Figure 2.3.4).8 In 2023, the topperforming model on SWE-bench surpassed the best model from 2022 by 4.3 percentage points.

A sample model input from SWE-bench Source: Jimenez et al., 2023

SWE-bench: percent resolved Source: SWE-bench Leaderboard, 2023 | Chart: 2024 AI Index report

8 According to the SWE-bench leaderboard, unassisted systems have no assistance in finding the relevant files in the repository. Assisted systems operate under the “oracle” retrieval setting, which means the systems are provided with the list of files that were modified in the pull request.

Computer vision allows machines to understand images and videos and create realistic visuals from textual prompts or other inputs. This technology is widely used in fields such as autonomous driving, medical imaging, and video game development.

2.4 Image Computer Vision and Image Generation

Generation

Image generation is the task of generating images that are indistinguishable from real ones. Today’s image generators are so advanced that most people struggle to differentiate between AI-generated images and actual images of human faces (Figure 2.4.1). Figure 2.4.2 highlights several generations from various Midjourney model variants from 2022 to 2024 for the prompt “a hyper-realistic image of Harry Potter.” The progression demonstrates the significant improvement in Midjourney’s ability to generate hyper-realistic images over a two-year period. In 2022, the model produced cartoonish and inaccurate renderings of Harry Potter, but by 2024, it could create startlingly realistic depictions.

Which face is real? Source: Which Face Is Real, 2023

Midjourney generations over time: “a hyper-realistic image of Harry Potter” Source: Midjourney, 2023 | Chart: 2024 AI Index report


V1, February 2022	V2, April 2022	V3, July 2022	V4, November 2022
V5, March 2023	V5.1, March 2023	V5.2, June 2023	V6, December 2023

HEIM: Holistic Evaluation of Text-to-Image Models

The rapid progress of AI text-to-image systems has prompted the development of more sophisticated evaluation methods. In 2023, Stanford researchers introduced the Holistic Evaluation of Text-toImage Models (HEIM), a benchmark designed to comprehensively assess image generators across 12 key aspects crucial for real-world deployment, such as image-text alignment, image quality, and aesthetics.9 Human evaluators are used to rate the models, a crucial feature since many automated metrics struggle to accurately assess various aspects of images.

HEIM’s findings indicate that no single model excels in all criteria. For human evaluation of image-to-text alignment (assessing how well the generated image matches the input text), OpenAI’s DALL-E 2 scores highest (Figure 2.4.3). In terms of image quality (gauging if the images resemble real photographs), aesthetics (evaluating the visual appeal), and originality (a measure of novel image generation and avoidance of copyright infringement), the Stable Diffusion–based Dreamlike Photoreal model ranks highest (Figure 2.4.4).

Image-text alignment: human evaluation Source: CRFM, 2023 | Chart: 2024 AI Index report

Model leaders on select HEIM sub-benchmarks Source: CRFM, 2023 | Table: 2024 AI Index report

9 The 12 evaluation aspects of HEIM are: (1) Alignment: How closely does the image align with the given text? (2) Quality: What is the quality of the produced image? (3) Aesthetic: How aesthetically pleasing is the generated image? (4) Originality: How original is the image? (5) Reasoning: Does the model understand objects, counts, and spatial relations? (6) Knowledge: Does the model have knowledge about the world? (7) Bias: Are the generated images biased? (8) Toxicity: Are the generated images toxic or inappropriate? (9) Fairness: Do the generated images exhibit performance disparities? (10) Robust: Is the model robust to input perturbations? (11) Multilinguality: Does the model support non-English languages? (12) Efficiency: How fast is model inference?

Highlighted Research: MVDream

Creating 3D geometries or models from text prompts has been a significant challenge for AI researchers, with existing models struggling with problems such as multiface Janus issue (inaccurately regenerating context described by text prompts) and content drift (inconsistency across different 3D views). MVDream is a new 3D generation system developed by ByteDance and University of California, San Diego researchers that overcomes some of these hurdles (Figure 2.4.5). In quantitative evaluations, MVDream’s generated models achieve Inception Score (IS) and CLIP scores comparable to those in the training set, indicating the high quality of the generated images (Figure 2.4.6). MVDream has major implications, especially for creative industries where 3D content creation is traditionally time-consuming and labor-intensive.

Sample generations from MVDream Source: Shi et al., 2023

Quantitative evaluation on image synthesis quality Source: Shi et al., 2023 | Table: 2024 AI Index report

Instruction-Following

In computer vision, instruction-following is the capacity of vision-language models to interpret text-based directives related to images. For instance, an AI system could be given an image of various ingredients and tasked with suggesting how to use them to prepare a healthy meal. Capable instructionfollowing vision-language models are necessary for developing advanced AI assistants.

VisIT-Bench

In 2023, a team of industry and academic researchers introduced VisIT-Bench, a benchmark consisting of 592 challenging vision-language instructions across about 70 instruction categories, such as plot analysis, art knowledge, and location understanding (Figure 2.4.8). As of January 2024, the leading model on VisIT-Bench is GPT-4V, the visionenabled variant of GPT-4 Turbo, with an Elo score of 1,349, marginally surpassing the human reference score for VisIT-Bench (Figure 2.4.9).

VisIT-Bench: Elo rating Source: Hugging Face, 2024 | Chart: 2024 AI Index report

Editing

Image editing involves using AI to modify images based on text prompts. This AIassisted approach has broad real-world applications in fields such as engineering, industrial design, and filmmaking.

EditVal

Despite the promise of text-guided image editing, few robust methods can evaluate how accurately AI image editors adhere to editing prompts. EditVal, a new benchmark for assessing text-guided image editing, includes over 13 edit types, such as adding objects or changing their positions, across 19 object classes (Figure 2.4.10). The benchmark was applied to evaluate eight leading text-guided image editing methods including SINE and Null-text. Performance improvements since 2021 on a variety of the benchmark’s editing tasks, are shown in Figure 2.4.11.

EditVal automatic evaluation: editing accuracy Source: EditVal Leaderboard, 2024 | Chart: 2024 AI Index report

Highlighted Research: ControlNet

Conditioning inputs or performing conditional control refers to the process of guiding the output created by an image generator by specifying certain conditions that a generated image must meet. Existing text-to-image models often lack precise control over the spatial composition of an image, making it difficult to use prompts alone to generate images with complex layouts, diverse shapes, and specific poses. Fine-tuning these models for greater compositional control by training them on additional images is theoretically feasible, but many specialized datasets, such as those for human poses, are not large enough to support successful training.

In 2023, researchers from Stanford introduced a new model, ControlNet, that improves conditional control editing for large textto-image diffusion models (Figure 2.4.12). ControlNet stands out for its ability to handle various conditioning inputs. Compared to other previously released models in 2022, human raters prefer ControlNet both in terms of superior quality and better condition fidelity (Figure 2.4.13). The introduction of ControlNet is a significant step toward creating advanced textto-image generators capable of editing images to more accurately replicate the complex images frequently encountered in the real world.

Sample edits using ControlNet Source: Zhang et al., 2023

Average User Ranking (AUR): result quality and condition Ǉdelity Source: Zhang et al., 2023 | Chart: 2024 AI Index report

Highlighted Research: Instruct-NeRF2NeRF

New models can edit 3D geometries using only text instructions. Instruct-NeRF2NeRF is a model developed by Berkeley researchers that employs an image-conditioned diffusion model for iterative text-based editing of 3D geometries (Figure 2.4.14). This method efficiently generates new, edited images that adhere to textual instructions, achieving greater consistency than current leading methods (Figure 2.4.15)

A demonstration of Instruct-NeRF2NeRF in action Source: Haque et al., 2023

Evaluating text-image alignment and frame consistency Source: Haque et al., 2023 | Chart: 2024 AI Index report

Segmentation

Segmentation involves assigning individual image pixels to specific categories (for example: human, bicycle, or street)

Highlighted Research: Segment Anything

In 2023, Meta researchers launched Segment Anything, a project that featured the Segment Anything Model (SAM) and an extensive SA1B dataset for image segmentation. SAM is remarkable for being one of the first broadly generalizable segmentation models that performs well zero-shot on new tasks and distributions. Segment Anything outperforms leading segmentation methods like RITM on 16 out of 23 segmentation datasets (Figure 2.4.17). The metric on which Segment Anything is evaluated is the mean Intersection over Union (IoU).

Meta’s Segment Anything model was then used, alongside human annotators, to create the SA-1B dataset, which included over 1 billion segmentation masks across 11 million images (Figure 2.4.16). A new segmentation dataset of this size will accelerate the training of future image segmentors. Segment Anything demonstrates how AI models can be used alongside humans to more efficiently create large datasets, which in turn can be used to train even better AI systems.

Various segmentation masks created by Segment Anything Source: Kirillov et al., 2023

SAM vs. RITM: mean IoU Source: Kirillov et al., 2023 | Chart: 2024 AI Index report

3D Reconstruction From Images

3D image reconstruction is the process of creating three-dimensional digital geometries from two-dimensional images. This type of reconstruction can be used in medical imaging, robotics, and virtual reality.

Highlighted Research: Skoltech3D

Data scarcity often hinders the development of AI systems for specific tasks. In 2023, a team of international researchers introduced an extensive new dataset, Skoltech3D, for multiview 3D surface reconstruction (Figure 2.4.18). Encompassing 1.4 million images of 107 scenes captured from 100 different viewpoints under 14 distinct lighting conditions, this dataset represents a major improvement over existing 3D reconstruction datasets (Figure 2.4.19).

Objects from the 3D reconstruction dataset Source: Voynov et al., 2023

Skoltech3D vs. the most widely used multisensor datasets Source: Voynov et al., 2023 | Table: 2024 AI Index report

Highlighted Research: RealFusion

RealFusion, developed by Oxford researchers, is a new method for generating complete 3D models of objects from single images, overcoming the challenge of often having insufficient information from single images for full 360 degree reconstruction. RealFusion utilizes existing 2D image generators to produce multiple views of an object, and then assembles these views into a comprehensive 360 degree model (Figure 2.4.20). This technique yields more accurate 3D reconstructions compared to stateof-the-art methods from 2021 (Shelf-Supervised), across a wide range of objects (Figure 2.4.21).

Sample generations from RealFusion Source: Melas-Kyriazi et al, 2023

Object reconstruction: RealFusion vs. Shelf-Supervised Source: Melas-Kyriazi et al., 2023 | Chart: 2024 AI Index report

2.5 Video Computer Vision and Video Generation

Generation

Video generation involves the use of AI to generate videos from text or images.

UCF101

UCF101 is an action recognition dataset of realistic action videos that contain 101 action categories (Figure 2.5.1). More recently, UCF101 has been used to benchmark video generators. This year’s top model, W.A.L.T-XL, posted an FVD16 score of 36, more than halving the state-of-the-art score posted the previous year (Figure 2.5.2).

UCF101: FVD16 Source: Papers With Code, 2023 | Chart: 2024 AI Index report

Highlighted Research: Align Your Latents

Most existing methods can only create short, lowresolution videos. To address this limitation, an international team of researchers has applied latent diffusion models, traditionally used for generating high-quality images, to produce high-resolution videos (Figure 2.5.3). Their Latent Diffusion Model (LDM) notably outperforms previous state-ofthe-art methods released in 2022 like Long Video GAN (LVG) in resolution quality (Figure 2.5.4). The adaptation of a text-to-image architecture to create LDM, a highly effective text-to-video model, exemplifies how advanced AI techniques can be repurposed across different domains of computer vision. The LDM’s strong video generation capabilities have many real-world applications, such as creating realistic driving simulations.

High-quality generation of milk dripping into a cup of coffee Source: Blattmann et al., 2023

Video LDM vs. LVG: FVD and FID Source: Blattmann et al., 2023 | Chart: 2024 AI Index report

Highlighted Research: Emu Video

Traditionally, progress in video generation has trailed that in image generation due to its higher complexity and the smaller datasets available for training. Emu Video, a new transformerbased video generation model created by Meta researchers, represents a significant step forward (Figure 2.5.5). Emu Video generates an image from text and then creates a video based on both the text and image. Figure 2.5.6 illustrates the degree to which the Emu Video model outperforms previously released state-of-the-art video generation methods. The metric is the proportion of cases when human evaluators preferred Emu Video’s image quality or faithfulness to text instructions over the compared method. Emu Video simplifies the video generation process and signals a new era of high-quality video generation.

Sample Emu Video generations Source: Girdhar et al., 2023

Emu Video vs. prior works: human-evaluated video quality and text faithfulness win rate Source: Girdhar et al., 2023 | Chart: 2024 AI Index report

Reasoning in AI involves the ability of AI systems to draw logically valid conclusions from different forms of information. AI systems are increasingly being tested in diverse reasoning contexts, including visual (reasoning about images), moral (understanding moral dilemmas), and social reasoning (navigating social situations).10

2.6 Reasoning

General Reasoning

General reasoning pertains to AI systems being able to reason across broad, rather than specific, domains. As part of a general reasoning challenge, for example, an AI system might be asked to reason across multiple subjects rather than perform one narrow task (e.g., playing chess).

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

In recent years, the reasoning abilities of AI systems have advanced so much that traditional benchmarks like SQuAD (for textual reasoning) and VQA (for visual reasoning) have become saturated, indicating a need for more challenging reasoning tests.

Responding to this, researchers from the United States and Canada recently developed MMMU, the Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. MMMU comprises about 11,500 college-level questions from six core disciplines: art and design, business, science, health and medicine, humanities and social science, and technology and engineering (Figure 2.6.1). The question formats include charts, maps, tables, chemical structures, and more. MMMU is one of the most demanding tests of perception, knowledge, and reasoning in AI to date. As of January 2024, the highest performing model is Gemini Ultra, which leads in all subject categories with an overall score of 59.4% (Figure 2.6.2).11 On most individual task categories, top models are still well beyond medium-level human experts (Figure 2.6.3). This relatively low score is evidence of MMMU’s effectiveness as a benchmark for assessing AI reasoning capabilities.

10 Some abilities highlighted in the previous sections implicitly involve some form of reasoning. This section highlights tasks that have a more specific reasoning focus.

11 The AI Index reports results from the MMMU validation set, as recommended by the paper authors for the most comprehensive coverage. According to the authors, the test set, with its unreleased labels and larger size, presents a more challenging yet unbiased benchmark for model performance, ensuring a more robust evaluation. The test set results are available on the MMMU page.

Sample MMMU questions Source: Yue et al., 2023

MMMU: overall accuracy Source: MMMU, 2023 | Chart: 2024 AI Index report

MMMU: subject-speciǇc accuracy Source: MMMU, 2023 | Table: 2024 AI Index report

12 An asterisk (*) next to the model names indicates that the results were provided by the authors.

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

In the last year, researchers from NYU, Anthropic, and Meta introduced the GPQA benchmark to test general multisubject AI reasoning. This dataset consists of 448 difficult multiple-choice questions that cannot be easily answered by Google searching. The questions were crafted by subject-matter experts in various fields like biology, physics, and chemistry (Figure 2.6.4). PhD-level experts achieved a 65% accuracy rate in their respective domains on GPQA, while nonexpert humans scored around 34%. The best-performing AI model, GPT-4, only reached a score of 41.0% on the main test set (Figure 2.6.5).

A sample chemistry question from GPQA Source: Rein et al., 2023

GPQA: accuracy on the main set | Source: Rein et al., 2023 | Chart: 2024 AI Index report

Highlighted Research: Comparing Humans, GPT-4, and GPT-4V on Abstraction and Reasoning Tasks

Abstract reasoning involves using known information to solve unfamiliar and novel problems and is a key aspect of human cognition that is evident even in toddlers. While recent LLMs like GPT-4 have shown impressive performance, their capability for true abstract reasoning remains a hotly debated subject.13 To further explore this topic, researchers from the Santa Fe Institute tested GPT-4 on the ConceptARC benchmark, a collection of analogy puzzles designed to assess general abstract reasoning skills (Figure 2.6.6). The study revealed that GPT-4 significantly trails behind humans in abstract reasoning abilities: While humans score 95% on the benchmark, the best GPT-4 system only scores 69% (Figure 2.6.7). The development of truly general AI requires abstract reasoning capabilities. Therefore, it will be important to continue tracking progress in this area.

A sample ARC reasoning task Source: Mitchell et al., 2023

ConceptARC: accuracy on minimal tasks over all concepts | Source: Mitchell et al., 2023 | Chart: 2024 AI Index report

Mathematical Reasoning

Mathematical problem-solving benchmarks evaluate AI systems’ ability to reason mathematically. AI models can be tested with a range of math problems, from grade-school level to competition-standard mathematics.

GSM8K

GSM8K, a dataset comprising approximately 8,000 varied grade school math word problems, requires that AI models develop multistep solutions utilizing arithmetic operations (Figure 2.6.8). GSM8K has quickly become a favored benchmark for evaluating advanced LLMs. The top-performing model on GSM8K is a GPT-4 variant (GPT-4 Code Interpreter), which scores an accuracy of 97%, a 4.4% improvement from the state-of-the-art score in the previous year and a 30.4% improvement from 2022 when the benchmark was first introduced (Figure 2.6.9).

GSM8K: accuracy Source: Papers With Code, 2023 | Chart: 2024 AI Index report

MATH

MATH is a dataset of 12,500 challenging competition-level mathematics problems introduced by UC Berkeley researchers in 2021 (Figure 2.6.10). AI systems struggled on MATH when it was first released, managing to solve only 6.9% of the problems. Performance has significantly improved. In 2023, a GPT-4-based model posted the top result, successfully solving 84.3% of the dataset’s problems (Figure 2.6.11).

MATH word problem-solving: accuracy Source: Papers With Code, 2023 | Chart: 2024 AI Index report

PlanBench

A planning system receives a specified goal, an initial state, and a collection of actions. Each action is defined by preconditions, which must be met for the action to be executed, and the effects that result from the action’s execution. The system constructs a plan, comprising a series of actions, to achieve the goal from the initial state.

Claims have been made that LLMs can solve planning problems. A group from Arizona State University has proposed PlanBench, a benchmark suite containing problems used in the automated planning community, especially those used in the International Planning Competition. They tested I-GPT-3 and GPT-4 on 600 problems in the Blocksworld domain (where a hand tries to construct stacks of blocks when it is only allowed to move one block at a time to the table or to the top of a clear block) using one-shot learning and showed that GPT-4 could generate correct plans and cost-optimal plans about 34% of the time, and I-GPT-3 about 6% (Figure 2.6.12). Verifying the correctness of a plan is easier.

GPT-4 vs. I-GPT-3 on PlanBench Source: Valmeekam, 2023 | Table: 2024 AI Index report

Visual Reasoning

Visual reasoning tests how well AI systems can reason across both visual and textual data.

Visual Commonsense Reasoning (VCR)

Introduced in 2019, the Visual Commonsense Reasoning (VCR) challenge tests the commonsense visual reasoning abilities of AI systems. In this challenge, AI systems not only answer questions based on images but also reason about the logic behind their answers (Figure 2.6.13). Performance in VCR is measured using the Q->AR score, which evaluates the machine’s ability to both select the correct answer to a question (Q->A) and choose the appropriate rationale behind that answer (Q->R). While AI systems have yet to outperform humans on this task, their capabilities are steadily improving. Between 2022 and 2023, there was a 7.93% increase in AI performance on the VCR challenge (Figure 2.6.14).

Visual Commonsense Reasoning (VCR) task: Q->AR score Source: VCR Leaderboard, 2023 | Chart: 2024 AI Index report

Moral Reasoning

In the future, AI will be increasingly applied to domains where ethical considerations are crucial, such as in healthcare and judicial systems. Therefore, it is essential for AI systems to possess robust moral reasoning capabilities, enabling them to effectively navigate and reason about ethical principles and moral considerations.

MoCa

The ability of AI models to reason in linguistic and visual domains is well established, yet their capacity for moral reasoning, especially moral reasoning that aligns with human moral judgments, is less understood.14 To further explore this topic, a team of Stanford researchers created a new dataset (MoCa) of human stories with moral elements (Figure 2.6.15). The researchers then presented these models with stories of human actions and prompted the models to respond, measuring moral agreement with the discrete agreement metric: A higher score indicates closer alignment with human moral judgment. The study yielded intriguing results. No model perfectly matches human moral systems, but newer, larger models like GPT-4 and Claude show greater alignment with human moral sentiments than smaller models like GPT-3, suggesting that as AI models scale, they are gradually becoming more morally aligned with humans. Of all models surveyed, GPT-4 showed the greatest agreement with human moral sentiments (Figure 2.6.16).

Discrete agreement (agg.) Zero-shot alignment with human judgments on the moral permissibility task: discrete agreement | Source: Nie et al., 2023 | Chart: 2024 AI Index report

Causal Reasoning

Causal reasoning assesses an AI system’s ability to understand cause-and-effect relationships. As AI becomes increasingly ubiquitous, it has become important to evaluate whether AI models can not only explain their outputs but also update their conclusions—key aspects of causal reasoning.

BigToM

Assessing whether LLMs have theory-of-mind (ToM) capabilities—understanding and attributing mental states such as beliefs, intentions, and emotions—has traditionally challenged AI researchers. Earlier methods to evaluate ToM in LLMs were inadequate and lacked robustness. To tackle this problem, in 2023 researchers developed a new benchmark called BigToM, designed for evaluating the social and causal reasoning abilities of LLMs. BigToM, comprising 25 controls and 5,000 model-generated evaluations, has been rated by human evaluators as superior to existing ToM benchmarks. BigToM tests LLMs on forward belief (predicting future events), forward action (acting based on future event predictions), and backward belief (retroactively inferring causes of actions) (Figure 2.6.17).

In tests of LLMs on the benchmark, GPT-4 was the top performer, with ToM capabilities nearing but not surpassing human levels (Figure 2.6.18, Figure 2.6.19, and Figure 2.6.20). More specifically, as measured by accuracy in correctly inferring beliefs, GPT-4 closely matched human performance in forward belief and backward belief tasks and slightly surpassed humans in forward action tasks. Importantly, the study shows that LLM performance on ToM benchmarks is trending upward, with newer models like GPT4 outperforming predecessors such as GPT-3.5 (released in 2022).

Sample BigToM scenario Source: Gandhi et al., 2023

Forward action inference with initial belief: accuracy Source: Gandhi et al., 2023 | Chart: 2024 AI Index report

Backward belief inference with initial belief: accuracy Source: Gandhi et al., 2023 | Chart: 2024 AI Index report

Forward belief inference with initial belief: accuracy Source: Gandhi et al., 2023 | Chart: 2024 AI Index report

Highlighted Research: Tübingen Cause-Effect Pairs

Researchers from Microsoft and the University of Chicago have demonstrated that LLMs are effective causal reasoners. The team evaluated several recent LLMs, including GPT4, using the Tübingen causeeffect pairs dataset. This benchmark comprises over 100 cause-and-effect pairs across 37 subdisciplines, testing AI systems’ ability to discern causal relationships (Figure 2.6.21). GPT4’s performance, a 96% accuracy score, surpassed the previous year’s best by 13 percentage points (Figure 2.6.22). Notably, GPT-4 outperformed prior covariance-based AI models, which were explicitly trained for causal reasoning tasks. Furthermore, the researchers discovered that certain prompts, especially those designed to encourage helpfulness, can significantly enhance an LLM’s causal reasoning capabilities.

Sample cause-effect pairs from the Tübingen dataset Source: Kiciman et al., 2023

Performance on the Tübingen cause-eǄect pairs dataset: accuracy Source: Kıcıman et al., 2023 | Chart: 2024 AI Index report

AI systems are adept at processing human speech, with audio capabilities that include transcribing spoken words to text and recognizing individual speakers. More recently, AI has advanced in generating synthetic audio content.

2.7 Audio

Generation

2023 marked a significant year in the field of audio generation, which involves creating synthetic audio content, ranging from human speech to music files. This advancement was highlighted by the release of several prominent audio generators, such as UniAudio, MusicGen, and MusicLM.

Highlighted Research: UniAudio

UniAudio is a high-level language modeling technique to create audio content. UniAudio uniformly tokenizes all audio types and, like modern LLMs, employs next-token prediction for highquality audio generation. UniAudio is capable of generating high-quality speech, sound, and music. UniAudio surpasses leading methods in tasks, including text-to-speech, speech enhancement, and voice conversion (Figure 2.7.1). With 1 billion parameters and trained on 165,000 hours of audio, UniAudio exemplifies the efficacy of big data and self-supervision for music generation.

UniAudio vs. selected prior works in the training stage: objective evaluation metrics Source: Yang et al., 2023 | Chart: 2024 AI Index report

Highlighted Research: MusicGEN and MusicLM

Meta’s MusicGen is a novel audio generation model that also leverages the transformer architecture common in language models to generate audio. MusicGen enables users to specify text for a desired audio outcome and then fine-tune it using specific melodies. In comparative studies, MusicGen outshines other popular text-to-music models like Riffusion, Moûsai, and MusicLM across various generative music metrics. It boasts a lower FAD score, indicating more plausible music generation, a lower KL score for better alignment with reference music, and a higher CLAP score, reflecting greater adherence to textual descriptions of reference music (Figure 2.7.2). Human evaluators also favor MusicGen for its overall quality (OVL).

Although MusicGen outperforms certain textto-music models released earlier in the year, MusicLM is worth highlighting because its release was accompanied by the launch of MusicCaps, a state-of-the-art dataset of 5.5K music-text pairs. MusicCaps was used by MusicGen researchers to benchmark the performance of their family of models. The emergence of new models like MusicGen, and new music-to-text benchmarks like MusicCaps, highlights the expansion of generative AI beyond language and images into more diverse skill modalities like audio generation.

Evaluation of MusicGen and baseline models on MusicCaps Source: Copet et al., 2023 | Chart: 2024 AI Index report

AI agents, autonomous or semiautonomous systems designed to operate within specific environments to accomplish goals, represent an exciting frontier in AI research. These agents have a diverse range of potential applications, from assisting in academic research and scheduling meetings to facilitating online shopping and vacation booking.

2.8 Agents

General Agents

This section highlights benchmarks and research into agents that can flexibly operate in general task environments

AgentBench

AgentBench, a new benchmark designed for evaluating LLM-based agents, encompasses eight distinct interactive settings, including web browsing, online shopping, household management, puzzles, and digital card games (Figure 2.8.1). The study assessed over 25 LLM-based agents, including those built on OpenAI’s GPT-4, Anthropic’s Claude 2, and Meta’s Llama 2. GPT-4 emerged as the top performer, achieving an overall score of 4.01, significantly higher than Claude 2’s score of 2.49 (Figure 2.8.2). The research also suggests that LLMs released in 2023 outperform earlier versions in agentic settings. Additionally, the AgentBench team speculated that agents’ struggles on certain benchmark subsections can be attributed to their limited abilities in long-term reasoning, decision-making, and instruction-following.

AgentBench across eight environments: overall score Source: Liu et al., 2023 | Chart: 2024 AI Index report

Highlighted Research: Voyageur

Recent research by Nvidia, Caltech, UT Austin, Stanford, and UW Madison demonstrates that existing LLMs like GPT-4 can be used to develop flexible agents capable of continuous learning. The team created Voyager, a GPT-4-based agent for Minecraft—a complex video game with no set endpoint that is essentially a boundless virtual playground for its players (Figure 2.8.3). Voyager excels in this environment, adeptly remembering plans, adapting to new settings, and transferring knowledge. It significantly outperforms previous models, collecting 3.3 times more unique items, traveling 2.3 times further, and reaching key milestones 15.3 times faster (Figure 2.8.4).

Voyager in action Source: Wang et al., 2023

Voyager’s performance improvements over prior state of the art in Minecraft Source: Wang et al., 2023 | Chart: 2024 AI Index report

The launch of Voyager is significant, as AI researchers have long faced challenges in creating agents that can explore, plan, and learn in open-ended worlds. While previous AI systems like AlphaZero succeeded in closed, rule-defined environments like chess, Go, and shogi, they struggled in more dynamic settings, lacking the ability to continuously learn. Voyager, however, demonstrates remarkable proficiency in a dynamic video game setting, thereby representing a notable advancement in the field of agentic AI.

Task-Specific Agents

This section highlights benchmarks and research into agents that are optimized to perform in specific task environments, such as mathematical problemsolving or academic research.

MLAgentBench

MLAgentBench, a new benchmark for evaluating AI research agents’ performance, tests whether AI agents are capable of engaging in scientific experimentation. More specifically, MLAgentBench assesses AI systems’ potential as computer science research assistants, evaluating their performance across 15 varied research tasks. Examples of the tasks include improving a baseline model on the CIFAR-10 image dataset and training a language model on over 10 million words in BabyLM. Various LLM-based agents, including GPT-4, Claude-1, AutoGPT, and LangChain, were tested. The results demonstrate that although there is promise in AI research agents, performance varies significantly across tasks. While some agents achieved over 80% on tasks like ogbnarxiv (improving a baseline paper classification model), all scored 0% on BabyLM (training a small language model) (Figure 2.8.5). Among these, GPT-4 consistently delivered the best results.

15 The full tasks include: (1) CIFAR-10 (improve a baseline image classification model), (2) imdb (improve a baseline sentiment classification model), (3) ogbn-arxiv (improve a baseline paper classification model from scratch), (4) house prices (train a regression model), (5) spaceship titanic (train a classifier model from scratch), (6) Parkinson’s-disease (train a time-series regression model), (7) FathomNet (train an out-of-distribution image classification model), (8) feedback (train an out-of-distribution text regression model), (9) identify contrails (train an out-of-distribution image segmentation model), (10) CLRS (model classic algorithms over graphs and lists), (11) BabyLM (train language model over 10M words), (12) llama-inference (improve the runtime/autoregressive generation speed of Llama 7B, (13) vectorization (improve the inference speed of a model), (14) literature-review-tool (perform literature review), and (15) bibtexgeneration (generate BibTex from sketch).

Over time, AI has become increasingly integrated into robotics, enhancing robots’ capabilities to perform complex tasks. Especially with the rise of foundation models, this integration allows robots to iteratively learn from their surroundings, adapt flexibly to new settings, and make autonomous decisions.

2.9 Robotics

Highlighted Research: PaLM-E

PaLM-E is a new AI model from Google that merges robotics with language modeling to address real-world tasks like robotic manipulation and knowledge tasks like question answering and image captioning. Leveraging transformer-based architectures, the largest PaLM-E model is scaled up to 562B parameters. The model is trained on diverse visual language as well as robotics data, which results in superior performance on a variety of robotic benchmarks. PaLM-E also sets new standards in visual tasks like OK-VQA, excels in other language tasks, and can engage in chain-of-thought, mathematical, and multi-image reasoning, even without specific training in these areas. Figure 2.9.1 illustrates some of the tasks that the PaLM-E model can perform.

On Task and Motion Planning (TAMP) domains, where robots have to manipulate objects, PaLM-E outperforms previous state-of-the-art methods like SayCan and PaLI on both embodied visual question answering and planning (Figure 2.9.2).16 On robotic manipulation tasks, PaLM-E outperforms competing models (PaLI and CLIP-FT) in its ability to detect failures, which is a crucial step for robots to perform closed-loop planning (Figure 2.9.3).

PaLM-E is significant in that it demonstrates that language modeling techniques as well as text data can enhance the performance of AI systems in nonlanguage domains, like robotics. PaLM-E also highlights how there are already linguistically adept robots capable of real-world interaction and high-level reasoning. Developing these kinds of multifaceted robots is an essential step in creating more general robotic assistants that can, for example, assist in household work.

16 Embodied Visual Question Answering (Embodied VQA) is a task where agents need to navigate through 3D environments and answer questions about the objects they visually perceive in the environment.

PaLM-E in action Source: Robotics at Google, 2023

Performance of select models on TAMP environment: success rate Source: Driess et al., 2023 | Table: 2024 AI Index report

Select models on mobile manipulation environment tests: failure detection Source: Driess et al., 2023 | Table: 2024 AI Index report

Highlighted Research: RT-2

Real-world robots could benefit from certain capabilities possessed by LLMs, such as text and code generation, as well as visual understanding. RT-2, a new robot released from DeepMind, represents an ambitious attempt to create a generalizable robotic model that has certain LLM capabilities. RT-2 uses a transformer-based architecture and is trained on both robotic trajectory data that is tokenized into text and extensive visual-language data.

RT-2 stands out as one of the most impressive and adaptable approaches for conditioning robotic policy. It outshines state-of-the-art models like Manipulation of Open-World Objects (MOO) across various benchmarks, particularly in tasks involving unseen objects. On such tasks, an RT-2/PaLM-E variant achieves an 80% success rate, significantly higher than MOO’s (53%) (Figure 2.9.4). In unseen object tasks, RT-2 surpasses the previous year’s state-of-the-art model, RT-1, by 43 percentage points. This indicates an improvement in robotic performance in novel environments over time.

Evaluation of RT-2 models and baselines on seen and unseen tasks: success rate Source: Brohan et al., 2023 | Chart: 2024 AI Index report

In reinforcement learning, AI systems are trained to maximize performance on a given task by interactively learning from their prior actions. Systems are rewarded if they achieve a desired goal and punished if they fail.

2.10 Reinforcement Learning

Reinforcement Learning from Human Feedback

Reinforcement learning has gained popularity in enhancing state-of-the-art language models like GPT-4 and Llama 2. Introduced in 2017, Reinforcement Learning from Human Feedback (RLHF) incorporates human feedback into the reward function, enabling models to be trained for characteristics like helpfulness and harmlessness.

This year, the AI Index tracked data on the number of foundation models using RLHF as part of their training. More specifically, the Index team looked through the technical reports and other documentation of all models included in CRFM’s Ecosystem graph, one of the most comprehensive repositories of the foundation model ecosystem.17 Figure 2.10.1 illustrates how many foundation models reported using RLHF over time. In 2021, no newly released foundation models used RLHF. In 2022, seven models reported using RLHF, and in 2023, 16 models reported using RLHF. The rising popularity of RLHF is also evidenced by the fact that many leading LLMs report improving their models with RLHF (Figure 2.10.2).

Number of foundation models using RLHF, 2021–23 Source: AI Index, 2024 | Chart: 2024 AI Index report

RLHF usage among foundation models Source: AI Index, 2024 | Table: 2024 AI Index report

17 It is possible that more models use RLHF as part of their training than reported. The Index only tracks data for models that publicly report using RLHF.

Highlighted Research: RLAIF

RLHF is a powerful method for aligning AI models but can be hindered by the time and labor required to generate human preference datasets for model alignment. As an alternative, Reinforcement Learning from AI Feedback (RLAIF) uses reinforcement learning based on the preferences of LLMs to align other AI models toward human preferences.

Recent research from Google Research compares RLAIF with RLHF, the traditional gold standard, to assess whether RLAIF can serve as a reliable substitute. The study finds that both RLAIF and RLHF are preferred over supervised fine-tuning (SFT) for summarization and helpfulness tasks, and that there is not a statistically significant difference in the degree to which RLHF is preferred (Figure 2.10.3). Notably, in harmless dialogue generation tasks focused on producing the least harmful outputs, RLAIF (88%) surpasses RLHF (76%) in effectiveness (Figure 2.10.4). This research indicates that RLAIF could be a more resource-efficient and cost-effective approach for AI model alignment.

RLAIF and RLHF vs. SFT baseline: win rate Source: Lee et al., 2023 | Chart: 2024 AI Index report

Harmless rate by policy Source: Lee et al., 2023 | Chart: 2024 AI Index report

Highlighted Research: Direct Preference Optimization

As illustrated above, RLHF is a useful method for aligning LLMs with human preferences. However, RLHF requires substantial computational resources, involving the training of multiple language models and integrating LM policy sampling within training loops. This complexity can hinder its broader adoption.

In response, researchers from Stanford and CZ Biohub have developed a new reinforcement learning algorithm for aligning models named Direct Preference Optimization (DPO). DPO is simpler than RLHF but equally effective. The researchers show that DPO is as effective as other existing alignment methods, such as Proximal Policy Optimization (PPO) and Supervised FineTuning (SFT), on tasks like summarization (Figure 2.10.5). The emergence of techniques like DPO suggests that model alignment methods are becoming more straightforward and accessible.

Comparison of diǄerent algorithms on TL;DR summarization task across diǄerent sampling temperatures Source: Rafailov et al., 2023 | Table: 2024 AI Index report

This section focuses on research exploring critical properties of LLMs, such as their capacity for sudden behavioral shifts and self-correction in reasoning. It is important to highlight these studies to develop an understanding of how LLMs, which are increasingly representative of the frontier of AI research, operate and behave.

2.11 Properties of LLMs

Highlighted Research: Challenging the Notion of Emergent Behavior

Many papers have argued that LLMs exhibit emergent abilities, meaning they can unpredictably and suddenly display new capabilities at larger scales.18 This has raised concerns that even larger models could develop surprising, and perhaps uncontrollable, new abilities.

However, research from Stanford challenges this notion, arguing that the perceived emergence of new capabilities is often a reflection of the benchmarks used for evaluation rather than an inherent property of the models themselves. The researchers found that when nonlinear or discontinuous metrics like multiple-choice grading are used to evaluate models, emergent abilities seem more apparent. In contrast, when linear or continuous metrics are employed, these abilities largely vanish. Analyzing a suite of benchmarks from BIGbench, a comprehensive LLM evaluation tool, the researchers noted emergent abilities on only five of the 39 benchmarks (Figure 2.11.1). These findings have important implications for AI safety and alignment research as they challenge a prevailing belief that AI models will inevitably learn new, unpredictable behaviors as they scale.

18 Some of these papers include Brown et al., 2023, Ganguli et al., 2022, Srivastava et al., 2022, and Wei et al., 2022.

Highlighted Research: Challenging the Notion of Emergent Behavior (cont’d)

Emergence score over all Big-bench tasks Source: Schaeffer et al., 2023

Highlighted Research: Changes in LLM Performance Over Time

Publicly usable closed-source LLMs, such as GPT-4, Claude 2, and Gemini, are often updated over time by their developers in response to new data or user feedback. However, there is little research on how the performance of such models changes, if at all, in response to such updating.

A study conducted at Stanford and Berkeley explores the performance of certain publicly usable LLMs over time and highlights that, in fact, their performance can significantly vary. More specifically, the study compared the March and June 2023 versions of GPT-3.5 and GPT-4 and demonstrated that performance declined on several tasks. For instance, the June version of GPT-4, compared to the March version, was 42 percentage points worse at generating code, 16 percentage points worse at answering sensitive questions, and 33 percentage points worse on certain mathematical tasks (Figure 2.11.2). The researchers also found that GPT-4’s ability to follow instructions diminished over time, which potentially explains the broader performance declines. This research highlights that LLM performance can evolve over time and suggests that regular users should be mindful of such changes.

Performance of the March 2023 and June 2023 versions of GPT-4 on eight tasks Source: Chen et al., 2023 | Chart: 2024 AI Index report

Highlighted Research: LLMs Are Poor Self-Correctors

It is generally understood that LLMs like GPT-4 have reasoning limitations and can sometimes produce hallucinations. One proposed solution to such issues is self-correction, whereby LLMs identify and correct their own reasoning flaws. As AI’s societal role grows, the concept of intrinsic self-correction—allowing LLMs to autonomously correct their reasoning without external guidance— is especially appealing. However, it is currently not well understood whether LLMs are in fact capable of this kind of self-correction.

Researchers from DeepMind and the University of Illinois at Urbana–Champaign tested GPT-4’s performance on three reasoning benchmarks: GSM8K (grade-school math), CommonSenseQA (common-sense reasoning), and HotpotQA (multidocument reasoning). They found that when the model was left to decide on self-correction without guidance, its performance declined across all tested benchmarks (Figure 2.11.3).

GPT-4 on reasoning benchmarks with intrinsic self-correction Source: Huang et al., 2023 | Chart: 2024 AI Index report

Closed vs. Open Model Performance

As LLMs become increasingly ubiquitous, debate intensifies over their varying degrees of accessibility. Some models such as Google’s Gemini remain closed, accessible solely to their developers. In contrast, models like OpenAI’s GPT-4 and Anthropic’s Claude 2 offer limited access, available publicly via an API. However, model weights are not fully released, which means the model cannot be independently modified by the public or further scrutinized. Conversely, Meta’s Llama 2 and Stability AI’s Stable Diffusion adopt an open approach, fully releasing their model weights. Open-source models can be modified and freely used by anyone.

Viewpoints differ on the merits of closed versus open AI models. Some argue in favor of open models, citing their ability to counteract market concentration, foster Artificial Intelligence Chapter 2: Technical Performance Index Report 2024 19 By closed models, the AI Index is referring both to models that are fully closed and those with limited access. 20 The data in this section was collected in early January 2024. 2.11 Properties of LLMs innovation, and enhance transparency within the AI ecosystem. Others contend that open-source models present considerable security risks, such as facilitating the creation of disinformation or bioweapons, and should therefore be approached with caution.

In the context of this debate, it is important to acknowledge that current evidence indicates a notable performance gap between open and closed models.19 Figures 2.11.4 and 2.11.5 juxtapose the performances of the top closed versus open model on a selection of benchmarks.20 On all selected benchmarks, closed models outperform open ones. Specifically, on 10 selected benchmarks, closed models achieved a median performance advantage of 24.2%, with differences ranging from as little as 4.0% on mathematical tasks like GSM8K to as much as 317.7% on agentic tasks like AgentBench.

Score diǄerentials of top closed vs. open models on select benchmarks Source: AI Index, 2024 | Table: 2024 AI Index report

19 By closed models, the AI Index is referring both to models that are fully closed and those with limited access.
20 The data in this section was collected in early January 2024.

Performance of top closed vs. open models on select benchmarks Source: AI Index, 2024 | Chart: 2024 AI Index report

2.12 Techniques for LLM Improvement

Prompting

Prompting, a vital aspect of the AI pipeline, entails supplying a model with natural language instructions that describe tasks the model should execute.

Mastering the art of crafting effective prompts significantly enhances the performance of LLMs without requiring that models undergo underlying improvements.

Highlighted Research: Graph of Thoughts Prompting

Highlighted Research: Chain of thought (CoT) and Tree of Thoughts (ToT) are prompting methods that can improve the performance of LLMs on reasoning tasks. In 2023, European researchers introduced another prompting method, Graph of Thoughts (GoT), that has also shown promise (Figure 2.12.1). GoT enables LLMs to model their thoughts in a more flexible, graph-like structure which more closely mirrors actual human reasoning. The researchers then designed a model architecture to implement GoT and found that, compared to ToT, it increased the quality of outputs by 62% on a sorting task while reducing cost by around 31% (Figure 2.12.2).

Graph of Thoughts (GoT) reasoning flow Source: Besta et al., 2023

Number of errors in sorting tasks with ChatGPT-3.5 Source: Besta et al., 2023 | Chart: 2024 AI Index report

Highlighted Research: Optimization by PROmpting (OPRO)

A paper from DeepMind has introduced Optimization by PROmpting (OPRO), a method that uses LLMs to iteratively generate prompts to improve algorithmic performance. OPRO uses natural language to guide LLMs in creating new prompts based on problem descriptions and previous solutions (Figure 2.12.3). The generated prompts aim to enhance the performance of AI systems on particular benchmarks. Compared to other prompting approaches like “let’s think step by step” or an empty starting point, ORPO leads to significantly greater accuracy on virtually all 23 BIG-bench Hard tasks (Figure 2.12.4).

Sample OPRO prompts and optimization progress Source: Yang et al., 2023

Accuracy diǄerence on 23 BIG-bench Hard (BBH) tasks using PaLM 2-L scorer Source: Yang et al., 2023 | Chart: 2024 AI Index report

Fine-Tuning

Fine-tuning has grown increasingly popular as a method of enhancing LLMs and involves further training or adjusting models on smaller datasets. Fine-tuning not only boosts overall model performance but also sharpens the model’s capabilities on specific tasks. It also allows for more precise control over the model’s behavior.

Highlighted Research: QLoRA

QLoRA, developed by researchers from the University of Washington in 2023, is a new method for more efficient model fine-tuning. It dramatically reduces memory usage, enabling the fine-tuning of a 65 billion parameter model on a single 48 GB GPU while maintaining full 16-bit fine-tuning performance. To put this in perspective, fine-tuning a 65B Llama model, a leading open-source LLM, typically requires about 780 GB of GPU memory. Therefore, QLoRA is nearly 16 times more efficient. QLoRA manages to increase efficiency with techniques like a 4-bit NormalFloat (NF4), double quantization, and page optimizers. QLoRA is used to train a model named Guanaco, which matched or even surpassed models like ChatGPT in performance on the Vicuna benchmark (a benchmark that ranks the outputs of LLMs) (Figure 2.12.5). Remarkably, the Guanaco models were created with just 24 hours of fine-tuning on a single GPU. QLoRa highlights how methods for optimizing and further improving models have become more efficient, meaning fewer resources will be required to make increasingly capable models.

Model competitions based on 10,000 simulations using GPT-4 and the Vicuna benchmark Source: Dettmers et al., 2023 | Chart: 2024 AI Index report

Attention

LLMs can flexibly handle various tasks but often demand substantial computational resources to train. As previously noted, high training costs can hinder AI’s broader adoption. Optimization methods aim to enhance AI’s efficiency by, for example, improving memory usage, thereby making LLMs more accessible and practical.

Highlighted Research: Flash-Decoding

Flash-Decoding, developed by Stanford researchers, tackles inefficiency in traditional LLMs by speeding up the attention mechanism, particularly in tasks requiring long sequences. It achieves this by parallelizing the loading of keys and values, then separately rescaling and combining them to maintain right attention outputs (Figure 2.12.6). In various tests, Flash-Decoding outperforms other leading methods like PyTorch Eager and FlashAttention-2, showing much faster inference: For example, on a 256 batch size and 256 sequence length, Flash-Decoding is 48 times faster than PyTorch Eager and six times faster than FlashAttention-2 (Figure 2.12.7). Inference on models like ChatGPT can cost $0.01 per response, which can become highly expensive when deploying such models to millions of users. Innovations like Flash-Decoding are critical for reducing inference costs in AI.

Flash-Decoding operation process Source: Dao et al., 2023

Performance comparison of multihead attention algorithms across batch sizes and sequence lengths Source: Dao et al., 2023 | Chart: 2024 AI Index report

This section examines trends in the environmental impact of AI systems, highlighting the evolving landscape of transparency and awareness. Historically, model developers seldom disclosed the carbon footprint of their AI systems, leaving researchers to make their best estimates. Recently, there has been a shift toward greater openness, particularly regarding the carbon costs of training AI models. However, disclosure of the environmental costs associated with inference—a potentially more significant concern—remains insufficient. This section presents data on carbon emissions as reported by developers in addition to featuring notable research exploring the intersection of AI and environmental impact. With AI models growing in size and becoming more widely used, it has never been more critical for the AI research community to diligently monitor and mitigate the environmental effects of AI systems.

2.13 Environmental Impact of AI Systems

General Environmental Impact

Training

Figure 2.13.1 presents the carbon released by (in tonnes) of select LLMs during their training, compared with human reference points. Emissions data of models marked with an asterisk were estimated by independent researchers as they were not disclosed by their developers. Emission data varies widely. For instance, Meta’s Llama 2 70B model released approximately 291.2 tonnes of carbon, which is nearly 291 times more than the emissions released by one traveler on a round-trip flight from New York to San Francisco, and roughly 16 times the amount of annual carbon emitted by an average American in one year.21 However, the emissions from Llama 2 are still less than the 502 tonnes reportedly released during the training of OpenAI’s GPT-3.

CO2 equivalent emissions (tonnes) by select machine learning models and real-life examples, 2020–23 Source: AI Index, 2024; Luccioni et al., 2022; Strubell et al., 2019 | Chart: 2024 AI Index report

The variance in emission estimates is due to factors such as model size, data center energy efficiency, and the carbon intensity of energy grids. Figure 2.13.2 shows the emissions of select models in relation to their size. Generally, larger models emit more carbon, a trend clearly seen in the Llama 2 model series, which were all trained on the same supercomputer (Meta’s Research Super Cluster). However, smaller models can still have high emissions if trained on energy grids powered by less efficient energy sources. Some estimates suggest that model emissions have declined over time, which is presumably tied to increasingly efficient mechanisms of model training. Figure 2.13.3 features the emissions of select models along with their power consumption.

CO2 equivalent emissions (tonnes) and number of parameters by select machine learning models Source: AI Index, 2024; Luccioni et al., 2022 | Chart: 2024 AI Index report

Environmental impact of select models Source: AI Index, 2024; Luccioni et al., 2022 | Table: 2024 AI Index report

A major challenge in evaluating the environmental impacts of AI models is a lack of transparency about emissions. Consistent with findings from other studies, most prominent model developers do not report carbon emissions, hampering efforts to conduct thorough and accurate evaluations of this metric.22 For example, many prominent model developers such as OpenAI, Google, Anthropic, and Mistral do not report emissions in training, although Meta does.

Inference

As highlighted earlier, the environmental impact of training AI models can be significant. While the perquery emissions of inference may be relatively low, the total impact can surpass that of training when models are queried thousands, if not millions, of times daily. Research on the emissions from model inference is scant. A study by Luccioni et al., published in 2023, is among the first to comprehensively assess the emissions from model inference. Figure 2.13.4 illustrates the emissions from 1,000 inferences across various model tasks, revealing that tasks like image generation have a much higher carbon footprint than text classification.

22 Research also suggests that the reporting of carbon emissions on open model development platforms, such as Hugging Face, is declining over time.

Positive Use Cases

Despite the widely recognized environmental costs of training AI systems, AI can contribute positively to environmental sustainability. Figure 2.13.5 showcases a variety of recent cases where AI supports environmental efforts.23 These applications include enhancing thermal energy system management, improving pest control strategies, and boosting urban air quality.

Positive AI environmental use cases Source: Fang et al., 2024 | Table: 2024 AI Index report

Appendix

Acknowledgments

The AI Index would like to acknowledge Andrew Shi for his work doing a literature review on the environmental impact of AI models; Emily Capstick for her work studying the use of RLHF in machine learning models; Sukrut Oak for his work generating sample Midjourney generations; and Emma Williamson for her work identifying significant AI technical advancements for the timeline.

Benchmarks

AgentBench: Data on AgentBench was taken from the AgentBench paper in January 2024. To learn more about AgentBench, please read the original paper.
BigToM: Data on BigToM was taken from the BigToM paper in January 2024. To learn more about BigToM, please read the original paper.
Chatbot Arena Leaderboard: Data on the Chatbot Arena Leaderboard was taken from the Chatbot Arena Leaderboard in January 2024. To learn more about the Chatbot Arena Leaderboard, please read the original paper.
EditVal: Data on EditVal was taken from the EditVal paper in January 2024. To learn more about EditVal, please read the original paper.
GPQA: Data on GPQA was taken from the GPQA paper in January 2024. To learn more about GPQA, please read the original paper.
GSM8K: Data on GSM8K was taken from the GSM8K Papers With Code leaderboard in January 2024. To learn more about GSM8K, please read the original paper.
HEIM: Data on HEIM was taken from the HEIM leaderboard in January 2024. To learn more about HEIM, please read the original paper.
HELM: Data on HELM was taken from the HELM leaderboard in January 2024. To learn more about HELM, please read the original paper.
HumanEval: Data on HumanEval was taken from the HumanEval Papers With Code leaderboard in January 2024. To learn more about HumanEval, please read the original paper.
MATH: Data on MATH was taken from the MATH Papers With Code leaderboard in January 2024. To learn more about MATH, please read the original paper.
MLAgentBench: Data on MLAgentBench was taken from the MLAgentBench paper in January 2024. To learn more about MLAgentBench, please read the original paper.
MMLU: Data on MMLU was taken from the MMLU Papers With Code leaderboard in January 2024. To learn more about MMLU, please read the original paper.
MMMU: Data on MMMU was taken from the MMMU leaderboard in January 2024. To learn more about MMMU, please read the original paper.
MoCa: Data on MoCa was taken from the MoCa paper in January 2024. To learn more about MoCa, please read the original paper.
PlanBench: Data on PlanBench was taken from the PlanBench paper in January 2024. To learn more about PlanBench, please read the original paper.
SWE-bench: Data on SWE-bench was taken from the SWE-bench leaderboard in January 2024. To learn more about SWE-bench, please read the original paper.

Chapter 2: Technical Performance | 2024 AI Index Report

CHAPTER 2: Technical Performance

Overview

Chapter Highlights

2.1 Overview of AI in 2023

Timeline: Significant Model Releases

State of AI Performance

AI Index Benchmarks

2.2 Language

Understanding

HELM: Holistic Evaluation of Language Models

MMLU: Massive Multitask Language Understanding

Generation

Chatbot Arena Leaderboard

Factuality and Truthfulness

TruthfulQA

HaluEval

2.3 Coding

Generation

HumanEval

SWE-bench

2.4 Image Computer Vision and Image Generation

Generation

HEIM: Holistic Evaluation of Text-to-Image Models

Highlighted Research: MVDream

Instruction-Following

VisIT-Bench

Editing

EditVal

Highlighted Research: ControlNet

Highlighted Research: Instruct-NeRF2NeRF

Segmentation

Highlighted Research: Segment Anything

3D Reconstruction From Images

Highlighted Research: Skoltech3D

Highlighted Research: RealFusion

2.5 Video Computer Vision and Video Generation

Generation

UCF101

Highlighted Research: Align Your Latents

Highlighted Research: Emu Video

2.6 Reasoning

General Reasoning

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Highlighted Research: Comparing Humans, GPT-4, and GPT-4V on Abstraction and Reasoning Tasks

Mathematical Reasoning

GSM8K

MATH

PlanBench

Visual Reasoning

Visual Commonsense Reasoning (VCR)

Moral Reasoning

MoCa

Causal Reasoning

BigToM

Highlighted Research: Tübingen Cause-Effect Pairs

2.7 Audio

Generation

Highlighted Research: UniAudio

Highlighted Research: MusicGEN and MusicLM

2.8 Agents

General Agents

AgentBench

Highlighted Research: Voyageur

Task-Specific Agents

MLAgentBench

2.9 Robotics

Highlighted Research: PaLM-E

Highlighted Research: RT-2

2.10 Reinforcement Learning

Reinforcement Learning from Human Feedback

Highlighted Research: RLAIF

Highlighted Research: Direct Preference Optimization

2.11 Properties of LLMs

Highlighted Research: Challenging the Notion of Emergent Behavior

Highlighted Research: Challenging the Notion of Emergent Behavior (cont’d)

Highlighted Research: Changes in LLM Performance Over Time

Highlighted Research: LLMs Are Poor Self-Correctors

Closed vs. Open Model Performance