Zhilin Wang | Reward Model Evaluation in June 2025

Generated by GPT-4o to represent the role of reward models (carrot) in training RLHF policies (donkey).

Why should we care about Reward Model Evaluations?

Reward Models are critical components for Reinforcement Learning from Human Feedback pipelines - essentially acting as “referees” to inform policy models how to distinguish a preferred vs a less preferred response. While verifiers can somewhat replace this role, it is important to note that Reward Models are generally much more compute-efficient and generalizable beyond the domains of mathematics and coding problems (with known right answers). In fact, the early works of RLHF (Ouyang et al., 2022; Bai et al., 2022). was precisely designed to capture this elusive side of human preference (in aspects such as presentation style) against the background that most NLP evaluations (e.g. Question Answering, Natural Language Inference, Dialogue State Tracking) were primarily optimizing for “correctness”.

A recent work (Huang et al., 2025) also finds that Reward Models can neatly complement verifiers in certain context (e.g. Math) because verifiers tend to easily miss correct answers (e.g. high precision, low recall) as they cannot recognize equivalence answers (e.g. 3 hours vs. 180 minutes). On the other hand, Reward models can err by misclassifying incorrect answers as correct (a.k.a reward hacking) in low precision, high recall setting.

This means that Reward Models are likely here to stay as part of Reinforcement Learning pipelines - as indicated in recent technical reports such as Gemini 2.5, Qwen 3 and DeepSeek V3.

What do we refer to as “Reward Models”?

Conventional Reward Models (response –> score)

Bradley-Terry Reward Models were the OG Reward Models (Ouyang et al., 2022; Bai et al., 2022).They are trained by maximizing the gap in rewards between a chosen and rejected responses. During inference, they generate a scalar score, representing the quality of the response - higher score indicates better.

While trained differently, Regression Reward Models (Wang et al., 2024; Wang et al., 2024) can be used in the same way as Bradley-Terry Reward Models. These models are typically trained to predict the quality of one response indepedently, typically on a Likert-5 or Likert-10 scale.

Pro: Fast inference (only class of RMs fast enough to be commonly use in RL pipeline)

Con: Scores not interpretable; can sometimes be hacked.

Prompted LLMs (response –> critique –> score)

Strong LLMs such as GPT-4 have history of being somewhat accurate and reliable judges to grade the quality of responses (starting with Vicuna/MT/Alpaca Bench in early 2023). In many Reward Modelling leaderboards, they serve as baseline models/references to understand how well conventional Reward Models behave. Subsequently, works such as UltraFeedback used LLMs in similar ways to construct preference datasets.

Pro: No training needed.

Con: Very slow inference, can be expensive to call external API.

Generative Reward Models (response –> critique –> score)

Generative Reward Models are the most recent class of reward models and can be used similar to Prompted LLMs during Inference. However, they are specifically trained to generate better critiques and more accurate scores, similar to Conventional Reward Models. While they are many ways to train them, an approach that recently caught attention (DeepSeek GRM) was to train a model to generate a critique, before using ground truth score/preference data (and the difference from the predicted score/preference) to calculate a loss for the model to learn.

Pro: Scores are most robust as they combine benefit of Conventional RMs and Prompted LLMs.

Con: Very slow inference (might be challenging to use in Reinforcement Learning pipelines).

Select Reward Model Evaluations

Note: given the number of papers related to reward models these days, it’s virtually impossible to cover all evaluations - therefore, I cover some of the ones that I looked into.

RewardBench

RewardBench measures a macro-average of models’ accuracy in assigning higher scores to chosen responses over rejected responses across 4 categories including Chat, Chat-Hard, Safety and Reasoning (Math and Code). In many ways, RewardBench was a turning point for Reward Model evaluation. Prior to RewardBench, there were existing evaluation data but these were typically validation subsets to popular training datasets such as Anthropic HH-RLHF, OpenAI Summarize and Stanford Human Preference. This means the best way to score well on these dataset was to overfit on the accompanying training datasets, instead of training a better generalized Reward Model. On the other hand, there was no obvious ways to game RewardBench as it comprises 20+ distinct tasks.

The other aspect of RewardBench was its transparent and reproducible design. Its maintainer Nathan did an amazing job creating and maintaining a leaderboard that later contains more than 100 models, alongside full documentation of the corresponding dataset and evaluation code. This made RewardBench a public leaderboard for people to “compete” on - similar to ChatBot Arena, (Super)GLUE and ImageNet. And where there is competition, there often is rapid progress (with the best performing model rapidly improving from 86% at launch in March 2024 to over 95% by December 2024).

However, with the rapid progress and transparency, limitations in RewardBench began to show:

RewardBench consists of 20+ different tasks from existing datasets. This means that it inherits the limitations of some of these datasets. For instance, ReWordBench shows that for Math prompts, the chosen and rejected prompts have different formating (i.e. \boxed{xxx} vs #### Answer: xxx). This made it possible for models to “guess” the answer from the formatting differences instead of having to compare the answers.
Relatedly, on many subsets within the Chat-Hard category, the ground-truth preference was chosen based on GPT-4 labels. While this was not a big issue when the Chat-Hard accuracy was around 50% and models were generally much weaker than GPT-4, it became a problem later in 2024 as models trained on GPT-4-labelled data shows an imbalanced performance on GPT-4-labelled subsets vs Human-labelled subsets within Chat-Hard. This suggests that certain stylistic elements associated with GPT-4 generations could give away the preference choice.
The performance of RewardBench begins to saturate as performance of the top performing models goes beyond 95%. For reasons discussed above, the actual ceiling for performance is not going to be 100% given annotation errors (akin to errors that exist with every benchmark including ImageNet/GLUE). This means that it’s no longer straightforward to tell if models that score better on RewardBench are truly ‘better’ for Out-Of-Distribution performance or merely got lucky on a couple of samples. Personally, I think 95% is probably at or around the ground truth annotation correctness level and hence, further optimizing Reward Models on RewardBench is likely to be counterproductive.

Nonetheless, RewardBench leaves behind a strong legacy for other Reward Model evaluations to live up to and indeed was directly built upon by several follow up works in terms of its codebase or data.

Paper: https://openreview.net/forum?id=XiConLcsqq#discussion (NAACL 2025 Findings) (Lambert et al., 2024)

Domains: Chat, Safety, Math and Code

Size: 3k pairs of responses (Good-sized)

Pro: First Out-of-Distribution, Diverse Evaluation set, Well-maintained public leaderboard

Con: Questionable Quality, English only, Saturated Performance

RM-Bench

RM-Bench maintains similar categories as RewardBench - Chat, Safety, Math and Code - but address many of its issues. Specifically, it addresses the style bias in the Chat category by mandating that both chosen and rejected responses come from with the same model, with the rejected response produced by making targeted corruptions into the chosen responses containing as few as 1 word of difference. The cherry on top is that samples were manually checked prior to inclusion into the dataset.

Furthermore, it also controls for style by making three different versions of responses with similar content - concise, verbose and verbose with markdown formatting. By doing this, it makes the benchmark much harder by pitting well-formatted and/or verbose incorrect responses against concise correct responses. This means that the overall top performing model in the paper is only at 70% while being much weaker on the Hard subset (55%) - which is low considering that random guess gives 50% accuracy.

Paper: https://openreview.net/forum?id=QEHrmQPBdd (ICLR 2025 Oral) (Liu et al., 2025)

Domains: Chat, Safety, Math and Code

Size: 1.33k * 3 pairs of responses ~ 4k pairs (Good-sized)

Pro: High Quality, Designed to Mitigate Stylistic Bias

Con: English only

JudgeBench

JudgeBench is a task designed to evaluate models on LLM-as-a-Judge settings and has slightly different categories compared with RM-Bench - General (Undergraduate-level) Knowledge, Logical Reasoning, Math and Code. Specifically, it curates prompts from established benchmarks (MMLU-pro, LiveBench and LiveCodeBench) before asking a single strong model (GPT-4o or Claude 3.5 Sonnet) to generate multiple sampled responses. This indirectly controls for stylistic differences since only one generator model is used. Only questions that are sometimes answered correctly (by comparing against the ground truth answer) and other times answered wrongly are kept.

Because such strong models are used, many conventional reward models do poorly on it with the highest achieving only 64% accuracy on it. The strongest performing models on this benchmark are strong reasoning LLM (e.g. o3-mini-high) that are prompted to first generate its own answer to the question, before choosing one out of the two answers. This is not surprising because the benchmark consists of many hard but verifiable STEM problems which reasoning models are optimized for. Generating its own response which are often correct can serve as a “silver-standard” to grade candidate answers against.

An unfortunate drawback of this benchmark is that it only consists of 350 samples, which is small and makes the evaluation imprecise.

Paper: https://openreview.net/forum?id=G0dksFayVq (ICLR 2025) (Tan et al., 2025)

Domains: Knowledge, Logical Reasoning, Math and Code

Size: 350 pairs of responses (Small)

Pro: High Quality, Very Challenging

Con: Small, English-Only, Value preposition overlaps with Verfiers which can be used directly for RL pipelines.

M-RewardBench

M-RewardBench proposes the translation of RewardBench into 20+ languages and evaluates its performance on each language separately. In addition, it includes a new translation subset that evaluates Reward Models’ capability to evaluate translation quality.

While this dataset addresses one of the main drawbacks of RewardBench (no multiliinguality), its strong dependence on RewardBench unfortunately means that it directly inherits all of its drawbacks (discussed under RewardBench section above). This is unfortunate because the community can really benefit from a high quality multilingual evaluation, but this is not it.

Paper: https://arxiv.org/abs/2410.15522 (ACL 2025) (Gureja et al., 2025)

Domains: Chat, Safety, Math, Code and Translation

Size: 24 languages * 3k pairs of responses ~ 72k pairs (Too large)

Pro: Multilingual

Con: Questionable Quality

RMB

RMB proposes focussing on two categories (Helpfulness and Harmlessness), which broadly fits into Chat and Safety from earlier benchmarks. While its prompt selection and response generation are done well, the primary limitation of RMB lies in the preference label process. Specifically, it only use GPT-4 generated label as the ground truth, leading to substantially stylistic bias for models trained on GPT-4 generated data such as Ultrafeedback. As an illustration, the top performing model is GPT-4o while the best conventional RM is Starling-34B which was trained on Nectar, another GPT-4 labelled dataset.

As a note, a key selling point of this dataset was how well it correlates with downstream RLHF model evaluation (such as Arena Hard). However, its worth noting that these downstream application were also using a GPT-4 judge, which means they also capture biases such as self-preferring bias.

Paper: https://openreview.net/forum?id=dw8uv3Rliz (ICLR 2025) (Zhou et al., 2025)

Domains: Chat and Safety

Size: 3k sets (pairs, triples or greater) of responses (good-sized)

Pro: Well-thought out prompt curation

Con: High Bias toward GPT-4 family or Models trained with GPT-4 data

PPE

PPE contains tasks mined from many different tasks including MMLU-Pro, MATH, GPQA, MBPP+ and IFEval. They then generate up to 512 responses per prompt using four small-sized models (Claude 3.5 Haiku, Llama 3 8B, GPT-4o-mini and Gemma 2 9B) and removed prompts that were either too easy or too hard (less than 10% or more than 90% accuracy).

The purpose of such a curation approach is to essentially train reward models that can differentiate correct vs incorrect answers for these benchmarks and therefore optimize the downstream performance of RLHF models on them. This brings into question whether Reward models selected based on these criteria are actually useful, since these tasks are verfiable on their own in RLHF pipelines.

The best model (Athene-RM-70B) only reaches 69% accuracy. Note that Athene-RM-70B was trained by the same group of researchers (associated with Chatbot Arena) and hence it’s not clear whether Athene-RM-70B was trained on the same distribution as this evaluation dataset. This also brings into attention that it might not be hard to overfit on PPE by training on verified rollouts on these consistuent datasets.

Paper: https://openreview.net/forum?id=cbttLtO94Q (ICLR 2025) (Frick et al., 2025)

Domains: General Knowledge, Math, Scientific MCQ, Code and Precise Instruction Following

Size: 18k sets (pairs, triples or greater) of responses (leaning large)

Pro: Thoughtfully constructed, diverse range of domains covered, Challenging

Cons: English-Only, Might be easy to overfit to, Value preposition overlaps with Verfiers which can be used

RewardBench 2

Our final evaluation dataset is RewardBench 2, produced by the team behind the original RewardBench. This version uses distinct prompts and categories - Factuality, Precise Instruction Following, Math, Safety, Focus and Ties. Across all categories other than Ties, models need to score a chosen response higher than three rejected responses - making the random choice success rate of 25% compared to 50% in RewardBench. Some categories are well-curated using a mixture of Human-AI collaborative techniques in Math category while other categories (e.g. Factuality, Focus and Safety) solely depend on LLM to derive ground-truth labels. The remaining categories have naturally verifiable solutions (Precise Instruction Following and Ties).

Performance on RewardBench2 is positively and linearly correlated with RewardBench with great fit, but on average around 20% lower. This solves the saturation problem right? However, if we look closer at the leaderboard, a substantial chunk of the lower performance can be attributed to the Precise IF category (an example being “Answer without the letter u”) which commonly has models scoring 40% when the average is at >75%. This suggests that this partition is very challenging - and in my opinion, unnecessarily challenging as such strict constraints rarely are useful in the real-world.

In other categories such as Factuality where ground-truth labels are derived from LLMs, they also potentially lead to self-preferring bias like those from the original RewardBench. For instance, looking at one datapoint, the chosen response in fact doesn’t fulfill some of the criteria (i.e. doesn’t contain the word married) even when the rejected responses do. This suggests that using LLM alone to determine ground-truth answers can be inaccurate and misleading.

Finally, the Ties category is an innovative category that measures whether models score equally-correct answers (e.g. Name a color of the rainbow) highly compared to incorrect answers (e.g. pink). While the idea is interesting, I find the implementation rather cumbersome as it is the only category that doesn’t have a set of 1 correct vs. 3 incorrect answers alongside a hard-to-understand scoring schema. This schema makes certain assumptions e.g. a good reward model should score equally correct answers within a smaller range compared to the margin between the top incorrect answer and the bottom correct answer. I get why this would make sense intuitively but having trained reward models of different types, it’s not clear why this assumption would hold for reward models trained in certain manners (e.g. pairwise Bradley-Terry).

I was also super surprised to see Gemini 2.5 Flash scoring much higher than Gemini 2.5 Pro (rank 1 vs 23 - yes first vs twenty-third). Looking more closely at the Math category, the difference is astounding at 81.1 for the Flash and 53.4 for the Pro. Not sure about you but this very much raises my eyebrows as it’s just difficult to believe that the Flash model is so much better as a math judge compared to the Pro.

In case it wasn’t clear in my writing, I really wanted to like RewardBench2 (having hoped it to come out for the past half a year or so) but I don’t think it’s polished enough for now.

Paper: https://arxiv.org/pdf/2506.01937 (Malik et al., 2025)

Domains: Chat, Precise Instruction Following, Math, Safety and (Elementary) Knowledge

Size: 1.8k sets - typically of 4 responses but Ties categories can have more than 30 responses (good-sized).

Pro: Innovative

Cons: Questionable quality, English only

Which Evaluation should I use?

To pick the best evaluation, I would recommend considering these criteria:

Does it measure the capabilities you care about? Say you want to only measure how well reward models can be used to improve coding behavior - then use those that include the coding domain.
Does it have downsides you can live with? No evaluation is perfect so you got to do tradeoffs. Maybe a benchmark is English only but your Reward Model application only cares about performance in English. In this case, it’s not too big of a deal.
Is it popular? This matters more if you’re writing an academic paper. By using an evaluation that everyone else is using, there’s a higher chance that reviewers would know of the benchmark and not ask for it if it’s not there. This also facilitates comparison with other works. The number of citations for a paper (discounted by the amount of time it was released publicly) can be an initial proxy.

Nonetheless, if you’re looking for a TLDR recommendation for “general” Reward Modeling, my best recommendation currently (June 2025) is RM-Bench.

Have fun evaluating Reward Models! If you have thoughts or questions, feel free to reach me via email or linkedin.