Zhilin Wang | Open-Data Reward Model Evaluations are Dead

Generated by GPT-4o to represent the dead role of reward model benchmarks to show usefulness for RLHF.

What inspired this thought? Hint: Skywork Reward v2

Just as I was about to break for dinner on the Eve of July the Fourth, I noticed a new technical report from Skywork Reward v2 team (Liu et al., 2025) - the same team which once held No. 1 position on RewardBench. After looking at the paper, my first reaction was shock - is Reward Modelling an essentially solved problem just like many other traditional NLP tasks that people no longer need to tackle today (e.g. Sentiment Classification, Dialogue State Tracking)? After all, the new Skywork-Reward-V2-Llama-3.1-8B-40M model substantially beats all other models on every benchmark it was evaluated on (RewardBench, PPE, RMB, RM-Bench, JudgeBench, RewardBench v2) - see my previous blog for more details on these benchmarks.

On RM-Bench, one of my preferred benchmarks recently, it hit 96% accuracy, up from 86% from previous SOTA. The gains on other benchmarks were equally impressive: RMB went to 89% from 74% of previous SOTA while JudgeBench went from 80% SOTA to 83%. We need to note that previous SOTA for JudgeBench was achieved by o3-mini-high, a proprietary model using inference time scaling while this model was a simple 8B Bradley-Terry that required the compute equivalent to decoding just 1 token and useable on CPU/a single gaming GPU while o3-mini-high took at least thousands (and perhaps millions of tokens) for each sample. In comparison, the highest performance on JudgeBench by Bradley Terry Reward Model was 74%. Optional Details: I later discovered that the Skywork paper incorrectly uses macro-average for JudgeBench, instead of the official micro-average implementation so exact numbers are not exactly comparable but the idea is the same - according to their own measurements in Table 2 of the technical report, this model is actually slightly behind o3-mini-high on JudgeBench. My own measurements of JudgeBench of the Skywork-Reward-V2-Llama-3.1-8B-40M model’s micro-average is actually 82%, which is higher than o3-mini-high’s reported 80.9% on JudgeBench leaderboard

In my perspective, this is a DeepSeek R1 moment for Reward Models. If this is really as good as it claims, there really isn’t a need to train Bradley-Terry or even Generative Reward Models any more - we should all just use this model.

What’s the secret sauce behind Skywork v2?

I would highly encourage interested folks to read the paper directly (Liu et al., 2025) but the contribution can be boiled down to two main things

Scaling data - while previous training used on the order of <100k training samples, this paper used 40M training samples - a 2 order of magnitude increase! While this is not the first time large amounts of data were used as Qwen WorldPM team did it 1.5 months earlier (Wang et al., 2025), it’s the first time that incredibly good benchmark results were obtained with this quantity of data.
Unique Human-AI collaboration system for efficient use of human annotations. For obvious reasons, not all 40M can be manually labelled so they used an approach to identify samples that will provide the most information to train a ‘gold reward model’ which is then used to pseudo-label preference-data from the wild.

Everything else is common-place for Bradley Terry training - just the vanilla loss function. There’s no reasoning process as well - just a regular Sequence Classifier - with a linear layer above the embedding for the final token.

What made me think twice about the results?

Unlike Skywork Reward v1 for which they released a 80k sized curation of public preference datasets, they did not release the data and don’t currently have the plans for it see this discussion
The results sound too good to be true. It’s important to note that Skywork v2 report astronomically large improvements above SOTA for each of the 7 reward modelling benchmarks. It’s one thing to improve on some benchmarks but to have astronomically large improvements across all 7 when some are already saturated (e.g. Rewardbench goes from 95.1% to 97.8% accuracy) is ever so slightly possible but not exactly plausible. In fact, even the Skyword Reward v2 technical report shows that performances on these benchmarks don’t correlate well with each other - PPE Preference has a 0.03 Pearson correlation with RewardBench v2 but somehow this model does well on every benchmark. The benchmark that caused me to raise my eyebrows the most was JudgeBench - it consists of reasoning-heavy prompts/responses from LiveCodeBench, LiveBench and MMLU-Pro that reasoning models are supposedly really good at and not something that should be solved well with one token-equivalent of compute.
Typically, in most Reward Modelling papers that I read, there’s some downstream evaluation of reward models in how well they help to align LLMs (e.g. using RLHF with PPO/REINFORCE/GRPO). This paper doesn’t do that - meaning there’s no way for us to tell if these models are actually useful for downstream applications (the most important being RLHF).

Some context about Skywork Reward Preference set v0.1 (Oct 2024)

When the preference set was released in Sept 2024, it was called Skywork-Reward-Preference-80K, with the v0.1 label appended to it after v0.2 was introduced. The purpose of introducing v0.2 was that v0.1 was shown by Nathan Lambert to be “accidentally overlapping” with RewardBench, which it topped previously. The source of contamination was from a Magpie dataset - and explained by Nathan as “What seems likely is that Meta trained on some these prompts, but the exact provenance of each prompt needs more example.” Hence, v2 removes samples that have ngram (above 7-gram) overlap with the Rewardbench samples. It’s difficult to pinpoint intention here, but this saga shows that the Skywork team knows how to select existing preference data in order to do well on a popular reward modelling benchmark.

Some hypotheses about what might have happened with Skywork Reward v2

The Skywork team took what they found useful in curating the first preference data and doubled down to make it perform better on Reward Modelling Benchmarks. Part of this involves selecting existing preference data (e.g. 40M of preference data from the wild) in order to train a better reward model.
There is potentially contamination between the training preference data and the benchmark data. With the exception of RewardBench v2 (released Jun 2025), nearly all of the benchmarks have openly released prompts, chosen responses and rejected responses by Oct 2024 - and some of them might be available indirectly before that. There could have been similar de-contamination approaches done (i.e. through filtering samples with common n-grams) but it’s not clear how effective this is. For instance, getting an LLM to paraphrase a prompt to ask essentially the same prompt with different phrasing is trivial nowadays. For some type of questions (e.g. Multiple choice questions), n-gram matching can miss samples that either have shuffled choices or use different delimiters such as a. instead of A).
Part of the contamination could come from the interaction between the base model (Llama 3.1 8B) and the preference data. Comparing Qwen3 8B to Llama 3.1 8B trained on the exact same preference data, Llama 3.1 8B (which came out roughly 1 year earlier) achieves 6.4% higher on average of 7 benchmarks - on RM-Bench, the gap is 13.4% while on JudgeBench, the gap is 10%.

However, given that Skywork Reward v2 didn’t release the preference dataset, there’s no straightforward way to figure out which of the hypotheses holds more water in practice.

Testing our hypotheses indirectly

While there’s no direct way to these the hypotheses above, I found an indirect way to do this. Within JudgeBench, the knowledge subset comprises of prompts and responses from MMLU-pro. Looking closely at the prompts, they share a common instruction at the end: “Once you have your answer, please duplicate that letter five times in a single string. For example, if the answer is K, then write KKKKK.” See example here. This is likely an artifact of JudgeBench creators hoping to make answer extraction more robust. However, this instruction isn’t actually required for answering the question. Therefore, even if we remove this instruction and also remove the 5 letter string in the responses (e.g. AAAAA, BBBBB … JJJJJ as MMLU-pro has up to 10 answers represented by prefixed with one of the first 10 letters), it should not affect the accuracy of scoring correct vs incorrect answers (since the response will also contain a singular answer letter before this 5 repeated-letter string).

Reward Models that are not overfitted to JudgeBench-Knowledge prompt template will have the same score before and after removing this instruction. If there’s a substantial drop in the score, it suggests overfitting to JudgeBench Knowledge category. Btw, this doesn’t mean overfitting to MMLU-pro (which is a separate topic altogether) since MMLU-pro’s instruction doesn’t have this 5-letter repetition see here. As a note, the knowledge subset only has 154 examples so each sample comprise 0.65% of the overall accuracy. Let’s try this out on a few 8B reward models trained by Skywork!

Model Name	Original	After removing 5 letter	Absolute Drop
Llama 3.1 8B v2-40M	80.5	66.9	13.6
Llama 3.1 8B v2	74.7	70.1	4.6
Qwen3 8B v2	69.5	63.6	5.9
Llama 3.1 8B v0.1	58.5	53.0	5.5
Llama 3.1 8B v0.2	59.7	53.9	5.8

As seen in the results above, all of the models have some extent of performance drop but what is clear is that while all other models drop around 5%, the flagship model in Skywork v2 dropped by 13.6%. This means that it’s substantially more overfitted on JudgeBench-Knowledge compared to the other similar-sized models it released as part of Skywork-Reward and Skywork-Reward-v2 series. The 14M additional preference used to train Llama 3.1 8B v2-40M beyond what was used to train Llama 3.1 8B v2 gave it a substantial advantage, which disappeared (in fact regressed) when this artifact was removed.

This is just one way that I thought of (and certainly not a perfect design) but I would imagine there are more ways to do similar investigative work without needing to see Skywork Reward v2 data.

What this means for Open-Data Reward Model Evaluations

Data contamination is by no means new in ML/NLP but for those working on either Reward Model Training or Evaluations, it’s a topic that we have not yet addressed as a community. Current versions of Reward Model evaluations (essentially allocating a higher score to a chosen response to one or more rejected response) can be easy to game, since all of them openly release their data on HuggingFace. The easiest way is to simply include these benchmark datasets as part of the training set but there are definitely ways to do this without being too overt.

The Skywork Llama 3.1 8B v2-40M model can be thought of one model that possibly (but not definitively) does this through the test conducted above. However the underlying issue here is that the reward modelling community doesn’t hold ‘Open-Weight’ model developers like (but not only) Skywork liable for data integrity. Researchers working on Open-Source reward models (e.g. those releasing a fully reproducible recipe for training including data) are unfairly disadvantaged because being fully reproducible makes the incentive to ‘game’ the benchmarks much lower as others can see and probe such attempts.

On the other hand, Open-Weight model developers have much stronger incentives to do so as the chance of being discovered is much lower without visibility into the data. While fully appreciative of why some model developers don’t want to go fully open-source (e.g. business reasons), this means that as a field, we are implicitly ‘rewarding’ developers who do better on benchmarks without full transparency by potentially gaming the benchmarks - and thereby ‘penalizing’ developers who are more open. This is a classical senario describing market for lemons that George Akerlof famously introduced back in 1970 (Akerlof, 1970). Without active intervention by the reward modelling community, we might have open-source reward model developers who train reward models in rigorous and transparent ways exit the market altogether.

What can we do about it?

This is a complex problem without a straightforward solution.

One immediate solution to keep open-data Reward Modelling Benchmarks relevant is to have benchmark developers explicitly label models as either open-source, open-weight or closed-source. This is similar to the “data contamination” flag introduced by Nathan Lambert on RewardBench v1 but more inclusive since this doesn’t create an adverse incentive for model developers to simply hide away their training data.

A longer-term solution is to have some reward model benchmarks with closed data, which make them harder to game - similar to the Scale AI SEAL leaderboard.

Another longer-term solution might be to create a standardized reward model evaluation that is less easy to game (e.g. aligning a common model with the Reward Model) but this seems very resource-intense.

Would love to hear more ideas on what we can do about this!