Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math
Abstract
Consequence-Based Utility evaluates mathematical solutions by testing their effectiveness as exemplars for related problems, outperforming reward models and LLM judges in ranking quality and correct-wrong separation.
Recent progress in reasoning models suggests that generating plausible attempts for research-level mathematics may be within reach, but verification remains a bottleneck, consuming scarce expert time. We hypothesize that a meaningful solution should contain enough method-level information that, when applied to a neighborhood of related questions, it should yield better downstream performance than incorrect solutions. Building on this idea, we propose Consequence-Based Utility, an oracle-free evaluator that scores each candidate by testing its value as an in-context exemplar in solving related yet verifiable questions. Our approach is evaluated on an original set of research-level math problems, each paired with one expert-written solution and nine LLM-generated solutions. Notably, Consequence-Based Utility consistently outperforms reward models, generative reward models, and LLM judges on ranking quality. Specifically, for GPT-OSS-120B, it improves Acc@1 from 67.2 to 76.3 and AUC from 71.4 to 79.6, with similarly large AUC gains on GPT-OSS-20B (69.0 to 79.2). Furthermore, compared to LLM-Judges, it also exhibits a larger solver-evaluator gap, maintaining a stronger correct-wrong separation even on instances where the underlying solver often fails to solve.
Community
We propose Consequence-Based Utility, an alternative approach for oracle-free evaluation of research-level math questions.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Step Potential Advantage Estimation: Harnessing Intermediate Confidence and Correctness for Efficient Mathematical Reasoning (2026)
- CORE: Collaborative Reasoning via Cross Teaching (2026)
- Calibrating LLM Judges: Linear Probes for Fast and Reliable Uncertainty Estimation (2025)
- Adaptive Test-Time Compute Allocation via Learned Heuristics over Categorical Structure (2026)
- Didactic to Constructive: Turning Expert Solutions into Learnable Reasoning (2026)
- Not All Negative Samples Are Equal: LLMs Learn Better from Plausible Reasoning (2026)
- Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper