EthiCompass
AI governanceLLM evaluationcomplianceAI ethics

Domain Judges: a way out of the privacy-versus-capability dilemma in LLM evaluation

By Leonardo Leenen · April 17, 2026

In nearly every LLM evaluation project we accompany, the client conversation repeats. First a sentence lands: "we can't send this data to a cloud judge, it won't pass compliance." Minutes later, another: "but if the judge has lower capability than the model being evaluated, how do we trust the verdict?"

For a long time, these two concerns sounded incompatible. They no longer are — and that changes how evaluation architecture should be thought about in regulated environments.

The dilemma real teams are facing

"LLM-as-a-judge" became part of the standard evaluation stack in 2025–2026. The major cloud providers offer it integrated in their tooling — model-based graders, pointwise scoring, reference-guided grading. For research and low-risk products, the pattern is reasonable.

In regulated environments, two problems appear at the same time.

Privacy and jurisdiction. The GDPR "travels with the data." Transfers outside the European Economic Area require adequacy decisions or appropriate safeguards. Providers have introduced regional residency and zero data retention options, but with partial coverage, specific approvals, and feature-level exclusions. For many concrete enterprise cases in Europe, sending a sensitive dataset to an external frontier judge becomes unviable before the technical discussion even starts.

Judge capability. Recent literature on LLM-as-a-judge converges on an uncomfortable observation: the judge starts delivering poor signal precisely when it itself cannot reliably solve the question being evaluated. In other words, a judge with lower capability than the evaluated model adds noise rather than signal. Studies in expert domains — 68% agreement with specialists in dietetics, 64% in mental health — show the generalist judge is not a substitute for a domain expert.

These two observations coexist in the heads of technical and compliance teams, leaving them with an uncomfortable tradeoff: give up on privacy, or give up on technical confidence.

What "domain judge" means

Not every small fine-tuned model qualifies as a domain judge. There are four conditions that define an evaluable domain:

  1. Clear and auditable success criteria.
  2. Labeled data or rubrics validated by domain experts.
  3. A professional community with substantial agreement on what a correct evaluation looks like.
  4. Bounded vocabulary and context.

Violation detection in banking compliance regulation qualifies. Contract review under a specific jurisdiction qualifies. "Customer service quality" as an abstract category does not. The distinction matters, because it determines whether a specialized judge can actually outperform a generalist in that task.

The evidence has started to shift

Three recent works point to the direction concretely.

CompliBench (2026 preprint) finds that a Qwen3-8B fine-tuned for compliance violation detection matches or surpasses frontier judges within that domain. This is a specific, practical result: a model that can be deployed on owned infrastructure, under data control, outperforming much larger models running as external services in its task.

Prometheus 2 presents itself as an open evaluator LM supporting direct assessment and pairwise ranking with custom criteria. It reports the best correlation among open evaluators tested against human judgment.

JudgeLRM trains reasoning-oriented judges via reinforcement learning. The 3B and 4B versions outperform frontier judges on the reported benchmarks; the larger versions surpass even strong reasoning models.

The literature also warns — fairly — that these fine-tuned judges do not generalize across arbitrary domains. An empirical study published at ACL 2025 concludes they function in practice as task-specific classifiers. That observation reinforces the point: a domain judge occupies the space where the generalist fails and where the domain itself allows specialization, without claiming universality.

What changes for the CTO and for compliance

A domain judge that participates in the decision pipeline in a regulated environment becomes, in its own right, an auditable artifact.

For the CTO, this implies concrete MLOps architecture decisions: judge versioning, documentation of its known failure modes, infrastructure for continuous recalibration against human references, and traceability for each verdict. The judge stops being a black box invoked by API and becomes a system component with its own lifecycle.

For the compliance officer, this introduces a new subject of scrutiny. The judge's training dataset now sits inside the auditable perimeter. The rubrics used to calibrate it are regulatory documentation. The judge's failure modes are information auditors have the right to see.

The deeper implication is that governance, under this pattern, shifts: it stops concentrating only on the primary model and extends to the judge artifact itself. Teams that run this transition with discipline will carry a sustained advantage over those that outsource evaluation to external services whose black boxes they cannot audit.

When this route does not apply

A domain judge is not a universal answer. There are three scenarios where it is not the right choice:

  • Poorly defined or rapidly evolving domains, where labels age faster than they can be generated.
  • Tasks with possible objective verification — executable tests, exact match, schema validation — where a deterministic check is cheaper and more auditable.
  • Teams without infrastructure for continuous recalibration. Without that cycle, the judge degrades silently and loses reliability before anyone notices.

If any of these three conditions applies, the cost of implementing and maintaining a domain judge exceeds the benefit.

What comes next

LLM evaluation in enterprise environments is converging toward a hybrid stack: deterministic checks for non-negotiables, domain judges for subjective evaluation within bounded areas, human review for edge cases and continuous calibration. No single layer replaces the others. No single layer is fully delegated to a third party.

The pattern we see repeating with European and Latin American clients in regulated sectors is that the privacy-versus-capability dilemma, framed against external frontier judges, stops being a tradeoff when evaluation moves to domain judges running on owned infrastructure. For a growing set of cases, this is today the only architecture that satisfies both constraints simultaneously.


References

  • Park et al. CompliBench: Evaluating LLM Judges for Regulatory Compliance. Preprint, 2026.
  • Kim et al. Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. 2024.
  • Chen et al. JudgeLRM: Large Reasoning Models as a Judge. 2025.
  • Zheng et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023.
  • Liu et al. Lost in the Middle: How Language Models Use Long Contexts. 2023.
  • European Data Protection Board. Guidelines on International Data Transfers (Chapter V, GDPR).
  • OpenAI, Microsoft Azure, Google Cloud — official documentation on model-based evaluation.
  • NIST. Generative AI Profile — Companion to the AI Risk Management Framework.
Domain Judges: a way out of the privacy-versus-capability dilemma in LLM evaluation | EthiCompass