Standard annotation budget: PPI++¶
PPI++ combines a small set of expensive true evaluation labels with a large pool of cheap proxy evaluation labels to produce a statistically valid, bias-corrected quality metric. This is necessary because proxy evaluation is generally carried out using LLM judges who are known to disagree between one another. This guide walks through a realistic hallucination-detection scenario end-to-end.
What you will learn:
- Why LLM-as-Judge metrics are systematically biased
- How to use PPI++ to produce a bias-corrected metric
- How to test that your metric fits your expectations
The problem: your LLM judge disagrees with your users¶
Let's assume you run a customer-facing agentic assistant handling thousands of conversations per day.
The signals¶
- Every tenth user reports incorrect or fabricated information (unacceptable for the management).
- You deploy an LLM judge to measure the hallucination rate. The latter tells you the hallucination rate is 5%.
The users and the LLM judge disagree. You decide to dig deeper.
The manual investigation¶
You budget for 200 manual annotations — expensive but accurate ground truth. Annotators find that ~10% of conversations contain a blatant hallucination.
That is double what LLM reports. The judge is systematically optimistic.
The challenge¶
You now have:
- 2,200 LLM judgements — cheap and fast, but biased
- 200 human annotations — accurate, but covering only a small portion of your data
PPI++ combines both to produce a reliable, unbiased estimate of the true hallucination rate across all conversations.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from glide.estimators import ClassicalMeanEstimator, PPIMeanEstimator
from glide.samplers import UniformSampler
from glide.simulators import generate_binary_dataset, simulate_annotation
# ── Colour palette ──────────────────────────────────────────
C_JUDGE = "#E74C3C" # LLM judge — red-orange
C_HUMAN = "#2E86AB" # Human-only — blue
C_GLIDE = "#27AE60" # GLIDE — green
C_TRUTH = "#2C3E50" # True value — dark slate
# ── Global plot style ───────────────────────────────────────
plt.rcParams.update(
{
"figure.facecolor": "white",
"axes.facecolor": "#FAFAFA",
"axes.grid": True,
"grid.color": "#E5E5E5",
"grid.linewidth": 0.8,
"font.size": 18,
"axes.labelsize": 18,
"axes.titlesize": 18,
"legend.fontsize": 16,
"xtick.labelsize": 16,
"ytick.labelsize": 16,
"figure.titlesize": 19,
}
)
Simulating thousands of conversations with a biased judge¶
generate_binary_dataset produces synthetic data that mirrors the scenario above, with ground-truth labels available for all samples.
The simulation replicates the practical workflow: proxy labels are generated for every conversation, a random subset is selected for human annotation, and the remaining ground-truth labels are masked with np.nan to reflect what would be unobserved in practice.
- Generate synthetic data with
generate_binary_dataset: proxy labels cover all samples. - Sample a subset for annotation with
UniformSampler. - Annotate selected samples and mask the rest via
simulate_annotation.
The table below summarizes the arrays produced and what each contains.
| Array | Meaning |
|---|---|
y_true_oracle |
Ground-truth labels for all conversations (revealed only after annotation) |
y_proxy |
Proxy predictions for all rows |
xi |
Annotation indicator: 1 if annotated, 0 otherwise |
y_true |
Observed labels: ground-truth where annotated, np.nan elsewhere |
N_TOTAL = 2200
N_LABELED = 200
RANDOM_SEED = 18
y_true_oracle, y_proxy = generate_binary_dataset(
n_total=N_TOTAL,
true_mean=0.10,
proxy_mean=0.05,
correlation=0.65,
random_seed=RANDOM_SEED,
)
xi = UniformSampler().sample(n_samples=N_TOTAL, budget=N_LABELED, random_seed=RANDOM_SEED)
y_true = simulate_annotation(y_true_oracle, xi)
labeled_mask = ~np.isnan(y_true)
n_labeled = np.sum(labeled_mask)
n_unlabeled = len(y_true) - n_labeled
print(f"Total conversations : {len(y_true):,}")
print(f" Manually annotated : {n_labeled:,}")
print(f" LLM-judged only : {n_unlabeled:,}")
print()
print("Sample values")
print(f" y_true (labeled): {y_true[labeled_mask][0]}")
print(f" y_true (unlabeled): {y_true[~labeled_mask][0]}")
print(f" y_proxy (all): {y_proxy[0]}")
Total conversations : 2,200 Manually annotated : 200 LLM-judged only : 2,000 Sample values y_true (labeled): 0.0 y_true (unlabeled): nan y_proxy (all): 0.0
Two naive strategies both fail¶
Two obvious approaches to estimating the true hallucination rate each have a fatal flaw:
Option A — Trust the judge on all conversations.
Precise (large sample), but the judge's systematic bias makes the estimate wrong.
Option B — Trust only the human annotations.
Unbiased, but the 95% confidence interval is very wide.
PPI++, introduced in the next section, fixes both problems simultaneously.
# Option A: LLM judge — average proxy labels over all conversations
judge_estimate = ClassicalMeanEstimator().estimate(y_proxy)
judge_mean = judge_estimate.mean
judge_lower_bound = judge_estimate.confidence_interval.lower_bound
judge_upper_bound = judge_estimate.confidence_interval.upper_bound
# Option B: human labels only — average over labeled conversations
human_estimate = ClassicalMeanEstimator().estimate(y_true)
human_mean = human_estimate.mean
human_lower_bound = human_estimate.confidence_interval.lower_bound
human_upper_bound = human_estimate.confidence_interval.upper_bound
sep = "-" * 70
print(sep)
print(f"{'Method':<34} {'Estimate':>8} {'95% confidence interval':>16}")
print(sep)
print(f"{'LLM Judge only (n=1,000)':<34} {judge_mean:>7.1%} [{judge_lower_bound:.1%}, {judge_upper_bound:.1%}]")
print(f"{'Human labels only (n=100)':<34} {human_mean:>7.1%} [{human_lower_bound:.1%}, {human_upper_bound:.1%}]")
print(sep)
print(f"{'True rate (simulation)':<34} {'10.0%':>8}")
---------------------------------------------------------------------- Method Estimate 95% confidence interval ---------------------------------------------------------------------- LLM Judge only (n=1,000) 5.5% [4.6%, 6.5%] Human labels only (n=100) 7.5% [3.8%, 11.2%] ---------------------------------------------------------------------- True rate (simulation) 10.0%
The root cause: the LLM judge is systematically biased¶
The gap is clear: compared to human annotators, the judge consistently under-reports hallucinations on average.
The PPI++ rectifier measures this systematic error on the labeled subset, then applies the correction across all conversations.
labeled_mask = ~np.isnan(y_true)
y_true_filtered = y_true[labeled_mask]
y_proxy_labeled = y_proxy[labeled_mask]
p_mean = np.mean(y_proxy_labeled)
t_mean = np.mean(y_true_filtered)
bias = p_mean - t_mean
fig, ax = plt.subplots(figsize=(7, 4.5))
x_pos = np.array([0, 1])
bar_vals = [p_mean, t_mean]
bar_colors = [C_JUDGE, C_HUMAN]
bar_labels = ["LLM Judge\n on annotated subset", "Human Annotation\n(ground truth)"]
ax.bar(x_pos, bar_vals, color=bar_colors, width=0.45, edgecolor="white", linewidth=2, zorder=3)
for xi, val, c in zip(x_pos, bar_vals, bar_colors):
ax.text(xi, val + 0.005, f"{val:.1%}", ha="center", fontsize=14, fontweight="bold", color=c)
ax.annotate(
"", xy=(1, t_mean + 0.010), xytext=(0, p_mean + 0.010), arrowprops=dict(arrowstyle="<->", color="#666666", lw=2.5)
)
ax.text(
0.5,
max(bar_vals) + 0.033,
f"Bias = {bias:+.1%}",
ha="center",
fontsize=12,
color="#555555",
fontstyle="italic",
bbox=dict(boxstyle="round,pad=0.35", facecolor="#FFFDE7", edgecolor="#CCCCCC"),
)
ax.axhline(0.10, color=C_TRUTH, linestyle="--", linewidth=1.8, zorder=2, label="True rate: 10%")
ax.set_xticks(x_pos)
ax.set_xticklabels(bar_labels)
ax.set_ylabel("Hallucination Rate")
ax.set_ylim(0, 0.26)
ax.set_xlim(-0.5, 1.5)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.spines[["top", "right"]].set_visible(False)
ax.legend(loc="upper right")
plt.tight_layout()
plt.show()
PPI++ corrects biased labels by leveraging true ones¶
PPIMeanEstimator implements Prediction-Powered Inference (PPI++), combining:
- n labeled samples (with human annotations and LLM judge labels)
- N unlabeled samples (LLM judge labels only)
The more the judge agrees with human annotators, the more it reduces the uncertainty of the final estimate — so a better-calibrated judge directly translates into cost savings.
estimator = PPIMeanEstimator()
ppi_result = estimator.estimate(
y_true,
y_proxy,
metric_name="Hallucination Rate",
confidence_level=0.95,
)
print(ppi_result.summary())
Metric: Hallucination Rate Point Estimate: 0.082 Confidence Interval (95%): [0.054, 0.111] Estimator : PPIMeanEstimator n_true: 200 n_proxy: 2200 Effective Sample Size: 328
When judge and human labels are correlated, the difference $y_{\text{true}} - y_{\text{proxy}}$ has low variance — the rectifier adds little noise. PPI++ achieves a sizable increase in effective sample size which is the annotation budget that the human-only strategy would need to achieve the same variance as PPI++, quantifying the latter's variance reduction.
PPI++ Delivers an Unbiased Estimate at Low Cost¶
The plot below compares point estimates and 95% confidence intervals for all three methods against the true hallucination rate (dashed line):
- LLM judge: very narrow confidence interval, but wrong.
- Human-only: unbiased, but the confidence interval is very wide.
- PPI++: unbiased and narrow — the accuracy of human labels combined with the precision of many proxy judgements.
TRUE_RATE = 0.10
estimates = [
(
f"LLM Judge\n({ppi_result.n_proxy} | raw proxy)",
judge_mean,
judge_lower_bound,
judge_upper_bound,
C_JUDGE,
),
(
f"Human Annotation\n({ppi_result.n_true} | small sample)",
human_mean,
human_lower_bound,
human_upper_bound,
C_HUMAN,
),
(
f"PPI++ (GLIDE)\n({ppi_result.n_true} + {ppi_result.n_proxy})\n(full data exploited)",
ppi_result.mean,
ppi_result.confidence_interval.lower_bound,
ppi_result.confidence_interval.upper_bound,
C_GLIDE,
),
]
fig, ax = plt.subplots(figsize=(11, 5.5))
y_pos = [2, 1, 0]
for y, (label, mean, lo, hi, color) in zip(y_pos, estimates):
# Confidence interval line
ax.plot([lo, hi], [y, y], color=color, linewidth=4, solid_capstyle="round", zorder=3)
# Cap marks
for xc in [lo, hi]:
ax.plot([xc, xc], [y - 0.2, y + 0.2], color=color, linewidth=2.5, zorder=3)
# Point estimate
ax.scatter(mean, y, s=200, color=color, zorder=5, edgecolors="white", linewidths=2.5)
# Value label above
ax.text(mean, y + 0.34, f"{mean:.1%}", ha="center", va="bottom", fontsize=12, color=color, fontweight="bold")
# Confidence interval bounds below
ax.text(mean, y - 0.34, f"[{lo:.1%}, {hi:.1%}]", ha="center", va="top", fontsize=11, color="#888888")
# True value
ax.axvline(TRUE_RATE, color=C_TRUTH, linestyle="--", linewidth=2.5, zorder=4)
ax.text(TRUE_RATE + 0.004, 2.72, "True rate 10%", color=C_TRUTH, fontsize=10.5, fontweight="bold")
ax.set_yticks(y_pos)
ax.set_yticklabels([e[0] for e in estimates])
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.set_xlabel("Estimated Hallucination Rate")
ax.set_xlim(-0.01, 0.26)
ax.set_ylim(-0.8, 3.2)
ax.spines[["top", "right", "left"]].set_visible(False)
ax.tick_params(left=False)
plt.tight_layout()
plt.show()
Testing whether the hallucination rate is within acceptable limits¶
GLIDE's estimation output enables a formal hypothesis test: is the true hallucination rate significantly higher than 5% ?
$$H_0 : \mu = 5\% \qquad H_1 : \mu > 5\%$$
Is our system's hallucination rate sufficiently below the business tolerance of 5% to deploy safely ?
Without PPI++, we can use the LLM judge estimate only which gives us a misleading confidence interval close to 5% making it unlikely that we would reject the null hypothesis. On ther other hand, the confidence interval obtained thanks to human annotators only is wide and leads to conservative decisions.
PPI++ combines both sources using a statistical method which allows us to perform accurate hypothesis testing.
z_stat, p_value, _ = ppi_result.confidence_interval.test_null_hypothesis(
h0_value=0.05, # LLM judge's claimed rate
alternative="larger", # H1: true rate > 5%
)
sep = "-" * 48
print("Hypothesis test — PPI++ Estimator")
print(sep)
print("H0 : hallucination rate = 5% (LLM says so)")
print("H1 : hallucination rate > 5% (users complain!)")
print()
print(f"z-statistic : {z_stat:.2f}")
print(f"p-value : {p_value:.10f}")
print()
if p_value < 0.05:
print("Decision : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
print("=> The true hallucination rate is significantly above 5%.")
else:
print(
"Decision : Given the p-value is higher than our threshold (p-value >= 0.05), "
" we cannot reject H0 at the 5% level."
)
Hypothesis test — PPI++ Estimator ------------------------------------------------ H0 : hallucination rate = 5% (LLM says so) H1 : hallucination rate > 5% (users complain!) z-statistic : 2.22 p-value : 0.0132522742 Decision : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level. => The true hallucination rate is significantly above 5%.
Notice that the null hypothesis is rejected signalling that the hallucination rate is significantly above the fixed threshold.
Let us try the same hypothesis test using human annotations only.
human_labels_ci = human_estimate.confidence_interval
z_stat_cm, p_value_cm, _ = human_labels_ci.test_null_hypothesis(
h0_value=0.05, # LLM judge's claimed rate
alternative="larger", # H1: true rate > 5%
)
sep = "-" * 48
print("Hypothesis test — Classical Mean Estimator (Human labels only)")
print(sep)
print("H0 : hallucination rate = 5% (LLM says so)")
print("H1 : hallucination rate > 5% (users complain!)")
print()
print(f"z-statistic : {z_stat_cm:.2f}")
print(f"p-value : {p_value_cm:.10f}")
print()
if p_value_cm < 0.05:
print("Decision : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
print("=> The true hallucination rate is significantly above 5%.")
else:
print(
"Decision : Given the p-value is higher than our threshold (p-value >= 0.05), "
" we cannot reject H0 at the 5% level."
)
Hypothesis test — Classical Mean Estimator (Human labels only) ------------------------------------------------ H0 : hallucination rate = 5% (LLM says so) H1 : hallucination rate > 5% (users complain!) z-statistic : 1.34 p-value : 0.0902931295 Decision : Given the p-value is higher than our threshold (p-value >= 0.05), we cannot reject H0 at the 5% level.
The null hypothesis is not rejected due to high uncertainty (see figure above) and it is not possible to draw the same conclusion. As we saw above, PPI fixes this by leveraging the LLM judge labels to reduce uncertainty.
Summary: PPI++ combines accuracy and precision¶
| LLM Judge | Human-only | PPI++ | |
|---|---|---|---|
| Sample size | 2,200 | 200 | 200 + 2,200 |
| Unbiased estimate | ❌ | ✅ | ✅ |
| Narrow confidence interval | 🟠 (misleading) | ❌ | ✅ |
| Labeling cost | Low | High | Small |
Key takeaways:
LLM judges are biased. A narrow confidence interval around the wrong value is worse than useless — it gives false confidence.
200 human annotations is all you need. The rectifier uses information from 2200 cheap proxy labels to shrink the confidence interval by a large amount compared to human-only estimation.
PPI++ efficiency relies on sample size and LLM-judge quality. To shrink the confidence interval further, you can invest in either more human annotations or a better aligned LLM-judge.
Want to go further? The PPI scientific validation notebook provides rigorous empirical evidence — coverage validity and confidence interval width across a sweep of proxy correlations and confidence levels.