Multiple Proxies: Multi-PPI++¶
Multi-PPI++ extends PPI++ to settings where multiple proxy annotation sources are available simultaneously. Instead of picking a single one or averaging them naively, Multi-PPI++ finds the optimal combination of all proxy sources and applies a bias correction, producing a statistically valid estimate that leverages every signal at hand.
This guide walks through a realistic hallucination-detection scenario end-to-end.
What you will learn:
- Why two different LLM judges disagree and why they can be biased
- How to use Multi-PPI++ to produce a bias-corrected metric from multiple proxy sources
- How Multi-PPI++ compares to using a single proxy or human annotations alone
The problem: your LLM judges disagree with your users¶
Let's assume you run a customer-facing AI assistant handling thousands of conversations per day.
The signals¶
- Every tenth user reports incorrect or fabricated information (unacceptable for the management).
- You deploy two LLM judges to measure the hallucination rate: Gemini reports 5% and Claude reports 8%.
The users and both judges disagree. You decide to dig deeper.
The manual investigation¶
You budget for 200 manual annotations — expensive but accurate ground truth. Annotators find that ~10% of conversations contain a blatant hallucination.
That is double what Gemini reports and 25% above what Claude reports. Both judges are systematically optimistic, just to different degrees.
The challenge¶
You now have:
- 2,200 Gemini judgements — cheap and fast, but biased
- 2,200 Claude judgements — cheap and fast, but also biased
- 200 human annotations — accurate, but covering only a small portion of your data
Multi-PPI++ combines all sources to produce a reliable, unbiased estimate of the true hallucination rate across all conversations.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from glide.estimators import ClassicalMeanEstimator, MultiPPIMeanEstimator
from glide.samplers import UniformSampler
from glide.simulators import generate_multi_binary_dataset, simulate_annotation
# ── Colour palette ──────────────────────────────────────────
C_GEMINI = "#4A90D9" # Gemini — blue
C_CLAUDE = "#E87040" # Claude — orange
C_HUMAN = "#2E86AB" # Human-only — steel blue
C_GLIDE = "#27AE60" # GLIDE — green
C_TRUTH = "#2C3E50" # True value — dark slate
# ── Global plot style ───────────────────────────────────────
plt.rcParams.update(
{
"figure.facecolor": "white",
"axes.facecolor": "#FAFAFA",
"axes.grid": True,
"grid.color": "#E5E5E5",
"grid.linewidth": 0.8,
"font.size": 18,
"axes.labelsize": 18,
"axes.titlesize": 18,
"legend.fontsize": 16,
"xtick.labelsize": 16,
"ytick.labelsize": 16,
"figure.titlesize": 19,
}
)
Simulating thousands of conversations with two biased judges¶
generate_multi_binary_dataset produces synthetic data that mirrors the scenario above, with ground-truth labels available for all samples.
The simulation replicates the practical workflow: two proxy labels are generated for every conversation, a random subset is selected for human annotation, and the remaining ground-truth labels are masked with np.nan to reflect what would be unobserved in practice.
- Generate synthetic data with
generate_multi_binary_dataset: both judges label all samples. - Sample a subset for annotation with
UniformSampler. - Annotate selected samples and mask the rest via
simulate_annotation.
The table below summarizes the arrays produced and what each contains.
| Array | Meaning |
|---|---|
y_true_oracle |
Ground-truth labels for all conversations (revealed only after annotation) |
y_proxies |
Proxy predictions for all rows. Contains two columns representing Gemini's and Claude's labels |
xi |
Annotation indicator: 1 if annotated, 0 otherwise |
y_true |
Observed labels: ground-truth where annotated, np.nan elsewhere |
We extract the columns of y_proxies into two separate arrays y_gemini and y_claude.
N_TOTAL = 2200
N_LABELED = 200
RANDOM_SEED = 12
y_true_oracle, y_proxies = generate_multi_binary_dataset(
n_total=N_TOTAL,
true_mean=0.10,
proxy_means=[0.05, 0.08],
correlations=[0.60, 0.70],
random_seed=RANDOM_SEED,
)
y_gemini = y_proxies[:, 0]
y_claude = y_proxies[:, 1]
xi = UniformSampler().sample(n_total=N_TOTAL, n_samples=N_LABELED, random_seed=RANDOM_SEED)
y_true = simulate_annotation(y_true_oracle, xi)
labeled_mask = ~np.isnan(y_true)
n_labeled = np.sum(labeled_mask)
n_unlabeled = len(y_true) - n_labeled
print(f"Total conversations : {len(y_true):,}")
print(f" Manually annotated : {n_labeled:,}")
print(f" Judge-labelled only: {n_unlabeled:,}")
print()
print("Sample values")
print(f" y_true (labeled): {y_true[labeled_mask][0]}")
print(f" y_true (unlabeled): {y_true[~labeled_mask][0]}")
print(f" y_gemini (all): {y_gemini[0]}")
print(f" y_claude (all): {y_claude[0]}")
Total conversations : 2,200 Manually annotated : 200 Judge-labelled only: 2,000 Sample values y_true (labeled): 0.0 y_true (unlabeled): nan y_gemini (all): 0.0 y_claude (all): 0.0
Three naive strategies all fall short¶
Three obvious approaches to estimating the true hallucination rate each have a fatal flaw:
Option A — Trust Gemini on all conversations.
Precise (large sample), but Gemini's systematic bias makes the estimate wrong.
Option B — Trust Claude on all conversations.
More accurate than Gemini, but still biased. The confidence interval is narrow around the wrong value.
Option C — Trust only the human annotations.
Unbiased, but the 95% confidence interval is very wide due to the small annotated sample.
Multi-PPI++, introduced in the next section, fixes both problems simultaneously.
# Option A: Gemini judge — average proxy labels over all conversations
gemini_estimate = ClassicalMeanEstimator().estimate(y_gemini)
gemini_mean = gemini_estimate.mean
gemini_lower = gemini_estimate.confidence_interval.lower_bound
gemini_upper = gemini_estimate.confidence_interval.upper_bound
# Option B: Claude judge — average proxy labels over all conversations
claude_estimate = ClassicalMeanEstimator().estimate(y_claude)
claude_mean = claude_estimate.mean
claude_lower = claude_estimate.confidence_interval.lower_bound
claude_upper = claude_estimate.confidence_interval.upper_bound
# Option C: human labels only — average over labeled conversations
human_estimate = ClassicalMeanEstimator().estimate(y_true)
human_mean = human_estimate.mean
human_lower = human_estimate.confidence_interval.lower_bound
human_upper = human_estimate.confidence_interval.upper_bound
sep = "-" * 70
print(sep)
print(f"{'Method':<34} {'Estimate':>8} {'95% confidence interval':>16}")
print(sep)
print(f"{'Gemini only (n=2,200)':<34} {gemini_mean:>7.1%} [{gemini_lower:.1%}, {gemini_upper:.1%}]")
print(f"{'Claude only (n=2,200)':<34} {claude_mean:>7.1%} [{claude_lower:.1%}, {claude_upper:.1%}]")
print(f"{'Human labels only (n=200)':<34} {human_mean:>7.1%} [{human_lower:.1%}, {human_upper:.1%}]")
print(sep)
print(f"{'True rate (simulation)':<34} {'10.0%':>8}")
---------------------------------------------------------------------- Method Estimate 95% confidence interval ---------------------------------------------------------------------- Gemini only (n=2,200) 4.5% [3.6%, 5.4%] Claude only (n=2,200) 7.7% [6.6%, 8.8%] Human labels only (n=200) 8.0% [4.2%, 11.8%] ---------------------------------------------------------------------- True rate (simulation) 10.0%
The root cause: both judges are systematically biased¶
The pattern is clear: compared to human annotators, both Gemini and Claude consistently under-report hallucinations on average. Gemini is more severely biased than Claude, but neither judge can be trusted directly.
The Multi-PPI++ rectifier measures each judge's systematic error on the labeled subset, then applies the correction across all conversations, optimally weighting the two judges based on how well each agrees with the human annotations.
labeled_mask = ~np.isnan(y_true)
y_true_filtered = y_true[labeled_mask]
y_gemini_labeled = y_gemini[labeled_mask]
y_claude_labeled = y_claude[labeled_mask]
gemini_mean_labeled = np.mean(y_gemini_labeled)
claude_mean_labeled = np.mean(y_claude_labeled)
t_mean = np.mean(y_true_filtered)
gemini_bias = gemini_mean_labeled - t_mean
claude_bias = claude_mean_labeled - t_mean
fig, ax = plt.subplots(figsize=(9, 4.5))
x_pos = np.array([0, 1, 2])
bar_vals = [gemini_mean_labeled, claude_mean_labeled, t_mean]
bar_colors = [C_GEMINI, C_CLAUDE, C_HUMAN]
bar_labels = ["Gemini\n(annotated subset)", "Claude\n(annotated subset)", "Human Annotation\n(ground truth)"]
ax.bar(x_pos, bar_vals, color=bar_colors, width=0.45, edgecolor="white", linewidth=2, zorder=3)
for xi_pos, val, c in zip(x_pos, bar_vals, bar_colors):
ax.text(xi_pos, val + 0.004, f"{val:.1%}", ha="center", fontsize=14, fontweight="bold", color=c)
for xi_pos, bias in zip([0, 1], [gemini_bias, claude_bias]):
ax.text(
xi_pos,
t_mean + 0.055,
f"Bias = {bias:+.1%}",
ha="center",
fontsize=11,
color="#555555",
fontstyle="italic",
bbox=dict(boxstyle="round,pad=0.35", facecolor="#FFFDE7", edgecolor="#CCCCCC"),
)
ax.axhline(0.10, color=C_TRUTH, linestyle="--", linewidth=1.8, zorder=2, label="True rate: 10%")
ax.set_xticks(x_pos)
ax.set_xticklabels(bar_labels)
ax.set_ylabel("Hallucination Rate")
ax.set_ylim(0, 0.28)
ax.set_xlim(-0.5, 2.5)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.spines[["top", "right"]].set_visible(False)
ax.legend(loc="upper right")
plt.tight_layout()
plt.show()
Multi-PPI++ corrects bias by optimally combining all sources¶
MultiPPIMeanEstimator implements Multi-PPI++, combining:
- n labeled samples (with human annotations and both judge labels)
- N unlabeled samples (both judge labels only)
The estimator automatically learns the optimal weight for each judge from the labeled data. A judge that agrees more closely with human annotators receives a larger weight, contributing more to bias rectification. Because Multi-PPI++ is power-tuned, it is always at least as efficient as human-only estimation, regardless of the quality of the judges. Note also that Multi-PPI++ can handle an arbitrary number of proxies in general.
estimator = MultiPPIMeanEstimator()
multi_ppi_result = estimator.estimate(
y_true,
y_proxies,
metric_name="Hallucination Rate",
confidence_level=0.95,
)
print(multi_ppi_result.summary())
Metric: Hallucination Rate Point Estimate: 0.095 Confidence Interval (95%): [0.065, 0.125] Estimator : MultiPPIMeanEstimator n_true: 200 n_proxy: 2200 Effective Sample Size: 315
Multi-PPI++ estimates tuning parameters from the labeled data to minimise confidence interval width. Because Claude agrees more closely with human annotators (higher correlation), it receives a larger weight. The effective sample size quantifies the annotation budget that the human-only strategy would need to match Multi-PPI++'s precision.
Multi-PPI++ Delivers an Unbiased Estimate at Low Cost¶
The plot below compares point estimates and 95% confidence intervals for all four methods against the true hallucination rate (dashed line):
- Gemini: very narrow confidence interval, but wrong.
- Claude: narrow confidence interval, also wrong.
- Human-only: unbiased, but the confidence interval is very wide.
- Multi-PPI++: unbiased and narrow, the accuracy of human labels combined with the precision of 2,200 proxy judgements from both judges.
TRUE_RATE = 0.10
estimates = [
(
f"Gemini\n({multi_ppi_result.n_proxy} | raw proxy)",
gemini_mean,
gemini_lower,
gemini_upper,
C_GEMINI,
),
(
f"Claude\n({multi_ppi_result.n_proxy} | raw proxy)",
claude_mean,
claude_lower,
claude_upper,
C_CLAUDE,
),
(
f"Human Annotation\n({multi_ppi_result.n_true} | small sample)",
human_mean,
human_lower,
human_upper,
C_HUMAN,
),
(
f"Multi-PPI++ (GLIDE)\n({multi_ppi_result.n_true} + {multi_ppi_result.n_proxy})\n(full data exploited)",
multi_ppi_result.mean,
multi_ppi_result.confidence_interval.lower_bound,
multi_ppi_result.confidence_interval.upper_bound,
C_GLIDE,
),
]
fig, ax = plt.subplots(figsize=(11, 7))
y_pos = [3, 2, 1, 0]
for y, (label, mean, lo, hi, color) in zip(y_pos, estimates):
ax.plot([lo, hi], [y, y], color=color, linewidth=4, solid_capstyle="round", zorder=3)
for xc in [lo, hi]:
ax.plot([xc, xc], [y - 0.2, y + 0.2], color=color, linewidth=2.5, zorder=3)
ax.scatter(mean, y, s=200, color=color, zorder=5, edgecolors="white", linewidths=2.5)
ax.text(mean, y + 0.34, f"{mean:.1%}", ha="center", va="bottom", fontsize=12, color=color, fontweight="bold")
ax.text(mean, y - 0.34, f"[{lo:.1%}, {hi:.1%}]", ha="center", va="top", fontsize=11, color="#888888")
ax.axvline(TRUE_RATE, color=C_TRUTH, linestyle="--", linewidth=2.5, zorder=4)
ax.text(TRUE_RATE + 0.004, 3.72, "True rate 10%", color=C_TRUTH, fontsize=10.5, fontweight="bold")
ax.set_yticks(y_pos)
ax.set_yticklabels([e[0] for e in estimates])
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.set_xlabel("Estimated Hallucination Rate")
ax.set_xlim(-0.01, 0.26)
ax.set_ylim(-0.8, 4.2)
ax.spines[["top", "right", "left"]].set_visible(False)
ax.tick_params(left=False)
plt.tight_layout()
plt.show()
Testing whether the hallucination rate is within acceptable limits¶
GLIDE's estimation output enables a formal hypothesis test: is the true hallucination rate significantly higher than 5% (the rate Gemini reports)?
$$H_0 : \mu = 5\% \qquad H_1 : \mu > 5\%$$
Is our system's hallucination rate sufficiently below the business tolerance of 5% to deploy safely?
Without Multi-PPI++, using either judge's labels directly gives a misleading confidence interval close to 5%, making it unlikely that we would reject the null hypothesis. On the other hand, the confidence interval obtained from human annotators only is wide and leads to conservative decisions.
Multi-PPI++ combines all sources using a statistical method which allows us to perform accurate hypothesis testing.
z_stat, p_value, _ = multi_ppi_result.confidence_interval.test_null_hypothesis(
h0_value=0.05,
alternative="larger",
)
sep = "-" * 48
print("Hypothesis test — Multi-PPI++ Estimator")
print(sep)
print("H0 : hallucination rate = 5% (Gemini says so)")
print("H1 : hallucination rate > 5% (users complain!)")
print()
print(f"z-statistic : {z_stat:.2f}")
print(f"p-value : {p_value:.10f}")
print()
if p_value < 0.05:
print("Decision : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
print("=> The true hallucination rate is significantly above 5%.")
else:
print(
"Decision : Given the p-value is higher than our threshold (p-value >= 0.05),"
" we cannot reject H0 at the 5% level."
)
Hypothesis test — Multi-PPI++ Estimator ------------------------------------------------ H0 : hallucination rate = 5% (Gemini says so) H1 : hallucination rate > 5% (users complain!) z-statistic : 2.95 p-value : 0.0015812791 Decision : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level. => The true hallucination rate is significantly above 5%.
Notice that the null hypothesis is rejected, signalling that the hallucination rate is significantly above the fixed threshold.
Let us try the same hypothesis test using human annotations only.
human_labels_ci = human_estimate.confidence_interval
z_stat_cm, p_value_cm, _ = human_labels_ci.test_null_hypothesis(
h0_value=0.05,
alternative="larger",
)
sep = "-" * 48
print("Hypothesis test — Classical Mean Estimator (Human labels only)")
print(sep)
print("H0 : hallucination rate = 5% (Gemini says so)")
print("H1 : hallucination rate > 5% (users complain!)")
print()
print(f"z-statistic : {z_stat_cm:.2f}")
print(f"p-value : {p_value_cm:.10f}")
print()
if p_value_cm < 0.05:
print("Decision : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
print("=> The true hallucination rate is significantly above 5%.")
else:
print(
"Decision : Given the p-value is higher than our threshold (p-value >= 0.05),"
" we cannot reject H0 at the 5% level."
)
Hypothesis test — Classical Mean Estimator (Human labels only) ------------------------------------------------ H0 : hallucination rate = 5% (Gemini says so) H1 : hallucination rate > 5% (users complain!) z-statistic : 1.56 p-value : 0.0593866096 Decision : Given the p-value is higher than our threshold (p-value >= 0.05), we cannot reject H0 at the 5% level.
The null hypothesis is not rejected due to high uncertainty, and it is not possible to draw the same conclusion. Indeed, the p-value from the human-only estimator is much larger than with Multi-PPI++, reflecting the greater uncertainty from using a small labeled sample alone. As we saw above, Multi-PPI++ fixes this by leveraging both judges' labels to reduce uncertainty and gain statistical power.
Summary: Multi-PPI++ combines accuracy and precision¶
| Gemini only | Claude only | Human-only | Multi-PPI++ | |
|---|---|---|---|---|
| Sample size | 2,200 | 2,200 | 200 | 200 + 2 $\times$ 2,200 |
| Unbiased estimate | ❌ | ❌ | ✅ | ✅ |
| Narrow confidence interval | 🟠 (misleading) | 🟠 (misleading) | ❌ | ✅ |
| Labeling cost | Low | Low | High | Small |
Key takeaways:
LLM judges are biased, and they disagree. A narrow confidence interval around the wrong value is worse than useless, it gives false confidence. Choosing between judges based on which number looks better is not a principled strategy.
200 human annotations is all you need. Multi-PPI++ uses information from 2,200 cheap proxy labels from both judges to shrink the confidence interval significantly compared to human-only estimation.
Multi-PPI++ efficiency relies on sample size and judge quality. To shrink the confidence interval further, you can invest in either more human annotations or better-aligned judges. Multi-PPI++ automatically make the most of the available proxy labels.
Want to go further? The Multi-PPI++ scientific validation notebook provides rigorous empirical evidence — coverage validity and confidence interval width across a sweep of proxy correlations and confidence levels.