Multiple Proxies: Multi-PPI++¶

Multi-PPI++ extends PPI++ to settings where multiple proxy annotation sources are available simultaneously. Instead of picking a single one or averaging them naively, Multi-PPI++ finds the optimal combination of all proxy sources and applies a bias correction, producing a statistically valid estimate that leverages every signal at hand.

This guide walks through a realistic hallucination-detection scenario end-to-end.

What you will learn:

Why two different LLM judges disagree and why they can be biased
How to use Multi-PPI++ to produce a bias-corrected metric from multiple proxy sources
How Multi-PPI++ compares to using a single proxy or human annotations alone

The problem: your LLM judges disagree with your users¶

Let's assume you run a customer-facing AI assistant handling thousands of conversations per day.

The signals¶

Every tenth user reports incorrect or fabricated information (unacceptable for the management).
You deploy two LLM judges to measure the hallucination rate: Gemini reports 5% and Claude reports 8%.

The users and both judges disagree. You decide to dig deeper.

The manual investigation¶

You budget for 200 manual annotations — expensive but accurate ground truth. Annotators find that ~10% of conversations contain a blatant hallucination.

That is double what Gemini reports and 25% above what Claude reports. Both judges are systematically optimistic, just to different degrees.

The challenge¶

You now have:

2,200 Gemini judgements — cheap and fast, but biased
2,200 Claude judgements — cheap and fast, but also biased
200 human annotations — accurate, but covering only a small portion of your data

Multi-PPI++ combines all sources to produce a reliable, unbiased estimate of the true hallucination rate across all conversations.

In [1]:

Copied!





%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

from glide.estimators import ClassicalMeanEstimator, MultiPPIMeanEstimator
from glide.samplers import UniformSampler
from glide.simulators import generate_multi_binary_dataset, simulate_annotation

# ── Colour palette ──────────────────────────────────────────
C_GEMINI = "#4A90D9"  # Gemini     — blue
C_CLAUDE = "#E87040"  # Claude     — orange
C_HUMAN = "#2E86AB"  # Human-only — steel blue
C_GLIDE = "#27AE60"  # GLIDE      — green
C_TRUTH = "#2C3E50"  # True value — dark slate

# ── Global plot style ───────────────────────────────────────
plt.rcParams.update(
    {
        "figure.facecolor": "white",
        "axes.facecolor": "#FAFAFA",
        "axes.grid": True,
        "grid.color": "#E5E5E5",
        "grid.linewidth": 0.8,
        "font.size": 18,
        "axes.labelsize": 18,
        "axes.titlesize": 18,
        "legend.fontsize": 16,
        "xtick.labelsize": 16,
        "ytick.labelsize": 16,
        "figure.titlesize": 19,
    }
)
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

from glide.estimators import ClassicalMeanEstimator, MultiPPIMeanEstimator
from glide.samplers import UniformSampler
from glide.simulators import generate_multi_binary_dataset, simulate_annotation

# ── Colour palette ──────────────────────────────────────────
C_GEMINI = "#4A90D9"  # Gemini     — blue
C_CLAUDE = "#E87040"  # Claude     — orange
C_HUMAN = "#2E86AB"  # Human-only — steel blue
C_GLIDE = "#27AE60"  # GLIDE      — green
C_TRUTH = "#2C3E50"  # True value — dark slate

# ── Global plot style ───────────────────────────────────────
plt.rcParams.update(
    {
        "figure.facecolor": "white",
        "axes.facecolor": "#FAFAFA",
        "axes.grid": True,
        "grid.color": "#E5E5E5",
        "grid.linewidth": 0.8,
        "font.size": 18,
        "axes.labelsize": 18,
        "axes.titlesize": 18,
        "legend.fontsize": 16,
        "xtick.labelsize": 16,
        "ytick.labelsize": 16,
        "figure.titlesize": 19,
    }
)

Simulating thousands of conversations with two biased judges¶

generate_multi_binary_dataset produces synthetic data that mirrors the scenario above, with ground-truth labels available for all samples.

The simulation replicates the practical workflow: two proxy labels are generated for every conversation, a random subset is selected for human annotation, and the remaining ground-truth labels are masked with np.nan to reflect what would be unobserved in practice.

Generate synthetic data with generate_multi_binary_dataset: both judges label all samples.
Sample a subset for annotation with UniformSampler.
Annotate selected samples and mask the rest via simulate_annotation.

The table below summarizes the arrays produced and what each contains.

Array	Meaning
`y_true_oracle`	Ground-truth labels for all conversations (revealed only after annotation)
`y_proxies`	Proxy predictions for all rows. Contains two columns representing Gemini's and Claude's labels
`xi`	Annotation indicator: 1 if annotated, 0 otherwise
`y_true`	Observed labels: ground-truth where annotated, `np.nan` elsewhere

We extract the columns of y_proxies into two separate arrays y_gemini and y_claude.

In [2]:

Copied!





N_TOTAL = 2200
N_LABELED = 200
RANDOM_SEED = 12

y_true_oracle, y_proxies = generate_multi_binary_dataset(
    n_total=N_TOTAL,
    true_mean=0.10,
    proxy_means=[0.05, 0.08],
    correlations=[0.60, 0.70],
    random_seed=RANDOM_SEED,
)
y_gemini = y_proxies[:, 0]
y_claude = y_proxies[:, 1]

xi = UniformSampler().sample(n_total=N_TOTAL, n_samples=N_LABELED, random_seed=RANDOM_SEED)
y_true = simulate_annotation(y_true_oracle, xi)
N_TOTAL = 2200
N_LABELED = 200
RANDOM_SEED = 12

y_true_oracle, y_proxies = generate_multi_binary_dataset(
    n_total=N_TOTAL,
    true_mean=0.10,
    proxy_means=[0.05, 0.08],
    correlations=[0.60, 0.70],
    random_seed=RANDOM_SEED,
)
y_gemini = y_proxies[:, 0]
y_claude = y_proxies[:, 1]

xi = UniformSampler().sample(n_total=N_TOTAL, n_samples=N_LABELED, random_seed=RANDOM_SEED)
y_true = simulate_annotation(y_true_oracle, xi)

In [3]:

Copied!





labeled_mask = ~np.isnan(y_true)
n_labeled = np.sum(labeled_mask)
n_unlabeled = len(y_true) - n_labeled

print(f"Total conversations  : {len(y_true):,}")
print(f"  Manually annotated : {n_labeled:,}")
print(f"  Judge-labelled only: {n_unlabeled:,}")
print()
print("Sample values")
print(f"  y_true (labeled):    {y_true[labeled_mask][0]}")
print(f"  y_true (unlabeled):  {y_true[~labeled_mask][0]}")
print(f"  y_gemini (all):      {y_gemini[0]}")
print(f"  y_claude (all):      {y_claude[0]}")
labeled_mask = ~np.isnan(y_true)
n_labeled = np.sum(labeled_mask)
n_unlabeled = len(y_true) - n_labeled

print(f"Total conversations  : {len(y_true):,}")
print(f"  Manually annotated : {n_labeled:,}")
print(f"  Judge-labelled only: {n_unlabeled:,}")
print()
print("Sample values")
print(f"  y_true (labeled):    {y_true[labeled_mask][0]}")
print(f"  y_true (unlabeled):  {y_true[~labeled_mask][0]}")
print(f"  y_gemini (all):      {y_gemini[0]}")
print(f"  y_claude (all):      {y_claude[0]}")

Total conversations  : 2,200
  Manually annotated : 200
  Judge-labelled only: 2,000

Sample values
  y_true (labeled):    0.0
  y_true (unlabeled):  nan
  y_gemini (all):      0.0
  y_claude (all):      0.0

Three naive strategies all fall short¶

Three obvious approaches to estimating the true hallucination rate each have a fatal flaw:

Option A — Trust Gemini on all conversations.
Precise (large sample), but Gemini's systematic bias makes the estimate wrong.

Option B — Trust Claude on all conversations.
More accurate than Gemini, but still biased. The confidence interval is narrow around the wrong value.

Option C — Trust only the human annotations.
Unbiased, but the 95% confidence interval is very wide due to the small annotated sample.

Multi-PPI++, introduced in the next section, fixes both problems simultaneously.

In [4]:

Copied!





# Option A: Gemini judge — average proxy labels over all conversations
gemini_estimate = ClassicalMeanEstimator().estimate(y_gemini)
gemini_mean = gemini_estimate.mean
gemini_lower = gemini_estimate.confidence_interval.lower_bound
gemini_upper = gemini_estimate.confidence_interval.upper_bound

# Option B: Claude judge — average proxy labels over all conversations
claude_estimate = ClassicalMeanEstimator().estimate(y_claude)
claude_mean = claude_estimate.mean
claude_lower = claude_estimate.confidence_interval.lower_bound
claude_upper = claude_estimate.confidence_interval.upper_bound

# Option C: human labels only — average over labeled conversations
human_estimate = ClassicalMeanEstimator().estimate(y_true)
human_mean = human_estimate.mean
human_lower = human_estimate.confidence_interval.lower_bound
human_upper = human_estimate.confidence_interval.upper_bound
# Option A: Gemini judge — average proxy labels over all conversations
gemini_estimate = ClassicalMeanEstimator().estimate(y_gemini)
gemini_mean = gemini_estimate.mean
gemini_lower = gemini_estimate.confidence_interval.lower_bound
gemini_upper = gemini_estimate.confidence_interval.upper_bound

# Option B: Claude judge — average proxy labels over all conversations
claude_estimate = ClassicalMeanEstimator().estimate(y_claude)
claude_mean = claude_estimate.mean
claude_lower = claude_estimate.confidence_interval.lower_bound
claude_upper = claude_estimate.confidence_interval.upper_bound

# Option C: human labels only — average over labeled conversations
human_estimate = ClassicalMeanEstimator().estimate(y_true)
human_mean = human_estimate.mean
human_lower = human_estimate.confidence_interval.lower_bound
human_upper = human_estimate.confidence_interval.upper_bound

In [5]:

Copied!





sep = "-" * 70
print(sep)
print(f"{'Method':<34} {'Estimate':>8}   {'95% confidence interval':>16}")
print(sep)
print(f"{'Gemini only  (n=2,200)':<34} {gemini_mean:>7.1%}   [{gemini_lower:.1%}, {gemini_upper:.1%}]")
print(f"{'Claude only  (n=2,200)':<34} {claude_mean:>7.1%}   [{claude_lower:.1%}, {claude_upper:.1%}]")
print(f"{'Human labels only (n=200)':<34} {human_mean:>7.1%}   [{human_lower:.1%}, {human_upper:.1%}]")
print(sep)
print(f"{'True rate  (simulation)':<34} {'10.0%':>8}")
sep = "-" * 70
print(sep)
print(f"{'Method':<34} {'Estimate':>8}   {'95% confidence interval':>16}")
print(sep)
print(f"{'Gemini only  (n=2,200)':<34} {gemini_mean:>7.1%}   [{gemini_lower:.1%}, {gemini_upper:.1%}]")
print(f"{'Claude only  (n=2,200)':<34} {claude_mean:>7.1%}   [{claude_lower:.1%}, {claude_upper:.1%}]")
print(f"{'Human labels only (n=200)':<34} {human_mean:>7.1%}   [{human_lower:.1%}, {human_upper:.1%}]")
print(sep)
print(f"{'True rate  (simulation)':<34} {'10.0%':>8}")

----------------------------------------------------------------------
Method                             Estimate   95% confidence interval
----------------------------------------------------------------------
Gemini only  (n=2,200)                4.5%   [3.6%, 5.4%]
Claude only  (n=2,200)                7.7%   [6.6%, 8.8%]
Human labels only (n=200)             8.0%   [4.2%, 11.8%]
----------------------------------------------------------------------
True rate  (simulation)               10.0%

The root cause: both judges are systematically biased¶

The pattern is clear: compared to human annotators, both Gemini and Claude consistently under-report hallucinations on average. Gemini is more severely biased than Claude, but neither judge can be trusted directly.

The Multi-PPI++ rectifier measures each judge's systematic error on the labeled subset, then applies the correction across all conversations, optimally weighting the two judges based on how well each agrees with the human annotations.

In [6]:

Copied!





labeled_mask = ~np.isnan(y_true)
y_true_filtered = y_true[labeled_mask]
y_gemini_labeled = y_gemini[labeled_mask]
y_claude_labeled = y_claude[labeled_mask]

gemini_mean_labeled = np.mean(y_gemini_labeled)
claude_mean_labeled = np.mean(y_claude_labeled)
t_mean = np.mean(y_true_filtered)
gemini_bias = gemini_mean_labeled - t_mean
claude_bias = claude_mean_labeled - t_mean
labeled_mask = ~np.isnan(y_true)
y_true_filtered = y_true[labeled_mask]
y_gemini_labeled = y_gemini[labeled_mask]
y_claude_labeled = y_claude[labeled_mask]

gemini_mean_labeled = np.mean(y_gemini_labeled)
claude_mean_labeled = np.mean(y_claude_labeled)
t_mean = np.mean(y_true_filtered)
gemini_bias = gemini_mean_labeled - t_mean
claude_bias = claude_mean_labeled - t_mean

In [7]:

Copied!





fig, ax = plt.subplots(figsize=(9, 4.5))

x_pos = np.array([0, 1, 2])
bar_vals = [gemini_mean_labeled, claude_mean_labeled, t_mean]
bar_colors = [C_GEMINI, C_CLAUDE, C_HUMAN]
bar_labels = ["Gemini\n(annotated subset)", "Claude\n(annotated subset)", "Human Annotation\n(ground truth)"]

ax.bar(x_pos, bar_vals, color=bar_colors, width=0.45, edgecolor="white", linewidth=2, zorder=3)

for xi_pos, val, c in zip(x_pos, bar_vals, bar_colors):
    ax.text(xi_pos, val + 0.004, f"{val:.1%}", ha="center", fontsize=14, fontweight="bold", color=c)

for xi_pos, bias in zip([0, 1], [gemini_bias, claude_bias]):
    ax.text(
        xi_pos,
        t_mean + 0.055,
        f"Bias = {bias:+.1%}",
        ha="center",
        fontsize=11,
        color="#555555",
        fontstyle="italic",
        bbox=dict(boxstyle="round,pad=0.35", facecolor="#FFFDE7", edgecolor="#CCCCCC"),
    )

ax.axhline(0.10, color=C_TRUTH, linestyle="--", linewidth=1.8, zorder=2, label="True rate: 10%")

ax.set_xticks(x_pos)
ax.set_xticklabels(bar_labels)
ax.set_ylabel("Hallucination Rate")
ax.set_ylim(0, 0.28)
ax.set_xlim(-0.5, 2.5)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.spines[["top", "right"]].set_visible(False)
ax.legend(loc="upper right")
plt.tight_layout()
plt.show()
fig, ax = plt.subplots(figsize=(9, 4.5))

x_pos = np.array([0, 1, 2])
bar_vals = [gemini_mean_labeled, claude_mean_labeled, t_mean]
bar_colors = [C_GEMINI, C_CLAUDE, C_HUMAN]
bar_labels = ["Gemini\n(annotated subset)", "Claude\n(annotated subset)", "Human Annotation\n(ground truth)"]

ax.bar(x_pos, bar_vals, color=bar_colors, width=0.45, edgecolor="white", linewidth=2, zorder=3)

for xi_pos, val, c in zip(x_pos, bar_vals, bar_colors):
    ax.text(xi_pos, val + 0.004, f"{val:.1%}", ha="center", fontsize=14, fontweight="bold", color=c)

for xi_pos, bias in zip([0, 1], [gemini_bias, claude_bias]):
    ax.text(
        xi_pos,
        t_mean + 0.055,
        f"Bias = {bias:+.1%}",
        ha="center",
        fontsize=11,
        color="#555555",
        fontstyle="italic",
        bbox=dict(boxstyle="round,pad=0.35", facecolor="#FFFDE7", edgecolor="#CCCCCC"),
    )

ax.axhline(0.10, color=C_TRUTH, linestyle="--", linewidth=1.8, zorder=2, label="True rate: 10%")

ax.set_xticks(x_pos)
ax.set_xticklabels(bar_labels)
ax.set_ylabel("Hallucination Rate")
ax.set_ylim(0, 0.28)
ax.set_xlim(-0.5, 2.5)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.spines[["top", "right"]].set_visible(False)
ax.legend(loc="upper right")
plt.tight_layout()
plt.show()

No description has been provided for this image

Multi-PPI++ corrects bias by optimally combining all sources¶

MultiPPIMeanEstimator implements Multi-PPI++, combining:

n labeled samples (with human annotations and both judge labels)
N unlabeled samples (both judge labels only)

The estimator automatically learns the optimal weight for each judge from the labeled data. A judge that agrees more closely with human annotators receives a larger weight, contributing more to bias rectification. Because Multi-PPI++ is power-tuned, it is always at least as efficient as human-only estimation, regardless of the quality of the judges. Note also that Multi-PPI++ can handle an arbitrary number of proxies in general.

In [8]:

Copied!





estimator = MultiPPIMeanEstimator()

multi_ppi_result = estimator.estimate(
    y_true,
    y_proxies,
    metric_name="Hallucination Rate",
    confidence_level=0.95,
)

print(multi_ppi_result.summary())
estimator = MultiPPIMeanEstimator()

multi_ppi_result = estimator.estimate(
    y_true,
    y_proxies,
    metric_name="Hallucination Rate",
    confidence_level=0.95,
)

print(multi_ppi_result.summary())

Metric: Hallucination Rate
Point Estimate: 0.095
Confidence Interval (95%): [0.065, 0.125]
Estimator : MultiPPIMeanEstimator
n_true: 200
n_proxy: 2200
Effective Sample Size: 315

Multi-PPI++ estimates tuning parameters from the labeled data to minimise confidence interval width. Because Claude agrees more closely with human annotators (higher correlation), it receives a larger weight. The effective sample size quantifies the annotation budget that the human-only strategy would need to match Multi-PPI++'s precision.

Multi-PPI++ Delivers an Unbiased Estimate at Low Cost¶

The plot below compares point estimates and 95% confidence intervals for all four methods against the true hallucination rate (dashed line):

Gemini: very narrow confidence interval, but wrong.
Claude: narrow confidence interval, also wrong.
Human-only: unbiased, but the confidence interval is very wide.
Multi-PPI++: unbiased and narrow, the accuracy of human labels combined with the precision of 2,200 proxy judgements from both judges.

In [9]:

Copied!





TRUE_RATE = 0.10

estimates = [
    (
        f"Gemini\n({multi_ppi_result.n_proxy}  |  raw proxy)",
        gemini_mean,
        gemini_lower,
        gemini_upper,
        C_GEMINI,
    ),
    (
        f"Claude\n({multi_ppi_result.n_proxy}  |  raw proxy)",
        claude_mean,
        claude_lower,
        claude_upper,
        C_CLAUDE,
    ),
    (
        f"Human Annotation\n({multi_ppi_result.n_true}  |  small sample)",
        human_mean,
        human_lower,
        human_upper,
        C_HUMAN,
    ),
    (
        f"Multi-PPI++ (GLIDE)\n({multi_ppi_result.n_true}  +  {multi_ppi_result.n_proxy})\n(full data exploited)",
        multi_ppi_result.mean,
        multi_ppi_result.confidence_interval.lower_bound,
        multi_ppi_result.confidence_interval.upper_bound,
        C_GLIDE,
    ),
]

fig, ax = plt.subplots(figsize=(11, 7))
y_pos = [3, 2, 1, 0]

for y, (label, mean, lo, hi, color) in zip(y_pos, estimates):
    ax.plot([lo, hi], [y, y], color=color, linewidth=4, solid_capstyle="round", zorder=3)
    for xc in [lo, hi]:
        ax.plot([xc, xc], [y - 0.2, y + 0.2], color=color, linewidth=2.5, zorder=3)
    ax.scatter(mean, y, s=200, color=color, zorder=5, edgecolors="white", linewidths=2.5)
    ax.text(mean, y + 0.34, f"{mean:.1%}", ha="center", va="bottom", fontsize=12, color=color, fontweight="bold")
    ax.text(mean, y - 0.34, f"[{lo:.1%},  {hi:.1%}]", ha="center", va="top", fontsize=11, color="#888888")

ax.axvline(TRUE_RATE, color=C_TRUTH, linestyle="--", linewidth=2.5, zorder=4)
ax.text(TRUE_RATE + 0.004, 3.72, "True rate  10%", color=C_TRUTH, fontsize=10.5, fontweight="bold")

ax.set_yticks(y_pos)
ax.set_yticklabels([e[0] for e in estimates])
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.set_xlabel("Estimated Hallucination Rate")
ax.set_xlim(-0.01, 0.26)
ax.set_ylim(-0.8, 4.2)
ax.spines[["top", "right", "left"]].set_visible(False)
ax.tick_params(left=False)
plt.tight_layout()
plt.show()
TRUE_RATE = 0.10

estimates = [
    (
        f"Gemini\n({multi_ppi_result.n_proxy}  |  raw proxy)",
        gemini_mean,
        gemini_lower,
        gemini_upper,
        C_GEMINI,
    ),
    (
        f"Claude\n({multi_ppi_result.n_proxy}  |  raw proxy)",
        claude_mean,
        claude_lower,
        claude_upper,
        C_CLAUDE,
    ),
    (
        f"Human Annotation\n({multi_ppi_result.n_true}  |  small sample)",
        human_mean,
        human_lower,
        human_upper,
        C_HUMAN,
    ),
    (
        f"Multi-PPI++ (GLIDE)\n({multi_ppi_result.n_true}  +  {multi_ppi_result.n_proxy})\n(full data exploited)",
        multi_ppi_result.mean,
        multi_ppi_result.confidence_interval.lower_bound,
        multi_ppi_result.confidence_interval.upper_bound,
        C_GLIDE,
    ),
]

fig, ax = plt.subplots(figsize=(11, 7))
y_pos = [3, 2, 1, 0]

for y, (label, mean, lo, hi, color) in zip(y_pos, estimates):
    ax.plot([lo, hi], [y, y], color=color, linewidth=4, solid_capstyle="round", zorder=3)
    for xc in [lo, hi]:
        ax.plot([xc, xc], [y - 0.2, y + 0.2], color=color, linewidth=2.5, zorder=3)
    ax.scatter(mean, y, s=200, color=color, zorder=5, edgecolors="white", linewidths=2.5)
    ax.text(mean, y + 0.34, f"{mean:.1%}", ha="center", va="bottom", fontsize=12, color=color, fontweight="bold")
    ax.text(mean, y - 0.34, f"[{lo:.1%},  {hi:.1%}]", ha="center", va="top", fontsize=11, color="#888888")

ax.axvline(TRUE_RATE, color=C_TRUTH, linestyle="--", linewidth=2.5, zorder=4)
ax.text(TRUE_RATE + 0.004, 3.72, "True rate  10%", color=C_TRUTH, fontsize=10.5, fontweight="bold")

ax.set_yticks(y_pos)
ax.set_yticklabels([e[0] for e in estimates])
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.set_xlabel("Estimated Hallucination Rate")
ax.set_xlim(-0.01, 0.26)
ax.set_ylim(-0.8, 4.2)
ax.spines[["top", "right", "left"]].set_visible(False)
ax.tick_params(left=False)
plt.tight_layout()
plt.show()

Testing whether the hallucination rate is within acceptable limits¶

GLIDE's estimation output enables a formal hypothesis test: is the true hallucination rate significantly higher than 5% (the rate Gemini reports)?

$$H_0 : \mu = 5\% \qquad H_1 : \mu > 5\%$$

Is our system's hallucination rate sufficiently below the business tolerance of 5% to deploy safely?

Without Multi-PPI++, using either judge's labels directly gives a misleading confidence interval close to 5%, making it unlikely that we would reject the null hypothesis. On the other hand, the confidence interval obtained from human annotators only is wide and leads to conservative decisions.

Multi-PPI++ combines all sources using a statistical method which allows us to perform accurate hypothesis testing.

In [10]:

Copied!





z_stat, p_value, _ = multi_ppi_result.confidence_interval.test_null_hypothesis(
    h0_value=0.05,
    alternative="larger",
)

sep = "-" * 48
print("Hypothesis test — Multi-PPI++ Estimator")
print(sep)
print("H0 : hallucination rate = 5%   (Gemini says so)")
print("H1 : hallucination rate > 5%   (users complain!)")
print()
print(f"z-statistic : {z_stat:.2f}")
print(f"p-value     : {p_value:.10f}")
print()
if p_value < 0.05:
    print("Decision  : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
    print("=> The true hallucination rate is significantly above 5%.")
else:
    print(
        "Decision  : Given the p-value is higher than our threshold (p-value >= 0.05),"
        " we cannot reject H0 at the 5% level."
    )
z_stat, p_value, _ = multi_ppi_result.confidence_interval.test_null_hypothesis(
    h0_value=0.05,
    alternative="larger",
)

sep = "-" * 48
print("Hypothesis test — Multi-PPI++ Estimator")
print(sep)
print("H0 : hallucination rate = 5%   (Gemini says so)")
print("H1 : hallucination rate > 5%   (users complain!)")
print()
print(f"z-statistic : {z_stat:.2f}")
print(f"p-value     : {p_value:.10f}")
print()
if p_value < 0.05:
    print("Decision  : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
    print("=> The true hallucination rate is significantly above 5%.")
else:
    print(
        "Decision  : Given the p-value is higher than our threshold (p-value >= 0.05),"
        " we cannot reject H0 at the 5% level."
    )

Hypothesis test — Multi-PPI++ Estimator
------------------------------------------------
H0 : hallucination rate = 5%   (Gemini says so)
H1 : hallucination rate > 5%   (users complain!)

z-statistic : 2.95
p-value     : 0.0015812791

Decision  : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.
=> The true hallucination rate is significantly above 5%.

Notice that the null hypothesis is rejected, signalling that the hallucination rate is significantly above the fixed threshold.

Let us try the same hypothesis test using human annotations only.

In [11]:

Copied!





human_labels_ci = human_estimate.confidence_interval
z_stat_cm, p_value_cm, _ = human_labels_ci.test_null_hypothesis(
    h0_value=0.05,
    alternative="larger",
)

sep = "-" * 48
print("Hypothesis test — Classical Mean Estimator (Human labels only)")
print(sep)
print("H0 : hallucination rate = 5%   (Gemini says so)")
print("H1 : hallucination rate > 5%   (users complain!)")
print()
print(f"z-statistic : {z_stat_cm:.2f}")
print(f"p-value     : {p_value_cm:.10f}")
print()
if p_value_cm < 0.05:
    print("Decision  : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
    print("=> The true hallucination rate is significantly above 5%.")
else:
    print(
        "Decision  : Given the p-value is higher than our threshold (p-value >= 0.05),"
        " we cannot reject H0 at the 5% level."
    )
human_labels_ci = human_estimate.confidence_interval
z_stat_cm, p_value_cm, _ = human_labels_ci.test_null_hypothesis(
    h0_value=0.05,
    alternative="larger",
)

sep = "-" * 48
print("Hypothesis test — Classical Mean Estimator (Human labels only)")
print(sep)
print("H0 : hallucination rate = 5%   (Gemini says so)")
print("H1 : hallucination rate > 5%   (users complain!)")
print()
print(f"z-statistic : {z_stat_cm:.2f}")
print(f"p-value     : {p_value_cm:.10f}")
print()
if p_value_cm < 0.05:
    print("Decision  : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
    print("=> The true hallucination rate is significantly above 5%.")
else:
    print(
        "Decision  : Given the p-value is higher than our threshold (p-value >= 0.05),"
        " we cannot reject H0 at the 5% level."
    )

Hypothesis test — Classical Mean Estimator (Human labels only)
------------------------------------------------
H0 : hallucination rate = 5%   (Gemini says so)
H1 : hallucination rate > 5%   (users complain!)

z-statistic : 1.56
p-value     : 0.0593866096

Decision  : Given the p-value is higher than our threshold (p-value >= 0.05), we cannot reject H0 at the 5% level.

The null hypothesis is not rejected due to high uncertainty, and it is not possible to draw the same conclusion. Indeed, the p-value from the human-only estimator is much larger than with Multi-PPI++, reflecting the greater uncertainty from using a small labeled sample alone. As we saw above, Multi-PPI++ fixes this by leveraging both judges' labels to reduce uncertainty and gain statistical power.

Summary: Multi-PPI++ combines accuracy and precision¶

	Gemini only	Claude only	Human-only	Multi-PPI++
Sample size	2,200	2,200	200	200 + 2 $\times$ 2,200
Unbiased estimate	❌	❌	✅	✅
Narrow confidence interval	🟠 (misleading)	🟠 (misleading)	❌	✅
Labeling cost	Low	Low	High	Small

Key takeaways:

LLM judges are biased, and they disagree. A narrow confidence interval around the wrong value is worse than useless, it gives false confidence. Choosing between judges based on which number looks better is not a principled strategy.
200 human annotations is all you need. Multi-PPI++ uses information from 2,200 cheap proxy labels from both judges to shrink the confidence interval significantly compared to human-only estimation.
Multi-PPI++ efficiency relies on sample size and judge quality. To shrink the confidence interval further, you can invest in either more human annotations or better-aligned judges. Multi-PPI++ automatically make the most of the available proxy labels.

Want to go further? The Multi-PPI++ scientific validation notebook provides rigorous empirical evidence — coverage validity and confidence interval width across a sweep of proxy correlations and confidence levels.

Click here to download this notebook