Standard annotation budget: PPI++¶

PPI++ combines a small set of expensive true evaluation labels with a large pool of cheap proxy evaluation labels to produce a statistically valid, bias-corrected quality metric. This is necessary because proxy evaluation is generally carried out using LLM judges who are known to disagree between one another. This guide walks through a realistic hallucination-detection scenario end-to-end.

What you will learn:

Why LLM-as-Judge metrics are systematically biased
How to use PPI++ to produce a bias-corrected metric
How to test that your metric fits your expectations

The problem: your LLM judge disagrees with your users¶

Let's assume you run a customer-facing agentic assistant handling thousands of conversations per day.

The signals¶

Every tenth user reports incorrect or fabricated information (unacceptable for the management).
You deploy an LLM judge to measure the hallucination rate. The latter tells you the hallucination rate is 5%.

The users and the LLM judge disagree. You decide to dig deeper.

The manual investigation¶

You budget for 200 manual annotations — expensive but accurate ground truth. Annotators find that ~10% of conversations contain a blatant hallucination.

That is double what the LLM reports. The judge is systematically optimistic.

The challenge¶

You now have:

2,200 LLM judgements — cheap and fast, but biased
200 human annotations — accurate, but covering only a small portion of your data

PPI++ combines both to produce a reliable, unbiased estimate of the true hallucination rate across all conversations.

In [1]:

Copied!





%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

from glide.estimators import ClassicalMeanEstimator, PPIMeanEstimator
from glide.samplers import UniformSampler
from glide.simulators import generate_binary_dataset, simulate_annotation

# ── Colour palette ──────────────────────────────────────────
C_JUDGE = "#E74C3C"  # LLM judge  — red-orange
C_HUMAN = "#2E86AB"  # Human-only — blue
C_GLIDE = "#27AE60"  # GLIDE      — green
C_TRUTH = "#2C3E50"  # True value — dark slate

# ── Global plot style ───────────────────────────────────────
plt.rcParams.update(
    {
        "figure.facecolor": "white",
        "axes.facecolor": "#FAFAFA",
        "axes.grid": True,
        "grid.color": "#E5E5E5",
        "grid.linewidth": 0.8,
        "font.size": 18,
        "axes.labelsize": 18,
        "axes.titlesize": 18,
        "legend.fontsize": 16,
        "xtick.labelsize": 16,
        "ytick.labelsize": 16,
        "figure.titlesize": 19,
    }
)
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

from glide.estimators import ClassicalMeanEstimator, PPIMeanEstimator
from glide.samplers import UniformSampler
from glide.simulators import generate_binary_dataset, simulate_annotation

# ── Colour palette ──────────────────────────────────────────
C_JUDGE = "#E74C3C"  # LLM judge  — red-orange
C_HUMAN = "#2E86AB"  # Human-only — blue
C_GLIDE = "#27AE60"  # GLIDE      — green
C_TRUTH = "#2C3E50"  # True value — dark slate

# ── Global plot style ───────────────────────────────────────
plt.rcParams.update(
    {
        "figure.facecolor": "white",
        "axes.facecolor": "#FAFAFA",
        "axes.grid": True,
        "grid.color": "#E5E5E5",
        "grid.linewidth": 0.8,
        "font.size": 18,
        "axes.labelsize": 18,
        "axes.titlesize": 18,
        "legend.fontsize": 16,
        "xtick.labelsize": 16,
        "ytick.labelsize": 16,
        "figure.titlesize": 19,
    }
)

Simulating thousands of conversations with a biased judge¶

generate_binary_dataset produces synthetic data that mirrors the scenario above, with ground-truth labels available for all samples.

The simulation replicates the practical workflow: proxy labels are generated for every conversation, a random subset is selected for human annotation, and the remaining ground-truth labels are masked with np.nan to reflect what would be unobserved in practice.

Generate synthetic data with generate_binary_dataset: proxy labels cover all samples.
Sample a subset for annotation with UniformSampler.
Annotate selected samples and mask the rest via simulate_annotation.

The table below summarizes the arrays produced and what each contains.

Array	Meaning
`y_true_oracle`	Ground-truth labels for all conversations (revealed only after annotation)
`y_proxy`	Proxy predictions for all rows
`xi`	Annotation indicator: 1 if annotated, 0 otherwise
`y_true`	Observed labels: ground-truth where annotated, `np.nan` elsewhere

In [2]:

Copied!





N_TOTAL = 2200
N_LABELED = 200
RANDOM_SEED = 12

y_true_oracle, y_proxy = generate_binary_dataset(
    n_total=N_TOTAL,
    true_mean=0.10,
    proxy_mean=0.05,
    correlation=0.65,
    random_seed=RANDOM_SEED,
)
xi = UniformSampler().sample(n_total=N_TOTAL, n_samples=N_LABELED, random_seed=RANDOM_SEED)
y_true = simulate_annotation(y_true_oracle, xi)
N_TOTAL = 2200
N_LABELED = 200
RANDOM_SEED = 12

y_true_oracle, y_proxy = generate_binary_dataset(
    n_total=N_TOTAL,
    true_mean=0.10,
    proxy_mean=0.05,
    correlation=0.65,
    random_seed=RANDOM_SEED,
)
xi = UniformSampler().sample(n_total=N_TOTAL, n_samples=N_LABELED, random_seed=RANDOM_SEED)
y_true = simulate_annotation(y_true_oracle, xi)

In [3]:

Copied!





labeled_mask = ~np.isnan(y_true)
n_labeled = np.sum(labeled_mask)
n_unlabeled = len(y_true) - n_labeled

print(f"Total conversations  : {len(y_true):,}")
print(f"  Manually annotated : {n_labeled:,}")
print(f"  LLM-judged only    : {n_unlabeled:,}")
print()
print("Sample values")
print(f"  y_true (labeled):    {y_true[labeled_mask][0]}")
print(f"  y_true (unlabeled):  {y_true[~labeled_mask][0]}")
print(f"  y_proxy (all):       {y_proxy[0]}")
labeled_mask = ~np.isnan(y_true)
n_labeled = np.sum(labeled_mask)
n_unlabeled = len(y_true) - n_labeled

print(f"Total conversations  : {len(y_true):,}")
print(f"  Manually annotated : {n_labeled:,}")
print(f"  LLM-judged only    : {n_unlabeled:,}")
print()
print("Sample values")
print(f"  y_true (labeled):    {y_true[labeled_mask][0]}")
print(f"  y_true (unlabeled):  {y_true[~labeled_mask][0]}")
print(f"  y_proxy (all):       {y_proxy[0]}")

Total conversations  : 2,200
  Manually annotated : 200
  LLM-judged only    : 2,000

Sample values
  y_true (labeled):    0.0
  y_true (unlabeled):  nan
  y_proxy (all):       0.0

Two naive strategies both fail¶

Two obvious approaches to estimating the true hallucination rate each have a fatal flaw:

Option A — Trust the judge on all conversations.
Precise (large sample), but the judge's systematic bias makes the estimate wrong.

Option B — Trust only the human annotations.
Unbiased, but the 95% confidence interval is very wide.

PPI++, introduced in the next section, fixes both problems simultaneously.

In [4]:

Copied!





# Option A: LLM judge — average proxy labels over all conversations
judge_estimate = ClassicalMeanEstimator().estimate(y_proxy)
judge_mean = judge_estimate.mean
judge_lower_bound = judge_estimate.confidence_interval.lower_bound
judge_upper_bound = judge_estimate.confidence_interval.upper_bound

# Option B: human labels only — average over labeled conversations
human_estimate = ClassicalMeanEstimator().estimate(y_true)
human_mean = human_estimate.mean
human_lower_bound = human_estimate.confidence_interval.lower_bound
human_upper_bound = human_estimate.confidence_interval.upper_bound
# Option A: LLM judge — average proxy labels over all conversations
judge_estimate = ClassicalMeanEstimator().estimate(y_proxy)
judge_mean = judge_estimate.mean
judge_lower_bound = judge_estimate.confidence_interval.lower_bound
judge_upper_bound = judge_estimate.confidence_interval.upper_bound

# Option B: human labels only — average over labeled conversations
human_estimate = ClassicalMeanEstimator().estimate(y_true)
human_mean = human_estimate.mean
human_lower_bound = human_estimate.confidence_interval.lower_bound
human_upper_bound = human_estimate.confidence_interval.upper_bound

In [5]:

Copied!





sep = "-" * 70
print(sep)
print(f"{'Method':<34} {'Estimate':>8}   {'95% confidence interval':>16}")
print(sep)
print(f"{'LLM Judge only  (n=1,000)':<34} {judge_mean:>7.1%}   [{judge_lower_bound:.1%}, {judge_upper_bound:.1%}]")
print(f"{'Human labels only (n=100)':<34} {human_mean:>7.1%}   [{human_lower_bound:.1%}, {human_upper_bound:.1%}]")
print(sep)
print(f"{'True rate  (simulation)':<34} {'10.0%':>8}")
sep = "-" * 70
print(sep)
print(f"{'Method':<34} {'Estimate':>8}   {'95% confidence interval':>16}")
print(sep)
print(f"{'LLM Judge only  (n=1,000)':<34} {judge_mean:>7.1%}   [{judge_lower_bound:.1%}, {judge_upper_bound:.1%}]")
print(f"{'Human labels only (n=100)':<34} {human_mean:>7.1%}   [{human_lower_bound:.1%}, {human_upper_bound:.1%}]")
print(sep)
print(f"{'True rate  (simulation)':<34} {'10.0%':>8}")

----------------------------------------------------------------------
Method                             Estimate   95% confidence interval
----------------------------------------------------------------------
LLM Judge only  (n=1,000)             4.9%   [4.0%, 5.8%]
Human labels only (n=100)             8.0%   [4.2%, 11.8%]
----------------------------------------------------------------------
True rate  (simulation)               10.0%

The root cause: the LLM judge is systematically biased¶

The gap is clear: compared to human annotators, the judge consistently under-reports hallucinations on average.

The PPI++ rectifier measures this systematic error on the labeled subset, then applies the correction across all conversations.

In [6]:

Copied!





labeled_mask = ~np.isnan(y_true)
y_true_filtered = y_true[labeled_mask]
y_proxy_labeled = y_proxy[labeled_mask]

p_mean = np.mean(y_proxy_labeled)
t_mean = np.mean(y_true_filtered)
bias = p_mean - t_mean
labeled_mask = ~np.isnan(y_true)
y_true_filtered = y_true[labeled_mask]
y_proxy_labeled = y_proxy[labeled_mask]

p_mean = np.mean(y_proxy_labeled)
t_mean = np.mean(y_true_filtered)
bias = p_mean - t_mean

In [7]:

Copied!





fig, ax = plt.subplots(figsize=(7, 4.5))

x_pos = np.array([0, 1])
bar_vals = [p_mean, t_mean]
bar_colors = [C_JUDGE, C_HUMAN]
bar_labels = ["LLM Judge\n (annotated subset)", "Human Annotation\n(ground truth)"]

ax.bar(x_pos, bar_vals, color=bar_colors, width=0.45, edgecolor="white", linewidth=2, zorder=3)

for xi, val, c in zip(x_pos, bar_vals, bar_colors):
    ax.text(xi, val + 0.005, f"{val:.1%}", ha="center", fontsize=14, fontweight="bold", color=c)

ax.annotate(
    "", xy=(1, t_mean + 0.010), xytext=(0, p_mean + 0.010), arrowprops=dict(arrowstyle="<->", color="#666666", lw=2.5)
)
ax.text(
    0.5,
    max(bar_vals) + 0.033,
    f"Bias = {bias:+.1%}",
    ha="center",
    fontsize=12,
    color="#555555",
    fontstyle="italic",
    bbox=dict(boxstyle="round,pad=0.35", facecolor="#FFFDE7", edgecolor="#CCCCCC"),
)

ax.axhline(0.10, color=C_TRUTH, linestyle="--", linewidth=1.8, zorder=2, label="True rate: 10%")

ax.set_xticks(x_pos)
ax.set_xticklabels(bar_labels)
ax.set_ylabel("Hallucination Rate")
ax.set_ylim(0, 0.26)
ax.set_xlim(-0.5, 1.5)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.spines[["top", "right"]].set_visible(False)
ax.legend(loc="upper right")
plt.tight_layout()
plt.show()
fig, ax = plt.subplots(figsize=(7, 4.5))

x_pos = np.array([0, 1])
bar_vals = [p_mean, t_mean]
bar_colors = [C_JUDGE, C_HUMAN]
bar_labels = ["LLM Judge\n (annotated subset)", "Human Annotation\n(ground truth)"]

ax.bar(x_pos, bar_vals, color=bar_colors, width=0.45, edgecolor="white", linewidth=2, zorder=3)

for xi, val, c in zip(x_pos, bar_vals, bar_colors):
    ax.text(xi, val + 0.005, f"{val:.1%}", ha="center", fontsize=14, fontweight="bold", color=c)

ax.annotate(
    "", xy=(1, t_mean + 0.010), xytext=(0, p_mean + 0.010), arrowprops=dict(arrowstyle="<->", color="#666666", lw=2.5)
)
ax.text(
    0.5,
    max(bar_vals) + 0.033,
    f"Bias = {bias:+.1%}",
    ha="center",
    fontsize=12,
    color="#555555",
    fontstyle="italic",
    bbox=dict(boxstyle="round,pad=0.35", facecolor="#FFFDE7", edgecolor="#CCCCCC"),
)

ax.axhline(0.10, color=C_TRUTH, linestyle="--", linewidth=1.8, zorder=2, label="True rate: 10%")

ax.set_xticks(x_pos)
ax.set_xticklabels(bar_labels)
ax.set_ylabel("Hallucination Rate")
ax.set_ylim(0, 0.26)
ax.set_xlim(-0.5, 1.5)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.spines[["top", "right"]].set_visible(False)
ax.legend(loc="upper right")
plt.tight_layout()
plt.show()

No description has been provided for this image

PPI++ corrects biased labels by leveraging true ones¶

PPIMeanEstimator implements Prediction-Powered Inference (PPI++), combining:

n labeled samples (with human annotations and LLM judge labels)
N unlabeled samples (LLM judge labels only)

The more the judge agrees with human annotators, the more it reduces the uncertainty of the final estimate, so a better-calibrated judge directly translates into cost savings.

In [8]:

Copied!





estimator = PPIMeanEstimator()

ppi_result = estimator.estimate(
    y_true,
    y_proxy,
    metric_name="Hallucination Rate",
    confidence_level=0.95,
)

print(ppi_result.summary())
estimator = PPIMeanEstimator()

ppi_result = estimator.estimate(
    y_true,
    y_proxy,
    metric_name="Hallucination Rate",
    confidence_level=0.95,
)

print(ppi_result.summary())

Metric: Hallucination Rate
Point Estimate: 0.087
Confidence Interval (95%): [0.058, 0.116]
Estimator : PPIMeanEstimator
n_true: 200
n_proxy: 2200
Effective Sample Size: 342

When judge and human labels are correlated, the difference $y_{\text{true}} - y_{\text{proxy}}$ has low variance so the rectifier adds little noise. PPI++ achieves a sizable increase in effective sample size which is the annotation budget that the human-only strategy would need to achieve the same variance as PPI++, quantifying the latter's variance reduction.

PPI++ Delivers an Unbiased Estimate at Low Cost¶

The plot below compares point estimates and 95% confidence intervals for all three methods against the true hallucination rate (dashed line):

LLM judge: very narrow confidence interval, but wrong.
Human-only: unbiased, but the confidence interval is very wide.
PPI++: unbiased and narrow — the accuracy of human labels combined with the precision of many proxy judgements.

In [9]:

Copied!





TRUE_RATE = 0.10

estimates = [
    (
        f"LLM Judge\n({ppi_result.n_proxy}  |  raw proxy)",
        judge_mean,
        judge_lower_bound,
        judge_upper_bound,
        C_JUDGE,
    ),
    (
        f"Human Annotation\n({ppi_result.n_true} |  small sample)",
        human_mean,
        human_lower_bound,
        human_upper_bound,
        C_HUMAN,
    ),
    (
        f"PPI++ (GLIDE)\n({ppi_result.n_true}  +  {ppi_result.n_proxy})\n(full data exploited)",
        ppi_result.mean,
        ppi_result.confidence_interval.lower_bound,
        ppi_result.confidence_interval.upper_bound,
        C_GLIDE,
    ),
]

fig, ax = plt.subplots(figsize=(11, 5.5))
y_pos = [2, 1, 0]

for y, (label, mean, lo, hi, color) in zip(y_pos, estimates):
    # Confidence interval line
    ax.plot([lo, hi], [y, y], color=color, linewidth=4, solid_capstyle="round", zorder=3)
    # Cap marks
    for xc in [lo, hi]:
        ax.plot([xc, xc], [y - 0.2, y + 0.2], color=color, linewidth=2.5, zorder=3)
    # Point estimate
    ax.scatter(mean, y, s=200, color=color, zorder=5, edgecolors="white", linewidths=2.5)
    # Value label above
    ax.text(mean, y + 0.34, f"{mean:.1%}", ha="center", va="bottom", fontsize=12, color=color, fontweight="bold")
    # Confidence interval bounds below
    ax.text(mean, y - 0.34, f"[{lo:.1%},  {hi:.1%}]", ha="center", va="top", fontsize=11, color="#888888")

# True value
ax.axvline(TRUE_RATE, color=C_TRUTH, linestyle="--", linewidth=2.5, zorder=4)
ax.text(TRUE_RATE + 0.004, 2.72, "True rate  10%", color=C_TRUTH, fontsize=10.5, fontweight="bold")

ax.set_yticks(y_pos)
ax.set_yticklabels([e[0] for e in estimates])
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.set_xlabel("Estimated Hallucination Rate")
ax.set_xlim(-0.01, 0.26)
ax.set_ylim(-0.8, 3.2)
ax.spines[["top", "right", "left"]].set_visible(False)
ax.tick_params(left=False)
plt.tight_layout()
plt.show()
TRUE_RATE = 0.10

estimates = [
    (
        f"LLM Judge\n({ppi_result.n_proxy}  |  raw proxy)",
        judge_mean,
        judge_lower_bound,
        judge_upper_bound,
        C_JUDGE,
    ),
    (
        f"Human Annotation\n({ppi_result.n_true} |  small sample)",
        human_mean,
        human_lower_bound,
        human_upper_bound,
        C_HUMAN,
    ),
    (
        f"PPI++ (GLIDE)\n({ppi_result.n_true}  +  {ppi_result.n_proxy})\n(full data exploited)",
        ppi_result.mean,
        ppi_result.confidence_interval.lower_bound,
        ppi_result.confidence_interval.upper_bound,
        C_GLIDE,
    ),
]

fig, ax = plt.subplots(figsize=(11, 5.5))
y_pos = [2, 1, 0]

for y, (label, mean, lo, hi, color) in zip(y_pos, estimates):
    # Confidence interval line
    ax.plot([lo, hi], [y, y], color=color, linewidth=4, solid_capstyle="round", zorder=3)
    # Cap marks
    for xc in [lo, hi]:
        ax.plot([xc, xc], [y - 0.2, y + 0.2], color=color, linewidth=2.5, zorder=3)
    # Point estimate
    ax.scatter(mean, y, s=200, color=color, zorder=5, edgecolors="white", linewidths=2.5)
    # Value label above
    ax.text(mean, y + 0.34, f"{mean:.1%}", ha="center", va="bottom", fontsize=12, color=color, fontweight="bold")
    # Confidence interval bounds below
    ax.text(mean, y - 0.34, f"[{lo:.1%},  {hi:.1%}]", ha="center", va="top", fontsize=11, color="#888888")

# True value
ax.axvline(TRUE_RATE, color=C_TRUTH, linestyle="--", linewidth=2.5, zorder=4)
ax.text(TRUE_RATE + 0.004, 2.72, "True rate  10%", color=C_TRUTH, fontsize=10.5, fontweight="bold")

ax.set_yticks(y_pos)
ax.set_yticklabels([e[0] for e in estimates])
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.set_xlabel("Estimated Hallucination Rate")
ax.set_xlim(-0.01, 0.26)
ax.set_ylim(-0.8, 3.2)
ax.spines[["top", "right", "left"]].set_visible(False)
ax.tick_params(left=False)
plt.tight_layout()
plt.show()

Testing whether the hallucination rate is within acceptable limits¶

GLIDE's estimation output enables a formal hypothesis test: is the true hallucination rate significantly higher than 5% ?

$$H_0 : \mu = 5\% \qquad H_1 : \mu > 5\%$$

Is our system's hallucination rate sufficiently below the business tolerance of 5% to deploy safely ?

Without PPI++, we can use the LLM judge estimate only which gives us a misleading confidence interval close to 5% making it unlikely that we would reject the null hypothesis. On the other hand, the confidence interval obtained thanks to human annotators only is wide and leads to conservative decisions.

PPI++ combines both sources using a statistical method which allows us to perform accurate hypothesis testing.

In [10]:

Copied!





z_stat, p_value, _ = ppi_result.confidence_interval.test_null_hypothesis(
    h0_value=0.05,  # LLM judge's claimed rate
    alternative="larger",  # H1: true rate > 5%
)

sep = "-" * 48
print("Hypothesis test — PPI++ Estimator")
print(sep)
print("H0 : hallucination rate = 5%   (LLM says so)")
print("H1 : hallucination rate > 5%   (users complain!)")
print()
print(f"z-statistic : {z_stat:.2f}")
print(f"p-value     : {p_value:.10f}")
print()
if p_value < 0.05:
    print("Decision  : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
    print("=> The true hallucination rate is significantly above 5%.")
else:
    print(
        "Decision  : Given the p-value is higher than our threshold (p-value >= 0.05), "
        " we cannot reject H0 at the 5% level."
    )
z_stat, p_value, _ = ppi_result.confidence_interval.test_null_hypothesis(
    h0_value=0.05,  # LLM judge's claimed rate
    alternative="larger",  # H1: true rate > 5%
)

sep = "-" * 48
print("Hypothesis test — PPI++ Estimator")
print(sep)
print("H0 : hallucination rate = 5%   (LLM says so)")
print("H1 : hallucination rate > 5%   (users complain!)")
print()
print(f"z-statistic : {z_stat:.2f}")
print(f"p-value     : {p_value:.10f}")
print()
if p_value < 0.05:
    print("Decision  : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
    print("=> The true hallucination rate is significantly above 5%.")
else:
    print(
        "Decision  : Given the p-value is higher than our threshold (p-value >= 0.05), "
        " we cannot reject H0 at the 5% level."
    )

Hypothesis test — PPI++ Estimator
------------------------------------------------
H0 : hallucination rate = 5%   (LLM says so)
H1 : hallucination rate > 5%   (users complain!)

z-statistic : 2.53
p-value     : 0.0056770244

Decision  : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.
=> The true hallucination rate is significantly above 5%.

Notice that the null hypothesis is rejected signalling that the hallucination rate is significantly above the fixed threshold.

Let us try the same hypothesis test using human annotations only.

In [11]:

Copied!





human_labels_ci = human_estimate.confidence_interval
z_stat_cm, p_value_cm, _ = human_labels_ci.test_null_hypothesis(
    h0_value=0.05,  # LLM judge's claimed rate
    alternative="larger",  # H1: true rate > 5%
)

sep = "-" * 48
print("Hypothesis test — Classical Mean Estimator (Human labels only)")
print(sep)
print("H0 : hallucination rate = 5%   (LLM says so)")
print("H1 : hallucination rate > 5%   (users complain!)")
print()
print(f"z-statistic : {z_stat_cm:.2f}")
print(f"p-value     : {p_value_cm:.10f}")
print()
if p_value_cm < 0.05:
    print("Decision  : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
    print("=> The true hallucination rate is significantly above 5%.")
else:
    print(
        "Decision  : Given the p-value is higher than our threshold (p-value >= 0.05), "
        " we cannot reject H0 at the 5% level."
    )
human_labels_ci = human_estimate.confidence_interval
z_stat_cm, p_value_cm, _ = human_labels_ci.test_null_hypothesis(
    h0_value=0.05,  # LLM judge's claimed rate
    alternative="larger",  # H1: true rate > 5%
)

sep = "-" * 48
print("Hypothesis test — Classical Mean Estimator (Human labels only)")
print(sep)
print("H0 : hallucination rate = 5%   (LLM says so)")
print("H1 : hallucination rate > 5%   (users complain!)")
print()
print(f"z-statistic : {z_stat_cm:.2f}")
print(f"p-value     : {p_value_cm:.10f}")
print()
if p_value_cm < 0.05:
    print("Decision  : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
    print("=> The true hallucination rate is significantly above 5%.")
else:
    print(
        "Decision  : Given the p-value is higher than our threshold (p-value >= 0.05), "
        " we cannot reject H0 at the 5% level."
    )

Hypothesis test — Classical Mean Estimator (Human labels only)
------------------------------------------------
H0 : hallucination rate = 5%   (LLM says so)
H1 : hallucination rate > 5%   (users complain!)

z-statistic : 1.56
p-value     : 0.0593866096

Decision  : Given the p-value is higher than our threshold (p-value >= 0.05),  we cannot reject H0 at the 5% level.

The null hypothesis is not rejected due to high uncertainty (see figure above) and it is not possible to draw the same conclusion. As we saw above, PPI fixes this by leveraging the LLM judge labels to reduce uncertainty.

Summary: PPI++ combines accuracy and precision¶

	LLM Judge	Human-only	PPI++
Sample size	2,200	200	200 + 2,200
Unbiased estimate	❌	✅	✅
Narrow confidence interval	🟠 (misleading)	❌	✅
Labeling cost	Low	High	Small

Key takeaways:

LLM judges are biased. A narrow confidence interval around the wrong value is worse than useless, it gives false confidence.
200 human annotations is all you need. The rectifier uses information from 2200 cheap proxy labels to shrink the confidence interval by a large amount compared to human-only estimation.
PPI++ efficiency relies on sample size and LLM-judge quality. To shrink the confidence interval further, you can invest in either more human annotations or a better aligned LLM-judge.

Want to go further? The PPI scientific validation notebook provides rigorous empirical evidence: coverage validity and confidence interval width across a sweep of proxy correlations and confidence levels.

Click here to download this notebook