Quickstart¶

GLIDE combines a small set of true evaluation labels (e.g. human annotations) with a large pool of proxy evaluation labels (e.g. LLM-as-judge annotations) to produce a bias-corrected metric with statistically valid confidence intervals.

Ingredient	Role
True labels	Ground truth — accurate but slow and expensive
Proxy labels	Proxy — cheap and fast but biased

In [1]:

Copied!





import numpy as np

from glide.estimators.ppi import PPIMeanEstimator
from glide.io import to_json
from glide.samplers import UniformSampler
from glide.simulators import generate_binary_dataset, simulate_annotation
import numpy as np

from glide.estimators.ppi import PPIMeanEstimator
from glide.io import to_json
from glide.samplers import UniformSampler
from glide.simulators import generate_binary_dataset, simulate_annotation

Step 1 — Assemble Your Data¶

GLIDE works with numpy arrays:

y_true — labeled ground truth values, with np.nan for unlabeled rows
y_proxy — proxy predictions for all rows

We generate all 1000 labels, then mask 900 of them to simulate the unobserved ground truths. The true rate is 15% and the proxy rate is 10% (biased low).

In [2]:

Copied!





y_true_oracle, y_proxy = generate_binary_dataset(
    n_total=1000,
    true_mean=0.15,  # true rate
    proxy_mean=0.10,  # proxy rate (biased)
    correlation=0.70,
    random_seed=0,
)
xi = UniformSampler().sample(n_total=len(y_true_oracle), n_samples=100, random_seed=0)
y_true = simulate_annotation(y_true_oracle, xi)

# Count labeled vs unlabeled
n_labeled = np.sum(~np.isnan(y_true))
n_unlabeled = len(y_true) - n_labeled

print(f"Total observations: {len(y_true)}")
print(f"  Labeled  : {n_labeled}")
print(f"  Unlabeled: {n_unlabeled}")
print()
print("Sample values")
print(f"  y_true (labeled)  : {y_true[0]}")
print(f"  y_true (unlabeled): {y_true[-1]}")
print(f"  y_proxy           : {y_proxy[0]}")
y_true_oracle, y_proxy = generate_binary_dataset(
    n_total=1000,
    true_mean=0.15,  # true rate
    proxy_mean=0.10,  # proxy rate (biased)
    correlation=0.70,
    random_seed=0,
)
xi = UniformSampler().sample(n_total=len(y_true_oracle), n_samples=100, random_seed=0)
y_true = simulate_annotation(y_true_oracle, xi)

# Count labeled vs unlabeled
n_labeled = np.sum(~np.isnan(y_true))
n_unlabeled = len(y_true) - n_labeled

print(f"Total observations: {len(y_true)}")
print(f"  Labeled  : {n_labeled}")
print(f"  Unlabeled: {n_unlabeled}")
print()
print("Sample values")
print(f"  y_true (labeled)  : {y_true[0]}")
print(f"  y_true (unlabeled): {y_true[-1]}")
print(f"  y_proxy           : {y_proxy[0]}")

Total observations: 1000
  Labeled  : 100
  Unlabeled: 900

Sample values
  y_true (labeled)  : nan
  y_true (unlabeled): 0.0
  y_proxy           : 0.0

Step 2 — Estimate the Metric¶

PPIMeanEstimator.estimate() corrects the proxy's bias using the labeled subset, then applies the correction across the full dataset.

In [3]:

Copied!





result = PPIMeanEstimator().estimate(
    y_true,
    y_proxy,
    metric_name="Evaluated metric",  # e.g. a hallucination rate
    confidence_level=0.95,
)

print(result.summary())
result = PPIMeanEstimator().estimate(
    y_true,
    y_proxy,
    metric_name="Evaluated metric",  # e.g. a hallucination rate
    confidence_level=0.95,
)

print(result.summary())

Metric: Evaluated metric
Point Estimate: 0.164
Confidence Interval (95%): [0.114, 0.215]
Estimator : PPIMeanEstimator
n_true: 100
n_proxy: 1000
Effective Sample Size: 195

The effective sample size is the number of true labels needed to achieve the same confidence with true labels only.

InferenceResult contains the point estimate, standard error and confidence interval.

In [4]:

Copied!





# Point estimate and confidence interval
print(f"Estimate : {result.mean:.3f}")
print(f"Standard error : {result.std:.3f}")
print(
    f"95% Confidence Interval : [{result.confidence_interval.lower_bound:.3f}, \
{result.confidence_interval.upper_bound:.3f}]"
)
print()
# Point estimate and confidence interval
print(f"Estimate : {result.mean:.3f}")
print(f"Standard error : {result.std:.3f}")
print(
    f"95% Confidence Interval : [{result.confidence_interval.lower_bound:.3f}, \
{result.confidence_interval.upper_bound:.3f}]"
)
print()

Estimate : 0.164
Standard error : 0.026
95% Confidence Interval : [0.114, 0.215]

Step 3 — Export as JSON¶

The inference results can be exported in JSON format using the to_json utility function.

In [5]:

Copied!

print(to_json(result))
print(to_json(result))

{
  "metric_name": "Evaluated metric",
  "estimator_name": "PPIMeanEstimator",
  "confidence_interval": {
    "confidence_level": 0.95,
    "lower_bound": 0.11406248283637743,
    "upper_bound": 0.21478484696386713,
    "width": 0.10072236412748971
  },
  "n_true": 100,
  "n_proxy": 1000,
  "effective_sample_size": 195,
  "mean": 0.16442366490012228,
  "std": 0.02569495279555514
}

Step 4 — Check if you reached the target¶

InferenceResult also allows you to perform hypothesis testing on the estimated distribution. For example, we can check if the 95% confidence interval is consistent with our target of keeping the rate below the specified limit.

In [6]:

Copied!





z, p, _ = result.confidence_interval.test_null_hypothesis(
    h0_value=0.10,
    alternative="larger",
)
print("H0: rate <= 10%  |  H1: rate > 10%")
print(f"z = {z:.3f}  |  p = {p:.3f}")
print("Reject H0 — the rate is significantly above 10%." if p < 0.05 else "Cannot reject H0 at the 5% level.")
z, p, _ = result.confidence_interval.test_null_hypothesis(
    h0_value=0.10,
    alternative="larger",
)
print("H0: rate <= 10%  |  H1: rate > 10%")
print(f"z = {z:.3f}  |  p = {p:.3f}")
print("Reject H0 — the rate is significantly above 10%." if p < 0.05 else "Cannot reject H0 at the 5% level.")

H0: rate <= 10%  |  H1: rate > 10%
z = 2.507  |  p = 0.006
Reject H0 — the rate is significantly above 10%.

Hypothesis was rejected, the hallucination rate is above the business tolerance of 10% at the 95% confidence level. In a real-world example, this would mean the system is not production-ready.

Using Your Own Data¶

Replace generate_binary_dataset with your own numpy arrays. Use np.nan for unlabeled rows in y_true.

import json
import numpy as np
import pandas as pd
import polars as pl

# From a JSON file
with open("annotations.json") as f:
    data = json.load(f)
    y_true = np.array([d.get("y_true", np.nan) for d in data])
    y_proxy = np.array([d["y_proxy"] for d in data])

# From a pandas DataFrame
df = pd.read_csv("your_data.csv")
y_true = df["label"].values.astype(float)  # NaN for unlabeled
y_proxy = df["prediction"].values

# From a polars DataFrame
df = pl.read_csv("your_data.csv")
y_true = df["label"].to_numpy().astype(float)
y_proxy = df["prediction"].to_numpy()

Click here to download this notebook