Case Study: Text-to-SQL Accuracy Estimation on Spider 1.0¶
Context. GLIDE combines a small set of ground truth annotations with a large pool of proxy labels to produce confidence intervals that are valid and narrower than purely classical intervals built from labeled data alone.
Dataset. Spider 1.0 is a large-scale text-to-SQL benchmark covering 200 relational databases across diverse domains. Each example pairs a natural language question with a gold SQL query. This case study uses a curated subset of 10 databases selected for their coverage and example count.
Proxy. We ran an LLM judge on all examples: given the question, database schema, and predicted SQL, the judge decides whether the translation is correct. The judge is not perfectly calibrated, and its reliability varies across databases, making each database a natural stratum.
Workflow. We benchmark three protocols: a classical baseline applied to all proxy labels (Proxy only), a stratified classical estimator applied to ground truth labels (True only), and Stratified PPI++, which combines ground truth labels with the full proxy signal to produce a valid, debiased confidence interval.
Setup. We simulate label scarcity by masking all but N_LABELED ground truth annotations. Each Monte Carlo seed draws a fresh sampling mask. Coverage is measured against the true accuracy computed over all examples.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from datasets import DownloadConfig, load_dataset
from glide.estimators import (
StratifiedClassicalMeanEstimator,
StratifiedPPIMeanEstimator,
)
from glide.samplers import StratifiedSampler
from glide.scientific_validation import compute_hits, coverage_with_error_bar, run_monte_carlo
from glide.simulators import simulate_annotation
plt.rcParams.update(
{
"figure.facecolor": "white",
"axes.facecolor": "#FAFAFA",
"axes.grid": True,
"grid.color": "#E5E5E5",
"grid.linewidth": 0.8,
"font.size": 18,
"axes.labelsize": 18,
"axes.titlesize": 18,
"legend.fontsize": 16,
"xtick.labelsize": 16,
"ytick.labelsize": 16,
"figure.titlesize": 19,
}
)
/home/docs/checkouts/readthedocs.org/user_builds/glide-py/checkouts/latest/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Experiment Parameters¶
CONFIDENCE_LEVEL: target coverage rate for all confidence intervals.N_LABELED: annotation budget, i.e., the number of ground truth labels revealed per draw.N_SEEDS: number of Monte Carlo draws used to estimate empirical coverage and confidence interval widths.
WORKFLOW_ESTIMATORS maps workflow names to their estimator classes. COLORS assigns a consistent color to each workflow across all figures.
RANDOM_SEED = 1
N_LABELED = 350
CONFIDENCE_LEVEL = 0.95
N_SEEDS = 1000
HF_REPO = "Glide-py/spider-text-to-sql"
dataset = load_dataset(HF_REPO, download_config=DownloadConfig(max_retries=100))
dataset = dataset["train"]
dataset = dataset.to_pandas()
assert isinstance(dataset, pd.DataFrame)
y_true_oracle = np.array(dataset["ground_truth_label"], dtype=float)
y_proxy = np.array(dataset["llm_judge_label"], dtype=float)
groups = np.array(dataset["db_id"])
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1180 examples [00:00, 170459.06 examples/s]
Data¶
The dataset is loaded from Hugging Face and contains one row per Spider training example from the selected databases. The columns relevant to this analysis are:
question: natural language question.db_id: database identifier, which defines the strata forStratifiedSampler.ground_truth_label: binary label (1 = correct SQL, 0 = incorrect) derived from comparing the predicted query against the gold reference using Spider's evaluation criteria.llm_judge_label: binary proxy label produced by an LLM judge given the question, schema, and predicted SQL.
The table below shows a few representative rows.
dataset.head(5)
| example_id | db_id | question | gold_sql | predicted_sql | llm_judge_label | ground_truth_label | llm_judge_reasoning | ground_truth_reasoning | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | spider_train_1322 | college_2 | Find the buildings which have rooms with capac... | SELECT DISTINCT building FROM classroom WHERE ... | SELECT DISTINCT building\nFROM classroom\nWHER... | 1 | 1 | The query uses the correct table (`classroom`)... | |
| 1 | spider_train_1323 | college_2 | What are the distinct buildings with capacitie... | SELECT DISTINCT building FROM classroom WHERE ... | SELECT DISTINCT building\nFROM classroom\nWHER... | 1 | 1 | The query uses the correct table (`classroom`)... | |
| 2 | spider_train_1324 | college_2 | Count the number of rooms that are not in the ... | SELECT count(*) FROM classroom WHERE building ... | SELECT COUNT(*) AS room_count\nFROM classroom\... | 1 | 1 | The query uses the correct table (`classroom`)... | |
| 3 | spider_train_1325 | college_2 | How many classrooms are not in Lamberton? | SELECT count(*) FROM classroom WHERE building ... | SELECT COUNT(*)\nFROM classroom\nWHERE buildin... | 1 | 1 | The query counts rows in the classroom table w... | |
| 4 | spider_train_1326 | college_2 | What is the name and building of the departmen... | SELECT dept_name , building FROM department W... | SELECT dept_name, building\nFROM department\nW... | 1 | 1 | The query uses the correct table (`department`... |
true_mean = np.mean(y_true_oracle)
WORKFLOW_ESTIMATORS = {
"Proxy only": StratifiedClassicalMeanEstimator,
"True only": StratifiedClassicalMeanEstimator,
"Stratified PPI++": StratifiedPPIMeanEstimator,
}
COLORS = {
"Proxy only": "red",
"True only": "steelblue",
"Stratified PPI++": "purple",
}
Stratified Sampling¶
Annotating every example with a human reviewer would be expensive. We draw a fixed budget of N_LABELED labels using stratified sampling, with each database forming one stratum.
Stratifying by database is well-motivated here: databases differ in schema complexity, domain vocabulary, and SQL query difficulty, so the judge's calibration (and thus the variance of its errors) varies meaningfully across them. Concentrating the annotation budget on higher-variance strata via Neyman allocation yields more efficient estimates than spreading it uniformly.
xi = StratifiedSampler().sample(y_proxy, groups, n_samples=N_LABELED, random_seed=RANDOM_SEED)
y_true = simulate_annotation(y_true_oracle, xi)
print(f"Annotated: {int(xi.sum())} / {len(xi)} examples")
Annotated: 350 / 1180 examples
result_proxy_only = StratifiedClassicalMeanEstimator().estimate(
y_proxy,
groups,
confidence_level=CONFIDENCE_LEVEL,
)
result_true_only = StratifiedClassicalMeanEstimator().estimate(
y_true,
groups,
confidence_level=CONFIDENCE_LEVEL,
)
result_stratified_ppi = StratifiedPPIMeanEstimator().estimate(
y_true,
y_proxy,
groups,
confidence_level=CONFIDENCE_LEVEL,
)
print(result_stratified_ppi.summary())
Metric: Metric Point Estimate: 0.879 Confidence Interval (95%): [0.849, 0.909] Estimator : StratifiedPPIMeanEstimator n_true: 350 n_proxy: 1180 Effective Sample Size: 412
estimates = [
(
f"Proxy only\n({len(y_proxy)} examples)",
result_proxy_only.mean,
result_proxy_only.confidence_interval.lower_bound,
result_proxy_only.confidence_interval.upper_bound,
COLORS["Proxy only"],
),
(
f"True only\n({int(xi.sum())} examples)",
result_true_only.mean,
result_true_only.confidence_interval.lower_bound,
result_true_only.confidence_interval.upper_bound,
COLORS["True only"],
),
(
f"Stratified PPI++\n({int(xi.sum())} labeled + {len(y_proxy)} proxy)",
result_stratified_ppi.mean,
result_stratified_ppi.confidence_interval.lower_bound,
result_stratified_ppi.confidence_interval.upper_bound,
COLORS["Stratified PPI++"],
),
]
fig, ax = plt.subplots(figsize=(11, 5.5))
y_pos = [2, 1, 0]
for y, (label, mean, lo, hi, color) in zip(y_pos, estimates):
ax.plot([lo, hi], [y, y], color=color, linewidth=4, solid_capstyle="round", zorder=3)
for xc in [lo, hi]:
ax.plot([xc, xc], [y - 0.2, y + 0.2], color=color, linewidth=2.5, zorder=3)
ax.scatter(mean, y, s=200, color=color, zorder=5, edgecolors="white", linewidths=2.5)
ax.text(mean, y + 0.34, f"{mean:.1%}", ha="center", va="bottom", fontsize=12, color=color, fontweight="bold")
ax.text(mean, y - 0.34, f"[{lo:.1%}, {hi:.1%}]", ha="center", va="top", fontsize=11, color="#888888")
ax.axvline(true_mean, color="black", linestyle="--", linewidth=2.5, zorder=4)
ax.text(true_mean + 0.004, 2.72, f"True accuracy {true_mean:.1%}", color="black", fontsize=10.5, fontweight="bold")
ax.set_yticks(y_pos)
ax.set_yticklabels([e[0] for e in estimates])
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.set_xlabel("SQL Accuracy")
ax.set_xlim(None, None)
ax.set_ylim(-0.8, 3.2)
ax.spines[["top", "right", "left"]].set_visible(False)
ax.tick_params(left=False)
plt.tight_layout()
plt.show()
The forest plot illustrates three estimates of text-to-SQL accuracy on Spider 1.0.
Proxy only. The proxy-only estimate uses all proxy labels without bias correction. The gap between the proxy estimate and the true accuracy (dashed line) reveals the direction and magnitude of the judge's systematic error on this model.
True only. The stratified classical estimate from the annotated subset is unbiased: it centers near the true accuracy. Its confidence interval is wide, however, because the annotated subset is a small fraction of the full dataset.
Stratified PPI++. The stratified PPI++ estimate is also unbiased by construction but achieves a narrower confidence interval by leveraging all proxy labels. The width reduction reflects the effective sample size gain from combining ground truth annotations with the proxy signal.
Coverage Validity¶
A confidence interval is valid if it contains the true parameter at the stated rate. A 95% interval is valid when, across many independent draws of the sampling mask, roughly 95% of the resulting intervals contain the true mean.
We run a Monte Carlo experiment sweeping the confidence level from 0.55 to 0.95 and check that each workflow's empirical coverage tracks the diagonal. The Proxy only workflow is expected to fall well below because its point estimate is biased. See the Scientific Validation Methodology page for more details.
def preprocess_data(y_true_oracle, y_proxy, groups, workflow, seed):
if workflow == "Proxy only":
return y_proxy, groups
elif workflow == "True only":
xi = StratifiedSampler().sample(y_proxy, groups, n_samples=N_LABELED, random_seed=seed)
y_true = simulate_annotation(y_true_oracle, xi)
return y_true, groups
elif workflow == "Stratified PPI++":
xi = StratifiedSampler().sample(y_proxy, groups, n_samples=N_LABELED, random_seed=seed)
y_true = simulate_annotation(y_true_oracle, xi)
return y_true, y_proxy, groups
else:
raise ValueError(f"Unknown workflow: {workflow}")
def generate_estimates(y_true_oracle, y_proxy, groups, seed):
estimates = {}
for workflow, EstimatorClass in WORKFLOW_ESTIMATORS.items():
estimator = EstimatorClass()
data = preprocess_data(y_true_oracle, y_proxy, groups, workflow, seed)
result = estimator.estimate(*data, confidence_level=CONFIDENCE_LEVEL)
estimates[workflow] = {
"mean": result.mean,
"std": result.std,
"confidence_interval": result.confidence_interval,
"effective_sample_size": getattr(result, "effective_sample_size", None),
}
return estimates
confidence_levels = np.round(np.arange(0.55, 1.00, 0.05), 2)
raw_stats = run_monte_carlo(
confidence_levels,
run_seed=lambda seed: generate_estimates(y_true_oracle, y_proxy, groups, seed),
n_seeds=N_SEEDS,
)
coverages_by_confidence_level = {}
for cl in confidence_levels:
hits = compute_hits(raw_stats, cl, true_mean)
coverages_by_confidence_level[cl] = {
workflow: coverage_with_error_bar(hits[workflow], confidence_level=CONFIDENCE_LEVEL)
for workflow in WORKFLOW_ESTIMATORS
}
fig, ax = plt.subplots(figsize=(7, 6))
ax.plot(confidence_levels, confidence_levels, color="black", lw=1.5, linestyle="--", label="Ideal")
for workflow in WORKFLOW_ESTIMATORS:
mean_ci = np.array([coverages_by_confidence_level[cl][workflow] for cl in confidence_levels])
mean = mean_ci[:, 0]
lo = mean_ci[:, 1]
hi = mean_ci[:, 2]
ax.plot(confidence_levels, mean, marker="o", color=COLORS[workflow], label=workflow)
ax.fill_between(confidence_levels, lo, hi, alpha=0.15, color=COLORS[workflow])
ax.set_xlabel("Target confidence level")
ax.set_ylabel("Observed coverage")
ax.set_xlim(0.5, 1.0)
ax.set_ylim(0.5, 1.0)
ax.legend()
plt.tight_layout()
plt.show()
Both True only and Stratified PPI++ achieve valid coverage, tracking the diagonal across all tested confidence levels. Proxy only falls below because its point estimate is biased by the judge's systematic errors, causing intervals centered at the wrong value to miss the true mean more often than the nominal rate allows.
Confidence Interval Width¶
Both True only and Stratified PPI++ use the same N_LABELED annotation budget, so differences in interval width reflect the information gain from incorporating the full proxy signal. We report the mean width and a percentile band across Monte Carlo seeds.
width_by_confidence_level = {}
for cl in confidence_levels:
width_by_confidence_level[cl] = {}
for workflow in WORKFLOW_ESTIMATORS:
lower = raw_stats[workflow]["lower_bounds"][cl]
upper = raw_stats[workflow]["upper_bounds"][cl]
width_by_confidence_level[cl][workflow] = upper - lower
lower_percentile = round(((1 - CONFIDENCE_LEVEL) / 2) * 100)
upper_percentile = 100 - lower_percentile
fig, ax = plt.subplots(figsize=(9, 5))
for workflow in WORKFLOW_ESTIMATORS:
means_w = [np.mean(width_by_confidence_level[cl][workflow]) for cl in confidence_levels]
q_lower = [np.percentile(width_by_confidence_level[cl][workflow], lower_percentile) for cl in confidence_levels]
q_upper = [np.percentile(width_by_confidence_level[cl][workflow], upper_percentile) for cl in confidence_levels]
ax.plot(confidence_levels, means_w, marker="o", color=COLORS[workflow], label=workflow)
ax.fill_between(confidence_levels, q_lower, q_upper, alpha=0.15, color=COLORS[workflow])
ax.set_xlabel("Confidence level")
ax.set_ylabel("Confidence interval width")
ax.set_xlim(0.5, 1.0)
ax.legend()
plt.tight_layout()
plt.show()
Stratified PPI++ achieves narrower confidence intervals than True only at the same annotation budget, driven by the information extracted from the full set of proxy labels. The Proxy only intervals appear narrow but are misleading given that they do not achieve valid coverage.
Summary¶
This notebook ran an end-to-end case study of GLIDE's estimation workflows on Spider 1.0, a text-to-SQL benchmark, using an LLM judge as proxy. The key findings are:
| Property | Result |
|---|---|
| Coverage validity | True only and Stratified PPI++ achieve nominal coverage; Proxy only fails as expected |
| Efficiency | Stratified PPI++ produces narrower intervals than True only at the same annotation budget, driven by the proxy's correlation with ground truth labels |
Validity. Despite proxy bias, Stratified PPI++ maintains valid coverage thanks to its debiasing mechanism. Proxy only, by contrast, is systematically miscovered: its intervals are centered at the wrong value and miss the true accuracy far more often than the nominal rate allows.
Efficiency. Stratified PPI++ leverages the full proxy signal to extract information beyond the N_LABELED labeled samples, translating to narrower confidence intervals at no additional annotation cost.
Practical takeaway. A practitioner can achieve more precise text-to-SQL accuracy estimates by combining a small set of ground truth annotations with LLM-judge verdicts. The key requirement is a valid debiasing step: raw proxy labels alone should never substitute for ground truth, but they can substantially sharpen estimates when processed through a debiasing estimator.
Annex: Dataset and Pipeline¶
The dataset is published on Hugging Face at Glide-py/spider-text-to-sql. The repository includes scripts and a README.md with full reproduction instructions.
The pipeline runs in four steps:
- Download : fetch Spider 1.0 and filter to the 10 databases with the most examples.
- Generate : produce SQL predictions for each question using a language model.
- Ground truth : derive reference labels by comparing each predicted query against the gold SQL from the Spider benchmark.
- LLM judge : assign proxy labels by asking another language model to assess each prediction.