Estimators

GLIDE provides three estimators that each combine proxy labels and a small human-annotated subset to produce an unbiased mean estimate and a confidence interval. The right choice depends on how the labeled subset was collected.

What a good estimator looks like

A good estimator \(\hat{\theta}\) of \(\theta^*\) must satisfy two criteria.

Properties

No bias: the estimate should be correct in expectation:

\[E[\hat{\theta}] = \theta^*\]

Small and statistically valid uncertainty: the true value \(\theta^*\) should fall within a confidence interval \(C_\alpha\) at risk level \(\alpha\):

\[\Pr(\theta^* \in C_\alpha) \geq 1 - \alpha\]

Moreover, \(C_\alpha\) should be as small as possible.

Input data

All estimators in GLIDE rely on two complementary sources of labels. Proxy labels \(\tilde{Y}_i\) are available for \(N\) samples at low cost but are biased (\(E[\tilde{Y}] \neq \theta^*\)). Human labels \(Y_j\) are unbiased (\(E[Y] = \theta^*\)) but expensive, and only available for a small labeled set of \(n \ll N\) samples. The key insight: even though human labels are scarce, they can be used to correct the bias in the cheap proxy labels.

Prediction-Powered Inference (PPI++)

PPI assumes that the labeled subset is drawn uniformly at random from the population. Under this assumption, it constructs an unbiased estimator by combining all available proxy labels with a small set of ground-truth annotations, correcting for the bias of the proxy at minimal cost.

In PPI, each sample has two associated values:

Value	Present for	Description
\(\tilde{Y}_i\)	All \(n+N\) samples	Proxy label
\(Y_j\)	Labeled samples only (\(n \ll N\))	Ground-truth label

Mean estimation

PPI++ [2] is an extension of the original PPI [1] that introduces a weight \(\lambda \in [0, 1]\) on the proxy labels. Denoting \(\tilde{Y}^{\bullet}\) and \(\tilde{Y}^{\circ}\) the labeled and unlabeled proxies respectively, the mean estimate is:

\[\hat{\theta}_{\lambda} = \frac{1}{n} \sum_{j=1}^{n} Y_j + \lambda \left[\frac{1}{N} \sum_{i=1}^{N} \tilde{Y}_i^{\circ} - \frac{1}{n} \sum_{j=1}^{n} \tilde{Y}_j^{\bullet}\right]\]

This combines two components:

The human-label mean \(\frac{1}{n}\sum_{j} Y_j\), which is unbiased but high-variance due to the small labeled set.
A variance reduction term that uses all \(N\) proxy labels to reduce variance, scaled by \(\lambda\) to control how much weight the proxy receives.

At \(\lambda = 1\), this recovers the original PPI estimator, which can equivalently be written as:

\[\hat{\theta} = \underbrace{\frac{1}{N} \sum_{i=1}^{N} \tilde{Y}_i^{\circ}}_{\text{Biased estimate}} + \underbrace{\frac{1}{n} \sum_{j=1}^{n} \left(Y_j - \tilde{Y}_j^{\bullet}\right)}_{\text{Bias rectifier}}\]

The parameter \(\lambda\) allows modulating the contribution of the proxy labels based on how informative they are. We will see that it can be set to an optimal value below.

Variance and confidence intervals

For large enough sample sizes (typically \(n \geq 50\)), the Central Limit Theorem applies and the variance of the PPI++ estimator decomposes as:

\[\sigma^2_{\hat{\theta}}(\lambda) = \underbrace{\frac{\sigma^2_{Y - \lambda\tilde{Y}}}{n}}_{\text{Labeled residual variance}} + \underbrace{\frac{\lambda^2\,\sigma^2_{\tilde{Y}}}{N}}_{\text{Unlabeled proxy variance}}\]

The first term shrinks both as \(n\) grows and as the proxy aligns better with human annotations.
The second term shrinks as \(N\) grows and is usually negligible in practice since \(N \gg n\).

This gives a confidence interval at level \(1 - \alpha\):

\[\Pr\!\left(\theta^* \in \left[\hat{\theta}_{\lambda} - z_{1-\alpha/2}\, \sigma_{\hat{\theta}}(\lambda),\; \hat{\theta}_{\lambda} + z_{1-\alpha/2}\, \sigma_{\hat{\theta}}(\lambda)\right]\right) \geq 1 - \alpha\]

where \(z_{1-\alpha/2}\) is the standard normal quantile (e.g. \(z_{0.975} = 1.96\) for a \(95\%\) two-sided confidence interval).

Power-tuning

The \(\lambda\) parameter needs to be chosen wisely. If left at \(\lambda = 1,\) low-quality proxy labels with weak or negative covariance to human labels could degrade the estimation by inducing larger confidence intervals compared to using human labels only (\(\lambda = 0\)). PPI++ derives a closed-form plug-in estimator for the \(\lambda\) that minimises the CI width:

\[\hat{\lambda} = \frac{\widehat{\text{Cov}}_n(Y,\, \tilde{Y}^{\bullet})}{\left(1 + \tfrac{n}{N}\right)\widehat{\text{Var}}_{n+N}(\tilde{Y})}\]

where:

\(\widehat{\text{Cov}}_n\) is the sample covariance computed on the \(n\) labeled samples only,
\(\widehat{\text{Var}}_{n+N}\) is the sample variance computed on all \(n + N\) proxy values pooled,
\(n\) and \(N\) are the numbers of labeled and unlabeled samples respectively.

When the proxy is informative (high covariance with human labels), \(\hat{\lambda}\) is close to 1 and the CI is narrower than standard PPI; when the proxy is uninformative, \(\hat{\lambda}\) shrinks toward 0, down-weighting it and falling back to the classical human-only mean estimate. It is standard to use optimal \(\hat{\lambda}\). This ensures the resulting estimate always has smaller variance than the classical estimate.

Stratified PPI++

Standard PPI++ assumes that labeled and unlabeled samples are drawn uniformly from a single population. In practice, the dataset is often naturally partitioned into strata (for example, by language, domain, or question type), and the proxy model may behave very differently across these strata. Stratified PPI++ [5, 6] exploits this structure: rather than applying one global estimate, it runs PPI++ independently within each stratum and combines the results with weights proportional to stratum size.

Let \(K\) denote the number of strata. Stratum \(k\) contains \(n_k+N_k\) total samples (labeled + unlabeled), of which \(n_k\) are labeled. We let \(n = \sum_k n_k\) and \(N = \sum_k N_k\) be the total numbers of labeled and unlabeled samples respectively. We assume that \(n_k/n \approx N_k/N\) for all \(k\) and compute the population weight of stratum \(k\) as:

\[w_k = \frac{n_k+N_k}{n+N}\]

In Stratified PPI++, each sample has the same values as in PPI++ (a proxy label \(\tilde{Y}_i\) and optionally a ground-truth label \(Y_j\)), plus a stratum identifier indicating which stratum the sample belongs to.

Value	Present for	Description
\(\tilde{Y}_i\)	All \(n+N\) samples	Proxy label
\(Y_j\)	Labeled samples only (\(n \ll N\))	Ground-truth label
\(g_j\)	All \(n+N\) samples	Stratum identifier

The stratum identifiers allow to partition samples and their labels according to the stratum they belong to.

Mean estimation

The Stratified PPI++ point estimate is a weighted average of the per-stratum PPI++ estimates:

\[\hat{\theta}_{\text{strat}} = \sum_{k=1}^{K} w_k \cdot \hat{\theta}_k(\lambda_k)\]

where \(\hat{\theta}_k(\lambda_k)\) is exactly the PPI++ mean estimator applied to the data in stratum \(k\) with its own weight \(\lambda_k\). The weights \(w_k\) are proportional to stratum size, so larger strata contribute more to the final estimate. Since each \(\hat{\theta}_k(\lambda_k)\) is an unbiased estimator for the stratum-\(k\) mean and the weights sum to one, \(\hat{\theta}_{\text{strat}}\) is an unbiased estimator for the population mean \(\theta^*\).

Variance and confidence intervals

Keep in mind that Stratified PPI++ is designed for a small number of large strata. The theoretical guarantees assume that the number of strata \(K\) stays fixed as sample size grows and that each stratum contains a non-vanishing share of the data. In practice, many small strata mean that per-stratum statistical estimates become unreliable, and the CLT approximation underlying the confidence interval may break down. When in doubt, prefer a coarser stratification with fewer, larger strata.

The asymptotic variance of \(\hat{\theta}_{\text{strat}}\) is the sum of the per-stratum PPI++ variances, each scaled by its squared population weight:

\[\sigma^2_{\text{strat}} = \sum_{k=1}^{K} w_k^2 \cdot \sigma^2_k(\lambda_k)\]

where \(\sigma^2_k(\lambda_k)\) is the PPI++ variance for stratum \(k\). The reported standard deviation \(\sigma_{\text{strat}}\) serves to construct a confidence interval at level \(1 - \alpha\) via the CLT exactly as in PPI++.

The key benefit over global PPI++ becomes apparent when strata differ substantially in proxy quality. Strata where the proxy is accurate contribute a small \(\sigma^2_k(\lambda_k)\), while strata where it is poor contribute a larger one, but each contribution is isolated to its own stratum instead of polluting the global estimate.

Power-tuning

Each stratum \(k\) receives its own optimal weight \(\hat{\lambda}_k\), computed with the same closed-form formula as PPI++, restricted to the \(n_k\) labeled and \(N_k\) unlabeled samples within that stratum:

\[\hat{\lambda}_k = \frac{\widehat{\text{Cov}}_{n_k}(Y^k,\, \tilde{Y}^{k, \bullet})}{\left(1 + \tfrac{n_k}{N_k}\right)\widehat{\text{Var}}_{n_k + N_k}(\tilde{Y}^k)},\]

where \(Y^k\) is the vector of ground-truths in stratum \(k\) (i.e. available \(Y_j\) values with \(j\) such that \(g_j = k\)), \(\tilde{Y}^{k, \bullet}\) is the vector of proxy labels on the labeled portion of stratum \(k\) and \(\tilde{Y}^{k}\) contains all proxy labels in stratum \(k\). This is the same formula as PPI++ power-tuning, applied stratum by stratum. In strata where the proxy is informative, \(\hat{\lambda}_k\) is close to 1 and the stratum estimate benefits from the proxy signal. In strata where the proxy is weak or unreliable, \(\hat{\lambda}_k\) shrinks toward 0, falling back to the classical human-only mean for that stratum, without affecting any other stratum. It is standard to use optimal power-tuning with the previous \(\hat{\lambda}_k\) values.

Active Statistical Inference (ASI)

Standard approaches to combining proxy and human labels assume that the labeled subset is drawn uniformly at random from the population. In practice, annotation resources are often allocated strategically, for instance, prioritizing uncertain or difficult examples. Active Statistical Inference (ASI) [3, 4] handles this general case: each sample \(X_i\) may have a distinct, pre-determined probability \(\pi_i \in (0, 1]\) of being selected for human annotation. Inverse-Probability Weighting (IPW) corrects for this non-uniform selection, yielding valid confidence intervals under any fixed sampling rule.

In this section, we assume we have a total of \(n\) samples, labeled and unlabeled. In ASI, each sample has three associated values:

Value	Present for	Description
\(\tilde{Y}_i\)	All \(n\) samples	Proxy label
\(\pi_i\)	All \(n\) samples	Known, pre-determined sampling probability
\(\xi_i\)	All \(n\) samples	Sampling indicator such that \(\Pr(\xi_i = 1) = \pi_i = 1 - \Pr(\xi_i = 0)\)
\(Y_i\)	Labeled samples only (\(\xi_i = 1\))	Ground-truth label

We define \(\xi_i \in \{0, 1\}\) as the sampling indicator: \(\xi_i = 1\) if a ground-truth label is present for sample \(i\), and \(\xi_i = 0\) otherwise. Crucially, \(\pi_i\) must be known for every sample. It is a property of the sampling design, not derived from the data.

Mean estimation

The core of ASI is a per-sample IPW-corrected effective label:

\[z_i(\lambda) = \lambda\,\tilde{Y}_i + \xi_i\,\frac{Y_i - \lambda\,\tilde{Y}_i}{\pi_i}\]

Expanding by case:

Unlabeled (\(\xi_i = 0\)): \(\quad z_i = \lambda\,\tilde{Y}_i\)
Labeled (\(\xi_i = 1\)): \(\quad z_i = \lambda\,\tilde{Y}_i + \dfrac{Y_i - \lambda\,\tilde{Y}_i}{\pi_i}\)

For labeled samples, the residual \(Y_i - \lambda\,\tilde{Y}_i\) is divided by \(\pi_i\). This up-weights samples that were less likely to be selected, ensuring each labeled sample represents its fair share of the population. The parameter \(\lambda\) modulates how much weight the proxy label receives.

The ASI mean estimator is simply the average of the IPW-corrected labels:

\[\hat{\theta}_{\lambda} = \frac{1}{n}\sum_{i=1}^{n} z_i(\lambda)\]

This estimator is unbiased for the population mean under any fixed sampling design, provided \(\pi_i > 0\) for all samples.

At \(\lambda = 0\), this reduces to the classical Horvitz–Thompson estimator, which uses only the labeled samples (each weighted by \(1/\pi_i\)). As \(\lambda\) increases, the proxy labels contribute progressively more to the estimate.

Variance and confidence intervals

The asymptotic variance is the sample variance of the corrected labels divided by \(n\):

\[\hat{\sigma}^2_{\text{SE}}(\lambda) = \frac{\widehat{\text{Var}}\!\left(z(\lambda)\right)}{n}\]

where \(\widehat{\text{Var}}\) denotes the sample variance. By the Central Limit Theorem (for \(n\) large enough, typically \(n \geq 50\)), this yields a confidence interval at level \(1 - \alpha\):

\[\Pr\!\left(\theta^* \in \left[\hat{\theta}_{\lambda} - z_{1-\alpha/2}\,\hat{\sigma}_{\text{SE}},\; \hat{\theta}_{\lambda} + z_{1-\alpha/2}\,\hat{\sigma}_{\text{SE}}\right]\right) \geq 1 - \alpha\]

where \(z_{1-\alpha/2}\) is the standard normal quantile (e.g. \(z_{0.975} = 1.96\) for a \(95\%\) two-sided confidence interval).

Power-tuning

The choice of \(\lambda\) directly controls the width of the confidence interval. A poor value can increase variance relative to a human-only estimate. ASI derives a closed-form optimal \(\lambda\) by minimising \(\hat{\sigma}^2_{\text{SE}}(\lambda)\) analytically.

Define two per-sample quantities:

\[a_i = \tilde{Y}_i\!\left(\frac{\xi_i}{\pi_i} - 1\right), \qquad b_i = Y_i \cdot \frac{\xi_i}{\pi_i}\]

\(a_i\) is computable for every sample (requires only \(\tilde{Y}_i\), \(\xi_i\), and \(\pi_i\)).
\(b_i\) equals \(Y_i / \pi_i\) for labeled samples and \(0\) for unlabeled samples.

The variance-minimising \(\lambda\) is:

\[\hat{\lambda} = \frac{\widehat{\text{Cov}}(a,\, b)}{\widehat{\text{Var}}(a)}\]

When the proxy is informative, \(\hat{\lambda}\) is large and the IPW-corrected labels benefit from the proxy signal, narrowing the confidence interval. When the proxy is uninformative, \(\hat{\lambda}\) shrinks toward 0, down-weighting it. Fixing \(\lambda = 1\), recover the plain IPW estimator. It is standard to use optimal power-tuning with the \(\hat{\lambda}\) value above.

Clustered Prediction-Powered Inference

Standard PPI++ assumes that observations are independent draws from the population. In practice, data often exhibits a cluster structure: samples are grouped into natural units (for example, phrases from the same paragraph), and observations within a cluster may be correlated. Clustered PPI++ handles this case by reducing each cluster to its mean and treating the resulting cluster means as independent observations, so that PPI++ can be applied directly to them. This yields valid confidence intervals even when within-cluster dependence is strong.

The data is partitioned into \(L\) disjoint clusters \(C_1, \dots, C_L\), with \(\bigcup_{l=1}^{L} C_l = \{1, \dots, n + N\}\) and \(C_i \cap C_j = \emptyset\) for \(i \neq j\). Each cluster is either fully labeled or fully unlabeled: no cluster contains a mix of labeled and unlabeled observations. Observations from distinct clusters are independent, but no independence is assumed among observations within the same cluster. Let \(L^{\bullet}\) denote the number of labeled clusters and \(L^{\circ}\) the number of unlabeled clusters.

In Clustered PPI++, each sample has the same values as in PPI++, plus a cluster identifier:

Value	Present for	Description
\(\tilde{Y}_i\)	All \(n+N\) samples	Proxy label
\(Y_j\)	Labeled samples only (\(n \ll N\))	Ground-truth label
\(c_i\)	All \(n+N\) samples	Cluster identifier

The cluster identifiers allow partitioning the data into cluster-level means, which replace individual observations as the sampling units used for inference.

Mean estimation

The first step is computing cluster means. For each labeled cluster \(l\), the true-label mean and proxy mean are:

\[\bar{Y}^{(l)} = \frac{1}{|C_l|}\sum_{i \in C_l} Y_i, \qquad \bar{\tilde{Y}}^{(l),\bullet} = \frac{1}{|C_l|}\sum_{i \in C_l} \tilde{Y}_i\]

For each unlabeled cluster \(m\), the proxy mean is:

\[\bar{\tilde{Y}}^{(m),\circ} = \frac{1}{|C_m|}\sum_{i \in C_m} \tilde{Y}_i\]

We denote \(\bar{\tilde{Y}}^{(l)}\) the vector containing all \(L^{\bullet} + L^{\circ}\) labeled and unlabeled proxy cluster means.

The Clustered PPI++ mean estimate is exactly the PPI++ mean estimator applied to the cluster means, with \(L^{\bullet}\) labeled and \(L^{\circ}\) unlabeled observations:

\[\hat{\theta} = \frac{1}{L^{\bullet}}\sum_{l=1}^{L^{\bullet}} \bar{Y}^{(l)} + \lambda \left[\frac{1}{L^{\circ}}\sum_{m=1}^{L^{\circ}} \bar{\tilde{Y}}^{(m),\circ} - \frac{1}{L^{\bullet}}\sum_{l=1}^{L^{\bullet}} \bar{\tilde{Y}}^{(l),\bullet}\right]\]

This combines two components:

The human-label cluster mean \(\frac{1}{L^{\bullet}}\sum_l \bar{Y}^{(l)}\), unbiased but high-variance due to the small number of labeled clusters.
A variance reduction term scaled by \(\lambda\), using all proxy cluster means to reduce variance.

Note. Each term in the formula above uses the mean of cluster means rather than a size-weighted mean, for instance, \(\frac{1}{L^{\bullet}}\sum_l \bar{Y}^{(l)}\) rather than \(\sum_l \frac{|C_l|}{n} \bar{Y}^{(l)}\). Both are unbiased for \(\theta^*\), but their variances differ based on within-cluster dependence. Because no independence is assumed among observations within the same cluster, the conservative assumption treats them as perfectly correlated, which gives \(\mathrm{Var}(\bar{Y}^{(l)}) = \mathrm{Var}(Y) = \sigma^2\) for every cluster, regardless of its size. Under this assumption, the variance of any estimator of the form \(\sum_l \alpha_l \bar{Y}^{(l)}\) with \(\sum_l \alpha_l = 1\) equals \(\sigma^2 \sum_l \alpha_l^2\). Applying the Cauchy-Schwarz inequality to the vectors \((\alpha_1, \ldots, \alpha_{L^{\bullet}})\) and \((1, \ldots, 1)\) gives \(\left(\sum_l \alpha_l \cdot 1\right)^2 \leq \left(\sum_l \alpha_l^2\right)\cdot L^\bullet\) so that \(\sum_l \alpha_l^2 \geq \frac{1}{L^{\bullet}}\), with equality if and only if all \(\alpha_l\) are equal to \(\frac{1}{L^{\bullet}}\). The conclusion is that the mean of cluster means is the minimum-variance unbiased estimator of this form, and is therefore preferred over the size-weighted mean, which sets \(\alpha_l = \frac{|C_l|}{n}\) and yields a strictly larger variance whenever cluster sizes are unequal.

Variance and confidence intervals

Because observations within a cluster are not assumed independent, the variance cannot be computed at the individual sample level. Instead, the cluster means are treated as the independent units, and the variance formula is exactly the PPI++ variance applied to them:

\[\hat{\sigma}^2(\lambda) = \frac{1}{L^{\bullet}}\widehat{\text{Var}}\Big(\bar{Y}^{(l)} - \lambda \bar{\tilde{Y}}^{(l),\bullet}\Big) + \frac{\lambda^2}{L^{\circ}}\widehat{\text{Var}}\Big(\bar{\tilde{Y}}^{(m),\circ}\Big)\]

where \(\widehat{\text{Var}}\) denotes the sample variance computed across the \(L^{\bullet}\) labeled cluster means in the first term and the \(L^{\circ}\) unlabeled cluster means in the second.

By the Central Limit Theorem applied to the cluster means, this yields a confidence interval at level \(1 - \alpha\):

\[\Pr\!\left(\theta^* \in \left[\hat{\theta} - z_{1-\alpha/2}\,\hat{\sigma}(\lambda),\; \hat{\theta} + z_{1-\alpha/2}\,\hat{\sigma}(\lambda)\right]\right) \geq 1 - \alpha\]

where \(z_{1-\alpha/2}\) is the standard normal quantile (e.g. \(z_{0.975} = 1.96\) for a 95% two-sided confidence interval).

Power-tuning

As in PPI++, the optimal \(\lambda\) minimizes \(\hat{\sigma}^2(\lambda)\). The same closed-form formula applies, with cluster means as observations:

\[\hat{\lambda} = \frac{\widehat{\text{Cov}}_{L^{\bullet}}(\bar{Y}^{(l)},\, \bar{\tilde{Y}}^{(l),\bullet})}{\left(1 + \dfrac{L^{\bullet}}{L^{\circ}}\right) \widehat{\text{Var}}_{L^{\bullet} + L^{\circ}}(\bar{\tilde{Y}}^{(l)})}\]

where:

\(\widehat{\text{Cov}}_{L^{\bullet}}(\bar{Y}^{(l)}, \bar{\tilde{Y}}^{(l),\bullet})\) is the sample covariance between true and proxy cluster means, computed on the \(L^{\bullet}\) labeled clusters,
\(\widehat{\text{Var}}_{L^{\bullet} + L^{\circ}}(\bar{\tilde{Y}}^{(l)})\) is the sample variance over all \(L^{\bullet} + L^{\circ}\) proxy cluster means.

When the proxy is informative, \(\hat{\lambda}\) is close to 1 and the variance reduction narrows the confidence interval. When the proxy is uninformative, \(\hat{\lambda}\) shrinks toward 0, falling back to the classical cluster mean. It is standard to use the optimal \(\hat{\lambda}\) in practice.

When every cluster is a singleton, \(L^{\bullet} = n\) and \(L^{\circ} = N\), and all formulas reduce exactly to their PPI++ counterparts, recovering the full PPI++ procedure.

Multi-Proxy Prediction-Powered Inference (Multi-PPI)

Standard PPI++ assumes a single proxy model. Multi-Proxy Prediction-Powered Inference (Multi-PPI) [8] generalises this to \(M \geq 1\) proxy models simultaneously. It finds the optimal linear combination of all \(M\) proxies before applying the PPI correction. As in PPI++, labeled samples are drawn uniformly at random from the population.

In Multi-PPI, each sample has two associated values:

Value	Present for	Description
\(\tilde{Y}^{(m)}_i\), for \(m = 1, \ldots, M\)	All \(n+N\) samples	Proxy prediction from proxy \(m\)
\(Y_j\)	Labeled samples only (\(n \ll N\))	Ground-truth label

Mean estimation

Let \(\tilde{\mathbf{Y}}^\bullet_j = (\tilde{Y}^{(1),\bullet}_j, \ldots, \tilde{Y}^{(M),\bullet}_j)^\top \in \mathbb{R}^M\) and \(\tilde{\mathbf{Y}}^\circ_i = (\tilde{Y}^{(1),\circ}_i, \ldots, \tilde{Y}^{(M),\circ}_i)^\top \in \mathbb{R}^M\) denote the proxy prediction vectors for labeled and unlabeled samples respectively. The Multi-PPI mean estimate is parameterised by a tuning vector \(\mathbf{\lambda} = (\lambda_1, \ldots, \lambda_M)^\top \in \mathbb{R}^M\) and defined as:

\[\hat{\theta}_{\mathbf{\lambda}} = \frac{1}{n}\sum_{j=1}^{n} Y_j + \mathbf{\lambda}^\top \left[\frac{1}{N}\sum_{i=1}^{N} \tilde{\mathbf{Y}}^\circ_i - \frac{1}{n}\sum_{j=1}^{n} \tilde{\mathbf{Y}}^\bullet_j\right]\]

This combines two components:

The human-label mean \(\frac{1}{n}\sum_j Y_j\), unbiased but high-variance due to the small labeled set.
A variance reduction term using all proxy predictions across \(M\) proxies. The tuning parameter \(\lambda_m\) scales the correction from proxy \(m\): the difference between the unlabeled and labeled means of that proxy.

When \(\mathbf{\lambda} = \mathbf{0}\), the estimator reduces to the naive sample mean. When \(M = 1\) and \(\lambda_1 = 1\), it recovers standard PPI.

Variance and confidence intervals

By the Central Limit Theorem (for large enough \(n\), typically \(n \geq 50\)), the asymptotic variance of the Multi-PPI estimator decomposes as:

\[\sigma^2_{\hat{\theta}}(\mathbf{\lambda}) = \underbrace{\frac{\text{Var}(Y - \mathbf{\lambda}^\top \tilde{\mathbf{Y}})}{n}}_{\text{Labeled residual variance}} + \underbrace{\frac{\text{Var}(\mathbf{\lambda}^\top \tilde{\mathbf{Y}})}{N}}_{\text{Unlabeled proxy variance}}\]

The first term is the variance of the residuals \(Y - \mathbf{\lambda}^\top \tilde{\mathbf{Y}}\) on the labeled samples, scaled by \(1/n\). It shrinks when the proxy combination is well correlated with the true labels.
The second term is the variance of the projected proxy scores \(\mathbf{\lambda}^\top \tilde{\mathbf{Y}}\) on the unlabeled samples, scaled by \(1/N\). It is typically negligible since \(N \gg n\).

This gives a confidence interval at level \(1 - \alpha\):

\[\Pr\!\left(\theta^* \in \left[\hat{\theta}_{\mathbf{\lambda}} - z_{1-\alpha/2}\, \sigma_{\hat{\theta}}(\mathbf{\lambda}),\; \hat{\theta}_{\mathbf{\lambda}} + z_{1-\alpha/2}\, \sigma_{\hat{\theta}}(\mathbf{\lambda})\right]\right) \geq 1 - \alpha\]

where \(z_{1-\alpha/2}\) is the standard normal quantile (e.g. \(z_{0.975} = 1.96\) for a 95% two-sided confidence interval).

Power-tuning

In Multi-PPI, the tuning parameter \(\mathbf{\lambda} \in \mathbb{R}^M\) is a vector rather than a scalar, since each of the \(M\) proxies receives its own weight. The mean squared error of \(\hat{\theta}_{\mathbf{\lambda}}\) is a quadratic function of \(\mathbf{\lambda}\) with a unique minimiser:

\[\mathbf{\lambda}^* = \frac{N}{n+N} \cdot \text{Var}(\tilde{\mathbf{Y}})^{-1} \cdot \text{Cov}(\tilde{\mathbf{Y}}, Y)\]

where \(\text{Var}(\tilde{\mathbf{Y}})\) is the \(M \times M\) covariance matrix of the proxy predictions and \(\text{Cov}(\tilde{\mathbf{Y}}, Y)\) is the \(M\)-dimensional cross-covariance vector between proxy predictions and the true label. For \(M = 1\), this reduces to the same scalar formula as in PPI++.

In practice, \(\mathbf{\lambda}^*\) is unknown and replaced by the plug-in estimator:

\[\hat{\mathbf{\lambda}} = \frac{N}{n+N} \cdot \widehat{\text{Var}}_{n+N}(\tilde{\mathbf{Y}})^{-1} \cdot \widehat{\text{Cov}}_n(\tilde{\mathbf{Y}}^\bullet, Y)\]

where:

\(\widehat{\text{Var}}_{n+N}(\tilde{\mathbf{Y}})\) is the \(M \times M\) sample covariance matrix of the proxy predictions, computed over all \(n + N\) samples,
\(\widehat{\text{Cov}}_n(\tilde{\mathbf{Y}}^\bullet, Y)\) is the \(M\)-dimensional sample cross-covariance vector between labeled proxy predictions and true labels, computed over the \(n\) labeled samples only.

When a proxy is informative (high covariance with the true label), the corresponding component \(\lambda_m^*\) is large and the estimator benefits from that proxy's signal, narrowing the confidence interval. When a proxy is uninformative, its component shrinks toward 0, down-weighting it without affecting the other components. This guarantees that the variance of \(\hat{\theta}_{\hat{\mathbf{\lambda}}}\) is always no greater than \(\text{Var}(Y)/n\), the variance of the naive sample mean: using Multi-PPI with optimal tuning can only reduce the confidence interval width relative to the classical estimator, regardless of the number or quality of the proxies. It is standard to use the empirical estimate \(\hat{\mathbf{\lambda}}\) in practice.

Predict-Then-Debias (PTD)

Predict-Then-Debias (PTD) [7] constructs a confidence interval from the empirical distribution of bootstrap estimates rather than a normal approximation, making it reliable when \(n\) is small or residuals are non-Gaussian. GLIDE implements Algorithm 3 from [7], which works on a uniformly drawn labeled sample and includes a speedup that avoids resampling the unlabeled data during the bootstrap.

In PTD, each sample has two associated values:

Value	Present for	Description
\(\tilde{Y}_i\)	All \(n+N\) samples	Proxy label
\(Y_j\)	Labeled samples only (\(n \ll N\))	Ground-truth label

Denote \((\tilde{Y}^\circ_i)_{i=1}^N\) the unlabeled proxies and \((\tilde{Y}^\bullet_j)_{j=1}^n\) the labeled ones.

Mean estimation

The PTD mean estimate is the average of \(B\) bootstrap estimates:

\[\hat{\theta}_{\text{PTD}} = \frac{1}{B}\sum_{b=1}^{B}\hat{\theta}^{(b)}_{\text{PTD}}\]

where each \(\hat{\theta}^{(b)}_{\text{PTD}}\) is computed during the bootstrap procedure described below.

Bootstrap procedure

For \(b = 1, \dots, B\), sample a set of indices \(\mathcal{I}^{(b)}\) of size \(n\) uniformly with replacement from \(\{1, \dots, n\}\) and compute the bootstrap means of the labeled ground-truth and proxy labels:

\[\hat{\mu}^{(b)}_{\text{true}} = \frac{1}{n}\sum_{j\in \mathcal{I}^{(b)}} Y_j, \qquad \hat{\mu}^{(b)}_{\text{proxy}} = \frac{1}{n}\sum_{j\in \mathcal{I}^{(b)}} \tilde{Y}^\bullet_j\]

The third ingredient needed is a perturbed draw of the unlabeled proxy mean, \(\tilde{\gamma}^{(b)}\). Naively, this would require resampling all \(N\) proxy labels on the unlabeled samples at each iteration. Algorithm 3 in [7] avoids this cost: by the CLT, the mean of \(N\) i.i.d. proxy scores is approximately Gaussian with mean \(\hat{\gamma}^\circ = \frac{1}{N}\sum_{i=1}^{N}\tilde{Y}^\circ_i\) and variance \(\hat{S}_{\gamma}^\circ = \widehat{\text{Var}}(\tilde{Y}^\circ) / N\), so instead of resampling all \(N\) unlabeled proxy scores at each iteration, we replace that expensive resample with a single standard gaussian draw, mimicking bootstrap randomness at a far lower computational cost:

\[\tilde{\gamma}^{(b)} = \hat{\gamma}^\circ + Z^{(b)} \cdot \sqrt{\hat{S}_{\gamma}^\circ}, \qquad Z^{(b)} \sim \mathcal{N}(0,\, 1)\]

The quantities \(\hat{\gamma}^\circ\) and \(\hat{S}_{\gamma}^\circ\) are computed once before the loop, reducing the per-iteration cost to \(O(n)\) instead of \(O(n+N)\). This approximation is reliable for large \(N\) which is typically the case in practical scenarios where proxy labels are far cheaper than expensive human annotations.

Combining the labeled bootstrap means with this unlabeled draw gives:

\[\hat{\theta}^{(b)}_{\text{PTD}} = \lambda \cdot \tilde{\gamma}^{(b)} + \left(\hat{\mu}^{(b)}_{\text{true}} - \lambda \cdot \hat{\mu}^{(b)}_{\text{proxy}}\right)\]

where \(\lambda\) is a power-tuning factor that controls the proxy labels' influence similarly to previous sections. The term \(\hat{\mu}^{(b)}_{\text{true}} - \lambda \cdot \hat{\mu}^{(b)}_{\text{proxy}}\) captures the proxy bias measured on the labeled set, while \(\lambda \cdot \tilde{\gamma}^{(b)}\) contributes the proxy signal on the full unlabeled population. Together they form a bias-corrected estimate of \(\theta^*\) for each bootstrap replicate.

Confidence intervals

The confidence interval at level \(1 - \alpha\) is the interval between the \(\alpha/2\) and \(1 - \alpha/2\) empirical quantiles of \(\bigl\{\hat{\theta}^{(1)}_{\text{PTD}},\, \ldots,\, \hat{\theta}^{(B)}_{\text{PTD}}\bigr\}\). This bootstrap percentile approach adapts to the actual shape of the residual distribution, making it reliable even when \(n\) is small.

Power-tuning

The optimal \(\lambda\) is estimated from the bootstrap covariances. Let \(\hat{\mu}_{\text{true}}\) and \(\hat{\mu}_{\text{proxy}}\) be the vectors of values \(\hat{\mu}^{(b)}_{\text{true}}\) and \(\hat{\mu}^{(b)}_{\text{proxy}}\) for \(b=1,\dots,B\) respectively. After running the bootstrap loop, it is computed as:

\[\hat{\lambda} = \frac{\widehat{\text{Cov}}_B\!\left(\hat{\mu}_{\text{true}},\; \hat{\mu}_{\text{proxy}}\right)}{\widehat{\text{Var}}_B\!\left(\hat{\mu}_{\text{proxy}}\right) + \hat{S}_{\gamma}^\circ}\]

where \(\widehat{\text{Cov}}_B\) and \(\widehat{\text{Var}}_B\) are computed across the \(B\) bootstrap replicates of the labeled means, and \(\hat{S}_{\gamma}^\circ\) is the estimated sampling variance of the unlabeled proxy mean. The denominator adds \(\hat{S}_{\gamma}^\circ\) to account for the variance of the unlabeled proxies. This value can be readily used since a Gaussian approximation is made for the unlabeled mean.

When the proxy is informative (high bootstrap covariance with ground-truth means), \(\hat{\lambda}\) is large and the estimate borrows heavily from the proxy signal, narrowing the interval. When the proxy is uninformative, \(\hat{\lambda}\) shrinks toward 0, down-weighting it. Fixing \(\lambda = 1\), recovers the unweighted PTD estimator. It is standard to use the optimal value \(\hat{\lambda}\) in practice.

Stratified PTD

Stratified PTD [7] extends PTD to datasets naturally partitioned into strata (for example, by language, domain, or data source). The PTD bootstrap is run independently within each stratum, each with its own tuning parameter, and the per-stratum results are combined with weights proportional to stratum size into a single estimate. When strata differ in proxy quality, this yields narrower confidence intervals than a single PTD run on the pooled data.

GLIDE implements a stratified extension of Algorithm 3 from [7], applying the CLT speedup for the unlabeled mean independently within each stratum.

Let \(K\) denote the number of strata. Stratum \(k\) contains \(n_k + N_k\) total samples, of which \(n_k\) are labeled and \(N_k\) are unlabeled, with population weight:

\[w_k = \frac{n_k + N_k}{n + N}\]

where \(n = \sum_k n_k\) and \(N = \sum_k N_k\).

In Stratified PTD, each sample has the same values as in PTD, plus a stratum identifier:

Value	Present for	Description
\(\tilde{Y}_i\)	All \(n+N\) samples	Proxy label
\(Y_j\)	Labeled samples only (\(n \ll N\))	Ground-truth label
\(g_j\)	All \(n+N\) samples	Stratum identifier

The stratum identifiers allow to partition samples and their labels according to the stratum they belong to.

Mean estimation

The Stratified PTD point estimate is the mean of \(B\) combined bootstrap estimates:

\[\hat{\theta}_{\text{SPTD}} = \frac{1}{B}\sum_{b=1}^{B}\hat{\theta}^{(b)}_{\text{SPTD}}\]

where each \(\hat{\theta}^{(b)}_{\text{SPTD}}\) is produced during the stratified bootstrap procedure described below.

Bootstrap procedure

Denote \((\tilde{Y}^{k, \circ}_{i})_{i=1}^{N_k}\) the unlabeled proxies in stratum \(k\) and \((\tilde{Y}^{k, \bullet}_{j})_{j=1}^{n_k}\) the labeled ones. Before the bootstrap loop, compute for each stratum \(k\) the mean and sampling variance of the unlabeled proxy scores:

\[\hat{\gamma}^\circ_k = \frac{1}{N_k}\sum_{i=1}^{N_k}\tilde{Y}^{k, \circ}_{i}, \qquad \hat{S}^{\circ}_{\gamma,k} = \frac{\widehat{\text{Var}}(\tilde{Y}^{k, \circ})}{N_k}\]

These quantities are computed once and reused across all \(B\) iterations, applying the CLT speedup to each stratum independently.

For \(b = 1, \dots, B\) and for each stratum \(k\), sample \(n_k\) indices \(\mathcal{I}^{(b)}_k\) with replacement from \(\{1, \dots, n_k\}\) and compute the bootstrap means of the labeled ground-truth and proxy labels:

\[\hat{\mu}^{(b)}_{\text{true},k} = \frac{1}{n_k}\sum_{j \in \mathcal{I}^{(b)}_k} Y^k_{j}, \qquad \hat{\mu}^{(b)}_{\text{proxy},k} = \frac{1}{n_k}\sum_{j \in \mathcal{I}^{(b)}_k} \tilde{Y}^{k, \bullet}_{j}\]

A perturbed draw of the unlabeled proxy mean for stratum \(k\) is formed as:

\[\tilde{\gamma}^{(b)}_k = \hat{\gamma}^\circ_k + Z^{(b)}_k \cdot \sqrt{\hat{S}^{\circ}_{\gamma,k}}, \qquad Z^{(b)}_k \sim \mathcal{N}(0, 1)\]

where each \(Z^{(b)}_k\) is drawn independently across strata and iterations. The per-stratum bootstrap estimates are then combined with weights proportional to stratum size:

\[\hat{\theta}^{(b)}_{\text{SPTD}} = \sum_{k=1}^{K} w_k \left[\lambda_k \cdot \tilde{\gamma}^{(b)}_k + \left(\hat{\mu}^{(b)}_{\text{true},k} - \lambda_k \cdot \hat{\mu}^{(b)}_{\text{proxy},k}\right)\right]\]

where each \(\lambda_k\) is a power-tuning factor that controls the proxy labels' influence within stratum \(k.\) The term \(\hat{\mu}^{(b)}_{\text{true},k} - \lambda_k \cdot \hat{\mu}^{(b)}_{\text{proxy},k}\) measures the proxy bias in stratum \(k\) on the labeled set, while \(\lambda_k \cdot \tilde{\gamma}^{(b)}_k\) computes the proxy signal on the stratum's unlabeled population.

Confidence intervals

The confidence interval at level \(1 - \alpha\) is the interval between the \(\alpha/2\) and \(1 - \alpha/2\) empirical quantiles of \(\bigl\{\hat{\theta}^{(b)}_{\text{SPTD}}\bigr\}_{b=1}^B\). This quantile-based approach adapts to any arbitrary shape of the residual distribution and remains reliable for small sample sizes.

Keep in mind that Stratified PTD is designed for a small number of large strata. The theoretical guarantees assume that the number of strata \(K\) stays fixed as sample size grows and that each stratum contains a non-vanishing share of the data. In practice, many small strata mean that per-stratum statistical estimates become unreliable. When in doubt, prefer a coarser stratification with fewer, larger strata.

Power-tuning

Each stratum \(k\) receives its own optimal tuning parameter \(\hat{\lambda}_k\), estimated after the bootstrap loop from the bootstrap covariances within that stratum. For each \(k=1, \dots, K\), let \(\hat{\mu}_{\text{true},k}\) and \(\hat{\mu}_{\text{proxy},k}\) be the vectors of values \(\hat{\mu}^{(b)}_{\text{true},k}\) and \(\hat{\mu}^{(b)}_{\text{proxy},k}\) for \(b=1,\dots,B\) respectively. The per-stratum optimal tuning parameter is computed as:

\[\hat{\lambda}_k = \frac{\widehat{\text{Cov}}_B\!\left(\hat{\mu}_{\text{true},k},\; \hat{\mu}_{\text{proxy},k}\right)}{\widehat{\text{Var}}_B\!\left(\hat{\mu}_{\text{proxy},k}\right) + \hat{S}^{\circ}_{\gamma,k}}\]

This is the same formula as PTD power-tuning, applied stratum by stratum. In strata where the proxy is informative, \(\hat{\lambda}_k\) is close to 1 and the estimate benefits from the proxy signal. In strata where the proxy is weak, \(\hat{\lambda}_k\) shrinks toward 0, falling back to the classical bootstrap mean for that stratum, without affecting the others. It is standard to use optimal power-tuning with the \(\hat{\lambda}_k\) values above.

Inverse Probability Weighted Predict-Then-Debias (IPW-PTD)

Inverse Probability Weighted Predict-Then-Debias (IPW-PTD) [7] combines the robustness of PTD's bootstrap confidence intervals with Inverse Probability Weighting to handle non-uniform ground-truth labeling probabilities. In practice, labeled data is often collected through a cost-optimal sampling process where samples are selected for human annotation based on the proxy model's uncertainty. For example, when using an LLM-as-Judge, samples on which the model has high uncertainty receive higher labeling probability, while confident predictions receive lower probability. IPW-PTD corrects for this non-uniform selection while maintaining the empirical distribution advantage of bootstrap inference.

Standard PTD assumes that samples are drawn uniformly at random for ground-truth labeling. IPW-PTD relaxes this: each sample \(i\) has a known, pre-determined probability \(\pi_i \in (0, 1)\) of being selected for human annotation. IPW reweights labeled samples inversely to their selection probability, ensuring that samples with low labeling probability (\(\pi_i\)) are upweighted appropriately.

In IPW-PTD, each sample has four associated values:

Value	Present for	Description
\(\tilde{Y}_i\)	All \(n+N\) samples	Proxy label
\(\pi_i\)	All \(n+N\) samples	Known, pre-determined labeling probability
\(\xi_i\)	All \(n+N\) samples	Sampling indicator such that \(\Pr(\xi_i = 1) = \pi_i = 1 - \Pr(\xi_i = 0)\)
\(Y_i\)	Labeled samples only (\(n \ll N\))	Ground-truth label

The binary indicator is \(\xi_i = 1\) if sample \(i\) is labeled and \(\xi_i = 0\) otherwise; by design, \(\Pr(\xi_i = 1) = \pi_i\).

Mean estimation

IPW-PTD applies inverse probability weighting to correct for non-uniform labeling probabilities. The IPW-corrected weights are:

\(w^\bullet_i = \frac{\xi_i}{\pi_i}\) (labeled contribution; equals \(\frac{1}{\pi_i}\) for labeled samples, \(0\) for unlabeled)
\(w^\circ_i = \frac{1 - \xi_i}{1 - \pi_i}\) (unlabeled contribution; equals \(0\) for labeled samples, \(\frac{1}{1-\pi_i}\) for unlabeled)

The superscripts \(\bullet\) (labeled) and \(\circ\) (unlabeled) indicate which subset of the data is used in a computation. Quantities with the \(\bullet\) superscript are computed using values from labeled samples while those from unlabeled ones are masked off. Conversely, quantities with the \(\circ\) superscript use unlabeled sample values and mask labeled ones off.

The IPW-PTD point estimate is computed as the mean of \(B\) bootstrap estimates (see bootstrap procedure below).

Bootstrap procedure

The IPW-PTD bootstrap reweights samples by their inverse selection probability, then resamples from the weighted data.

Before the bootstrap loop, compute the weighted unlabeled proxy mean and its sampling variance:

\[\hat{\gamma}^\circ = \frac{1}{n+N}\sum_{i=1}^{n+N} w^\circ_i\tilde{Y}_i, \qquad \hat{S}_{\gamma}^\circ = \frac{\widehat{\text{Var}}(w^\circ_i\, \tilde{Y}_i)}{n+N}\]

For \(b = 1, \dots, B\), sample \(n+N\) indices \(\mathcal{I}^{(b)}\) uniformly with replacement from all samples and compute the bootstrap means of the weighted labeled ground-truth and proxy labels:

\[\hat{\mu}^{(b)}_{\text{true}} = \frac{1}{n+N}\sum_{i\in \mathcal{I}^{(b)}} w^\bullet_i\, Y_i, \qquad \hat{\mu}^{(b)}_{\text{proxy}} = \frac{1}{n+N}\sum_{i\in \mathcal{I}^{(b)}} w^\bullet_i\, \tilde{Y}_i\]

Note that the first mean may involve absent ground-truth annotations \(Y_j\), but these are masked off by zero weights.

The unlabeled proxy mean is held fixed across iterations. Instead of resampling all data (expensive), the sampling variability is approximated using the CLT by drawing a single Gaussian perturbation with variance \(\hat{S}_{\gamma}^\circ\):

\[\tilde{\gamma}^{(b)} = \hat{\gamma}^\circ + Z^{(b)} \cdot \sqrt{\hat{S}_{\gamma}^\circ}, \qquad Z^{(b)} \sim \mathcal{N}(0,\, 1)\]

The weighted labeled bootstrap means are combined with this perturbed unlabeled mean:

\[\hat{\theta}^{(b)}_{\text{IPW-PTD}} = \lambda \cdot \tilde{\gamma}^{(b)} + \left(\hat{\mu}^{(b)}_{\text{true}} - \lambda \cdot \hat{\mu}^{(b)}_{\text{proxy}}\right)\]

where \(\lambda\) is a power-tuning factor (see power-tuning below). The term \(\hat{\mu}^{(b)}_{\text{true}} - \lambda \cdot \hat{\mu}^{(b)}_{\text{proxy}}\) captures the IPW-corrected proxy bias measured on the labeled set, while \(\lambda \cdot \tilde{\gamma}^{(b)}\) contributes the weighted proxy signal on the unlabeled population.

Confidence intervals

The confidence interval at level \(1 - \alpha\) is the interval between the \(\alpha/2\) and \(1 - \alpha/2\) empirical quantiles of \(\bigl\{\hat{\theta}^{(1)}_{\text{IPW-PTD}},\, \ldots,\, \hat{\theta}^{(B)}_{\text{IPW-PTD}}\bigr\}\). The bootstrap percentile approach makes no distributional assumptions and adapts to the actual shape of the residuals, remaining reliable even for small sample count or non-Gaussian errors.

Power-tuning

The optimal \(\lambda\) is estimated from bootstrap covariances of the weighted labeled means. Let \(\hat{\mu}_{\text{true}}\) and \(\hat{\mu}_{\text{proxy}}\) be the vectors of bootstrap replicates \(\hat{\mu}^{(b)}_{\text{true}}\) and \(\hat{\mu}^{(b)}_{\text{proxy}}\) for \(b=1,\dots,B\). After running the bootstrap loop, it is computed as:

\[\hat{\lambda} = \frac{\widehat{\text{Cov}}_B\!\left(\hat{\mu}_{\text{true}},\; \hat{\mu}_{\text{proxy}}\right)}{\widehat{\text{Var}}_B\!\left(\hat{\mu}_{\text{proxy}}\right) + \hat{S}_{\gamma}^\circ}\]

where \(\widehat{\text{Cov}}_B\) and \(\widehat{\text{Var}}_B\) are sample covariance and variance computed across bootstrap replicates, and \(\hat{S}_{\gamma}^\circ\) is the CLT-approximated sampling variance of the weighted unlabeled proxy mean. The denominator accounts for both sources of uncertainty: labeled bootstrap variability and unlabeled sampling variability (estimated via CLT).

When the proxy is informative (high covariance with ground-truth), \(\hat{\lambda}\) is large and the estimate gains precision from the proxy signal. When the proxy is uninformative, \(\hat{\lambda}\) shrinks toward 0, reducing reliance on it. For large sample counts, IPW-PTD produces inference equivalent to ASI but without requiring a normal approximation. It is standard to use the optimal \(\hat{\lambda}\) in practice.

Clustered Predict-Then-Debias (Clustered PTD)

Clustered Predict-Then-Debias (Clustered PTD) [7] extends PTD to datasets where observations are grouped into clusters and each cluster is either entirely labeled or entirely unlabeled. The bootstrap resamples whole clusters rather than individual observations, accounting for within-cluster correlation and producing valid confidence intervals under cluster sampling designs. Like PTD, it builds a confidence interval from the empirical distribution of bootstrap estimates rather than a normal approximation, making it reliable when the number of labeled clusters is small.

The data is partitioned into \(L\) disjoint clusters \(C_1, \dots, C_L\), with \(\bigcup_{l=1}^{L} C_l = \{1, \dots, n + N\}\) and \(C_i \cap C_j = \emptyset\) for \(i \neq j\). Each cluster is either fully labeled or fully unlabeled: no cluster contains a mix of labeled and unlabeled observations. Observations from distinct clusters are independent, but no independence is assumed among observations within the same cluster. We denote \(L^\bullet\) the number of labeled clusters and \(L^\circ\) the number of unlabeled clusters.

In Clustered PTD, each sample has two associated values plus a cluster identifier:

Value	Present for	Description
\(\tilde{Y}_i\)	All \(n+N\) samples	Proxy label
\(Y_j\)	Labeled samples only (\(n \ll N\))	Ground-truth label
\(c_i\)	All \(n+N\) samples	Cluster identifier

The cluster identifiers allow partitioning the data into cluster-level means, which replace individual observations as the sampling units for the bootstrap.

Mean estimation

The Clustered PTD mean estimate is the average of \(B\) bootstrap estimates:

\[\hat{\theta}_{\text{CPTD}} = \frac{1}{B}\sum_{b=1}^{B}\hat{\theta}^{(b)}_{\text{CPTD}}\]

where each \(\hat{\theta}^{(b)}_{\text{CPTD}}\) is produced during the bootstrap procedure described below.

Bootstrap procedure

The first step is computing cluster means. For each labeled cluster \(l\):

\[\bar{Y}^{(l)} = \frac{1}{|C_l|}\sum_{i \in C_l} Y_i, \qquad \bar{\tilde{Y}}^{(l),\bullet} = \frac{1}{|C_l|}\sum_{i \in C_l} \tilde{Y}_i\]

For each unlabeled cluster \(m\):

\[\bar{\tilde{Y}}^{(m),\circ} = \frac{1}{|C_m|}\sum_{i \in C_m} \tilde{Y}_i\]

These \(L^\bullet\) labeled cluster mean pairs and \(L^\circ\) unlabeled proxy cluster means serve as the sampling units for the bootstrap.

Before the bootstrap loop, compute the mean and sampling variance of the unlabeled proxy cluster means:

\[\hat{\gamma}^\circ = \frac{1}{L^\circ}\sum_{m=1}^{L^\circ}\bar{\tilde{Y}}^{(m),\circ}, \qquad \hat{S}_\gamma^\circ = \frac{\widehat{\text{Var}}\big(\bar{\tilde{Y}}^{(m),\circ}\big)}{L^\circ}\]

These quantities are computed once and reused across all \(B\) iterations, applying the same CLT speedup as PTD to the unlabeled cluster means.

For \(b = 1, \dots, B\), sample \(L^\bullet\) cluster indices \(\mathcal{I}^{(b)}\) with replacement from \(\{1, \dots, L^\bullet\}\) and compute the bootstrap means of the labeled cluster means:

\[\hat{\mu}^{(b)}_{\text{true}} = \frac{1}{L^\bullet}\sum_{l \in \mathcal{I}^{(b)}} \bar{Y}^{(l)}, \qquad \hat{\mu}^{(b)}_{\text{proxy}} = \frac{1}{L^\bullet}\sum_{l \in \mathcal{I}^{(b)}} \bar{\tilde{Y}}^{(l),\bullet}\]

A perturbed draw of the unlabeled cluster proxy mean is formed as:

\[\tilde{\gamma}^{(b)} = \hat{\gamma}^\circ + Z^{(b)} \cdot \sqrt{\hat{S}_\gamma^\circ}, \qquad Z^{(b)} \sim \mathcal{N}(0,\, 1)\]

Combining the labeled bootstrap cluster means with this unlabeled draw gives:

\[\hat{\theta}^{(b)}_{\text{CPTD}} = \lambda \cdot \tilde{\gamma}^{(b)} + \left(\hat{\mu}^{(b)}_{\text{true}} - \lambda \cdot \hat{\mu}^{(b)}_{\text{proxy}}\right)\]

where \(\lambda\) is a power-tuning factor controlling the proxy labels' influence. The term \(\hat{\mu}^{(b)}_{\text{true}} - \lambda \cdot \hat{\mu}^{(b)}_{\text{proxy}}\) captures the proxy bias measured on the labeled clusters, while \(\lambda \cdot \tilde{\gamma}^{(b)}\) contributes the proxy signal on the unlabeled clusters.

Note. Each bootstrap mean \(\hat{\mu}^{(b)}_{\text{true}}\), \(\hat{\mu}^{(b)}_{\text{proxy}}\) and \(\tilde{\gamma}^{(b)}\) uses the mean of cluster means rather than a size-weighted combination. As argued in the Clustered PPI section, this is the minimum-variance choice when no independence is assumed among observations within the same cluster: treating all within-cluster observations as perfectly correlated, the optimal weights are uniform across clusters regardless of cluster size.

Confidence intervals

The confidence interval at level \(1 - \alpha\) is the interval between the \(\alpha/2\) and \(1 - \alpha/2\) empirical quantiles of \(\bigl\{\hat{\theta}^{(1)}_{\text{CPTD}},\, \ldots,\, \hat{\theta}^{(B)}_{\text{CPTD}}\bigr\}\). The bootstrap percentile approach makes no distributional assumptions and adapts to the actual shape of the residual distribution, remaining reliable even when the number of labeled clusters is small.

Power-tuning

The optimal \(\lambda\) is estimated from the bootstrap covariances of the labeled cluster means. Let \(\hat{\mu}_{\text{true}}\) and \(\hat{\mu}_{\text{proxy}}\) be the vectors of values \(\hat{\mu}^{(b)}_{\text{true}}\) and \(\hat{\mu}^{(b)}_{\text{proxy}}\) for \(b=1,\dots,B\) respectively. After running the bootstrap loop, it is computed as:

\[\hat{\lambda} = \frac{\widehat{\text{Cov}}_B\!\left(\hat{\mu}_{\text{true}},\; \hat{\mu}_{\text{proxy}}\right)}{\widehat{\text{Var}}_B\!\left(\hat{\mu}_{\text{proxy}}\right) + \hat{S}_\gamma^\circ}\]

where \(\widehat{\text{Cov}}_B\) and \(\widehat{\text{Var}}_B\) are computed across the \(B\) bootstrap replicates of the labeled cluster means, and \(\hat{S}_\gamma^\circ\) is the estimated sampling variance of the unlabeled proxy cluster mean. This is the same formula as PTD power-tuning, applied at the cluster level.

When the proxy is informative (high bootstrap covariance with ground-truth cluster means), \(\hat{\lambda}\) is large and the estimate borrows heavily from the proxy signal, reducing its variance. When the proxy is uninformative, \(\hat{\lambda}\) shrinks toward 0, falling back to the classical cluster mean bootstrap. It is standard to use the optimal value \(\hat{\lambda}\) in practice.

When every cluster is a singleton (\(L^\bullet = n\) and \(L^\circ = N\)), all formulas reduce exactly to their PTD counterparts, recovering the full PTD procedure.

References

[1] Angelopoulos, Anastasios N., Stephen Bates, Clara Fannjiang, Michael I. Jordan, and Tijana Zrnic. "Prediction-powered inference." Science 382, no. 6671 (2023): 669-674..

[2] Angelopoulos, Anastasios N., John C. Duchi, and Tijana Zrnic. "PPI++: Efficient prediction-powered inference." arXiv preprint arXiv:2311.01453 (2023)..

[3] Zrnic, Tijana, and Emmanuel J. Candès. "Active statistical inference." In Proceedings of the 41st International Conference on Machine Learning, pp. 62993-63010. 2024..

[4] Gligorić, Kristina, Tijana Zrnic, Cinoo Lee, Emmanuel Candes, and Dan Jurafsky. "Can unconfident llm annotations be used for confident conclusions?." In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 3514-3533. 2025..

[5] Fisch, Adam, Joshua Maynez, R. Alex Hofer, Bhuwan Dhingra, Amir Globerson, and William W. Cohen. "Stratified prediction-powered inference for effective hybrid evaluation of language models." Advances in Neural Information Processing Systems 37 (2024): 111489-111514..

[6] Fogliato, Riccardo, Pratik Patil, Mathew Monfort, and Pietro Perona. "A framework for efficient model evaluation through stratification, sampling, and estimation." In European Conference on Computer Vision, pp. 140-158. Cham: Springer Nature Switzerland, 2024..

[7] Kluger, Dan M., Kerri Lu, Tijana Zrnic, Sherrie Wang, and Stephen Bates. "Prediction-powered inference with imputed covariates and nonuniform sampling." arXiv preprint arXiv:2501.18577 (2025)..

[8] Shan, Jiawei, Zhifeng Chen, Yiming Dong, Yazhen Wang, and Jiwei Zhao. "SADA: Safe and Adaptive Aggregation of Multiple Black-Box Predictions in Semi-Supervised Learning." arXiv preprint arXiv:2509.21707 (2025)..