Samplers
glide.samplers.uniform.UniformSampler
Sampler that draws observations uniformly without replacement from the pool.
It is the standard approach when no auxiliary signal is available.
Examples:
>>> from glide.samplers import UniformSampler
>>> sampler = UniformSampler()
>>> xi = sampler.sample(n_samples=2, budget=1, random_seed=0)
>>> xi
array([0., 1.])
Source code in glide/samplers/uniform.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | |
sample
sample(n_samples, budget, random_seed=None)
Sample observations uniformly at random without replacement.
Selects exactly budget observations from a pool of n_samples
without replacement.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Total number of observations in the pool. Must be a strictly positive integer. |
required |
budget
|
int
|
Exact number of observations to select. Must be a strictly
positive integer and must not exceed |
required |
random_seed
|
int or SeedSequence or None
|
Random seed passed to |
None
|
Returns:
| Type | Description |
|---|---|
NDArray
|
Array of shape |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Source code in glide/samplers/uniform.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | |
glide.samplers.active.ActiveSampler
Sampler that draws elements with probabilities based on uncertainty scores.
Implements active sampling for inference pipelines which support inverse probability weighting (IPW). Each observation is assigned a drawing probability π_i proportional to its uncertainty score, then independently selected via a Bernoulli trial. This concentrates the annotation budget on the most uncertain observations.
References
Zrnic, Tijana, and Emmanuel J. Candès. "Active statistical inference." In Proceedings of the 41st International Conference on Machine Learning, pp. 62993-63010. 2024.
Gligorić, Kristina, Tijana Zrnic, Cinoo Lee, Emmanuel Candes, and Dan Jurafsky. "Can unconfident llm annotations be used for confident conclusions?." In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 3514-3533. 2025.
Examples:
>>> import numpy as np
>>> from glide.samplers import ActiveSampler
>>> uncertainties = np.array([0.1, 0.4])
>>> sampler = ActiveSampler()
>>> pi, xi = sampler.sample(uncertainties, budget=1, random_seed=0)
>>> pi
array([0.2, 0.8])
>>> xi
array([0., 1.])
Source code in glide/samplers/active.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | |
sample
sample(uncertainties, budget, random_seed=None)
Sample observations with probability proportional to uncertainty.
Each observation receives a drawing probability π_i that minimizes the variance
of downstream IPW-based estimators. This is equivalently done by minimizing the sum of
uncertainty_i^2 / π_i over all observations. Probabilities are constrained to
(0, 1] and sum to budget. The actual number of selected items is random
but limited to budget.
Samples are randomly permuted before drawing and the inverse permutation
is applied to the output, so the returned arrays are always in the
original input order. A post-draw cutoff is then applied to strictly
respect the budget: samples beyond the cutoff are discarded by setting their entries
in pi and xi to 0.0 and NaN respectively.
The two returned arrays are intended for use with IPW-based downstream estimators.
pi holds the per-sample probability of being selected. xi holds the
selection indicators for each sample so that a value of 1 means the sample
should be sent for annotation, a value of 0 means it was not selected, and
NaN means it was discarded by the budget cutoff.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uncertainties
|
NDArray
|
Array of shape |
required |
budget
|
int
|
Expected total number of annotations to collect. Must be a strictly
positive integer and must not exceed |
required |
random_seed
|
int or SeedSequence or None
|
Random seed passed to |
None
|
Returns:
| Type | Description |
|---|---|
Tuple[NDArray, NDArray]
|
[0]: array of shape |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Warns:
| Type | Description |
|---|---|
UserWarning
|
If the ratio of the largest to the smallest uncertainty is extreme, indicating potential numerical instability. |
References
Zrnic, Tijana, and Emmanuel J. Candès. "Active statistical inference." In Proceedings of the 41st International Conference on Machine Learning, pp. 62993-63010. 2024.
Source code in glide/samplers/active.py
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | |
glide.samplers.stratified.StratifiedSampler
Sampler for per-stratum annotation budget allocation.
This class implements stratified sampling strategies that determine how many samples to annotate in each stratum, given a fixed annotation budget and proxy labels for all samples (provided as numpy arrays). It supports two allocation strategies:
-
Proportional allocation (baseline): Allocates budget proportionally to stratum sizes, resulting in uniform sampling probabilities across the dataset.
-
Neyman allocation (default, optimal): Assigns more budget to strata with higher proxy variance, minimising the asymptotic variance of downstream estimators. Particularly effective when proxy variance varies substantially across strata.
Both allocators use largest-remainder rounding (Hamilton's method) to allocate budget across strata. Per-stratum sample sizes are capped at stratum size, so total allocated budget Σ n_h ≤ budget (may be less if strata are small). The sampler is typically used upstream of statistical estimators to plan annotation effort.
References
Fogliato, Riccardo, Pratik Patil, Mathew Monfort, and Pietro Perona. "A framework for efficient model evaluation through stratification, sampling, and estimation." In European Conference on Computer Vision, pp. 140-158. Cham: Springer Nature Switzerland, 2024.
Examples:
>>> import numpy as np
>>> from glide.samplers import StratifiedSampler
>>> y_proxy = np.array([0.8, 0.9, 0.85, 0.88, 2.4 , 2.5 , 2.45, 2.48])
>>> groups = np.array(["A", "A", "A", "A", "B", "B", "B", "B"], dtype=object)
>>> sampler = StratifiedSampler()
>>> xi = sampler.sample(y_proxy, groups, budget=4, random_seed=1)
>>> xi
array([0, 1, 1, 0, 1, 0, 1, 0])
Source code in glide/samplers/stratified.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 | |
sample
sample(
y_proxy,
groups,
budget,
strategy="neyman",
random_seed=None,
)
Allocate annotation budget across strata and perform stratified sampling.
Computes allocated annotation counts n_h for each stratum h using the
specified allocation strategy and selects exactly n_h samples from each stratum
without replacement. Neyman allocation (default) assigns more budget to strata with higher
proxy variance, minimising asymptotic variance of downstream estimators. Proportional
allocation allocates budget proportionally to stratum sizes and serves as a baseline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_proxy
|
NDArray
|
Proxy labels for all samples, shape |
required |
groups
|
NDArray
|
Stratum identifiers for all samples, shape |
required |
budget
|
int
|
Target annotation budget. Must be positive. Mandatory. |
required |
strategy
|
str
|
Allocation strategy: "neyman" (default) or "proportional". "neyman": assigns more budget to higher-variance strata. "proportional": allocates proportionally to stratum sizes. |
'neyman'
|
random_seed
|
int or None
|
Random seed for reproducible sampling. Defaults to None (non-deterministic). |
None
|
Returns:
| Type | Description |
|---|---|
NDArray
|
Selection indicators of shape |
Raises:
| Type | Description |
|---|---|
ValueError
|
|
Source code in glide/samplers/stratified.py
143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 | |
glide.samplers.cost_optimal_random.CostOptimalRandomSampler
Sampler implementing cost-optimal random annotation.
Implements the optimal random sampling strategy for two-rater annotation, where one rater is expensive (ground truth) and one is cheap (proxy). Determines the optimal probability of requesting the expensive rater based on relative costs and annotation quality differences.
References
Angelopoulos, Anastasios N., Jacob Eisenstein, Jonathan Berant, Alekh Agarwal, and Adam Fisch. "Cost-optimal active ai model evaluation." arXiv preprint arXiv:2506.07949 (2025).
Examples:
>>> import numpy as np
>>> from glide.samplers import CostOptimalRandomSampler
>>> y_true = np.array([1.0, 2.0])
>>> y_proxy = np.array([1.1, 1.9])
>>> sampler = CostOptimalRandomSampler()
>>> sampler = sampler.fit(y_true, y_proxy)
>>> pi, xi = sampler.sample(
... n_samples=2,
... y_true_cost=10.0,
... y_proxy_cost=1.0,
... budget=15,
... random_seed=42
... )
>>> pi
array([0.0451754, 0.0451754])
>>> xi
array([0., 0.])
Source code in glide/samplers/cost_optimal_random.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 | |
fit
fit(y_true, y_proxy)
Calibrate the sampler by estimating proxy quality and label variance.
Fits the sampler to a fully-labeled burn-in dataset by computing the mean squared error between proxy labels and ground truth labels, as well as the variance of ground truth labels. These statistics are used to determine the optimal probability of requesting expensive ground truth annotations during the sampling phase.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
NDArray
|
Ground truth labels, shape (n_samples,). Must not contain NaN values. |
required |
y_proxy
|
NDArray
|
Proxy labels, shape (n_samples,). Must not contain NaN values. |
required |
Returns:
| Type | Description |
|---|---|
CostOptimalRandomSampler
|
Self, to allow method chaining. |
Raises:
| Type | Description |
|---|---|
ValueError
|
|
Source code in glide/samplers/cost_optimal_random.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | |
sample
sample(
n_samples,
y_true_cost,
y_proxy_cost,
budget,
random_seed=None,
)
Sample observations with cost-optimal allocation between raters.
Derives the optimal probability of querying the expensive rater (ground truth) based on relative costs and proxy quality.
Samples are randomly permuted before drawing and the inverse permutation is applied
to the output, so the returned arrays are always in the original input order. A
post-draw cutoff is then applied to strictly respect the budget: samples beyond the
cutoff are discarded by setting their entries in pi and xi to 0.0 and
NaN respectively.
The two returned arrays are intended for use with IPW-based downstream estimators. pi
holds the per-sample probability of querying the expensive rater. xi holds the
annotation indicators for selected samples, with NaN marking samples excluded by the
budget cutoff.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Total number of candidate samples to draw from. Must be a strictly positive integer. |
required |
y_true_cost
|
float
|
Per-sample cost of the expensive rater (H). Must be strictly positive. |
required |
y_proxy_cost
|
float
|
Per-sample cost of the cheap rater (G). Must be strictly positive. |
required |
budget
|
float
|
Total annotation budget in cost units. Must be at least |
required |
random_seed
|
int or SeedSequence or None
|
Random seed passed to |
None
|
Returns:
| Type | Description |
|---|---|
Tuple[NDArray, NDArray]
|
[0]: array of shape |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If |
ValueError
|
|
Source code in glide/samplers/cost_optimal_random.py
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 | |
glide.samplers.cost_optimal.CostOptimalSampler
Sampler that draws elements with optimal probabilities based on uncertainty scores and annotation costs on a limited budget.
Implements a cost-optimal active annotation policy. Each sample is assigned an annotation probability proportional to how unreliable the proxy label is expected to be for that sample, as measured by the caller-supplied per-sample uncertainty scores. Samples with high expected proxy error are more likely to be annotated whereas those with low expected proxy error are less likely to be annotated. This concentrates the annotation budget where it matters most.
The caller provides per-sample uncertainty scores and passes them as a
1D array to sample(). These are treated as oracle root mean square error
estimates. This class does not learn those scores internally.
References
Angelopoulos, Anastasios N., Jacob Eisenstein, Jonathan Berant, Alekh Agarwal, and Adam Fisch. "Cost-optimal active ai model evaluation." arXiv preprint arXiv:2506.07949 (2025).
Examples:
>>> import numpy as np
>>> from glide.samplers import CostOptimalSampler
>>> y_true = np.array([1.0, 2.0, 3.0, 4.0])
>>> uncertainties = np.array([0.1, 0.4, 0.1, 0.4])
>>> sampler = CostOptimalSampler().fit(y_true)
>>> pi, xi = sampler.sample(
... uncertainties,
... y_true_cost=10.0,
... y_proxy_cost=1.0,
... budget=20,
... random_seed=0
... )
>>> pi
array([0.02514447, 0.10057789, 0.02514447, 0.10057789])
>>> xi
array([0., 0., 0., 1.])
Source code in glide/samplers/cost_optimal.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 | |
fit
fit(y_true)
Estimate the true label variance from a burn-in dataset.
The true label variance is computed ahead of active sampling so that
sample() can derive the cost-optimal annotation probabilities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
NDArray
|
1D float array of true labels from the burn-in phase. Must not contain NaN values. |
required |
Returns:
| Type | Description |
|---|---|
CostOptimalSampler
|
The fitted sampler (returns |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Source code in glide/samplers/cost_optimal.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | |
sample
sample(
uncertainties,
y_true_cost,
y_proxy_cost,
budget,
random_seed=None,
)
Compute sampling probabilities and draw annotation indicators under the cost optimal policy.
Per-sample annotation probabilities are derived from the supplied uncertainty
scores (root mean squared errors) and the true label variance estimated by fit().
Samples are randomly permuted before drawing and the inverse permutation is applied
to the output, so the returned arrays are always in the original input order. A
post-draw cutoff is then applied to strictly respect the budget: samples
beyond the cutoff are discarded by setting their entries in pi and xi to
0.0 and NaN respectively.
The two returned arrays are intended for use with IPW-based downstream estimators. pi
holds the per-sample probability of querying the expensive rater. xi holds the
annotation indicators for selected samples, with NaN marking samples excluded by the
budget cutoff.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uncertainties
|
NDArray
|
1D float array of shape |
required |
y_true_cost
|
float
|
Cost of one true label. Must be strictly positive. |
required |
y_proxy_cost
|
float
|
Cost of one proxy label. Must be non-negative. |
required |
budget
|
float
|
Total annotation budget in cost units. Must be at least |
required |
random_seed
|
int or SeedSequence or None
|
Random seed passed to |
None
|
Returns:
| Type | Description |
|---|---|
Tuple[NDArray, NDArray]
|
[0]: array of shape |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If |
ValueError
|
|
Source code in glide/samplers/cost_optimal.py
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 | |
glide.samplers.cluster.UniformClusterSampler
Sampler that selects entire clusters without replacement using uniform sampling.
Each call to sample draws a fixed number of clusters from the pool of unique
cluster labels in clusters, then marks every observation in a selected cluster
for annotation. Every cluster has equal probability of being selected, so every
individual observation has the same marginal probability of being annotated.
Examples:
>>> import numpy as np
>>> from glide.samplers import UniformClusterSampler
>>> clusters = np.array(["A", "A", "B", "B"], dtype=object)
>>> sampler = UniformClusterSampler()
>>> xi = sampler.sample(clusters, n_clusters=1, random_seed=0)
>>> xi
array([0, 0, 1, 1])
Source code in glide/samplers/cluster.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | |
sample
sample(clusters, n_clusters, random_seed=None)
Select entire clusters without replacement.
Draws n_clusters clusters from the unique values of clusters with equal
probability and returns selection indicators: every observation whose cluster was
drawn receives a 1, all others receive a 0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
clusters
|
NDArray
|
Array of shape |
required |
n_clusters
|
int
|
Number of clusters to select. Must be a strictly positive integer and must
not exceed the number of unique clusters in |
required |
random_seed
|
int or SeedSequence or None
|
Random seed passed to |
None
|
Returns:
| Type | Description |
|---|---|
NDArray
|
Selection indicators of shape |
Raises:
| Type | Description |
|---|---|
ValueError
|
|
Source code in glide/samplers/cluster.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | |