Data Simulators
glide.simulators
generate_binary_dataset
generate_binary_dataset(
n_total,
true_mean=0.7,
proxy_mean=0.6,
correlation=0.8,
random_seed=None,
)
Generate a synthetic binary-label oracle dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_total
|
int
|
Total number of samples. All samples have both true and proxy labels. |
required |
true_mean
|
float
|
Expected mean value of the true labels. |
0.7
|
proxy_mean
|
float
|
Expected mean value of the proxy labels. |
0.6
|
correlation
|
float
|
Pearson correlation between true and proxy labels. |
0.8
|
random_seed
|
int or SeedSequence
|
Seed for reproducibility. |
None
|
Returns:
| Type | Description |
|---|---|
Tuple[NDArray, NDArray]
|
[0]: array of shape |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
ValueError
|
If |
ValueError
|
If the combination of |
Notes
Step 1 — Joint distribution
For two binary variables with marginals p_t = P(y_true=1) and
p_p = P(y_proxy=1), the Pearson correlation uniquely determines the
joint distribution. Let D = sqrt(p_t * p_p * (1-p_t) * (1-p_p))
(product of standard deviations). Then:
p11 = P(y_true=1, y_proxy=1) = correlation * D + p_t * p_p
p00 = P(y_true=0, y_proxy=0) = 1 - p_t - p_p + p11
p01 = P(y_true=0, y_proxy=1) = p_p - p11
p10 = P(y_true=1, y_proxy=0) = p_t - p11
These four probabilities must all be strictly positive — otherwise the
parameter combination is impossible and a ValueError is raised. The
previous probabilities become negative for the following respective
values :
p11 < 0 for correlation < -(p_t * p_p) / D
p00 < 0 for correlation < (p_t + p_p - p_t * p_p - 1) / D
p01 < 0 for correlation > p_p * (1 - p_t) / D
p10 < 0 for correlation > p_t * (1 - p_p) / D
Therefore, the correlation needs to satisfy :
max(-p_t * p_p, p_t + p_p - p_t * p_p - 1) <= correlation * D <= min(p_t * (1 - p_p), p_p * (1 - p_t))
Step 2 — Sampling outcome pairs
The four outcomes (y_true=0, y_proxy=0), (y_true=0, y_proxy=1),
(y_true=1, y_proxy=0), (y_true=1, y_proxy=1) are encoded as
integers 0–3 with probabilities [p00, p01, p10, p11]. All n_total
pairs are drawn in one call via numpy.random.Generator.choice.
Step 3 — Decoding labels from integers
The integer encoding satisfies y_true = outcome // 2 and
y_proxy = outcome % 2, so both labels are recovered with cheap
integer arithmetic.
References
.. [SO] Correlation between Bernoulli Variables <https://math.stackexchange.com/questions/610443/finding-a-correlation-between-bernoulli-variables>_
Examples:
>>> import numpy as np
>>> from glide.simulators import generate_binary_dataset
>>> y_true, y_proxy = generate_binary_dataset(n_total=8, random_seed=42)
>>> len(y_true)
8
>>> len(y_proxy)
8
>>> bool(np.all(np.isin(y_true, [0.0, 1.0])))
True
>>> bool(np.all(np.isin(y_proxy, [0.0, 1.0])))
True
Source code in glide/simulators/binary.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | |
generate_binary_dataset_with_oracle_sampling
generate_binary_dataset_with_oracle_sampling(
n_total,
true_mean=0.7,
proxy_mean=0.6,
correlation=0.8,
random_seed=None,
)
Generate a synthetic binary dataset with oracle sampling probabilities.
All n_total samples have ground-truth labels (y_true_oracle), proxy predictions (y_proxy), and an oracle uncertainty score derived from the analytical proxy error. The uncertainty values are non-uniform: samples where the proxy is less reliable receive higher uncertainty following the optimal sampling rule.
The sampling is based on a latent variable which determines the correlation between y_true_oracle and y_proxy in each sample. This variable is sampled uniformly around the given correlation value with limited spread within the interval of possible correlation levels given true_mean and proxy_mean. This way, the correlation between y_true_oracle and y_proxy matches the target value on average.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_total
|
int
|
Total number of samples. |
required |
true_mean
|
float
|
Expected mean of y_true_oracle. Must be in (0, 1). |
0.7
|
proxy_mean
|
float
|
Expected mean of y_proxy. Must be in (0, 1). |
0.6
|
correlation
|
float
|
Pearson correlation between y_true_oracle and y_proxy (marginal, across all samples). |
0.8
|
random_seed
|
int
|
Seed for reproducibility. |
None
|
Returns:
| Type | Description |
|---|---|
Tuple[NDArray, NDArray, NDArray]
|
[0]: array of shape |
Raises:
| Type | Description |
|---|---|
ValueError
|
If true_mean is not in (0, 1). |
ValueError
|
If proxy_mean is not in (0, 1). |
ValueError
|
If the combination of true_mean, proxy_mean, and correlation leads to negative joint probabilities. |
Notes
Step 1 — Global joint distribution
For two binary variables with marginals p_t = P(y_true_oracle=1) and
p_p = P(y_proxy=1), the Pearson correlation uniquely determines the
joint distribution. Let D = sqrt(p_t * p_p * (1-p_t) * (1-p_p))
(product of standard deviations). Then:
p11 = P(y_true_oracle=1, y_proxy=1) = correlation * D + p_t * p_p
p00 = P(y_true_oracle=0, y_proxy=0) = 1 - p_t - p_p + p11
p01 = P(y_true_oracle=0, y_proxy=1) = p_p - p11
p10 = P(y_true_oracle=1, y_proxy=0) = p_t - p11
These four probabilities are fully determined by (p_t, p_p, correlation)
and must all be strictly positive — otherwise the parameter combination is
impossible and a ValueError is raised. The previous probabilities become
negative for the following respective values :
p11 < 0 for correlation < -(p_t * p_p) / D
p00 < 0 for correlation < (p_t + p_p - p_t * p_p - 1) / D
p01 < 0 for correlation > p_p * (1 - p_t) / D
p10 < 0 for correlation > p_t * (1 - p_p) / D
Therefore, the correlation needs to satisfy :
max(-p_t * p_p, p_t + p_p - p_t * p_p - 1) <= correlation * D <= min(p_t * (1 - p_p), p_p * (1 - p_t))
Step 2 — Latent variable x and per-sample correlation
Each sample receives a latent value x_i ~ Uniform(-1, 1) representing
"annotation difficulty". The per-sample Pearson correlation is defined as:
corr(x_i) = correlation + correlation_spread * x_i
Because E[x] = 0 for x ~ Uniform(-1, 1), the marginal
correlation E[corr(X)] = correlation exactly, preserving the target
value on average. Samples with low x have lower conditional
correlation (proxy less reliable → higher uncertainty); samples with high
x have higher conditional correlation (proxy more reliable → lower uncertainty).
correlation_spread is chosen as 90 % of the largest value that keeps
all four per-sample probabilities strictly positive for every
x in [-1, 1]:
max_safe_correlation_spread = min(p00, p01, p10, p11) / D
Step 3 — Per-sample probabilities and error probability
We adapt p11 with x and this propagates to other values:
p11(x) = corr(x) * D + p_t * p_p # varies with x
error_prob(x) = p01(x) + p10(x)
= p_t + p_p - 2 * p11(x) # proxy ≠ y_true_oracle
error_prob(x) is the per-sample proxy error probability, which
decreases linearly as x increases (higher x → better proxy).
Step 4 — Vectorized CDF inversion
Since each sample has its own probability vector, numpy.random.choice
(which takes a single fixed probability vector) cannot be used. Instead,
the four outcomes (0,0), (0,1), (1,0), (1,1) are encoded as integers
0–3 and sampled via cumulative-threshold comparison on a single
u ~ Uniform(0,1) draw:
u < p00(x) → outcome 0 : (y_true_oracle=0, y_proxy=0)
u < p00(x)+p01(x) → outcome 1 : (y_true_oracle=0, y_proxy=1)
u < p00(x)+p01(x)+p10(x) → outcome 2 : (y_true_oracle=1, y_proxy=0)
else → outcome 3 : (y_true_oracle=1, y_proxy=1)
The crucial simplification is that the second threshold collapses to the
constant 1 - p_t (independent of x), because:
p00(x) + p01(x) = (1-p_t-p_p+p11) + (p_p-p11) = 1 - p_t
We also have :
p00(x) + p01(x) + p10(x) = 1 - p11(x)
This means only two of the three thresholds require per-sample arrays.
The outcome integer encodes both labels: y_true_oracle = outcome // 2,
y_proxy = outcome % 2.
Step 5 — Oracle uncertainty
The optimal sampling probability satisfies
uncertainty = sqrt(E[(y_proxy - y_true_oracle)²]) = sqrt(error_prob(x)).
These values are stored directly as uncertainty.
Examples:
>>> from glide.simulators import generate_binary_dataset_with_oracle_sampling
>>> y_true_oracle, y_proxy, uncertainty = generate_binary_dataset_with_oracle_sampling(n_total=4, random_seed=0)
>>> len(y_true_oracle)
4
>>> len(y_proxy)
4
>>> len(uncertainty)
4
>>> bool(np.all(np.isin(y_true_oracle, [0.0, 1.0])))
True
>>> bool(np.all(np.isin(y_proxy, [0.0, 1.0])))
True
>>> bool(np.all((uncertainty >= 0) & (uncertainty <= 1)))
True
Source code in glide/simulators/oracle_binary.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 | |
generate_clustered_binary_dataset
generate_clustered_binary_dataset(
n_total,
n_clusters,
true_mean=0.7,
proxy_mean=0.6,
correlation=0.8,
random_seed=None,
)
Generate a synthetic clustered binary-label dataset for evaluation.
Draws n_total i.i.d. (y_true, y_proxy) pairs from the
joint binary distribution defined by true_mean, proxy_mean, and
correlation, then randomly partitions the observations into
n_clusters non-empty groups.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_total
|
int
|
Exact total number of observations across all clusters. |
required |
n_clusters
|
int
|
Exact number of clusters. Must be at least 2. |
required |
true_mean
|
float
|
Expected mean value of the true labels. Must be in |
0.7
|
proxy_mean
|
float
|
Expected mean value of the proxy labels. Must be in |
0.6
|
correlation
|
float
|
Pearson correlation between true and proxy labels. |
0.8
|
random_seed
|
int or SeedSequence
|
Seed for reproducibility. |
None
|
Returns:
| Type | Description |
|---|---|
Tuple[NDArray, NDArray, NDArray]
|
[0]: |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
ValueError
|
If |
ValueError
|
If the combination of |
ValueError
|
If |
ValueError
|
If |
Notes
Step 1 — Draw observations
Call generate_binary_dataset(n_total, ...) to obtain n_total
i.i.d. (y_true, y_proxy) pairs from the joint binary
distribution defined by true_mean, proxy_mean, and
correlation.
Step 2 — Random cluster partition
Draw n_clusters - 1 cut positions uniformly without replacement from
{1, 2, ..., n_total - 1} and sort them. Combined with 0 and
n_total, these define n_clusters contiguous intervals of random
lengths that sum to n_total. Assign cluster identifier k to all
observations whose position falls in the k-th interval. Every cluster
contains at least 1 observation by construction.
Step 3 — Shuffle
Randomly permute the cluster identifier array so that cluster membership is not determined by position in the output.
Examples:
>>> import numpy as np
>>> from glide.simulators import generate_clustered_binary_dataset
>>> y_true, y_proxy, clusters = generate_clustered_binary_dataset(
... n_total=10, n_clusters=4, random_seed=0
... )
>>> y_true
array([1., 1., 1., 0., 1., 1., 0., 1., 0., 1.])
>>> y_proxy
array([1., 0., 1., 0., 1., 1., 0., 1., 0., 1.])
>>> clusters
array([3, 0, 3, 1, 0, 3, 3, 2, 0, 0])
Source code in glide/simulators/clustered_binary.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | |
generate_gaussian_dataset
generate_gaussian_dataset(
n_labeled,
n_unlabeled,
true_mean=0.7,
true_std=1,
proxy_mean=0.6,
proxy_std=1,
correlation=0.8,
random_seed=None,
)
Generate a synthetic Gaussian dataset for evaluation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_labeled
|
int
|
Number of samples with both true and proxy labels (the labeled subset). |
required |
n_unlabeled
|
int
|
Number of samples with proxy labels only (the unlabeled subset). |
required |
true_mean
|
float
|
Mean of the true label distribution. |
0.7
|
true_std
|
float
|
Standard deviation of the true label distribution. |
1
|
proxy_mean
|
float
|
Mean of the proxy label distribution. |
0.6
|
proxy_std
|
float
|
Standard deviation of the proxy label distribution. |
1
|
correlation
|
float
|
Pearson correlation between true and proxy labels. |
0.8
|
random_seed
|
int
|
Seed for reproducibility. |
None
|
Returns:
| Type | Description |
|---|---|
Tuple[NDArray, NDArray]
|
[0]: array of shape |
Notes
Target distribution
The goal is to jointly sample (y_true, y_proxy) from a bivariate Gaussian:
(y_true, y_proxy) ~ N(μ, Σ)
where:
μ = (true_mean, proxy_mean)
Σ = [[true_std², ρ · true_std · proxy_std],
[ρ · true_std · proxy_std, proxy_std² ]]
and ρ is the target Pearson correlation.
Step 1 — Cholesky decomposition of Σ
To sample from N(0, Σ), we find a lower-triangular matrix L such that
Σ = L @ Lᵀ (Cholesky factor). The construction uses the angle
θ = arccos(ρ), so that cos(θ) = ρ and sin(θ) = √(1 - ρ²):
L = [[true_std, 0 ],
[proxy_std · cos(θ), proxy_std · sin(θ) ]]
One can verify L @ Lᵀ = Σ directly:
L @ Lᵀ = [[true_std², true_std · proxy_std · cos(θ)],
[true_std · proxy_std · cos(θ), proxy_std² · (cos²(θ)+sin²(θ))]]
= [[true_std², true_std · proxy_std · ρ],
[true_std · proxy_std · ρ, proxy_std² ]] = Σ
Step 2 — Sampling via the linear transform
Let Z be a 2 × (n_labeled+n_unlabeled) matrix whose entries are i.i.d. standard normals
Z_i ~ N(0, 1). Then:
Y = L @ Z
gives a 2 × (n_labeled+n_unlabeled) matrix where each column is a zero-mean sample from
N(0, Σ). In component form, each column (Z₁, Z₂) maps to:
Y₁ = true_std · Z₁
Y₂ = proxy_std · cos(θ) · Z₁ + proxy_std · sin(θ) · Z₂
The resulting properties are:
- Var(Y₁) = true_std² and Var(Y₂) = proxy_std² (correct marginal variances)
- Cov(Y₁, Y₂) = true_std · proxy_std · cos(θ) = true_std · proxy_std · ρ
- Corr(Y₁, Y₂) = ρ (correct Pearson correlation)
Step 3 — Shifting by the means
Adding the desired means shifts the distribution to N(μ, Σ):
y_true = true_mean + Y[0, :]
y_proxy = proxy_mean + Y[1, :]
The first n_labeled columns form the labeled set (both y_true and y_proxy
are observed); columns n_labeled through n_labeled+n_unlabeled-1 form the unlabeled set
(only y_proxy is observed).
Examples:
>>> import numpy as np
>>> from glide.simulators import generate_gaussian_dataset
>>> y_true, y_proxy = generate_gaussian_dataset(n_labeled=3, n_unlabeled=5, random_seed=42)
>>> len(y_true)
8
>>> int(np.sum(~np.isnan(y_true)))
3
>>> int(np.sum(~np.isnan(y_proxy)))
8
Source code in glide/simulators/gaussian.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | |
generate_stratified_binary_dataset
generate_stratified_binary_dataset(
n_total,
true_mean,
proxy_mean,
correlation,
random_seed=None,
)
Generate a synthetic stratified binary-label oracle dataset.
Generate multiple strata with potentially different parameters (true_mean, proxy_mean, correlation, n_total per stratum). This enables simulation of heterogeneous data where different groups have different proxy-truth relationships.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_total
|
List[int]
|
Total number of samples per stratum. All samples have both true and proxy labels. Length must equal number of strata. |
required |
true_mean
|
List[float]
|
Expected mean value of the true labels per stratum. Length must equal number of strata. |
required |
proxy_mean
|
List[float]
|
Expected mean value of the proxy labels per stratum. Length must equal number of strata. |
required |
correlation
|
List[float]
|
Pearson correlation between true and proxy per stratum. Length must equal number of strata. |
required |
random_seed
|
int
|
Seed for reproducibility. If provided, seeds are derived deterministically. |
None
|
Returns:
| Type | Description |
|---|---|
Tuple[NDArray, NDArray, NDArray]
|
Let [0]: array of shape |
Raises:
| Type | Description |
|---|---|
ValueError
|
If input lists have different lengths. |
ValueError
|
If fewer than 1 stratum is specified. |
ValueError
|
If any stratum has invalid parameters (see generate_binary_dataset). |
Examples:
>>> import numpy as np
>>> from glide.simulators import generate_stratified_binary_dataset
>>> y_true, y_proxy, groups = generate_stratified_binary_dataset(
... n_total=[6, 8],
... true_mean=[0.6, 0.8],
... proxy_mean=[0.5, 0.7],
... correlation=[0.7, 0.75],
... random_seed=42
... )
>>> len(y_true)
14
>>> len(groups)
14
>>> len(y_proxy)
14
>>> bool(np.all(np.isin(y_true, [0.0, 1.0])))
True
>>> bool(np.all(np.isin(y_proxy, [0.0, 1.0])))
True
Source code in glide/simulators/stratified_binary.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | |
simulate_annotation
simulate_annotation(y_true_oracle, xi)
Reveal oracle labels where annotated and mask the rest as NaN.
Given a full oracle label array and an annotation indicator, returns an array where labels
are kept for annotated elements (xi == 1) and set to np.nan for unannotated ones
(xi == 0 or xi == np.nan). The input arrays are not mutated.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true_oracle
|
NDArray
|
Full oracle ground-truth labels for all elements. |
required |
xi
|
NDArray
|
Annotation indicator of the same length. A value of |
required |
Returns:
| Type | Description |
|---|---|
NDArray
|
Array of the same length as |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
ValueError
|
If |
ValueError
|
If |
Examples:
>>> import numpy as np
>>> from glide.simulators import simulate_annotation
>>> y_true_oracle = np.array([0, 1, 1, 0])
>>> xi = np.array([1, 0, 1, np.nan])
>>> simulate_annotation(y_true_oracle, xi)
array([ 0., nan, 1., nan])
Source code in glide/simulators/annotation.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | |