Skip to content

Data Simulators

glide.simulators

generate_binary_dataset

generate_binary_dataset(
    n_total,
    true_mean=0.7,
    proxy_mean=0.6,
    correlation=0.8,
    random_seed=None,
)

Generate a synthetic binary-label oracle dataset.

Parameters:

Name Type Description Default
n_total int

Total number of samples. All samples have both true and proxy labels.

required
true_mean float

Expected mean value of the true labels.

0.7
proxy_mean float

Expected mean value of the proxy labels.

0.6
correlation float

Pearson correlation between true and proxy labels.

0.8
random_seed int or SeedSequence

Seed for reproducibility.

None

Returns:

Type Description
Tuple[NDArray, NDArray]

[0]: array of shape (n_total,), y_true containing ground-truth labels. [1]: array of shape (n_total,), y_proxy containing proxy labels.

Raises:

Type Description
ValueError

If true_mean is not in (0, 1).

ValueError

If proxy_mean is not in (0, 1).

ValueError

If the combination of true_mean, proxy_mean, and correlation is impossible (leads to negative joint probabilities).

Notes

Step 1 — Joint distribution

For two binary variables with marginals p_t = P(y_true=1) and p_p = P(y_proxy=1), the Pearson correlation uniquely determines the joint distribution. Let D = sqrt(p_t * p_p * (1-p_t) * (1-p_p)) (product of standard deviations). Then:

p11 = P(y_true=1, y_proxy=1) = correlation * D + p_t * p_p
p00 = P(y_true=0, y_proxy=0) = 1 - p_t - p_p + p11
p01 = P(y_true=0, y_proxy=1) = p_p - p11
p10 = P(y_true=1, y_proxy=0) = p_t - p11

These four probabilities must all be strictly positive — otherwise the parameter combination is impossible and a ValueError is raised. The previous probabilities become negative for the following respective values :

p11 < 0 for correlation < -(p_t * p_p) / D
p00 < 0 for correlation < (p_t + p_p - p_t * p_p - 1) / D
p01 < 0 for correlation > p_p * (1 - p_t) / D
p10 < 0 for correlation > p_t * (1 - p_p) / D

Therefore, the correlation needs to satisfy :

max(-p_t * p_p, p_t + p_p - p_t * p_p - 1) <= correlation * D <= min(p_t * (1 - p_p), p_p * (1 - p_t))

Step 2 — Sampling outcome pairs

The four outcomes (y_true=0, y_proxy=0), (y_true=0, y_proxy=1), (y_true=1, y_proxy=0), (y_true=1, y_proxy=1) are encoded as integers 0–3 with probabilities [p00, p01, p10, p11]. All n_total pairs are drawn in one call via numpy.random.Generator.choice.

Step 3 — Decoding labels from integers

The integer encoding satisfies y_true = outcome // 2 and y_proxy = outcome % 2, so both labels are recovered with cheap integer arithmetic.

References

.. [SO] Correlation between Bernoulli Variables <https://math.stackexchange.com/questions/610443/finding-a-correlation-between-bernoulli-variables>_

Examples:

>>> import numpy as np
>>> from glide.simulators import generate_binary_dataset
>>> y_true, y_proxy = generate_binary_dataset(n_total=8, random_seed=42)
>>> len(y_true)
8
>>> len(y_proxy)
8
>>> bool(np.all(np.isin(y_true, [0.0, 1.0])))
True
>>> bool(np.all(np.isin(y_proxy, [0.0, 1.0])))
True
Source code in glide/simulators/binary.py
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
def generate_binary_dataset(
    n_total: int,
    true_mean: float = 0.7,
    proxy_mean: float = 0.6,
    correlation: float = 0.8,
    random_seed: Optional[Union[int, np.random.SeedSequence]] = None,
) -> Tuple[NDArray, NDArray]:
    """Generate a synthetic binary-label oracle dataset.

    Parameters
    ----------
    n_total : int
        Total number of samples. All samples have both true and proxy labels.
    true_mean : float
        Expected mean value of the true labels.
    proxy_mean : float
        Expected mean value of the proxy labels.
    correlation : float
        Pearson correlation between true and proxy labels.
    random_seed : int or np.random.SeedSequence, optional
        Seed for reproducibility.

    Returns
    -------
    Tuple[NDArray, NDArray]
        [0]: array of shape ``(n_total,)``, y_true containing ground-truth labels.
        [1]: array of shape ``(n_total,)``, y_proxy containing proxy labels.

    Raises
    ------
    ValueError
        If ``true_mean`` is not in (0, 1).
    ValueError
        If ``proxy_mean`` is not in (0, 1).
    ValueError
        If the combination of ``true_mean``, ``proxy_mean``, and ``correlation`` is
        impossible (leads to negative joint probabilities).

    Notes
    -----
    **Step 1 — Joint distribution**

    For two binary variables with marginals ``p_t = P(y_true=1)`` and
    ``p_p = P(y_proxy=1)``, the Pearson correlation uniquely determines the
    joint distribution.  Let ``D = sqrt(p_t * p_p * (1-p_t) * (1-p_p))``
    (product of standard deviations).  Then:

    ```
    p11 = P(y_true=1, y_proxy=1) = correlation * D + p_t * p_p
    p00 = P(y_true=0, y_proxy=0) = 1 - p_t - p_p + p11
    p01 = P(y_true=0, y_proxy=1) = p_p - p11
    p10 = P(y_true=1, y_proxy=0) = p_t - p11
    ```

    These four probabilities must all be strictly positive — otherwise the
    parameter combination is impossible and a ``ValueError`` is raised. The
    previous probabilities become negative for the following respective
    values :

    ```
    p11 < 0 for correlation < -(p_t * p_p) / D
    p00 < 0 for correlation < (p_t + p_p - p_t * p_p - 1) / D
    p01 < 0 for correlation > p_p * (1 - p_t) / D
    p10 < 0 for correlation > p_t * (1 - p_p) / D
    ```

    Therefore, the correlation needs to satisfy :

    ```
    max(-p_t * p_p, p_t + p_p - p_t * p_p - 1) <= correlation * D <= min(p_t * (1 - p_p), p_p * (1 - p_t))
    ```

    **Step 2 — Sampling outcome pairs**

    The four outcomes ``(y_true=0, y_proxy=0)``, ``(y_true=0, y_proxy=1)``,
    ``(y_true=1, y_proxy=0)``, ``(y_true=1, y_proxy=1)`` are encoded as
    integers 0–3 with probabilities ``[p00, p01, p10, p11]``.  All ``n_total``
    pairs are drawn in one call via ``numpy.random.Generator.choice``.

    **Step 3 — Decoding labels from integers**

    The integer encoding satisfies ``y_true = outcome // 2`` and
    ``y_proxy = outcome % 2``, so both labels are recovered with cheap
    integer arithmetic.

    References
    ----------
    .. [SO] `Correlation between Bernoulli Variables <https://math.stackexchange.com/questions/610443/finding-a-correlation-between-bernoulli-variables>`_

    Examples
    --------
    >>> import numpy as np
    >>> from glide.simulators import generate_binary_dataset
    >>> y_true, y_proxy = generate_binary_dataset(n_total=8, random_seed=42)
    >>> len(y_true)
    8
    >>> len(y_proxy)
    8
    >>> bool(np.all(np.isin(y_true, [0.0, 1.0])))
    True
    >>> bool(np.all(np.isin(y_proxy, [0.0, 1.0])))
    True
    """
    _validate_bounds(true_mean, "true_mean", lower=0, upper=1, left_inclusive=False, right_inclusive=False)
    _validate_bounds(proxy_mean, "proxy_mean", lower=0, upper=1, left_inclusive=False, right_inclusive=False)

    rng = np.random.default_rng(seed=random_seed)

    p_t = true_mean
    p_p = proxy_mean

    # std product of the variable pair will be used multiple times
    D = np.sqrt(p_t * p_p * (1 - p_t) * (1 - p_p))

    # some combinations of true_mean, proxy_mean and correlation are impossible
    # and lead to negative probabilities, raise an error if this is the case
    min_possible_correlation = max(-p_t * p_p, p_p + p_t - 1 - p_t * p_p) / D
    max_possible_correlation = min(p_t * (1 - p_p), p_p * (1 - p_t)) / D
    if correlation < min_possible_correlation or correlation > max_possible_correlation:
        raise ValueError(
            f"Impossible combination of 'true_mean'={true_mean!r}, 'proxy_mean'={proxy_mean!r}, "
            f"and 'correlation'={correlation!r}: leads to negative joint probabilities; "
            f"possible 'correlation' values are in the range ({min_possible_correlation:.3f}"
            f", {max_possible_correlation:.3f})."
        )

    # we will generate pairs values (true, proxy) with true and proxy equal to 0 or 1
    # probability of outcome (1, 1)
    p11 = correlation * D + p_t * p_p
    p00 = 1 - p_t - p_p + p11
    p01 = p_p - p11
    p10 = p_t - p11
    # probabilities of outcomes (0, 0), (0, 1), (1, 0), (1, 1)
    probs = [p00, p01, p10, p11]

    # generate the outcome pairs as integers between 0 and 3 inclusive
    samples = rng.choice(4, p=probs, size=n_total)
    # extract the true and proxy values via integer division and modulo 2
    # we have 0 = (0, 0), 1 = (0, 1), 2 = (1, 0), 3 = (1, 1)
    y_true = (samples // 2).astype(float)
    y_proxy = (samples % 2).astype(float)

    return y_true, y_proxy

generate_binary_dataset_with_oracle_sampling

generate_binary_dataset_with_oracle_sampling(
    n_total,
    true_mean=0.7,
    proxy_mean=0.6,
    correlation=0.8,
    random_seed=None,
)

Generate a synthetic binary dataset with oracle sampling probabilities.

All n_total samples have ground-truth labels (y_true_oracle), proxy predictions (y_proxy), and an oracle uncertainty score derived from the analytical proxy error. The uncertainty values are non-uniform: samples where the proxy is less reliable receive higher uncertainty following the optimal sampling rule.

The sampling is based on a latent variable which determines the correlation between y_true_oracle and y_proxy in each sample. This variable is sampled uniformly around the given correlation value with limited spread within the interval of possible correlation levels given true_mean and proxy_mean. This way, the correlation between y_true_oracle and y_proxy matches the target value on average.

Parameters:

Name Type Description Default
n_total int

Total number of samples.

required
true_mean float

Expected mean of y_true_oracle. Must be in (0, 1).

0.7
proxy_mean float

Expected mean of y_proxy. Must be in (0, 1).

0.6
correlation float

Pearson correlation between y_true_oracle and y_proxy (marginal, across all samples).

0.8
random_seed int

Seed for reproducibility.

None

Returns:

Type Description
Tuple[NDArray, NDArray, NDArray]

[0]: array of shape (n_total,), y_true_oracle with the full ground-truth labels for all n_total samples (no NaN); use simulate_annotation to mask unlabeled rows [1]: array of shape (n_total,), y_proxy with proxy predictions [2]: array of shape (n_total,), uncertainty (oracle uncertainty score) per sample

Raises:

Type Description
ValueError

If true_mean is not in (0, 1).

ValueError

If proxy_mean is not in (0, 1).

ValueError

If the combination of true_mean, proxy_mean, and correlation leads to negative joint probabilities.

Notes

Step 1 — Global joint distribution

For two binary variables with marginals p_t = P(y_true_oracle=1) and p_p = P(y_proxy=1), the Pearson correlation uniquely determines the joint distribution. Let D = sqrt(p_t * p_p * (1-p_t) * (1-p_p)) (product of standard deviations). Then:

p11 = P(y_true_oracle=1, y_proxy=1) = correlation * D + p_t * p_p
p00 = P(y_true_oracle=0, y_proxy=0) = 1 - p_t - p_p + p11
p01 = P(y_true_oracle=0, y_proxy=1) = p_p - p11
p10 = P(y_true_oracle=1, y_proxy=0) = p_t - p11

These four probabilities are fully determined by (p_t, p_p, correlation) and must all be strictly positive — otherwise the parameter combination is impossible and a ValueError is raised. The previous probabilities become negative for the following respective values :

p11 < 0 for correlation < -(p_t * p_p) / D
p00 < 0 for correlation < (p_t + p_p - p_t * p_p - 1) / D
p01 < 0 for correlation > p_p * (1 - p_t) / D
p10 < 0 for correlation > p_t * (1 - p_p) / D

Therefore, the correlation needs to satisfy :

max(-p_t * p_p, p_t + p_p - p_t * p_p - 1) <= correlation * D <= min(p_t * (1 - p_p), p_p * (1 - p_t))

Step 2 — Latent variable x and per-sample correlation

Each sample receives a latent value x_i ~ Uniform(-1, 1) representing "annotation difficulty". The per-sample Pearson correlation is defined as:

corr(x_i) = correlation + correlation_spread * x_i

Because E[x] = 0 for x ~ Uniform(-1, 1), the marginal correlation E[corr(X)] = correlation exactly, preserving the target value on average. Samples with low x have lower conditional correlation (proxy less reliable → higher uncertainty); samples with high x have higher conditional correlation (proxy more reliable → lower uncertainty).

correlation_spread is chosen as 90 % of the largest value that keeps all four per-sample probabilities strictly positive for every x in [-1, 1]:

max_safe_correlation_spread = min(p00, p01, p10, p11) / D

Step 3 — Per-sample probabilities and error probability

We adapt p11 with x and this propagates to other values:

p11(x) = corr(x) * D + p_t * p_p          # varies with x
error_prob(x) = p01(x) + p10(x)
                = p_t + p_p - 2 * p11(x)    # proxy ≠ y_true_oracle

error_prob(x) is the per-sample proxy error probability, which decreases linearly as x increases (higher x → better proxy).

Step 4 — Vectorized CDF inversion

Since each sample has its own probability vector, numpy.random.choice (which takes a single fixed probability vector) cannot be used. Instead, the four outcomes (0,0), (0,1), (1,0), (1,1) are encoded as integers 0–3 and sampled via cumulative-threshold comparison on a single u ~ Uniform(0,1) draw:

u < p00(x)                 → outcome 0 : (y_true_oracle=0, y_proxy=0)
u < p00(x)+p01(x)          → outcome 1 : (y_true_oracle=0, y_proxy=1)
u < p00(x)+p01(x)+p10(x)   → outcome 2 : (y_true_oracle=1, y_proxy=0)
else                       → outcome 3 : (y_true_oracle=1, y_proxy=1)

The crucial simplification is that the second threshold collapses to the constant 1 - p_t (independent of x), because:

p00(x) + p01(x) = (1-p_t-p_p+p11) + (p_p-p11) = 1 - p_t

We also have :

p00(x) + p01(x) + p10(x) = 1 - p11(x)

This means only two of the three thresholds require per-sample arrays. The outcome integer encodes both labels: y_true_oracle = outcome // 2, y_proxy = outcome % 2.

Step 5 — Oracle uncertainty

The optimal sampling probability satisfies uncertainty = sqrt(E[(y_proxy - y_true_oracle)²]) = sqrt(error_prob(x)). These values are stored directly as uncertainty.

Examples:

>>> from glide.simulators import generate_binary_dataset_with_oracle_sampling
>>> y_true_oracle, y_proxy, uncertainty = generate_binary_dataset_with_oracle_sampling(n_total=4, random_seed=0)
>>> len(y_true_oracle)
4
>>> len(y_proxy)
4
>>> len(uncertainty)
4
>>> bool(np.all(np.isin(y_true_oracle, [0.0, 1.0])))
True
>>> bool(np.all(np.isin(y_proxy, [0.0, 1.0])))
True
>>> bool(np.all((uncertainty >= 0) & (uncertainty <= 1)))
True
Source code in glide/simulators/oracle_binary.py
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
def generate_binary_dataset_with_oracle_sampling(
    n_total: int,
    true_mean: float = 0.7,
    proxy_mean: float = 0.6,
    correlation: float = 0.8,
    random_seed: Optional[int] = None,
) -> Tuple[NDArray, NDArray, NDArray]:
    """Generate a synthetic binary dataset with oracle sampling probabilities.

    All n_total samples have ground-truth labels (y_true_oracle), proxy predictions (y_proxy),
    and an oracle uncertainty score derived from the analytical
    proxy error. The uncertainty values are non-uniform: samples where the proxy is less
    reliable receive higher uncertainty following the optimal sampling rule.

    The sampling is based on a latent variable which determines the correlation
    between y_true_oracle and y_proxy in each sample. This variable is sampled uniformly
    around the given correlation value with limited spread within the interval of
    possible correlation levels given true_mean and proxy_mean. This way, the
    correlation between y_true_oracle and y_proxy matches the target value on average.

    Parameters
    ----------
    n_total : int
        Total number of samples.
    true_mean : float
        Expected mean of y_true_oracle. Must be in (0, 1).
    proxy_mean : float
        Expected mean of y_proxy. Must be in (0, 1).
    correlation : float
        Pearson correlation between y_true_oracle and y_proxy (marginal, across all samples).
    random_seed : int, optional
        Seed for reproducibility.

    Returns
    -------
    Tuple[NDArray, NDArray, NDArray]
        [0]: array of shape ``(n_total,)``, y_true_oracle with the full ground-truth labels for all n_total
        samples (no NaN); use ``simulate_annotation`` to mask unlabeled rows
        [1]: array of shape ``(n_total,)``, y_proxy with proxy predictions
        [2]: array of shape ``(n_total,)``, uncertainty (oracle uncertainty score) per sample

    Raises
    ------
    ValueError
        If true_mean is not in (0, 1).
    ValueError
        If proxy_mean is not in (0, 1).
    ValueError
        If the combination of true_mean, proxy_mean, and correlation leads to
        negative joint probabilities.

    Notes
    -----
    **Step 1 — Global joint distribution**

    For two binary variables with marginals ``p_t = P(y_true_oracle=1)`` and
    ``p_p = P(y_proxy=1)``, the Pearson correlation uniquely determines the
    joint distribution.  Let ``D = sqrt(p_t * p_p * (1-p_t) * (1-p_p))``
    (product of standard deviations).  Then:

    ```
    p11 = P(y_true_oracle=1, y_proxy=1) = correlation * D + p_t * p_p
    p00 = P(y_true_oracle=0, y_proxy=0) = 1 - p_t - p_p + p11
    p01 = P(y_true_oracle=0, y_proxy=1) = p_p - p11
    p10 = P(y_true_oracle=1, y_proxy=0) = p_t - p11
    ```

    These four probabilities are fully determined by ``(p_t, p_p, correlation)``
    and must all be strictly positive — otherwise the parameter combination is
    impossible and a ``ValueError`` is raised. The previous probabilities become
    negative for the following respective values :

    ```
    p11 < 0 for correlation < -(p_t * p_p) / D
    p00 < 0 for correlation < (p_t + p_p - p_t * p_p - 1) / D
    p01 < 0 for correlation > p_p * (1 - p_t) / D
    p10 < 0 for correlation > p_t * (1 - p_p) / D
    ```

    Therefore, the correlation needs to satisfy :

    ```
    max(-p_t * p_p, p_t + p_p - p_t * p_p - 1) <= correlation * D <= min(p_t * (1 - p_p), p_p * (1 - p_t))
    ```

    **Step 2 — Latent variable x and per-sample correlation**

    Each sample receives a latent value ``x_i ~ Uniform(-1, 1)`` representing
    "annotation difficulty".  The per-sample Pearson correlation is defined as:

    ```
    corr(x_i) = correlation + correlation_spread * x_i
    ```

    Because ``E[x] = 0`` for ``x ~ Uniform(-1, 1)``, the marginal
    correlation ``E[corr(X)] = correlation`` exactly, preserving the target
    value on average.  Samples with low ``x`` have lower conditional
    correlation (proxy less reliable → higher uncertainty); samples with high
    ``x`` have higher conditional correlation (proxy more reliable → lower uncertainty).

    ``correlation_spread`` is chosen as 90 % of the largest value that keeps
    all four per-sample probabilities strictly positive for every
    ``x in [-1, 1]``:

    ```
    max_safe_correlation_spread = min(p00, p01, p10, p11) / D
    ```

    **Step 3 — Per-sample probabilities and error probability**

    We adapt ``p11`` with ``x`` and this propagates to other values:

    ```
    p11(x) = corr(x) * D + p_t * p_p          # varies with x
    error_prob(x) = p01(x) + p10(x)
                    = p_t + p_p - 2 * p11(x)    # proxy ≠ y_true_oracle
    ```

    ``error_prob(x)`` is the per-sample proxy error probability, which
    decreases linearly as ``x`` increases (higher x → better proxy).

    **Step 4 — Vectorized CDF inversion**

    Since each sample has its own probability vector, ``numpy.random.choice``
    (which takes a single fixed probability vector) cannot be used.  Instead,
    the four outcomes ``(0,0), (0,1), (1,0), (1,1)`` are encoded as integers
    0–3 and sampled via cumulative-threshold comparison on a single
    ``u ~ Uniform(0,1)`` draw:

    ```
    u < p00(x)                 → outcome 0 : (y_true_oracle=0, y_proxy=0)
    u < p00(x)+p01(x)          → outcome 1 : (y_true_oracle=0, y_proxy=1)
    u < p00(x)+p01(x)+p10(x)   → outcome 2 : (y_true_oracle=1, y_proxy=0)
    else                       → outcome 3 : (y_true_oracle=1, y_proxy=1)
    ```

    The crucial simplification is that the second threshold collapses to the
    constant ``1 - p_t`` (independent of ``x``), because:

    ```
    p00(x) + p01(x) = (1-p_t-p_p+p11) + (p_p-p11) = 1 - p_t
    ```

    We also have :
    ```
    p00(x) + p01(x) + p10(x) = 1 - p11(x)
    ```

    This means only two of the three thresholds require per-sample arrays.
    The outcome integer encodes both labels: ``y_true_oracle = outcome // 2``,
    ``y_proxy = outcome % 2``.

    **Step 5 — Oracle uncertainty**

    The optimal sampling probability satisfies
    ``uncertainty = sqrt(E[(y_proxy - y_true_oracle)²]) = sqrt(error_prob(x))``.
    These values are stored directly as ``uncertainty``.

    Examples
    --------
    >>> from glide.simulators import generate_binary_dataset_with_oracle_sampling
    >>> y_true_oracle, y_proxy, uncertainty = generate_binary_dataset_with_oracle_sampling(n_total=4, random_seed=0)
    >>> len(y_true_oracle)
    4
    >>> len(y_proxy)
    4
    >>> len(uncertainty)
    4
    >>> bool(np.all(np.isin(y_true_oracle, [0.0, 1.0])))
    True
    >>> bool(np.all(np.isin(y_proxy, [0.0, 1.0])))
    True
    >>> bool(np.all((uncertainty >= 0) & (uncertainty <= 1)))
    True
    """
    _validate_bounds(true_mean, "true_mean", lower=0, upper=1, left_inclusive=False, right_inclusive=False)
    _validate_bounds(proxy_mean, "proxy_mean", lower=0, upper=1, left_inclusive=False, right_inclusive=False)

    rng = np.random.default_rng(seed=random_seed)
    p_t = true_mean
    p_p = proxy_mean

    # std product of the variable pair will be used multiple times
    D = np.sqrt(p_t * p_p * (1 - p_t) * (1 - p_p))

    # some combinations of true_mean, proxy_mean and correlation are impossible
    # and lead to negative probabilities, raise an error if this is the case
    min_possible_correlation = max(-p_t * p_p, p_p + p_t - 1 - p_t * p_p) / D
    max_possible_correlation = min(p_t * (1 - p_p), p_p * (1 - p_t)) / D
    if correlation < min_possible_correlation or correlation > max_possible_correlation:
        raise ValueError(
            f"Impossible combination of 'true_mean'={true_mean!r}, 'proxy_mean'={proxy_mean!r}, "
            f"and 'correlation'={correlation!r}: leads to negative joint probabilities; "
            f"possible 'correlation' values are in the range ({min_possible_correlation:.3f}"
            f", {max_possible_correlation:.3f})."
        )

    # Global (marginal) joint distribution — same as generate_binary_dataset
    p11 = correlation * D + p_t * p_p
    p00 = 1 - p_t - p_p + p11
    p01 = p_p - p11
    p10 = p_t - p11
    probs = [p00, p01, p10, p11]

    # Spread parameter: modulates the conditional correlation across samples
    max_safe_correlation_spread = min(probs) / D
    correlation_spread = 0.9 * max_safe_correlation_spread

    # Latent variable: controls per-sample proxy correlation
    x = rng.uniform(-1.0, 1.0, size=n_total)

    # Per-sample conditional joint distribution
    correlation_x = correlation + correlation_spread * x
    p11_x = correlation_x * D + p_t * p_p
    error_prob_x = p_t + p_p - 2.0 * p11_x

    # Vectorized CDF inversion to sample (y_true, y_proxy) per sample
    p00_x = 1.0 - p_t - p_p + p11_x
    u = rng.uniform(0.0, 1.0, size=n_total)
    samples = np.where(
        u < p00_x,
        0,
        np.where(
            u < 1.0 - p_t,
            1,
            np.where(u < 1.0 - p11_x, 2, 3),
        ),
    )
    y_true_oracle_arr = samples // 2
    y_proxy_arr = samples % 2

    # Oracle uncertainty: sqrt(P(error | x_i))
    uncertainty = np.sqrt(error_prob_x)

    return y_true_oracle_arr.astype(float), y_proxy_arr.astype(float), uncertainty

generate_clustered_binary_dataset

generate_clustered_binary_dataset(
    n_total,
    n_clusters,
    true_mean=0.7,
    proxy_mean=0.6,
    correlation=0.8,
    random_seed=None,
)

Generate a synthetic clustered binary-label dataset for evaluation.

Draws n_total i.i.d. (y_true, y_proxy) pairs from the joint binary distribution defined by true_mean, proxy_mean, and correlation, then randomly partitions the observations into n_clusters non-empty groups.

Parameters:

Name Type Description Default
n_total int

Exact total number of observations across all clusters.

required
n_clusters int

Exact number of clusters. Must be at least 2.

required
true_mean float

Expected mean value of the true labels. Must be in (0, 1).

0.7
proxy_mean float

Expected mean value of the proxy labels. Must be in (0, 1).

0.6
correlation float

Pearson correlation between true and proxy labels.

0.8
random_seed int or SeedSequence

Seed for reproducibility.

None

Returns:

Type Description
Tuple[NDArray, NDArray, NDArray]

[0]: y_true — shape (n_total,), values in {0.0, 1.0}. [1]: y_proxy — shape (n_total,), values in {0.0, 1.0}. [2]: clusters — shape (n_total,), integer cluster identifiers in {0, 1, ..., n_clusters - 1}.

Raises:

Type Description
ValueError

If true_mean is not in (0, 1).

ValueError

If proxy_mean is not in (0, 1).

ValueError

If the combination of true_mean, proxy_mean, and correlation is impossible (leads to negative joint probabilities).

ValueError

If n_clusters < 2.

ValueError

If n_total < n_clusters.

Notes

Step 1 — Draw observations

Call generate_binary_dataset(n_total, ...) to obtain n_total i.i.d. (y_true, y_proxy) pairs from the joint binary distribution defined by true_mean, proxy_mean, and correlation.

Step 2 — Random cluster partition

Draw n_clusters - 1 cut positions uniformly without replacement from {1, 2, ..., n_total - 1} and sort them. Combined with 0 and n_total, these define n_clusters contiguous intervals of random lengths that sum to n_total. Assign cluster identifier k to all observations whose position falls in the k-th interval. Every cluster contains at least 1 observation by construction.

Step 3 — Shuffle

Randomly permute the cluster identifier array so that cluster membership is not determined by position in the output.

Examples:

>>> import numpy as np
>>> from glide.simulators import generate_clustered_binary_dataset
>>> y_true, y_proxy, clusters = generate_clustered_binary_dataset(
...     n_total=10, n_clusters=4, random_seed=0
... )
>>> y_true
array([1., 1., 1., 0., 1., 1., 0., 1., 0., 1.])
>>> y_proxy
array([1., 0., 1., 0., 1., 1., 0., 1., 0., 1.])
>>> clusters
array([3, 0, 3, 1, 0, 3, 3, 2, 0, 0])
Source code in glide/simulators/clustered_binary.py
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
def generate_clustered_binary_dataset(
    n_total: int,
    n_clusters: int,
    true_mean: float = 0.7,
    proxy_mean: float = 0.6,
    correlation: float = 0.8,
    random_seed: Optional[Union[int, np.random.SeedSequence]] = None,
) -> Tuple[NDArray, NDArray, NDArray]:
    """Generate a synthetic clustered binary-label dataset for evaluation.

    Draws ``n_total`` i.i.d. ``(y_true, y_proxy)`` pairs from the
    joint binary distribution defined by ``true_mean``, ``proxy_mean``, and
    ``correlation``, then randomly partitions the observations into
    ``n_clusters`` non-empty groups.

    Parameters
    ----------
    n_total : int
        Exact total number of observations across all clusters.
    n_clusters : int
        Exact number of clusters. Must be at least 2.
    true_mean : float
        Expected mean value of the true labels. Must be in ``(0, 1)``.
    proxy_mean : float
        Expected mean value of the proxy labels. Must be in ``(0, 1)``.
    correlation : float
        Pearson correlation between true and proxy labels.
    random_seed : int or np.random.SeedSequence, optional
        Seed for reproducibility.

    Returns
    -------
    Tuple[NDArray, NDArray, NDArray]
        [0]: ``y_true`` — shape ``(n_total,)``, values in ``{0.0, 1.0}``.
        [1]: ``y_proxy`` — shape ``(n_total,)``, values in ``{0.0, 1.0}``.
        [2]: ``clusters`` — shape ``(n_total,)``, integer cluster
             identifiers in ``{0, 1, ..., n_clusters - 1}``.

    Raises
    ------
    ValueError
        If ``true_mean`` is not in ``(0, 1)``.
    ValueError
        If ``proxy_mean`` is not in ``(0, 1)``.
    ValueError
        If the combination of ``true_mean``, ``proxy_mean``, and
        ``correlation`` is impossible (leads to negative joint probabilities).
    ValueError
        If ``n_clusters < 2``.
    ValueError
        If ``n_total < n_clusters``.

    Notes
    -----
    **Step 1 — Draw observations**

    Call ``generate_binary_dataset(n_total, ...)`` to obtain ``n_total``
    i.i.d. ``(y_true, y_proxy)`` pairs from the joint binary
    distribution defined by ``true_mean``, ``proxy_mean``, and
    ``correlation``.

    **Step 2 — Random cluster partition**

    Draw ``n_clusters - 1`` cut positions uniformly without replacement from
    ``{1, 2, ..., n_total - 1}`` and sort them. Combined with ``0`` and
    ``n_total``, these define ``n_clusters`` contiguous intervals of random
    lengths that sum to ``n_total``. Assign cluster identifier ``k`` to all
    observations whose position falls in the ``k``-th interval. Every cluster
    contains at least 1 observation by construction.

    **Step 3 — Shuffle**

    Randomly permute the cluster identifier array so that cluster membership is
    not determined by position in the output.

    Examples
    --------
    >>> import numpy as np
    >>> from glide.simulators import generate_clustered_binary_dataset
    >>> y_true, y_proxy, clusters = generate_clustered_binary_dataset(
    ...     n_total=10, n_clusters=4, random_seed=0
    ... )
    >>> y_true
    array([1., 1., 1., 0., 1., 1., 0., 1., 0., 1.])
    >>> y_proxy
    array([1., 0., 1., 0., 1., 1., 0., 1., 0., 1.])
    >>> clusters
    array([3, 0, 3, 1, 0, 3, 3, 2, 0, 0])
    """
    _validate_bounds(n_clusters, "n_clusters", lower=2, error_message=f"'n_clusters' must be >= 2; got {n_clusters}.")
    _validate_bounds(
        n_total,
        "n_total",
        lower=n_clusters,
        error_message=f"'n_total' must be >= 'n_clusters'; got n_total={n_total} and n_clusters={n_clusters}.",
    )

    if isinstance(random_seed, np.random.SeedSequence):
        seed_sequence = random_seed
    else:
        seed_sequence = np.random.SeedSequence(random_seed)
    data_seed, partition_seed = seed_sequence.spawn(2)

    y_true, y_proxy = generate_binary_dataset(
        n_total=n_total,
        true_mean=true_mean,
        proxy_mean=proxy_mean,
        correlation=correlation,
        random_seed=data_seed,
    )

    rng = np.random.default_rng(partition_seed)

    cut_positions = np.sort(rng.choice(n_total - 1, size=n_clusters - 1, replace=False) + 1)
    interval_lengths = np.diff(np.hstack([[0], cut_positions, [n_total]]))
    clusters = np.repeat(np.arange(n_clusters, dtype=np.int64), interval_lengths)
    rng.shuffle(clusters)

    return y_true, y_proxy, clusters

generate_gaussian_dataset

generate_gaussian_dataset(
    n_labeled,
    n_unlabeled,
    true_mean=0.7,
    true_std=1,
    proxy_mean=0.6,
    proxy_std=1,
    correlation=0.8,
    random_seed=None,
)

Generate a synthetic Gaussian dataset for evaluation.

Parameters:

Name Type Description Default
n_labeled int

Number of samples with both true and proxy labels (the labeled subset).

required
n_unlabeled int

Number of samples with proxy labels only (the unlabeled subset).

required
true_mean float

Mean of the true label distribution.

0.7
true_std float

Standard deviation of the true label distribution.

1
proxy_mean float

Mean of the proxy label distribution.

0.6
proxy_std float

Standard deviation of the proxy label distribution.

1
correlation float

Pearson correlation between true and proxy labels.

0.8
random_seed int

Seed for reproducibility.

None

Returns:

Type Description
Tuple[NDArray, NDArray]

[0]: array of shape (n_labeled+n_unlabeled,), y_true with labeled values and NaN for unlabeled rows [1]: array of shape (n_labeled+n_unlabeled,), y_proxy with all values present

Notes

Target distribution

The goal is to jointly sample (y_true, y_proxy) from a bivariate Gaussian:

(y_true, y_proxy) ~ N(μ, Σ)

where:

μ = (true_mean, proxy_mean)

Σ = [[true_std²,                          ρ · true_std · proxy_std],
     [ρ · true_std · proxy_std,           proxy_std²              ]]

and ρ is the target Pearson correlation.

Step 1 — Cholesky decomposition of Σ

To sample from N(0, Σ), we find a lower-triangular matrix L such that Σ = L @ Lᵀ (Cholesky factor). The construction uses the angle θ = arccos(ρ), so that cos(θ) = ρ and sin(θ) = √(1 - ρ²):

L = [[true_std,                  0                  ],
     [proxy_std · cos(θ),        proxy_std · sin(θ) ]]

One can verify L @ Lᵀ = Σ directly:

L @ Lᵀ = [[true_std²,                    true_std · proxy_std · cos(θ)],
          [true_std · proxy_std · cos(θ), proxy_std² · (cos²(θ)+sin²(θ))]]

       = [[true_std²,                    true_std · proxy_std · ρ],
          [true_std · proxy_std · ρ,     proxy_std²              ]]  = Σ

Step 2 — Sampling via the linear transform

Let Z be a 2 × (n_labeled+n_unlabeled) matrix whose entries are i.i.d. standard normals Z_i ~ N(0, 1). Then:

Y = L @ Z

gives a 2 × (n_labeled+n_unlabeled) matrix where each column is a zero-mean sample from N(0, Σ). In component form, each column (Z₁, Z₂) maps to:

Y₁ = true_std · Z₁
Y₂ = proxy_std · cos(θ) · Z₁ + proxy_std · sin(θ) · Z₂

The resulting properties are: - Var(Y₁) = true_std² and Var(Y₂) = proxy_std² (correct marginal variances) - Cov(Y₁, Y₂) = true_std · proxy_std · cos(θ) = true_std · proxy_std · ρ - Corr(Y₁, Y₂) = ρ (correct Pearson correlation)

Step 3 — Shifting by the means

Adding the desired means shifts the distribution to N(μ, Σ):

y_true  = true_mean  + Y[0, :]
y_proxy = proxy_mean + Y[1, :]

The first n_labeled columns form the labeled set (both y_true and y_proxy are observed); columns n_labeled through n_labeled+n_unlabeled-1 form the unlabeled set (only y_proxy is observed).

Examples:

>>> import numpy as np
>>> from glide.simulators import generate_gaussian_dataset
>>> y_true, y_proxy = generate_gaussian_dataset(n_labeled=3, n_unlabeled=5, random_seed=42)
>>> len(y_true)
8
>>> int(np.sum(~np.isnan(y_true)))
3
>>> int(np.sum(~np.isnan(y_proxy)))
8
Source code in glide/simulators/gaussian.py
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
def generate_gaussian_dataset(
    n_labeled: int,
    n_unlabeled: int,
    true_mean: float = 0.7,
    true_std: float = 1,
    proxy_mean: float = 0.6,
    proxy_std: float = 1,
    correlation: float = 0.8,
    random_seed: Optional[int] = None,
) -> Tuple[NDArray, NDArray]:
    """Generate a synthetic Gaussian dataset for evaluation.

    Parameters
    ----------
    n_labeled : int
        Number of samples with both true and proxy labels (the labeled subset).
    n_unlabeled : int
        Number of samples with proxy labels only (the unlabeled subset).
    true_mean : float
        Mean of the true label distribution.
    true_std : float
        Standard deviation of the true label distribution.
    proxy_mean : float
        Mean of the proxy label distribution.
    proxy_std : float
        Standard deviation of the proxy label distribution.
    correlation : float
        Pearson correlation between true and proxy labels.
    random_seed : int, optional
        Seed for reproducibility.

    Returns
    -------
    Tuple[NDArray, NDArray]
        [0]: array of shape ``(n_labeled+n_unlabeled,)``, y_true with labeled values and NaN for unlabeled rows
        [1]: array of shape ``(n_labeled+n_unlabeled,)``, y_proxy with all values present

    Notes
    -----
    **Target distribution**

    The goal is to jointly sample ``(y_true, y_proxy)`` from a bivariate Gaussian:

    ```
    (y_true, y_proxy) ~ N(μ, Σ)
    ```

    where:

    ```
    μ = (true_mean, proxy_mean)

    Σ = [[true_std²,                          ρ · true_std · proxy_std],
         [ρ · true_std · proxy_std,           proxy_std²              ]]
    ```

    and ``ρ`` is the target Pearson correlation.

    **Step 1 — Cholesky decomposition of Σ**

    To sample from ``N(0, Σ)``, we find a lower-triangular matrix ``L`` such that
    ``Σ = L @ Lᵀ`` (Cholesky factor). The construction uses the angle
    ``θ = arccos(ρ)``, so that ``cos(θ) = ρ`` and ``sin(θ) = √(1 - ρ²)``:

    ```
    L = [[true_std,                  0                  ],
         [proxy_std · cos(θ),        proxy_std · sin(θ) ]]
    ```

    One can verify ``L @ Lᵀ = Σ`` directly:

    ```
    L @ Lᵀ = [[true_std²,                    true_std · proxy_std · cos(θ)],
              [true_std · proxy_std · cos(θ), proxy_std² · (cos²(θ)+sin²(θ))]]

           = [[true_std²,                    true_std · proxy_std · ρ],
              [true_std · proxy_std · ρ,     proxy_std²              ]]  = Σ
    ```

    **Step 2 — Sampling via the linear transform**

    Let ``Z`` be a ``2 × (n_labeled+n_unlabeled)`` matrix whose entries are i.i.d. standard normals
    ``Z_i ~ N(0, 1)``. Then:

    ```
    Y = L @ Z
    ```

    gives a ``2 × (n_labeled+n_unlabeled)`` matrix where each column is a zero-mean sample from
    ``N(0, Σ)``. In component form, each column ``(Z₁, Z₂)`` maps to:

    ```
    Y₁ = true_std · Z₁
    Y₂ = proxy_std · cos(θ) · Z₁ + proxy_std · sin(θ) · Z₂
    ```

    The resulting properties are:
    - ``Var(Y₁) = true_std²`` and ``Var(Y₂) = proxy_std²`` (correct marginal variances)
    - ``Cov(Y₁, Y₂) = true_std · proxy_std · cos(θ) = true_std · proxy_std · ρ``
    - ``Corr(Y₁, Y₂) = ρ`` (correct Pearson correlation)

    **Step 3 — Shifting by the means**

    Adding the desired means shifts the distribution to ``N(μ, Σ)``:

    ```
    y_true  = true_mean  + Y[0, :]
    y_proxy = proxy_mean + Y[1, :]
    ```

    The first ``n_labeled`` columns form the labeled set (both ``y_true`` and ``y_proxy``
    are observed); columns ``n_labeled`` through ``n_labeled+n_unlabeled-1`` form the unlabeled set
    (only ``y_proxy`` is observed).

    Examples
    --------
    >>> import numpy as np
    >>> from glide.simulators import generate_gaussian_dataset
    >>> y_true, y_proxy = generate_gaussian_dataset(n_labeled=3, n_unlabeled=5, random_seed=42)
    >>> len(y_true)
    8
    >>> int(np.sum(~np.isnan(y_true)))
    3
    >>> int(np.sum(~np.isnan(y_proxy)))
    8
    """
    _validate_bounds(correlation, "correlation", lower=-1, upper=1)
    rng = np.random.default_rng(seed=random_seed)
    angle = np.arccos(correlation)
    lin_transform = np.array([[true_std, 0], [proxy_std * np.cos(angle), proxy_std * np.sin(angle)]])

    Y = lin_transform @ rng.standard_normal(size=(2, n_labeled + n_unlabeled))

    y_true = true_mean + Y[0, :].copy()
    y_true[n_labeled:] = np.nan
    y_proxy = proxy_mean + Y[1, :]

    return y_true, y_proxy

generate_stratified_binary_dataset

generate_stratified_binary_dataset(
    n_total,
    true_mean,
    proxy_mean,
    correlation,
    random_seed=None,
)

Generate a synthetic stratified binary-label oracle dataset.

Generate multiple strata with potentially different parameters (true_mean, proxy_mean, correlation, n_total per stratum). This enables simulation of heterogeneous data where different groups have different proxy-truth relationships.

Parameters:

Name Type Description Default
n_total List[int]

Total number of samples per stratum. All samples have both true and proxy labels. Length must equal number of strata.

required
true_mean List[float]

Expected mean value of the true labels per stratum. Length must equal number of strata.

required
proxy_mean List[float]

Expected mean value of the proxy labels per stratum. Length must equal number of strata.

required
correlation List[float]

Pearson correlation between true and proxy per stratum. Length must equal number of strata.

required
random_seed int

Seed for reproducibility. If provided, seeds are derived deterministically.

None

Returns:

Type Description
Tuple[NDArray, NDArray, NDArray]

Let N = sum(n_total) be the total number of samples across all strata.

[0]: array of shape (N,), y_true containing ground-truth labels. [1]: array of shape (N,), y_proxy containing proxy labels. [2]: array of shape (N,), integer stratum identifiers.

Raises:

Type Description
ValueError

If input lists have different lengths.

ValueError

If fewer than 1 stratum is specified.

ValueError

If any stratum has invalid parameters (see generate_binary_dataset).

Examples:

>>> import numpy as np
>>> from glide.simulators import generate_stratified_binary_dataset
>>> y_true, y_proxy, groups = generate_stratified_binary_dataset(
...     n_total=[6, 8],
...     true_mean=[0.6, 0.8],
...     proxy_mean=[0.5, 0.7],
...     correlation=[0.7, 0.75],
...     random_seed=42
... )
>>> len(y_true)
14
>>> len(groups)
14
>>> len(y_proxy)
14
>>> bool(np.all(np.isin(y_true, [0.0, 1.0])))
True
>>> bool(np.all(np.isin(y_proxy, [0.0, 1.0])))
True
Source code in glide/simulators/stratified_binary.py
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
def generate_stratified_binary_dataset(
    n_total: List[int],
    true_mean: List[float],
    proxy_mean: List[float],
    correlation: List[float],
    random_seed: Optional[int] = None,
) -> Tuple[NDArray, NDArray, NDArray]:
    """Generate a synthetic stratified binary-label oracle dataset.

    Generate multiple strata with potentially different parameters (true_mean, proxy_mean,
    correlation, n_total per stratum). This enables simulation of heterogeneous data where
    different groups have different proxy-truth relationships.

    Parameters
    ----------
    n_total : List[int]
        Total number of samples per stratum. All samples have both true and proxy labels.
        Length must equal number of strata.
    true_mean : List[float]
        Expected mean value of the true labels per stratum.
        Length must equal number of strata.
    proxy_mean : List[float]
        Expected mean value of the proxy labels per stratum.
        Length must equal number of strata.
    correlation : List[float]
        Pearson correlation between true and proxy per stratum.
        Length must equal number of strata.
    random_seed : int, optional
        Seed for reproducibility. If provided, seeds are derived deterministically.

    Returns
    -------
    Tuple[NDArray, NDArray, NDArray]
        Let ``N = sum(n_total)`` be the total number of samples across all strata.

        [0]: array of shape ``(N,)``, y_true containing ground-truth labels.
        [1]: array of shape ``(N,)``, y_proxy containing proxy labels.
        [2]: array of shape ``(N,)``, integer stratum identifiers.

    Raises
    ------
    ValueError
        If input lists have different lengths.
    ValueError
        If fewer than 1 stratum is specified.
    ValueError
        If any stratum has invalid parameters (see generate_binary_dataset).

    Examples
    --------
    >>> import numpy as np
    >>> from glide.simulators import generate_stratified_binary_dataset
    >>> y_true, y_proxy, groups = generate_stratified_binary_dataset(
    ...     n_total=[6, 8],
    ...     true_mean=[0.6, 0.8],
    ...     proxy_mean=[0.5, 0.7],
    ...     correlation=[0.7, 0.75],
    ...     random_seed=42
    ... )
    >>> len(y_true)
    14
    >>> len(groups)
    14
    >>> len(y_proxy)
    14
    >>> bool(np.all(np.isin(y_true, [0.0, 1.0])))
    True
    >>> bool(np.all(np.isin(y_proxy, [0.0, 1.0])))
    True
    """
    _validate_non_empty(n_total, "n_total")
    num_strata = len(n_total)

    _validate_equal_lengths(
        np.array(n_total),
        np.array(true_mean),
        np.array(proxy_mean),
        np.array(correlation),
        names=["n_total", "true_mean", "proxy_mean", "correlation"],
    )

    # Generate data for each stratum
    y_true_per_stratum = []
    y_proxy_per_stratum = []
    groups_per_stratum = []

    seed_sequence = np.random.SeedSequence(random_seed)
    seeds = seed_sequence.spawn(num_strata)

    for stratum_id in range(num_strata):
        y_true_k, y_proxy_k = generate_binary_dataset(
            n_total=n_total[stratum_id],
            true_mean=true_mean[stratum_id],
            proxy_mean=proxy_mean[stratum_id],
            correlation=correlation[stratum_id],
            random_seed=seeds[stratum_id],
        )
        y_true_per_stratum.append(y_true_k)
        y_proxy_per_stratum.append(y_proxy_k)
        groups_per_stratum.append(np.full_like(y_true_k, stratum_id))

    y_true = np.hstack(y_true_per_stratum)
    y_proxy = np.hstack(y_proxy_per_stratum)
    groups = np.hstack(groups_per_stratum)

    return y_true, y_proxy, groups

simulate_annotation

simulate_annotation(y_true_oracle, xi)

Reveal oracle labels where annotated and mask the rest as NaN.

Given a full oracle label array and an annotation indicator, returns an array where labels are kept for annotated elements (xi == 1) and set to np.nan for unannotated ones (xi == 0 or xi == np.nan). The input arrays are not mutated.

Parameters:

Name Type Description Default
y_true_oracle NDArray

Full oracle ground-truth labels for all elements.

required
xi NDArray

Annotation indicator of the same length. A value of 1 means the element was sent to a human annotator; 0 or np.nan means it was not.

required

Returns:

Type Description
NDArray

Array of the same length as y_true_oracle, with oracle values where xi == 1 and np.nan where xi == 0 or xi == np.nan.

Raises:

Type Description
ValueError

If y_true_oracle and xi have different lengths.

ValueError

If y_true_oracle contains NaN values.

ValueError

If xi contains values other than 0, 1, and np.nan.

Examples:

>>> import numpy as np
>>> from glide.simulators import simulate_annotation
>>> y_true_oracle = np.array([0, 1, 1, 0])
>>> xi = np.array([1, 0, 1, np.nan])
>>> simulate_annotation(y_true_oracle, xi)
array([ 0., nan,  1., nan])
Source code in glide/simulators/annotation.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def simulate_annotation(
    y_true_oracle: NDArray,
    xi: NDArray,
) -> NDArray:
    """Reveal oracle labels where annotated and mask the rest as NaN.

    Given a full oracle label array and an annotation indicator, returns an array where labels
    are kept for annotated elements (``xi == 1``) and set to ``np.nan`` for unannotated ones
    (``xi == 0`` or ``xi == np.nan``). The input arrays are not mutated.

    Parameters
    ----------
    y_true_oracle : NDArray
        Full oracle ground-truth labels for all elements.
    xi : NDArray
        Annotation indicator of the same length. A value of ``1`` means the element was sent
        to a human annotator; ``0`` or ``np.nan`` means it was not.

    Returns
    -------
    NDArray
        Array of the same length as ``y_true_oracle``, with oracle values where ``xi == 1``
        and ``np.nan`` where ``xi == 0`` or ``xi == np.nan``.

    Raises
    ------
    ValueError
        If ``y_true_oracle`` and ``xi`` have different lengths.
    ValueError
        If ``y_true_oracle`` contains NaN values.
    ValueError
        If ``xi`` contains values other than ``0``, ``1``, and ``np.nan``.

    Examples
    --------
    >>> import numpy as np
    >>> from glide.simulators import simulate_annotation
    >>> y_true_oracle = np.array([0, 1, 1, 0])
    >>> xi = np.array([1, 0, 1, np.nan])
    >>> simulate_annotation(y_true_oracle, xi)
    array([ 0., nan,  1., nan])
    """
    _validate_equal_lengths(y_true_oracle, xi, names=["y_true_oracle", "xi"])
    _validate_has_no_nan(y_true_oracle, "y_true_oracle")
    xi_float = xi.astype(float)
    _validate_binary_or_nan(xi, "xi")

    y_true = y_true_oracle.astype(float)
    y_true[xi_float != 1] = np.nan
    return y_true