Sample weighting to alter likelihood of samples

BatchUp defines samplers that are used to generate the indices of samples that should be combined to form a mini-batch. They are defined in the sampling module.

When constructing a data source (e.g. ArrayDataSource) you can provide a sampler that will control how the samples are selected.

Buy default one of the standard samplers (StandardSampler or SubsetSampler) will be constructed if you don’t provide one.

The weighted sampler

If you want some samples to be drawn more frequently than others, construct a WeightedSampler and pass it as the sampler argument to the py:class:.ArrayDataSource constructor. In the example the per-sample weights are stored in train_w.

from batchup import sampling

sampler = sampling.WeightedSampler(weights=train_w)

ds = data_source.ArrayDataSource([train_X, train_y], sampler=sampler)

# Drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(
        batch_size=64, shuffle=np.random.RandomState(12345)):
    # Processes batches here...

Note that in-order is NOT supported when using WeightedSampler, so shuffle cannot be False or None.

To draw from a subset of the dataset, use WeightedSubsetSampler:

from batchup import sampling

# NOTE that the weights parameter is called `sub_weights` (rather
# than `weights`) and that it must have the same length as `indices`.
sampler = sampling.WeightedSubsetSampler(sub_weights=train_w[subset_a],
                                         indices=subset_a)

ds = data_source.ArrayDataSource([train_X, train_y], sampler=sampler)

# Drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(
        batch_size=64, shuffle=np.random.RandomState(12345)):
    # Processes batches here...

Counteracting class imbalance

An alternate constructor method WeightedSampler.class_balancing_sampler() is available to construct a weighted sampler to compensate for class imbalance:

# Construct the sampler; NOTE that the `n_classes` argument
# is *optional*
sampler = sampling.WeightedSampler.class_balancing_sampler(
    y=train_y, n_classes=train_y.max() + 1)

ds = data_source.ArrayDataSource([train_X, train_y], sampler=sampler)

# Drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(
        batch_size=64, shuffle=np.random.RandomState(12345)):
    # Processes batches here...

The WeightedSampler.class_balancing_sample_weights() helper method constructs an array of sample weights in case you wish to modify the weights first:

weights = sampling.WeightedSampler.class_balancing_sample_weights(
    y=train_y, n_classes=train_y.max() + 1)

# Assume `modify_weights` is defined above
weights = modify_weights(weights)

# Construct the sampler and the data source
sampler = sampling.WeightedSampler(weights=weights)
ds = data_source.ArrayDataSource([train_X, train_y], sampler=sampler)

# Drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(
        batch_size=64, shuffle=np.random.RandomState(12345)):
    # Processes batches here...