Sample weighting to alter likelihood of samples¶
BatchUp defines samplers that are used to generate the indices of samples that should be combined to form
a mini-batch. They are defined in the sampling
module.
When constructing a data source (e.g. ArrayDataSource
) you can provide a sampler
that will control how the samples are selected.
Buy default one of the standard samplers (StandardSampler
or SubsetSampler
)
will be constructed if you don’t provide one.
Contents
The weighted sampler¶
If you want some samples to be drawn more frequently than others, construct a WeightedSampler
and pass
it as the sampler argument to the py:class:.ArrayDataSource constructor. In the example the per-sample
weights are stored in train_w
.
from batchup import sampling
sampler = sampling.WeightedSampler(weights=train_w)
ds = data_source.ArrayDataSource([train_X, train_y], sampler=sampler)
# Drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(
batch_size=64, shuffle=np.random.RandomState(12345)):
# Processes batches here...
Note that in-order is NOT supported when using WeightedSampler
, so shuffle
cannot be False
or None
.
To draw from a subset of the dataset, use WeightedSubsetSampler
:
from batchup import sampling
# NOTE that the weights parameter is called `sub_weights` (rather
# than `weights`) and that it must have the same length as `indices`.
sampler = sampling.WeightedSubsetSampler(sub_weights=train_w[subset_a],
indices=subset_a)
ds = data_source.ArrayDataSource([train_X, train_y], sampler=sampler)
# Drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(
batch_size=64, shuffle=np.random.RandomState(12345)):
# Processes batches here...
Counteracting class imbalance¶
An alternate constructor method WeightedSampler.class_balancing_sampler()
is available to construct
a weighted sampler to compensate for class imbalance:
# Construct the sampler; NOTE that the `n_classes` argument
# is *optional*
sampler = sampling.WeightedSampler.class_balancing_sampler(
y=train_y, n_classes=train_y.max() + 1)
ds = data_source.ArrayDataSource([train_X, train_y], sampler=sampler)
# Drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(
batch_size=64, shuffle=np.random.RandomState(12345)):
# Processes batches here...
The WeightedSampler.class_balancing_sample_weights()
helper method constructs an array of sample
weights in case you wish to modify the weights first:
weights = sampling.WeightedSampler.class_balancing_sample_weights(
y=train_y, n_classes=train_y.max() + 1)
# Assume `modify_weights` is defined above
weights = modify_weights(weights)
# Construct the sampler and the data source
sampler = sampling.WeightedSampler(weights=weights)
ds = data_source.ArrayDataSource([train_X, train_y], sampler=sampler)
# Drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(
batch_size=64, shuffle=np.random.RandomState(12345)):
# Processes batches here...