Basic batch iteration from arrays

BatchUp defines data sources from which we draw data in mini-batches. They are defined in the data_source module and the one that you will most frequency use is data_source.ArrayDataSource.

Simple batch iteration from NumPy arrays

This example will show how to draw mini-batches of samples from NumPy arrays in random order.

Assume we have a training set in the form of NumPy arrays in the variables train_X and train_y.

First, construct a data source that will draw data from train_X and train_y:

from batchup import data_source

# Construct an array data source
ds = data_source.ArrayDataSource([train_X, train_y])
ArrayDataSource notes:
  • train_X and train_y must have the same number of samples
  • you can use any number of arrays when building the ArrayDataSource

Now we can use the batch_iterator() method to create a batch iterator from which we can draw mini-batches of data:

# Iterate over samples, drawing batches of 64 elements in
# random order
for (batch_X, batch_y) in ds.batch_iterator(
        batch_size=64, shuffle=np.random.RandomState(12345)):
    # Processes batch_X and batch_y here...
Batch iterator notes:
  • the last batch will be short (have less samples than the requested batch size) if there isn’t enough data to fill it
  • the shuffle parameter:
    • using shuffle=True will use NumPy’s default random number generator
    • if no value is provided for shuffle, samples will be processed in-order

Note: we don’t have to use NumPy arrays; any array-like object will do; see Data from array-like objects (data accessors) for more.

Iterating over a subset of the samples

We can specify the indices of a subset of the samples in a dataset and draw mini-batches from only those samples:

import numpy as np

# Randomly choose a subset of 20,000 samples, by indices
subset_a = np.random.permutation(len(train_X))[:20000]

# Construct an array data source that will only draw samples whose indices are in `subset_a`
ds = data_source.ArrayDataSource([train_X, train_y], indices=subset_a)

# Drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(
        batch_size=64, shuffle=np.random.RandomState(12345)):
    # Processes batches here...

Getting the indices of sample in the mini-batches

We can ask to be provided with the indices of the samples that were drawn to form the mini-batch:

# Construct an array data source that will provide sample indices
ds = data_source.ArrayDataSource([train_X, train_y], include_indices=True)

# Drawing batches of 64 elements in random order
for (batch_ndx, batch_X, batch_y) in ds.batch_iterator(
        batch_size=64, shuffle=np.random.RandomState(12345)):
    # Processes batches here; indices in batch_ndx

Batches from repeated/looped arrays

Lets say you need an iterator that extracts samples from your dataset and starts from the beginning when it reaches the end. Provide a value for the repeats argument of the ArrayDataSource constructor like so:

ds_times_5 = data_source.ArrayDataSource([train_X, train_y], repeats=5)

Now use the batch_iterator() method as before.

The repeats parameter accepts either -1 for infinite, or any positive integer >= 1 for a specified number of repetitions:

inf_ds = data_source.ArrayDataSource([train_X, train_y], repeats=-1)

This will also work if the dataset has less samples than the batch size; this is not a common use case but it can happen.