Short note on tf.data.Dataset

I work a lot with speech data, so setting up a data pipeline always requires some effort. The dataset is usually huge, and you need a lot of preprocessing to extract good inputs from raw audio. I tried out tf.data.Dataset in tensorflow 2.1 and it makes things pretty smooth.

Here's how you set up a simple pipeline.

# dataset of all paths to files
file_list = tf.data.Dataset.list_files('path/to/data/*/*')

You can iterate over it.

for item in file_list:
    print(item)
# tf.Tensor(b'path/to/data/folder1/file1.wav', shape=(), dtype=string)
# tf.Tensor(b'path/to/data/folder1/file2.wav', shape=(), dtype=string)
# ...
# ...

Then create a function to load and preprocess each file.

# function to load and preprocess file at said path
def extract_audio_features(file_path):
    audio = tf.io.read_file(file_path)
    audio, sample_rate = tf.audio.decode_wav(audio,desired_channels=1,desired_samples=8000)
    signals = tf.squeeze(audio)
    stfts = tf.signal.stft(signals, fft_length=256)
    spectrograms = tf.math.pow(tf.abs(stfts), 0.5)
    return spectrograms

# apply above function on each file path
feats_dataset = file_list.map(extract_audio_features)

# shuffle
feats_dataset = feats_dataset.shuffle(buffer_size=shuffle_buffer_size)

# repeat so you can iterate over the dataset n times
# count = -1 => loop over indefinitely
feats_dataset = feats_dataset.repeat(count=-1) 

# create batches
feats_dataset = feats_dataset.batch(4)

# prefetch next n batches in memory ready to be connsumed by your model
feats_dataset = feats_dataset.prefetch(buffer_size=2)

That's it! Iterate over this in your training loop, or pass it in a model.fit() function.

for data_batch in feats_dataset:
    train_step(data_batch)

You don't have to use all tensorflow functions inside your preprocessing function. Any python function will work. Everything is a little faster if you can manage it with tensorflow functions.

Each file I/O can be expensive. This whole process will be slow if you have a large number of small files.
This bottleneck can be resolved by storing your entire dataset as TFRecord files, where each TFRecord would have the data from multiple small files. So, you will reduce the number of times you need to read a file. I will cover that in another post.