Use ImageDataLoaders to transform audio files

jzwi · June 9, 2023, 2:54am

I want to perform audio classification and have a folder of .wav files.

Is it possible to use a set of custom transforms to feed an ImageDataLoader these audio files directly?

A snippet of the idea is below. Essentially, I would like to define a transformation pipeline that takes a dataframe with paths to audio files, and then (as part of item_tfms=[transformPipeline] ) augments the audio and converts to a spectrogram in one fell swoop.

def audioAugmentations(audioFile):
  ### DO STUFF HERE TO AUGMENT AUDIO
  return augmentedAudioFile

def audioToSpectogram(augmentedAudioFile):
  ### CONVERT THE .WAV FILE TO SPECTROGRAM
  return spectrogram

def transformPipeline(audioFile):
  augmentedAudioFile = audioAugmentations(audioFile)
  spectrogram = audioToSpectogram(augmentedAudioFile)
  return spectrogram

data = ImageDataLoaders.from_df(df,
                                path='path/to/audio/files/',         
                                valid_pct = 0.2,
                                label_col='label',
                                item_tfms=[transformPipeline],
                                batch_tfms=[Normalize.from_stats(*imagenet_stats)]
)

Unfortunately, I am currently running into the following error UnidentifiedImageError: cannot identify image file 'audiofile.wav', which seems to indicate that the inputs must be images to begin with.

The concern here is that if I need to create spectrograms of all the augmented audio prior to loading into the ImageDataLoader, then I will gobble up the little space I have on disk.

I would like to be able to have the augmented images generated from .wav files in realtime and as part of the training, thereby saving me the space of the numerous extra images that would otherwise be stored on disk.