Question about converting audio to images

ForBo7 · January 31, 2023, 6:46am

Hello.

I’m figuring out how to efficiently convert many audio files into spectrogram images, and a thought came into my mind.

When inputting an image to a model, what’s input is a matrix of pixel values. That means when I want to convert my audio files, I don’t really need to save them as images, right? I can just save them as PyTorch tensors/pickle files/csv files to disk, right?

But then another question arises: will fastai’s [image] DataBlock be able to read these files appropriately? I assume that the DataBlock would expect image files, but I’m not quite sure if it would accept images that are stored as PyTorch tensor files/pickle files/csv files.

I would really appreciate clarfication!

MikeGallimore · February 1, 2023, 6:44am

Hi, I’m working on the same problem! I’ll let you know if I get anywhere with it.

s.s.o · February 1, 2023, 10:18am

you can find some examples in kaggle and forums.

ForBo7 · February 2, 2023, 7:08am

I did want to use fastaudio, but it’s no longer maintained and uses an old version of fastai. Did find a couple of notebooks there, but perhaps should prod around some more.

ForBo7 · February 5, 2023, 1:43pm

The reason I asked my question above was because converting the audio files into PNGs took too long; I have around 14,800 audio files and each conversion takes around 1.5 seconds. So in total, it would have taken around 6 hours. I could have used JPG, but there would be a loss in detail.

Because of that, I was wondering whether I could instead save the audio files as PyTorch tensors directly and if fastai’s DataBlock could read these tensors as images.

While I didn’t explore that route, I searched online for alternative file formats and came across the TIFF format. It keeps the same level of detail as PNG, but the conversion speed was as fast as that of a JPG (around 100 ms; so approximately 26 minutes in total).

The the drawback though is that TIFF files are HUGE. The same PNG that was around 3MB was now around 8MB as a TIFF.

If anybody would like to see the generated dataset, here’s the link:

I converted bird calls into spectrograms to try attempt a classification model that identifies birds by their sound.