Deep Learning with Audio Thread

Thank you for your reply, do you know if Jeremy covers CoordConv in any of fastai lessons? I got that all spectrograms should be generated with the same parameters and that no augmentation should be used on spectrograms. I am extremely confused about this whole square/rectangular issue. If we squash the spectrogram, we lose a lot of information, like for example, frequency modulation pattern in a call, would it make classification task more problematic? Why can’t we use rectangular spectrogram where this information is preserved? Cutting max and min frequencies could be problematic too, some calls have high harmonics all way up, some calls are lower in fundamental than the average, so we really need the full sampling range. Plus, there is a difference between a spectrogram of a specific call or a spectrogram of 10 sec of the entire auditory scene sound snippet (with or without a call i it)

Also, I do not fully understand why spectrograms cannot be treated as a type of image? This post seems to argue that they cannot be

2 Likes

I was wondering if there are any models in pytorch pretrained on audio data.I found one in tensor flow: https://github.com/tensorflow/models/tree/master/research/audioset

2 Likes

Very cool idea, I saw it mentioned earlier that some people were interested in speech analysis. That’s what I’m interested in as well. I’d love to see how others are approaching audio prediction.

Thank you! This is great!

What I am trying to understand is could we really just make square spectrograms by just setting up figsize? Spectrogram is basically time vs. frequency plot, and by making it square we tend to distort the time part (maybe frequency part too?). I am not sure about music, but for bioacoustic signals this distortion could be problematic, I think.

Here is a new paper on using CNN for bioacoustic classification

They did not just made square spectrograms from sound files, instead they cut sound files into what they called “fixed length sound frames” and it seems to be a somewhat complicated process. After that cutting they did STFT for each frame, used Hamming window, FFT 1024, etc. :

All the denoised sounds in the data set are sequentially cut into
sound frames with a duration of td (no overlapping between adjacent
frames). The sound frame with a length of less than td at the
end of the sound file is discarded. The Short Time Fourier Transform
(STFT), with Hamming window, a segment length of td/40,
segment shift of td/80 and FFT length of 1024 samples, is computed
for each sound frame. In order to show more details in the spectrogram,
the STFT coefficients are logarithmized by Eq. (1).
Z ¼ log10ðjZjÞ ð1Þ
where Z is the STFT coefficients matrix for each sound frame.
If the value of td is too small, some short-term pulse interference
may also be misdetected; if the value of td is too large, the signal
detection accuracy is lowered. Based on the durations of the
whistles from both whale species, td is set to 250 ms. In addition,
the time interval between most adjacent whistles is greater than
td, so the paper does not discuss the case where two whistles are
falsely detected as a whole whistle due to the short signal interval
(<td).
Further, for each sound frame, based on the preprocessed STFT
coefficients Z, a frame spectrogram (grayscale) of 180 * 120 pixels
is obtained by the pcolormesh method in matplotlib [24] to visualize
the STFT result. Fig. 1(b) and Fig. 2(b) show the start and end
positions of the frames for the denoised sound, and Fig. 1© and
Fig. 2© show the corresponding frame spectrograms. As can be
seen, the contours of whistles have been enhanced.

I am very confused about this, because I have a bunch of labeled 10 sec spectrograms I have no idea what to do with. I can make them square and it will distort frequency/time relationship. I cannot use them as is, because the model will just randomly make squares from them in the middle and in some files of interest the signals are not in the middle, will be missed, but the file will still be labeled as containing signal of interest. So the model will be confused.

2 Likes

But as I said above, as long as you keep the relationship between x and y, time and frequency scale the same throughout your experiments, it does not matter that they are „distorted“. And distorted in relation to what anyways? The x to y side relationship of the picture is kind of arbitrary in the beginning also, simply set by the defaults of the used library. It also doesn‘t really cost you anything to just experiment with the settings and just see what helps your model, so just try it! My favorite Jeremy quote: „The answer to the question ‚Should I do blah?‘ is always ‚Try blah and see!‘“ :grinning:

Of course if you try to fit 1min of sound into a image size of e.g. 28px then you will loose a lot of information on the time axis with that resolution, but you can also adjust the parameter of the window length, so your 10secs may be too long and you could try cutting those to 1 sec or 500ms and then using those in squared images.

Depending on your problem you may the also have to adjust how you tackle the problem, so if you classification depends a lot on the changes over time (like kajetan‘s music example), then too short a window may leave your model unable to detect the right class. But it could be used for a first step (e.g. detect whale-sound vs. no-whale-sound) and then running the whale sound type classification only on those larger windows where there is actually whale sound detected in step 1…

3 Likes

@kodzaks The images are randomly cropped by default but you can easily override this. In the line where you’re creating the DataBunch, you can pass in size = (height, width) as an argument.

Additionally, if your height and width change dynamically, then you choose how you want to handle this (again, this is also an argument to the DataBunch function) by choosing what you want the resize_method to be.
For example, if you want the images to be squished and not cropped into the size you chose, pass in resize_method = ResizeMethod.SQUISH.

1 Like

Thank you for this! Can height/width be rectangular though? I remember the discussion that images have to be square and Jeremy mentioned that he will be talking about rectangular images in part 2, saying that they are surprisingly difficult to deal with, or something like that?

No problem :slight_smile:

Yes, it can. I’ve used this successfully with rectangular images of size (100, 177) and scaled up versions of the same.

I do remember Jeremy saying this, but rectangular image sizes work just fine (it is possible that this was made possible after the lecture was recorded, thus the mismatch).

My understanding is that he will explain why it is complicated in part 2. Intuitively, I think this is because picking the right size for rectangular convolutional kernels must be tricky.

2 Likes

Why not use this thread? Having it in a separate group means that there are multiple places to look for this info, and stuff in telegram won’t be available to people searching.

I’ll be working on fastai.audio very soon FYI - if there are exclusive private groups outside of these forums we won’t be able to help each other!

9 Likes

I meant that a dataset with lots of different sized rectangles is tricky. If they’re all the same size, that’s fine (e.g. the segmentation datasets in part 1).

5 Likes

Thank you @rsomani95 and @jeremy for your replies.

I tried to use my spectrogram data in 2 formats: 224 square and (100,177) rectangular size.

My results for 224 seem to be better:

Initial resnet34, (pretty bad):

epoch train_loss valid_loss error_rate
1 0.647652 0.621205 0.394737
2 0.563093 1.762814 0.552632
3 0.427048 1.189877 0.473684

After unfreeze and learnfit it gets better:

epoch train_loss valid_loss error_rate
1 0.534642 0.357039 0.210526

Resnet50 gave me the best results so far:

epoch train_loss valid_loss error_rate
1 0.827895 0.735053 0.473684
2 0.567242 0.594174 0.315789
3 0.427497 0.566335 0.289474
4 0.349678 0.458205 0.157895
5 0.299800 0.379624 0.131579
6 0.257279 0.368416 0.157895
7 0.235691 0.364216 0.105263
8 0.212819 0.347390 0.105263

unfreeze and learnfit makes it worse:

epoch train_loss valid_loss error_rate
1 1.191806 12.630829 0.500000
2 1.661533 30.858454 0.473684
3 1.424686 14.588722 0.473684

But for rectangular images, my results are much, much worse:

initial resnet34

epoch train_loss valid_loss error_rate
1 0.826759 0.995546 0.596491
2 0.580530 0.858085 0.543860
3 0.425833 1.251360 0.578947

After unfreeze and learnfit:

poch train_loss valid_loss error_rate
1 0.993478 1.775175 0.578947

resnet50:

epoch train_loss valid_loss error_rate
1 0.627348 1.009493 0.526316
2 0.456382 1.192073 0.526316
3 0.329753 1.243930 0.500000
4 0.279683 1.176981 0.473684
5 0.230862 1.421680 0.500000
6 0.192906 1.193287 0.473684
7 0.166738 1.016759 0.342105
8 0.144671 1.161860 0.342105

gets a little better after unfreeze and learnfit:

epoch train_loss valid_loss error_rate
1 1.220552 23.675739 0.500000
2 1.345855 5.028001 0.289474
3 1.289851 1.882938 0.289474

I wonder why this is happening? The base data is the same (i.e. spectrogram content) the difference is in shape and size.

So my best results so far 0.1 error rate with resnet50, 200 spectrograms, 2 classes, 0.2 validation.

But when I run the data again, the model changes, it gets worse, then a little better. I know I am supposed to use random seed for more consistent results, but why there is so much variation in results I get? Besides, if we lock validation set, does it not mean that our model does not generalize well? (or maybe I just do not get it, sorry).

I sure hope so. In a couple of weeks there should be something to look at.

6 Likes

I can’t explain your results in depth, but just one observation:

using 224x224 pixel images, you give the model 50176 color pixels to process, whereas with 177x100 only 17700, so the model gets almost 3 times the information in the squared images, assuming you create and not just stretch them.

So the question is: were the 224 by 224 just created by stretching the 177x100? Because if not, that alone might account for a lot of the difference in accuracy. Also assuming the 100 is the y axis corresponding to the frequency: A lot of information might be lost if you condense/squish the frequencies into 100 px as opposed to 224, with the later the resolution of frequency that is possible is 2.24 times higher, so the model gets a much “clearer” picture of the occuring frequencies and might be able to better differentiate classes.

1 Like

@marcmuc Thank you for your reply. The 224 by 224 was the size set up when I created the first version of spectrograms. The original rectangular size for the second version was something like 1400 by 1900 pixels and I got some sort overload error. So I took those spectrograms and set 100 by 177 size assuming they will be just scaled down without data loss? Maybe it was my mistake? Is there the upper size limit in pixels for rectangular images?

Also, my spectrograms are gray scale. I am playing with window size now trying to see which one will give me better results since it is time/frequency resolution trade-off.

Great idea, allowing the solutions we come up with to be searchable by others is essential so we don’t all reinvent the wheel. On the other hand we have frequent random chat that doesn’t need to be clogging the forums. We’ve talked it over and are going to try a hybrid where we maintain a private group (open to anyone) but post any real content and questions here. If we find too much content is slipping through the cracks and not making it to the forums, we will reconsider and just use the forums 100%.

1 Like

So Jeremy, would you recommend trimming all the clips to be the same size? I am attempting to use fastai to identify speech patterns, such as speech disfluency, pauses, intonation, etc. One issue I will deal with is the fact that everyone’s recordings are different lengths. I was thinking of either padding speeches with loops of their own recordings (like you did with the vision) or cutting the length down to 20-30 second clips. But then the problem becomes for the longer ones, which 20-30 seconds to you take? I think this is a very cool area of research and looking forward to people way smarter than me to get interested in it as well :slight_smile:

Love this forum @MadeUpMasters! I personally have a bunch of job interview recordings, and I was trying to classify different things. I got decent accuracy (70-80% accuracy) for one of my classification tasks by converting .wav files into audio spectograms and then training a classifier over that.

However, I’m seeing that I probably need to convert the audio into text and run NLP classifiers over that data if I want to classify any meaning from the audio.

1 Like

@harris I’d be extremely interested in hearing more about this. I’m working in very similar space, trying to predict job outcomes with interview, presentations, etc.

I’ve found the following dual challenges with audio to be consistently frustrating. Hopefully the fastai module that @jeremy appears to be working on will address them! In the meantime, I’d love to hear everyone else’s best practices!

  1. Analogy to image databunch transforms.
    Fastai has made it incredibly easy to take a small dataset (e.g. one picture of a dog) and make myriad alterations (cropping, rotating, shading, flipping…) to create a huge training set that helps regularize the network. It occurs to me that, in audio, very few of the traditional transformations are applicable (except perhaps left-right cropping in the case of a spectrogram)…
    It seems to me that ideally a module designed for audio can take any length audio segment, in native format (e.g. wav instead of pre-conversion to spectrogram–independent of whether or not the best solution is for the module to then run a CNN on a spectrogram anyway). I think this might allow much more opportunity for “transforms” like adding background sound (e.g. street noise), or cutting out sound (e.g. bad mic connection), or muffling (e.g. various distances between mic and source)…

Do any of you have strategies that you use currently for augmenting (i.e. regularizing) your incoming databunches?

  1. Training dataset labeling.
    I consistently can’t decide what the best labeling protocol is for my data sets. I think it would be ideal if the labels were of the “segmentation” style, such that in a 10 second clip, you can imagine a mask that colors in the two seconds of dog barking and another half second of car horn. The frustrating thing is that once the data is in spectrogram form, it is exceedingly difficult for a human labeler to identify the sounds that would have been second nature for them in audio form. The result is that I am forced to add single-classifications to entire audio clips (e.g. the whole 10-second clip is put in the category “dog” because it has a positive example of dog barking).

How do other people deal with labeling their dataset in a way that also allows you to use fastai?

3 Likes

I wanted to share a short guide I wrote on how to use fastai’s parallel function for speeding up preprocessing. It helped me generate spectrograms for audio classification nearly 3x faster! Feedback, especially of the critical variety, is much appreciated!

11 Likes