Deep Learning with Audio Thread

(Natalija Lace) #21

Thank you for your reply, do you know if Jeremy covers CoordConv in any of fastai lessons? I got that all spectrograms should be generated with the same parameters and that no augmentation should be used on spectrograms. I am extremely confused about this whole square/rectangular issue. If we squash the spectrogram, we lose a lot of information, like for example, frequency modulation pattern in a call, would it make classification task more problematic? Why can’t we use rectangular spectrogram where this information is preserved? Cutting max and min frequencies could be problematic too, some calls have high harmonics all way up, some calls are lower in fundamental than the average, so we really need the full sampling range. Plus, there is a difference between a spectrogram of a specific call or a spectrogram of 10 sec of the entire auditory scene sound snippet (with or without a call i it)

Also, I do not fully understand why spectrograms cannot be treated as a type of image? This post seems to argue that they cannot be

(Unnati Niraj Patel) #22

I was wondering if there are any models in pytorch pretrained on audio data.I found one in tensor flow:

(Nick Koenig) #23

Very cool idea, I saw it mentioned earlier that some people were interested in speech analysis. That’s what I’m interested in as well. I’d love to see how others are approaching audio prediction.

(Kajetan Olszewski) #24

Could I suggest connecting with a broader community? What I mean is using the MIR (Music Information Retrieval) Slack group: There are hundreds of MIR practitioners and researchers there.
We could create our channel there and post information about it in Robert’s top post. I’ve already created a “fastai” channel there.

(Kajetan Olszewski) #25

I’m working with the FMA (Free Music Archive) dataset (small subset) and I used the following code to get square spectrograms.

from pathlib import Path
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
import pylab

def gen_spec(load_path: str):
    y, sr = librosa.load(load_path)
    S = librosa.feature.melspectrogram(y=y, sr=sr)
    plt.figure(figsize=(2.24, 2.24))
    pylab.axes([0., 0., 1., 1.], frameon=False, xticks=[], yticks=[])
    librosa.display.specshow(librosa.power_to_db(S, ref=np.max), y_axis='mel', x_axis='time')
    save_path = f'{load_path.split(".")[0]}.png'
    pylab.savefig(save_path, bbox_inches=None, pad_inches=0, dpi=100)

It should save 224x224 spectrogram image for a given path in the same folder as an audio file. The important bits here are plt.figure(figsize=(2.24, 2.24)) and pylab.savefig(str(save_path), bbox_inches=None, pad_inches=0, dpi=100). The rest of the pylab code just strips the axis labels and padding. It doesn’t really matter if the height is stretched. In terms of time domain, I didn’t have to anything to make it “the same size”, because FMA contains 30s clips.

(Natalija Lace) #26

Thank you! This is great!

What I am trying to understand is could we really just make square spectrograms by just setting up figsize? Spectrogram is basically time vs. frequency plot, and by making it square we tend to distort the time part (maybe frequency part too?). I am not sure about music, but for bioacoustic signals this distortion could be problematic, I think.

Here is a new paper on using CNN for bioacoustic classification

They did not just made square spectrograms from sound files, instead they cut sound files into what they called “fixed length sound frames” and it seems to be a somewhat complicated process. After that cutting they did STFT for each frame, used Hamming window, FFT 1024, etc. :

All the denoised sounds in the data set are sequentially cut into
sound frames with a duration of td (no overlapping between adjacent
frames). The sound frame with a length of less than td at the
end of the sound file is discarded. The Short Time Fourier Transform
(STFT), with Hamming window, a segment length of td/40,
segment shift of td/80 and FFT length of 1024 samples, is computed
for each sound frame. In order to show more details in the spectrogram,
the STFT coefficients are logarithmized by Eq. (1).
Z ¼ log10ðjZjÞ ð1Þ
where Z is the STFT coefficients matrix for each sound frame.
If the value of td is too small, some short-term pulse interference
may also be misdetected; if the value of td is too large, the signal
detection accuracy is lowered. Based on the durations of the
whistles from both whale species, td is set to 250 ms. In addition,
the time interval between most adjacent whistles is greater than
td, so the paper does not discuss the case where two whistles are
falsely detected as a whole whistle due to the short signal interval
Further, for each sound frame, based on the preprocessed STFT
coefficients Z, a frame spectrogram (grayscale) of 180 * 120 pixels
is obtained by the pcolormesh method in matplotlib [24] to visualize
the STFT result. Fig. 1(b) and Fig. 2(b) show the start and end
positions of the frames for the denoised sound, and Fig. 1© and
Fig. 2© show the corresponding frame spectrograms. As can be
seen, the contours of whistles have been enhanced.

I am very confused about this, because I have a bunch of labeled 10 sec spectrograms I have no idea what to do with. I can make them square and it will distort frequency/time relationship. I cannot use them as is, because the model will just randomly make squares from them in the middle and in some files of interest the signals are not in the middle, will be missed, but the file will still be labeled as containing signal of interest. So the model will be confused.

(Marc Rostock) #27

But as I said above, as long as you keep the relationship between x and y, time and frequency scale the same throughout your experiments, it does not matter that they are „distorted“. And distorted in relation to what anyways? The x to y side relationship of the picture is kind of arbitrary in the beginning also, simply set by the defaults of the used library. It also doesn‘t really cost you anything to just experiment with the settings and just see what helps your model, so just try it! My favorite Jeremy quote: „The answer to the question ‚Should I do blah?‘ is always ‚Try blah and see!‘“ :grinning:

Of course if you try to fit 1min of sound into a image size of e.g. 28px then you will loose a lot of information on the time axis with that resolution, but you can also adjust the parameter of the window length, so your 10secs may be too long and you could try cutting those to 1 sec or 500ms and then using those in squared images.

Depending on your problem you may the also have to adjust how you tackle the problem, so if you classification depends a lot on the changes over time (like kajetan‘s music example), then too short a window may leave your model unable to detect the right class. But it could be used for a first step (e.g. detect whale-sound vs. no-whale-sound) and then running the whale sound type classification only on those larger windows where there is actually whale sound detected in step 1…

(Kajetan Olszewski) #28

Do you have any baseline to compare to? I’d try to make them square (by adjusting the figure; not cropping), train a model and check what results are you getting. So far my results are not that great (0.426 error rate on resnet50 unfrozen from the start and trained for 3 epochs using learning rate slice found through learning rate finder), but they seem reasonable because the network has the biggest problem with classifying the pop genre, which is not a de facto genre, but a quality of being listenable by general public, I’d say, and it draws from many genres.

(Rahul Somani) #29

@kodzaks The images are randomly cropped by default but you can easily override this. In the line where you’re creating the DataBunch, you can pass in size = (height, width) as an argument.

Additionally, if your height and width change dynamically, then you choose how you want to handle this (again, this is also an argument to the DataBunch function) by choosing what you want the resize_method to be.
For example, if you want the images to be squished and not cropped into the size you chose, pass in resize_method = ResizeMethod.SQUISH.

(Natalija Lace) #30

Thank you for this! Can height/width be rectangular though? I remember the discussion that images have to be square and Jeremy mentioned that he will be talking about rectangular images in part 2, saying that they are surprisingly difficult to deal with, or something like that?

(Rahul Somani) #31

No problem :slight_smile:

Yes, it can. I’ve used this successfully with rectangular images of size (100, 177) and scaled up versions of the same.

I do remember Jeremy saying this, but rectangular image sizes work just fine (it is possible that this was made possible after the lecture was recorded, thus the mismatch).

My understanding is that he will explain why it is complicated in part 2. Intuitively, I think this is because picking the right size for rectangular convolutional kernels must be tricky.

(Jeremy Howard (Admin)) #32

Why not use this thread? Having it in a separate group means that there are multiple places to look for this info, and stuff in telegram won’t be available to people searching.

I’ll be working on very soon FYI - if there are exclusive private groups outside of these forums we won’t be able to help each other!

(Jeremy Howard (Admin)) #33

I meant that a dataset with lots of different sized rectangles is tricky. If they’re all the same size, that’s fine (e.g. the segmentation datasets in part 1).

(Kajetan Olszewski) #34

Is there any way we could help with

(Natalija Lace) #35

Thank you @rsomani95 and @jeremy for your replies.

I tried to use my spectrogram data in 2 formats: 224 square and (100,177) rectangular size.

My results for 224 seem to be better:

Initial resnet34, (pretty bad):

epoch train_loss valid_loss error_rate
1 0.647652 0.621205 0.394737
2 0.563093 1.762814 0.552632
3 0.427048 1.189877 0.473684

After unfreeze and learnfit it gets better:

epoch train_loss valid_loss error_rate
1 0.534642 0.357039 0.210526

Resnet50 gave me the best results so far:

epoch train_loss valid_loss error_rate
1 0.827895 0.735053 0.473684
2 0.567242 0.594174 0.315789
3 0.427497 0.566335 0.289474
4 0.349678 0.458205 0.157895
5 0.299800 0.379624 0.131579
6 0.257279 0.368416 0.157895
7 0.235691 0.364216 0.105263
8 0.212819 0.347390 0.105263

unfreeze and learnfit makes it worse:

epoch train_loss valid_loss error_rate
1 1.191806 12.630829 0.500000
2 1.661533 30.858454 0.473684
3 1.424686 14.588722 0.473684

But for rectangular images, my results are much, much worse:

initial resnet34

epoch train_loss valid_loss error_rate
1 0.826759 0.995546 0.596491
2 0.580530 0.858085 0.543860
3 0.425833 1.251360 0.578947

After unfreeze and learnfit:

poch train_loss valid_loss error_rate
1 0.993478 1.775175 0.578947


epoch train_loss valid_loss error_rate
1 0.627348 1.009493 0.526316
2 0.456382 1.192073 0.526316
3 0.329753 1.243930 0.500000
4 0.279683 1.176981 0.473684
5 0.230862 1.421680 0.500000
6 0.192906 1.193287 0.473684
7 0.166738 1.016759 0.342105
8 0.144671 1.161860 0.342105

gets a little better after unfreeze and learnfit:

epoch train_loss valid_loss error_rate
1 1.220552 23.675739 0.500000
2 1.345855 5.028001 0.289474
3 1.289851 1.882938 0.289474

I wonder why this is happening? The base data is the same (i.e. spectrogram content) the difference is in shape and size.

So my best results so far 0.1 error rate with resnet50, 200 spectrograms, 2 classes, 0.2 validation.

But when I run the data again, the model changes, it gets worse, then a little better. I know I am supposed to use random seed for more consistent results, but why there is so much variation in results I get? Besides, if we lock validation set, does it not mean that our model does not generalize well? (or maybe I just do not get it, sorry).

(Jeremy Howard (Admin)) #36

I sure hope so. In a couple of weeks there should be something to look at.

(Marc Rostock) #37

I can’t explain your results in depth, but just one observation:

using 224x224 pixel images, you give the model 50176 color pixels to process, whereas with 177x100 only 17700, so the model gets almost 3 times the information in the squared images, assuming you create and not just stretch them.

So the question is: were the 224 by 224 just created by stretching the 177x100? Because if not, that alone might account for a lot of the difference in accuracy. Also assuming the 100 is the y axis corresponding to the frequency: A lot of information might be lost if you condense/squish the frequencies into 100 px as opposed to 224, with the later the resolution of frequency that is possible is 2.24 times higher, so the model gets a much “clearer” picture of the occuring frequencies and might be able to better differentiate classes.

(Natalija Lace) #38

@marcmuc Thank you for your reply. The 224 by 224 was the size set up when I created the first version of spectrograms. The original rectangular size for the second version was something like 1400 by 1900 pixels and I got some sort overload error. So I took those spectrograms and set 100 by 177 size assuming they will be just scaled down without data loss? Maybe it was my mistake? Is there the upper size limit in pixels for rectangular images?

Also, my spectrograms are gray scale. I am playing with window size now trying to see which one will give me better results since it is time/frequency resolution trade-off.

(Robert Bracco) #39

Great idea, allowing the solutions we come up with to be searchable by others is essential so we don’t all reinvent the wheel. On the other hand we have frequent random chat that doesn’t need to be clogging the forums. We’ve talked it over and are going to try a hybrid where we maintain a private group (open to anyone) but post any real content and questions here. If we find too much content is slipping through the cracks and not making it to the forums, we will reconsider and just use the forums 100%.

(Nick Koenig) #40

So Jeremy, would you recommend trimming all the clips to be the same size? I am attempting to use fastai to identify speech patterns, such as speech disfluency, pauses, intonation, etc. One issue I will deal with is the fact that everyone’s recordings are different lengths. I was thinking of either padding speeches with loops of their own recordings (like you did with the vision) or cutting the length down to 20-30 second clips. But then the problem becomes for the longer ones, which 20-30 seconds to you take? I think this is a very cool area of research and looking forward to people way smarter than me to get interested in it as well :slight_smile: