Deep Learning with Audio Thread

kodzaks · March 9, 2019, 11:21pm

Thank you! This is great!

What I am trying to understand is could we really just make square spectrograms by just setting up figsize? Spectrogram is basically time vs. frequency plot, and by making it square we tend to distort the time part (maybe frequency part too?). I am not sure about music, but for bioacoustic signals this distortion could be problematic, I think.

Here is a new paper on using CNN for bioacoustic classification

They did not just made square spectrograms from sound files, instead they cut sound files into what they called “fixed length sound frames” and it seems to be a somewhat complicated process. After that cutting they did STFT for each frame, used Hamming window, FFT 1024, etc. :

All the denoised sounds in the data set are sequentially cut into
sound frames with a duration of td (no overlapping between adjacent
frames). The sound frame with a length of less than td at the
end of the sound file is discarded. The Short Time Fourier Transform
(STFT), with Hamming window, a segment length of td/40,
segment shift of td/80 and FFT length of 1024 samples, is computed
for each sound frame. In order to show more details in the spectrogram,
the STFT coefficients are logarithmized by Eq. (1).
Z ¼ log10ðjZjÞ ð1Þ
where Z is the STFT coefficients matrix for each sound frame.
If the value of td is too small, some short-term pulse interference
may also be misdetected; if the value of td is too large, the signal
detection accuracy is lowered. Based on the durations of the
whistles from both whale species, td is set to 250 ms. In addition,
the time interval between most adjacent whistles is greater than
td, so the paper does not discuss the case where two whistles are
falsely detected as a whole whistle due to the short signal interval
(<td).
Further, for each sound frame, based on the preprocessed STFT
coefficients Z, a frame spectrogram (grayscale) of 180 * 120 pixels
is obtained by the pcolormesh method in matplotlib [24] to visualize
the STFT result. Fig. 1(b) and Fig. 2(b) show the start and end
positions of the frames for the denoised sound, and Fig. 1© and
Fig. 2© show the corresponding frame spectrograms. As can be
seen, the contours of whistles have been enhanced.

I am very confused about this, because I have a bunch of labeled 10 sec spectrograms I have no idea what to do with. I can make them square and it will distort frequency/time relationship. I cannot use them as is, because the model will just randomly make squares from them in the middle and in some files of interest the signals are not in the middle, will be missed, but the file will still be labeled as containing signal of interest. So the model will be confused.

marcmuc · March 10, 2019, 7:07am

But as I said above, as long as you keep the relationship between x and y, time and frequency scale the same throughout your experiments, it does not matter that they are „distorted“. And distorted in relation to what anyways? The x to y side relationship of the picture is kind of arbitrary in the beginning also, simply set by the defaults of the used library. It also doesn‘t really cost you anything to just experiment with the settings and just see what helps your model, so just try it! My favorite Jeremy quote: „The answer to the question ‚Should I do blah?‘ is always ‚Try blah and see!‘“

Of course if you try to fit 1min of sound into a image size of e.g. 28px then you will loose a lot of information on the time axis with that resolution, but you can also adjust the parameter of the window length, so your 10secs may be too long and you could try cutting those to 1 sec or 500ms and then using those in squared images.

Depending on your problem you may the also have to adjust how you tackle the problem, so if you classification depends a lot on the changes over time (like kajetan‘s music example), then too short a window may leave your model unable to detect the right class. But it could be used for a first step (e.g. detect whale-sound vs. no-whale-sound) and then running the whale sound type classification only on those larger windows where there is actually whale sound detected in step 1…

rsomani95 · March 16, 2019, 2:26am

@kodzaks The images are randomly cropped by default but you can easily override this. In the line where you’re creating the DataBunch, you can pass in size = (height, width) as an argument.

Additionally, if your height and width change dynamically, then you choose how you want to handle this (again, this is also an argument to the DataBunch function) by choosing what you want the resize_method to be.
For example, if you want the images to be squished and not cropped into the size you chose, pass in resize_method = ResizeMethod.SQUISH.

kodzaks · March 16, 2019, 10:24pm

Thank you for this! Can height/width be rectangular though? I remember the discussion that images have to be square and Jeremy mentioned that he will be talking about rectangular images in part 2, saying that they are surprisingly difficult to deal with, or something like that?

rsomani95 · March 17, 2019, 10:40am

No problem

Yes, it can. I’ve used this successfully with rectangular images of size (100, 177) and scaled up versions of the same.

I do remember Jeremy saying this, but rectangular image sizes work just fine (it is possible that this was made possible after the lecture was recorded, thus the mismatch).

My understanding is that he will explain why it is complicated in part 2. Intuitively, I think this is because picking the right size for rectangular convolutional kernels must be tricky.

jeremy · March 17, 2019, 10:54am

Why not use this thread? Having it in a separate group means that there are multiple places to look for this info, and stuff in telegram won’t be available to people searching.

I’ll be working on fastai.audio very soon FYI - if there are exclusive private groups outside of these forums we won’t be able to help each other!

jeremy · March 17, 2019, 10:56am

I meant that a dataset with lots of different sized rectangles is tricky. If they’re all the same size, that’s fine (e.g. the segmentation datasets in part 1).

kodzaks · March 17, 2019, 7:29pm

Thank you @rsomani95 and @jeremy for your replies.

I tried to use my spectrogram data in 2 formats: 224 square and (100,177) rectangular size.

My results for 224 seem to be better:

Initial resnet34, (pretty bad):

epoch	train_loss	valid_loss	error_rate
1	0.647652	0.621205	0.394737
2	0.563093	1.762814	0.552632
3	0.427048	1.189877	0.473684

After unfreeze and learnfit it gets better:

epoch	train_loss	valid_loss	error_rate
1	0.534642	0.357039	0.210526

Resnet50 gave me the best results so far:

epoch	train_loss	valid_loss	error_rate
1	0.827895	0.735053	0.473684
2	0.567242	0.594174	0.315789
3	0.427497	0.566335	0.289474
4	0.349678	0.458205	0.157895
5	0.299800	0.379624	0.131579
6	0.257279	0.368416	0.157895
7	0.235691	0.364216	0.105263
8	0.212819	0.347390	0.105263

unfreeze and learnfit makes it worse:

epoch	train_loss	valid_loss	error_rate
1	1.191806	12.630829	0.500000
2	1.661533	30.858454	0.473684
3	1.424686	14.588722	0.473684

But for rectangular images, my results are much, much worse:

initial resnet34

epoch	train_loss	valid_loss	error_rate
1	0.826759	0.995546	0.596491
2	0.580530	0.858085	0.543860
3	0.425833	1.251360	0.578947

After unfreeze and learnfit:

poch	train_loss	valid_loss	error_rate
1	0.993478	1.775175	0.578947

resnet50:

epoch	train_loss	valid_loss	error_rate
1	0.627348	1.009493	0.526316
2	0.456382	1.192073	0.526316
3	0.329753	1.243930	0.500000
4	0.279683	1.176981	0.473684
5	0.230862	1.421680	0.500000
6	0.192906	1.193287	0.473684
7	0.166738	1.016759	0.342105
8	0.144671	1.161860	0.342105

gets a little better after unfreeze and learnfit:

epoch	train_loss	valid_loss	error_rate
1	1.220552	23.675739	0.500000
2	1.345855	5.028001	0.289474
3	1.289851	1.882938	0.289474

I wonder why this is happening? The base data is the same (i.e. spectrogram content) the difference is in shape and size.

So my best results so far 0.1 error rate with resnet50, 200 spectrograms, 2 classes, 0.2 validation.

But when I run the data again, the model changes, it gets worse, then a little better. I know I am supposed to use random seed for more consistent results, but why there is so much variation in results I get? Besides, if we lock validation set, does it not mean that our model does not generalize well? (or maybe I just do not get it, sorry).

jeremy · March 17, 2019, 10:18pm

I sure hope so. In a couple of weeks there should be something to look at.

marcmuc · March 17, 2019, 11:25pm

I can’t explain your results in depth, but just one observation:

using 224x224 pixel images, you give the model 50176 color pixels to process, whereas with 177x100 only 17700, so the model gets almost 3 times the information in the squared images, assuming you create and not just stretch them.

So the question is: were the 224 by 224 just created by stretching the 177x100? Because if not, that alone might account for a lot of the difference in accuracy. Also assuming the 100 is the y axis corresponding to the frequency: A lot of information might be lost if you condense/squish the frequencies into 100 px as opposed to 224, with the later the resolution of frequency that is possible is 2.24 times higher, so the model gets a much “clearer” picture of the occuring frequencies and might be able to better differentiate classes.

kodzaks · March 17, 2019, 11:45pm

@marcmuc Thank you for your reply. The 224 by 224 was the size set up when I created the first version of spectrograms. The original rectangular size for the second version was something like 1400 by 1900 pixels and I got some sort overload error. So I took those spectrograms and set 100 by 177 size assuming they will be just scaled down without data loss? Maybe it was my mistake? Is there the upper size limit in pixels for rectangular images?

Also, my spectrograms are gray scale. I am playing with window size now trying to see which one will give me better results since it is time/frequency resolution trade-off.

MadeUpMasters · March 18, 2019, 7:22pm

Great idea, allowing the solutions we come up with to be searchable by others is essential so we don’t all reinvent the wheel. On the other hand we have frequent random chat that doesn’t need to be clogging the forums. We’ve talked it over and are going to try a hybrid where we maintain a private group (open to anyone) but post any real content and questions here. If we find too much content is slipping through the cracks and not making it to the forums, we will reconsider and just use the forums 100%.

mizzourah2006 · March 18, 2019, 7:42pm

So Jeremy, would you recommend trimming all the clips to be the same size? I am attempting to use fastai to identify speech patterns, such as speech disfluency, pauses, intonation, etc. One issue I will deal with is the fact that everyone’s recordings are different lengths. I was thinking of either padding speeches with loops of their own recordings (like you did with the vision) or cutting the length down to 20-30 second clips. But then the problem becomes for the longer ones, which 20-30 seconds to you take? I think this is a very cool area of research and looking forward to people way smarter than me to get interested in it as well

harris · March 18, 2019, 8:33pm

Love this forum @MadeUpMasters! I personally have a bunch of job interview recordings, and I was trying to classify different things. I got decent accuracy (70-80% accuracy) for one of my classification tasks by converting .wav files into audio spectograms and then training a classifier over that.

However, I’m seeing that I probably need to convert the audio into text and run NLP classifiers over that data if I want to classify any meaning from the audio.

mizzourah2006 · March 18, 2019, 8:35pm

@harris I’d be extremely interested in hearing more about this. I’m working in very similar space, trying to predict job outcomes with interview, presentations, etc.

jona · March 19, 2019, 4:57pm

I’ve found the following dual challenges with audio to be consistently frustrating. Hopefully the fastai module that @jeremy appears to be working on will address them! In the meantime, I’d love to hear everyone else’s best practices!

Analogy to image databunch transforms.
Fastai has made it incredibly easy to take a small dataset (e.g. one picture of a dog) and make myriad alterations (cropping, rotating, shading, flipping…) to create a huge training set that helps regularize the network. It occurs to me that, in audio, very few of the traditional transformations are applicable (except perhaps left-right cropping in the case of a spectrogram)…
It seems to me that ideally a module designed for audio can take any length audio segment, in native format (e.g. wav instead of pre-conversion to spectrogram–independent of whether or not the best solution is for the module to then run a CNN on a spectrogram anyway). I think this might allow much more opportunity for “transforms” like adding background sound (e.g. street noise), or cutting out sound (e.g. bad mic connection), or muffling (e.g. various distances between mic and source)…

Do any of you have strategies that you use currently for augmenting (i.e. regularizing) your incoming databunches?

Training dataset labeling.
I consistently can’t decide what the best labeling protocol is for my data sets. I think it would be ideal if the labels were of the “segmentation” style, such that in a 10 second clip, you can imagine a mask that colors in the two seconds of dog barking and another half second of car horn. The frustrating thing is that once the data is in spectrogram form, it is exceedingly difficult for a human labeler to identify the sounds that would have been second nature for them in audio form. The result is that I am forced to add single-classifications to entire audio clips (e.g. the whole 10-second clip is put in the category “dog” because it has a positive example of dog barking).

How do other people deal with labeling their dataset in a way that also allows you to use fastai?

MadeUpMasters · March 20, 2019, 4:22pm

I wanted to share a short guide I wrote on how to use fastai’s parallel function for speeding up preprocessing. It helped me generate spectrograms for audio classification nearly 3x faster! Feedback, especially of the critical variety, is much appreciated!

jeremy · March 20, 2019, 5:05pm

Thanks for sharing! BTW if you add your twitter handle to Medium then it’ll automatically credit you when sharing.

baz · March 20, 2019, 10:45pm

Hello! I’m interested in voice identification. I noticed that most people are using the spectrogram approach but the examples of doing this are using the raw audio samples and putting them into 1D convolutions:

I know that this isn’t possible with the current fastai library but hopefully when fastai.audio is out we will be able to do this stuff.

In particular, I’m interested in recreating something that can identify the voices in audio sample. This article uses a Siamese Network which uses trains 2 identical encoders to compare two things (say 3s audio clips or pixels from a face). It then take a euclidean distance of the two output vectors and trains the network to decrease the distance of the output encodings if they are the same and increase it if they aren’t.

I want to be able to take totally new voice and enrol a person and then from further audio samples use this enrolment to identify the speaker similar to the article above.

What might be the best way to approach this with the fastai library? I would probably need to create a custom Module?

ThomM · March 21, 2019, 5:20am

A lot to catch up on in this thread I just jumped in to say that TuriCreate (Apple’s ML creation framework) just pushed a new version which includes a sound classifier, and their code is on github under BSDv3. I don’t know about license compatibility with fast.ai, but it’s got some interesting implementations. Roughly speaking it looks like they’re slicing the input samples into overlapping windows and then using a CNN directly on the audio data. (It looks like they’re also using transfer learning). It would be interesting to reimplement their approach in fast.ai and compare results, both between the libraries and between the to-image-then-convnet method. I think it would take a fair bit of time & effort to do justice to, though, including understanding the structure of the turicreate system to reimplement comparably in fast.ai.