[Invitation to open collaboration] Practice what you learn in the course and help animal researchers! šŸµ

Great! I agree with Radek that this challenge is not over yet. After digging into the fastai audio source code I found that this is a good opportunity to practice the new fastai . Firstly this is an incomplete repo and there is many thing to improve. Secondly, this is not so big that I think we can replicate the result without feeling so intimidated. For me now Iā€™m trying to get familiar more with fastai2 by going through the Jeremyā€™s walkthrough :smiley:

2 Likes

Influenced by @florianl . I experiment with the Hz scale rather than the Mel scale. For someone who interested in what it is, you can look at here https://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0 . In short, for human, the mel scale (abbreviation for melody scale) is better for our ear to distinguish 2 different sounds.

However, the model is not our ear so maybe we can get better result just with Hz. The get_x as below similar to @florian did:

    stft = librosa.stft(x, n_fft=n_fft, hop_length=hop_length)
    stft_magnitude, stft_phase = librosa.magphase(stft)
    stft_magnitude_db = librosa.amplitude_to_db(stft_magnitude)
    stft_magnitude_db = stft_magnitude_db - stft_magnitude_db.min()
    stft_magnitude_db = stft_magnitude_db / stft_magnitude_db.max() * 255 # we want the range of values for our data to be [0, 255]
                                   # this way fastai internally will be able to represent it as an image using PIL
    return stft_magnitude_db.astype(np.uint8)

The num_samples is sill num_samples = int(0.75 * rate) because my colab get out of memory if I use 1.7*rate.

And I got better results as below. But intuitively, I donā€™t really know if it can work with other dataset. And by curious I want to test with human dataset because Mel Scale is for human

2 Likes

That is a super interesting question whether to use linear, log or mel scale for the frequency :slightly_smiling_face:

Mel scale mimics very closely how we hear but at the same time when studying animals being anthropocentric can be quite a pitfall :wink:

Definitely a very important consideration.

4 Likes

Hi all,

Wanted to include one more possibility weā€™ve seen is pretty useful.
A lot of folks in the audio/speech community use Mel scales to the point itā€™s a default.

But there are also Gammatone spectrums and filter banks. Iā€™ve always liked this one since it seems that Gammatones to our cochleas are roughly like Gabors for our corneas.

I have a python version of these lying around somewhere, hoping to dig them up and give it a run with the notebook! If anyone has other filter banks please feel free to share, but I know that quickly becomes a rabbit hole.

3 Likes

The pre-processing possibilities seem to be endless: scale of the axis, power_to_db, amplitude_to_db, hop_length, nfft etc. I wonder if thereā€™s a logic behind that, that would spare us from having to try them all out.

Hi shut-ins,

It looks like everyone so far is using the approach of classifying spectrogram images. radek has suggested working directly on the timeseries as another approach. Iā€™d like to present a starter notebook that hits 96% classification accuracy using only conv1d, two simple pooling functions, and a Linear classifier.

The method is called ROCKET. You may have seen it already discussed in Time series/ sequential data study group. The original code and paper can be found at https://github.com/angus924/rocket. For those not familiar, here is a brief overview.

ROCKET extracts a set of features, typically several thousand numbers, from each timeseries sample (in this case the Macaque calls). The features are then run through a classifier to train the model to predict a category. The classifier (at least the ones I have seen used so far) is simply a linear combination of weights. Oguzaā€™s demo, the original paper, and my attached demo all use sklearnā€™s RidgeClassifier. You could just as well use the more familiar Linear/softmax/Cross entropy/optimizer setup, even appending more layers.

The power of ROCKET, though, lies in its features. These are generated by running each sample through a large set of fixed conv1dā€™s. Each conv1d has randomized weights centered on zero, and randomized biases. The output of each conv1d, a series itself, is then reduced to two numbers. The first is simply the maximum of the series. The second is the fraction of positive values in the series, the ā€˜proportion of positive valuesā€™ (ppv). In this way, each timeseries sample yields a list of numbers (features) that characterize it, of length two times the number of random convolutions. As with spectrogram images, itā€™s these features that are sent to the classifier.

It is important to note that the weights and biases of the conv1dā€™s are fixed. Contrary to our usual practice, they are not trained during the optimization of the classifier.

Getting into opinion and speculation, I think ROCKET effectively does a search of the space of conv1dā€™s by using a large universe of random kernel lengths, weights, biases, dilations, and paddings. The classifier selects which of these conv1dā€™s are predictive of the training samples. Rather than predesigning the architecture as we typically do, this approach finds the conv1dā€™s that work best for the problem.

Such a search would be impossible using typical machine learning methods because most of its parameters are not differentiable wrt loss. Two non-linearities, both of which are also non-differentiable, then reduce the dimensionality of the conv1d outputs. IMO, thereā€™s great potential in this approach of using randomness to search the space of architectures and weights. You can find papers that suggest that the olfactory systemā€™s random connections work in a similar way. Also, see weight-agnostic architectures.

Some further notesā€¦

  1. The various dilations of conv1d are able to extract the periodicities (frequencies) of the sounds, much as spectograms do. I think thatā€™s one reason ROCKET works well on this audio task.

  2. Although ROCKET looks computationally intensive, I find that most of the trained classification coefficients end up very small. (This is not my idea - I downloaded a notebook that shows this observation, but donā€™t know who originally authored it.) It means those conv1dā€™s could be eliminated, or replaced with different randomly sampled conv1dā€™s that may turn out to work better.

  3. Thereā€™s some special magic in the ppv non-linearity. Combined with conv1d, it is exceptionally good at classifying time series in general. Why is that so?


Notes on my initial implementation (based on Ignacio Oguizaā€™s ROCKET demo at https://github.com/timeseriesAI/timeseriesAI -thanks!)

https://github.com/PomoML/ROCKET_Sound

First, run notebook saveSounds. It saves the Macaque calls and names into ~.fastai. These will be loaded by the following notebook.

Second, run notebook MacaqueROCKET for a demonstration of the ROCKET method. It requires fastai v1 only for the last section. These notebooks are not tested on servers. They were run locally only.

The biggest issue was dealing with variable length samples. ROCKET is not limited to fixed length samples, but works most straightforwardly with them. There is already discussion of this issue in depth in the Time Series Sequential Data Study Group. One simple idea is to pad each sample with zeros to the same (longest) length. However doing so drastically alters the max and ppv measures, and empirically decreases accuracy.

The primary problem with using different length samples is when randomly chosen kernel length, padding, and dilation for conv1d yield different length outputs, all within one batch. Even more, what should be the max and ppv of a zero length conv1d output (short sample and large dilation)?

The issue is especially acute in PyTorch, because of course tensors have to be rectangular. I experimented extensively with conv1d to find out exactly how it handles padding with nans/zeros, when it errors out, etc. I think this ROCKET implementation is correct when samples are padded on the right with nan, even when the conv1d output is empty. It throws an error however when the input tensor sample length dimension is too small for a particular conv1d. [Fixed on 20200402.]

In the end, I did not tackle this last problem. Instead, I limited the dilations so that the shortest sample is always valid for every conv1d. This measured nearly as accurate as including larger dilations. Perhaps itā€™s because we are identifying voice timbres by frequencies and formants. Such frequencies are already captured by the smaller dilations. If you are looking for larger structures in a call - the meaning or bass notes for instance - the larger dilations would be needed.


Notes on the problemā€¦

Itā€™s an easy one in the grand scheme. In essence, we are distinguishing voices. That can be done quite well using pitch and timbre alone, which both spectrograms and conv1d can extract. But both methods have difficulty detecting temporal patterns. Resnet detects features in an image, but does not know whether they are located in the upper left or lower right. ROCKET loses the time structure by pooling it away with ppv and max.

If the distant goal is to recognize the meaning of the calls, we will want to ignore pitch and timbre and focus on the callā€™s structure along the time dimension. It will require some kind of time-aware architecture like an RNN. Just sayinā€™ for now.


Directions and ideas (in case anyone is inspired)

  • Replace the unused conv1d features with new random ones. Does accuracy keep improving?

  • Do the most predictive conv1dā€™s have certain characteristics in common? If so, we get a sense of how to design a model based on conv1d.

  • Find a better way to adapt ROCKET to time series with different lengths. Right now the space of dilations assumes the series has a fixed length. Many conv1dā€™s with large dilations remain unused because they do not apply to short samples. Is there a way to better distribute the conv1dā€™s to match the distribution of sample lengths?

  • With a typical Linear/Cross entropy training on the features, would more layers find complex feature patterns that improve generalization?

  • Make a more efficient implementation that skips the overhead of nn.conv1d. We could go directly to F.conv because we already know the parameters are safe.

  • Fix the fastai section to work correctly and work with fastaiv2

  • I am severely lost with git and github :confused:, but will try to learn enough to integrate contributions. Iā€™ll probably need to ask for help. :slightly_smiling_face:

Thanks for reading and for code corrections!

6 Likes

Thatā€™s part of the problem - with experience on problems such as this one, one maybe can build intuition to speed up the research process, but I am not really sure how much that helps. Sure, you probably can work with spectrograms faster and maybe apply more complicated transformations, but probably you still need to continue trying things out and looking at the data and the results to figure out how to best process the sound. Especially that with animal datasets you might get a lot of background noise, or a specific species would only use a certain frequency spectrum, or it would hear in some other way than the ones you worked with beforeā€¦

Your question got me thinking and searching a bit, and there apparently exists something like differentiable digital signal processing (paper, blog post) with some interesting references to prior work. Something worth checking out but that is mostly for generative models.

One way around this problem would be working directly with audio as a time series and not jumping to spectrograms :slightly_smiling_face:

3 Likes

Malcolm, this is seriously cool. Wow :exploding_head:. Thx so much for sharing this and for your explanation of how the method works! This is awesome!!!

When you are ready, would you please be so kind and submit a pull request to the repo? Any explanation you could include as you do so here in the forum post in prose would be greatly appreciated. Maybe the repo can serve as a collection of interesting and useful methods to work with audio. So far we have the intro, a fastai2 audio model and this would make a great addition :slight_smile: I am sure I will be using this code for my work, that is quite certain. Was approached by a colleague earlier today and already pointed him to this repo for an example of what he was asking about :slightly_smiling_face:

This is looking really good! Thank you so much for sharing this with us!

2 Likes

Thanks for the hints @radek. I think I will stay with spectrograms for a bit, since they seem interesting :slight_smile:. I want to implement the CoordConv idea from uber. I see that the problem is well structured to try it out, but Iā€™m struggling with the get_x function. Does it have to return a numpy array only? Can it be a torch tensor? Iā€™m asking because your code implies that Image.fromarray() will be used internally. I can construct the required tensor but it seems like a waste to convert that to numpy.ndarrays.

To be honest, I am not sure :slight_smile: You could check what happens if you return a tensor, my guess is that it will likely work. Also, I wouldnā€™t really worry too much about whether something is less or more efficient - this only matters once you start hitting some constraints, something does not fit on the GPU or the run time becomes prohibitively long. Here such an additional operation is likely to add just a few ms per example, probably nothing to worry about :slight_smile:

1 Like

I tried to learn more about the fastai API so I built Transforms for the Audio files. They may not make too much sense right now but maybe you are intrested too. I tried not to look into the fastai2 library to avoid copy and pasting. So I guess they came up with better ideas ;). nevertheless I learned a lot ā€¦ :slight_smile:

Preformatted textsize = 200
bs=32


class AudioToImage(Transform):

def encodes(self, o: np.ndarray): 
    o = np.uint8(o)
    return PILImage.create(o)


class AudioMel(Transform):

def encodes(self, o: np.ndarray):
    o = librosa.feature.melspectrogram(y=o, sr=24414)
    o = librosa.power_to_db(o, ref=np.max)
    o = o - o.min()
    o = np.flip(o, axis=0)
    return o
    

class AudioStft(Transform):

def encodes(self, o: np.ndarray):
    o = np.abs(librosa.stft(o, hop_length=32))
    o = librosa.amplitude_to_db(o, ref=np.max)
    o = o - o.min()
    o = np.flip(o, axis=0)
    return o


class AudioAddNoise(Transform):

def encodes(self, o:np.ndarray):
    if np.random.random() > 0.5:
        noise_amp = 0.001*np.random.uniform()*np.amax(o)
        o = o.astype('float64') + noise_amp * np.random.normal(size=o.shape)
    return o


class AudioTransform(Transform):

def __init__(self, length=0.0):
    self.length = length

def encodes(self, o):
    o, sr = librosa.load(o)
    if self.length > 0.0:
        o = librosa.util.fix_length(o, int(sr*self.length))
    return o



def AudioBlock(length=0.0):
return TransformBlock(type_tfms=AudioTransform(length=length), batch_tfms=IntToFloatTensor)

train_sz=0.01
dblocks = DataBlock(blocks = (AudioBlock(length=0.75),CategoryBlock),
             get_items=get_files, 
             splitter=RandomSubsetSplitter(train_sz=train_sz, valid_sz=train_sz*0.2, seed=42),
             get_y=parent_label,
             item_tfms=[AudioMel,AudioAddNoise,AudioToImage],
             )

dls=dblocks.dataloaders(path) 
4 Likes

Wow! Thanks @florianl! Itā€™s going to take me a bit of time to figure all that out (Iā€™m new to v2) but I reckon this will indeed be the way to go. I donā€™t think it makes sense to dovetail the CoordConv approach in that which is used for images.

1 Like

So it looks like it has to be an array for Image.fromarray() to work. If I understood CoordConv properly, that makes my images look like this (yes, Iā€™m treating this data as image data):
image

No improvement in the score, in fact its worse than before (it does go lower to 1.6% after 5 epochs though):
Screenshot 2020-04-03 at 23.10.42

Would be cool to discuss the CoordConv method with anyone who has tried it out or read it. Iā€™m not sure Iā€™ve understood it correctly since I gave it only a brief look and started coding it. In any case, Iā€™ve got to figure out how to do these Transform thingies of v2 that @florianl wrote (I think that is the right way to go about it, especially if we want data augmentation etc. later on) and of course can play around the preprocessing (perhaps making the images more square via the parameters nfft and hop_length, would help?). Hints are much appreciated :slight_smile:

3 Likes

I think it doesnā€™t matter if the images are square. they only have to have the same size. so it should be enough to set a fixed length.

CoordConv looks intresting but I am not sure if it is of help here. But I tried to implement something like sketch2code (https://sketch2code.azurewebsites.net). I guess they could improve the results there. Could you please post your CoordConv code? Iā€™d like to learn more about that.

2 Likes

Sure, no problem @florianl!

This function converts the audio file to the spectrogram (just occurred to me that I used different parameters from earlier, not sure how that influences the result):

def x_to_spec(x, r):
    spec = librosa.feature.melspectrogram(y=x, sr=r, n_fft=1024, hop_length=140)
    spec = librosa.power_to_db(spec, ref=np.max)
    return spec - spec.min()

And, this is my get_x:

def get_x(path, rate=24414, num_samples=41503)::
    x, rate = librosa.load(path, sr=rate)
    x = librosa.util.pad_center(x, num_samples, mode='constant')
    #spec tensor 
    spec = x_to_spec(x, r=rate)
    snt = tensor(spec/spec.max()*255) 
    #x-coord tensor
    xnt = torch.arange(0,snt.shape[1]).float()
    xnt = xnt.expand_as(snt)
    xnt = xnt/xnt.max()*255 
    #y-coord tensor
    ynt = torch.arange(0,snt.shape[0]).unsqueeze(1).float()
    ynt = ynt.expand_as(snt)
    ynt = ynt/ynt.max()*255 
    #stack them up
    lnt = [snt, xnt, ynt]
    s3nt = torch.stack(lnt)
    
    return np.asarray(s3nt.permute(1,2,0)).astype(np.uint8)

I am curious though:

  1. How is sketch2code related to audio / this problem?
  2. Why donā€™t you think that CoordConv wonā€™t work here?
1 Like

Thanks for posting your code :).

As far as I undersand is the biggest advantage of coordconv, that the model is better at keeping track of the locations of the features. I think that wonā€™t improve the results here (someone hit already 99.6% accuracy). But very interesting approach - havenā€™t heard of coordconv before.

Most sketch2code implementations use CNNs and one of the biggest flaws is, that they have problem with location information of elements. So I guess it could improve the results!

1 Like

I agree with Florian. One way to proceed might therefore be to get as much information as possible out of the signal and reducing the ā€œempty spaceā€ in the spectrogram to a minimum.

Using the same code as before I took a 150 ms sample from each file:

image

to get an error rate of: 0.0021 after 6 epochs.

epoch train_loss valid_loss error_rate time
0 2.558188 0.358300 0.107756 00:12
epoch train_loss valid_loss error_rate time
0 0.145152 0.114322 0.034317 00:11
1 0.092643 0.096365 0.031572 00:11
2 0.056835 0.156654 0.052848 00:11
3 0.034394 0.022413 0.005491 00:11
4 0.017923 0.011003 0.002745 00:11
5 0.010128 0.010757 0.002059 00:11

The length of the call doesnā€™t seem to matter much in the identification.

3 Likes

Hmmā€¦ I thought it would be important to know where a feature was since, e.g. it makes a difference at what height of the image the high amplitudes are located (like a low vs high pitched ā€œvoiceā€). But perhaps here, that doesnā€™t help.

Good luck with sketch2code! Sounds like an awesome project :slight_smile:

Impressive result @adpostma! Can you tell me what steps would go into generating that spectrogram starting from the audio file (in ā€œlibrosa termsā€ would be great)? Those spectrograms do appear to bring the features out well.

I think he get the notebook of fastai audio (which has the highest score) that now you can find in the github repo and change the sample time to 150ms