Great! I agree with Radek that this challenge is not over yet. After digging into the fastai audio source code I found that this is a good opportunity to practice the new fastai . Firstly this is an incomplete repo and there is many thing to improve. Secondly, this is not so big that I think we can replicate the result without feeling so intimidated. For me now Iām trying to get familiar more with fastai2 by going through the Jeremyās walkthrough
Influenced by @florianl . I experiment with the Hz scale rather than the Mel scale. For someone who interested in what it is, you can look at here https://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0 . In short, for human, the mel scale (abbreviation for melody scale) is better for our ear to distinguish 2 different sounds.
However, the model is not our ear so maybe we can get better result just with Hz. The get_x as below similar to @florian did:
stft = librosa.stft(x, n_fft=n_fft, hop_length=hop_length)
stft_magnitude, stft_phase = librosa.magphase(stft)
stft_magnitude_db = librosa.amplitude_to_db(stft_magnitude)
stft_magnitude_db = stft_magnitude_db - stft_magnitude_db.min()
stft_magnitude_db = stft_magnitude_db / stft_magnitude_db.max() * 255 # we want the range of values for our data to be [0, 255]
# this way fastai internally will be able to represent it as an image using PIL
return stft_magnitude_db.astype(np.uint8)
The num_samples is sill num_samples = int(0.75 * rate) because my colab get out of memory if I use 1.7*rate.
And I got better results as below. But intuitively, I donāt really know if it can work with other dataset. And by curious I want to test with human dataset because Mel Scale is for human
That is a super interesting question whether to use linear, log or mel scale for the frequency
Mel scale mimics very closely how we hear but at the same time when studying animals being anthropocentric can be quite a pitfall
Definitely a very important consideration.
Hi all,
Wanted to include one more possibility weāve seen is pretty useful.
A lot of folks in the audio/speech community use Mel scales to the point itās a default.
But there are also Gammatone spectrums and filter banks. Iāve always liked this one since it seems that Gammatones to our cochleas are roughly like Gabors for our corneas.
I have a python version of these lying around somewhere, hoping to dig them up and give it a run with the notebook! If anyone has other filter banks please feel free to share, but I know that quickly becomes a rabbit hole.
The pre-processing possibilities seem to be endless: scale of the axis, power_to_db, amplitude_to_db, hop_length, nfft etc. I wonder if thereās a logic behind that, that would spare us from having to try them all out.
Hi shut-ins,
It looks like everyone so far is using the approach of classifying spectrogram images. radek has suggested working directly on the timeseries as another approach. Iād like to present a starter notebook that hits 96% classification accuracy using only conv1d, two simple pooling functions, and a Linear classifier.
The method is called ROCKET. You may have seen it already discussed in Time series/ sequential data study group. The original code and paper can be found at https://github.com/angus924/rocket. For those not familiar, here is a brief overview.
ROCKET extracts a set of features, typically several thousand numbers, from each timeseries sample (in this case the Macaque calls). The features are then run through a classifier to train the model to predict a category. The classifier (at least the ones I have seen used so far) is simply a linear combination of weights. Oguzaās demo, the original paper, and my attached demo all use sklearnās RidgeClassifier. You could just as well use the more familiar Linear/softmax/Cross entropy/optimizer setup, even appending more layers.
The power of ROCKET, though, lies in its features. These are generated by running each sample through a large set of fixed conv1dās. Each conv1d has randomized weights centered on zero, and randomized biases. The output of each conv1d, a series itself, is then reduced to two numbers. The first is simply the maximum of the series. The second is the fraction of positive values in the series, the āproportion of positive valuesā (ppv). In this way, each timeseries sample yields a list of numbers (features) that characterize it, of length two times the number of random convolutions. As with spectrogram images, itās these features that are sent to the classifier.
It is important to note that the weights and biases of the conv1dās are fixed. Contrary to our usual practice, they are not trained during the optimization of the classifier.
Getting into opinion and speculation, I think ROCKET effectively does a search of the space of conv1dās by using a large universe of random kernel lengths, weights, biases, dilations, and paddings. The classifier selects which of these conv1dās are predictive of the training samples. Rather than predesigning the architecture as we typically do, this approach finds the conv1dās that work best for the problem.
Such a search would be impossible using typical machine learning methods because most of its parameters are not differentiable wrt loss. Two non-linearities, both of which are also non-differentiable, then reduce the dimensionality of the conv1d outputs. IMO, thereās great potential in this approach of using randomness to search the space of architectures and weights. You can find papers that suggest that the olfactory systemās random connections work in a similar way. Also, see weight-agnostic architectures.
Some further notesā¦
-
The various dilations of conv1d are able to extract the periodicities (frequencies) of the sounds, much as spectograms do. I think thatās one reason ROCKET works well on this audio task.
-
Although ROCKET looks computationally intensive, I find that most of the trained classification coefficients end up very small. (This is not my idea - I downloaded a notebook that shows this observation, but donāt know who originally authored it.) It means those conv1dās could be eliminated, or replaced with different randomly sampled conv1dās that may turn out to work better.
-
Thereās some special magic in the ppv non-linearity. Combined with conv1d, it is exceptionally good at classifying time series in general. Why is that so?
Notes on my initial implementation (based on Ignacio Oguizaās ROCKET demo at https://github.com/timeseriesAI/timeseriesAI -thanks!)
First, run notebook saveSounds. It saves the Macaque calls and names into ~.fastai. These will be loaded by the following notebook.
Second, run notebook MacaqueROCKET for a demonstration of the ROCKET method. It requires fastai v1 only for the last section. These notebooks are not tested on servers. They were run locally only.
The biggest issue was dealing with variable length samples. ROCKET is not limited to fixed length samples, but works most straightforwardly with them. There is already discussion of this issue in depth in the Time Series Sequential Data Study Group. One simple idea is to pad each sample with zeros to the same (longest) length. However doing so drastically alters the max and ppv measures, and empirically decreases accuracy.
The primary problem with using different length samples is when randomly chosen kernel length, padding, and dilation for conv1d yield different length outputs, all within one batch. Even more, what should be the max and ppv of a zero length conv1d output (short sample and large dilation)?
The issue is especially acute in PyTorch, because of course tensors have to be rectangular. I experimented extensively with conv1d to find out exactly how it handles padding with nans/zeros, when it errors out, etc. I think this ROCKET implementation is correct when samples are padded on the right with nan, even when the conv1d output is empty. It throws an error however when the input tensor sample length dimension is too small for a particular conv1d. [Fixed on 20200402.]
In the end, I did not tackle this last problem. Instead, I limited the dilations so that the shortest sample is always valid for every conv1d. This measured nearly as accurate as including larger dilations. Perhaps itās because we are identifying voice timbres by frequencies and formants. Such frequencies are already captured by the smaller dilations. If you are looking for larger structures in a call - the meaning or bass notes for instance - the larger dilations would be needed.
Notes on the problemā¦
Itās an easy one in the grand scheme. In essence, we are distinguishing voices. That can be done quite well using pitch and timbre alone, which both spectrograms and conv1d can extract. But both methods have difficulty detecting temporal patterns. Resnet detects features in an image, but does not know whether they are located in the upper left or lower right. ROCKET loses the time structure by pooling it away with ppv and max.
If the distant goal is to recognize the meaning of the calls, we will want to ignore pitch and timbre and focus on the callās structure along the time dimension. It will require some kind of time-aware architecture like an RNN. Just sayinā for now.
Directions and ideas (in case anyone is inspired)
-
Replace the unused conv1d features with new random ones. Does accuracy keep improving?
-
Do the most predictive conv1dās have certain characteristics in common? If so, we get a sense of how to design a model based on conv1d.
-
Find a better way to adapt ROCKET to time series with different lengths. Right now the space of dilations assumes the series has a fixed length. Many conv1dās with large dilations remain unused because they do not apply to short samples. Is there a way to better distribute the conv1dās to match the distribution of sample lengths?
-
With a typical Linear/Cross entropy training on the features, would more layers find complex feature patterns that improve generalization?
-
Make a more efficient implementation that skips the overhead of nn.conv1d. We could go directly to F.conv because we already know the parameters are safe.
-
Fix the fastai section to work correctly and work with fastaiv2
-
I am severely lost with git and github , but will try to learn enough to integrate contributions. Iāll probably need to ask for help.
Thanks for reading and for code corrections!
Thatās part of the problem - with experience on problems such as this one, one maybe can build intuition to speed up the research process, but I am not really sure how much that helps. Sure, you probably can work with spectrograms faster and maybe apply more complicated transformations, but probably you still need to continue trying things out and looking at the data and the results to figure out how to best process the sound. Especially that with animal datasets you might get a lot of background noise, or a specific species would only use a certain frequency spectrum, or it would hear in some other way than the ones you worked with beforeā¦
Your question got me thinking and searching a bit, and there apparently exists something like differentiable digital signal processing (paper, blog post) with some interesting references to prior work. Something worth checking out but that is mostly for generative models.
One way around this problem would be working directly with audio as a time series and not jumping to spectrograms
Malcolm, this is seriously cool. Wow . Thx so much for sharing this and for your explanation of how the method works! This is awesome!!!
When you are ready, would you please be so kind and submit a pull request to the repo? Any explanation you could include as you do so here in the forum post in prose would be greatly appreciated. Maybe the repo can serve as a collection of interesting and useful methods to work with audio. So far we have the intro, a fastai2 audio model and this would make a great addition I am sure I will be using this code for my work, that is quite certain. Was approached by a colleague earlier today and already pointed him to this repo for an example of what he was asking about
This is looking really good! Thank you so much for sharing this with us!
Thanks for the hints @radek. I think I will stay with spectrograms for a bit, since they seem interesting . I want to implement the CoordConv idea from uber. I see that the problem is well structured to try it out, but Iām struggling with the get_x
function. Does it have to return a numpy array only? Can it be a torch tensor? Iām asking because your code implies that Image.fromarray()
will be used internally. I can construct the required tensor but it seems like a waste to convert that to numpy.ndarray
s.
To be honest, I am not sure You could check what happens if you return a tensor, my guess is that it will likely work. Also, I wouldnāt really worry too much about whether something is less or more efficient - this only matters once you start hitting some constraints, something does not fit on the GPU or the run time becomes prohibitively long. Here such an additional operation is likely to add just a few ms per example, probably nothing to worry about
I tried to learn more about the fastai API so I built Transforms for the Audio files. They may not make too much sense right now but maybe you are intrested too. I tried not to look into the fastai2 library to avoid copy and pasting. So I guess they came up with better ideas ;). nevertheless I learned a lot ā¦
Preformatted textsize = 200
bs=32
class AudioToImage(Transform):
def encodes(self, o: np.ndarray):
o = np.uint8(o)
return PILImage.create(o)
class AudioMel(Transform):
def encodes(self, o: np.ndarray):
o = librosa.feature.melspectrogram(y=o, sr=24414)
o = librosa.power_to_db(o, ref=np.max)
o = o - o.min()
o = np.flip(o, axis=0)
return o
class AudioStft(Transform):
def encodes(self, o: np.ndarray):
o = np.abs(librosa.stft(o, hop_length=32))
o = librosa.amplitude_to_db(o, ref=np.max)
o = o - o.min()
o = np.flip(o, axis=0)
return o
class AudioAddNoise(Transform):
def encodes(self, o:np.ndarray):
if np.random.random() > 0.5:
noise_amp = 0.001*np.random.uniform()*np.amax(o)
o = o.astype('float64') + noise_amp * np.random.normal(size=o.shape)
return o
class AudioTransform(Transform):
def __init__(self, length=0.0):
self.length = length
def encodes(self, o):
o, sr = librosa.load(o)
if self.length > 0.0:
o = librosa.util.fix_length(o, int(sr*self.length))
return o
def AudioBlock(length=0.0):
return TransformBlock(type_tfms=AudioTransform(length=length), batch_tfms=IntToFloatTensor)
train_sz=0.01
dblocks = DataBlock(blocks = (AudioBlock(length=0.75),CategoryBlock),
get_items=get_files,
splitter=RandomSubsetSplitter(train_sz=train_sz, valid_sz=train_sz*0.2, seed=42),
get_y=parent_label,
item_tfms=[AudioMel,AudioAddNoise,AudioToImage],
)
dls=dblocks.dataloaders(path)
Wow! Thanks @florianl! Itās going to take me a bit of time to figure all that out (Iām new to v2) but I reckon this will indeed be the way to go. I donāt think it makes sense to dovetail the CoordConv approach in that which is used for images.
So it looks like it has to be an array for Image.fromarray()
to work. If I understood CoordConv properly, that makes my images look like this (yes, Iām treating this data as image data):
No improvement in the score, in fact its worse than before (it does go lower to 1.6% after 5 epochs though):
Would be cool to discuss the CoordConv method with anyone who has tried it out or read it. Iām not sure Iāve understood it correctly since I gave it only a brief look and started coding it. In any case, Iāve got to figure out how to do these Transform
thingies of v2 that @florianl wrote (I think that is the right way to go about it, especially if we want data augmentation etc. later on) and of course can play around the preprocessing (perhaps making the images more square via the parameters nfft and hop_length, would help?). Hints are much appreciated
I think it doesnāt matter if the images are square. they only have to have the same size. so it should be enough to set a fixed length.
CoordConv looks intresting but I am not sure if it is of help here. But I tried to implement something like sketch2code (https://sketch2code.azurewebsites.net). I guess they could improve the results there. Could you please post your CoordConv code? Iād like to learn more about that.
Sure, no problem @florianl!
This function converts the audio file to the spectrogram (just occurred to me that I used different parameters from earlier, not sure how that influences the result):
def x_to_spec(x, r): spec = librosa.feature.melspectrogram(y=x, sr=r, n_fft=1024, hop_length=140) spec = librosa.power_to_db(spec, ref=np.max) return spec - spec.min()
And, this is my get_x
:
def get_x(path, rate=24414, num_samples=41503)::
x, rate = librosa.load(path, sr=rate)
x = librosa.util.pad_center(x, num_samples, mode='constant')
#spec tensor
spec = x_to_spec(x, r=rate)
snt = tensor(spec/spec.max()*255)
#x-coord tensor
xnt = torch.arange(0,snt.shape[1]).float()
xnt = xnt.expand_as(snt)
xnt = xnt/xnt.max()*255
#y-coord tensor
ynt = torch.arange(0,snt.shape[0]).unsqueeze(1).float()
ynt = ynt.expand_as(snt)
ynt = ynt/ynt.max()*255
#stack them up
lnt = [snt, xnt, ynt]
s3nt = torch.stack(lnt)
return np.asarray(s3nt.permute(1,2,0)).astype(np.uint8)
I am curious though:
- How is sketch2code related to audio / this problem?
- Why donāt you think that CoordConv wonāt work here?
Thanks for posting your code :).
As far as I undersand is the biggest advantage of coordconv, that the model is better at keeping track of the locations of the features. I think that wonāt improve the results here (someone hit already 99.6% accuracy). But very interesting approach - havenāt heard of coordconv before.
Most sketch2code implementations use CNNs and one of the biggest flaws is, that they have problem with location information of elements. So I guess it could improve the results!
I agree with Florian. One way to proceed might therefore be to get as much information as possible out of the signal and reducing the āempty spaceā in the spectrogram to a minimum.
Using the same code as before I took a 150 ms sample from each file:
to get an error rate of: 0.0021 after 6 epochs.
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 2.558188 | 0.358300 | 0.107756 | 00:12 |
epoch | train_loss | valid_loss | error_rate | time |
0 | 0.145152 | 0.114322 | 0.034317 | 00:11 |
1 | 0.092643 | 0.096365 | 0.031572 | 00:11 |
2 | 0.056835 | 0.156654 | 0.052848 | 00:11 |
3 | 0.034394 | 0.022413 | 0.005491 | 00:11 |
4 | 0.017923 | 0.011003 | 0.002745 | 00:11 |
5 | 0.010128 | 0.010757 | 0.002059 | 00:11 |
The length of the call doesnāt seem to matter much in the identification.
Hmmā¦ I thought it would be important to know where a feature was since, e.g. it makes a difference at what height of the image the high amplitudes are located (like a low vs high pitched āvoiceā). But perhaps here, that doesnāt help.
Good luck with sketch2code! Sounds like an awesome project
Impressive result @adpostma! Can you tell me what steps would go into generating that spectrogram starting from the audio file (in ālibrosa termsā would be great)? Those spectrograms do appear to bring the features out well.
I think he get the notebook of fastai audio (which has the highest score) that now you can find in the github repo and change the sample time to 150ms