[Invitation to open collaboration] Practice what you learn in the course and help animal researchers! šŸµ

Hi shut-ins,

It looks like everyone so far is using the approach of classifying spectrogram images. radek has suggested working directly on the timeseries as another approach. Iā€™d like to present a starter notebook that hits 96% classification accuracy using only conv1d, two simple pooling functions, and a Linear classifier.

The method is called ROCKET. You may have seen it already discussed in Time series/ sequential data study group. The original code and paper can be found at GitHub - angus924/rocket: ROCKET: Exceptionally fast and accurate time series classification using random convolutional kernels. For those not familiar, here is a brief overview.

ROCKET extracts a set of features, typically several thousand numbers, from each timeseries sample (in this case the Macaque calls). The features are then run through a classifier to train the model to predict a category. The classifier (at least the ones I have seen used so far) is simply a linear combination of weights. Oguzaā€™s demo, the original paper, and my attached demo all use sklearnā€™s RidgeClassifier. You could just as well use the more familiar Linear/softmax/Cross entropy/optimizer setup, even appending more layers.

The power of ROCKET, though, lies in its features. These are generated by running each sample through a large set of fixed conv1dā€™s. Each conv1d has randomized weights centered on zero, and randomized biases. The output of each conv1d, a series itself, is then reduced to two numbers. The first is simply the maximum of the series. The second is the fraction of positive values in the series, the ā€˜proportion of positive valuesā€™ (ppv). In this way, each timeseries sample yields a list of numbers (features) that characterize it, of length two times the number of random convolutions. As with spectrogram images, itā€™s these features that are sent to the classifier.

It is important to note that the weights and biases of the conv1dā€™s are fixed. Contrary to our usual practice, they are not trained during the optimization of the classifier.

Getting into opinion and speculation, I think ROCKET effectively does a search of the space of conv1dā€™s by using a large universe of random kernel lengths, weights, biases, dilations, and paddings. The classifier selects which of these conv1dā€™s are predictive of the training samples. Rather than predesigning the architecture as we typically do, this approach finds the conv1dā€™s that work best for the problem.

Such a search would be impossible using typical machine learning methods because most of its parameters are not differentiable wrt loss. Two non-linearities, both of which are also non-differentiable, then reduce the dimensionality of the conv1d outputs. IMO, thereā€™s great potential in this approach of using randomness to search the space of architectures and weights. You can find papers that suggest that the olfactory systemā€™s random connections work in a similar way. Also, see weight-agnostic architectures.

Some further notesā€¦

  1. The various dilations of conv1d are able to extract the periodicities (frequencies) of the sounds, much as spectograms do. I think thatā€™s one reason ROCKET works well on this audio task.

  2. Although ROCKET looks computationally intensive, I find that most of the trained classification coefficients end up very small. (This is not my idea - I downloaded a notebook that shows this observation, but donā€™t know who originally authored it.) It means those conv1dā€™s could be eliminated, or replaced with different randomly sampled conv1dā€™s that may turn out to work better.

  3. Thereā€™s some special magic in the ppv non-linearity. Combined with conv1d, it is exceptionally good at classifying time series in general. Why is that so?


Notes on my initial implementation (based on Ignacio Oguizaā€™s ROCKET demo at GitHub - timeseriesAI/timeseriesAI -thanks!)

https://github.com/PomoML/ROCKET_Sound

First, run notebook saveSounds. It saves the Macaque calls and names into ~.fastai. These will be loaded by the following notebook.

Second, run notebook MacaqueROCKET for a demonstration of the ROCKET method. It requires fastai v1 only for the last section. These notebooks are not tested on servers. They were run locally only.

The biggest issue was dealing with variable length samples. ROCKET is not limited to fixed length samples, but works most straightforwardly with them. There is already discussion of this issue in depth in the Time Series Sequential Data Study Group. One simple idea is to pad each sample with zeros to the same (longest) length. However doing so drastically alters the max and ppv measures, and empirically decreases accuracy.

The primary problem with using different length samples is when randomly chosen kernel length, padding, and dilation for conv1d yield different length outputs, all within one batch. Even more, what should be the max and ppv of a zero length conv1d output (short sample and large dilation)?

The issue is especially acute in PyTorch, because of course tensors have to be rectangular. I experimented extensively with conv1d to find out exactly how it handles padding with nans/zeros, when it errors out, etc. I think this ROCKET implementation is correct when samples are padded on the right with nan, even when the conv1d output is empty. It throws an error however when the input tensor sample length dimension is too small for a particular conv1d. [Fixed on 20200402.]

In the end, I did not tackle this last problem. Instead, I limited the dilations so that the shortest sample is always valid for every conv1d. This measured nearly as accurate as including larger dilations. Perhaps itā€™s because we are identifying voice timbres by frequencies and formants. Such frequencies are already captured by the smaller dilations. If you are looking for larger structures in a call - the meaning or bass notes for instance - the larger dilations would be needed.


Notes on the problemā€¦

Itā€™s an easy one in the grand scheme. In essence, we are distinguishing voices. That can be done quite well using pitch and timbre alone, which both spectrograms and conv1d can extract. But both methods have difficulty detecting temporal patterns. Resnet detects features in an image, but does not know whether they are located in the upper left or lower right. ROCKET loses the time structure by pooling it away with ppv and max.

If the distant goal is to recognize the meaning of the calls, we will want to ignore pitch and timbre and focus on the callā€™s structure along the time dimension. It will require some kind of time-aware architecture like an RNN. Just sayinā€™ for now.


Directions and ideas (in case anyone is inspired)

  • Replace the unused conv1d features with new random ones. Does accuracy keep improving?

  • Do the most predictive conv1dā€™s have certain characteristics in common? If so, we get a sense of how to design a model based on conv1d.

  • Find a better way to adapt ROCKET to time series with different lengths. Right now the space of dilations assumes the series has a fixed length. Many conv1dā€™s with large dilations remain unused because they do not apply to short samples. Is there a way to better distribute the conv1dā€™s to match the distribution of sample lengths?

  • With a typical Linear/Cross entropy training on the features, would more layers find complex feature patterns that improve generalization?

  • Make a more efficient implementation that skips the overhead of nn.conv1d. We could go directly to F.conv because we already know the parameters are safe.

  • Fix the fastai section to work correctly and work with fastaiv2

  • I am severely lost with git and github :confused:, but will try to learn enough to integrate contributions. Iā€™ll probably need to ask for help. :slightly_smiling_face:

Thanks for reading and for code corrections!

6 Likes