Time series/ sequential data study group

I wanted a streamlined way to evaluate models so I created this notebook and file by slightly adapting @oguiza’s notebook. Hopefully this will be useful for testing new ideas.

The UCR archive provides some benchmarks, but they note that in many cases these can easily be improved with “low-hanging fruit.” Does anyone know of a more authoritative source for SotA results? If I’m testing a new model, I’d really like to know if my results are competitive with SotA.

If this turns out to be useful, I’d also like to improve it by…

  1. Creating a leaderboard (unless one already exists?)
  2. Agreeing on a “UCR-lite” dataset along the lines of Imagenette to quickly test new ideas (it can be time-consuming to test on the full dataset, especially if you use multiple iterations, which you probably should)

Thank you for coming here to help explain your work! I am working to adapt rocket for use on raw audio. In audio we normally convert the raw signal to a spectrogram which extracts the relevant frequency data, and then pass that to a normal CNN.

My first results were fairly poor, 85% in 6 minutes on a 10 class voice recognition problem I can usually get 99%+ accuracy in under 2 minutes using spectrogram +CNN. I added a stride (5-7 seemed to be ideal) to your code (it was just too time-expensive without one for long audio signals) and got faster results with very little accuracy sacrificed. Eventually I got it to reach 95% accuracy in 6s, and 99% accuracy in under 2 minutes!

Unfortunately when I tried a much harder problem, a 250 class dataset, I couldn’t get beyond 30% accuracy (~95% using spectrogram method). I looked at the efficacy of a single random kernel on both datasets and found on the 10 speakers a single filter can get ~20% validation accuracy, but on the harder problem it is closer to 2%. It seems like the idea of ensembling slightly predictive random kernels gets exponentially harder as the number of classes goes up. What are the class sizes like in the bake-off dataset? Did you notice this effect as well?

Any advice for applying rocket to raw audio data? Thanks again!

Edit: Here’s a graph with accuracy on a 25 class dataset as a function of number of kernels, averaged over 3 runs each. The dataset has around 3700 total audios that have been trimmed to 2s each, so they are time series with 32000 elements. Time for 16384 kernels is ~16 minutes. The 250 speaker version is ~45000 files and would take around 3 hrs to run the same amount. This is with stride 5 by the way. Next up is playing with the dilation to try to extract frequency info in a more structured way, hopefully through that I can dramatically lower the number of kernels required.

1 Like

Thanks for taking the time to try ROCKET with audio. This is very interesting. (Also, thanks for providing the additional info in the edit.)

Preliminary comment: there are almost certainly problems where ROCKET will not work well, and this may be one of them. (I think I have slightly unconventional views on the ‘no free lunch’ theorem, but that is another story…) The Phoneme dataset in the UCR archive, which may (or may not) be most closely related to your data, is one of the datasets where ROCKET ‘struggles’ (although most methods for time series classification don’t do well on this dataset).

In any case, I will try and say something sensible.

Curiosity / Context

  • The spectrograms you usually use as input to your cnns—these are 2-dimensional, essentially like images, time on one axis and frequency on the other axis?
  • Accordingly, are you using cnns based on architectures for image classification? What is the stride (in particular, the time-domain-axis stride) you use in your cnns?
  • Which, if any, other time series classification methods have you tried with your audio data? In particular, have you tried InceptionTime? If a particular method does work well, it might help to diagnose the problem (it may not be possible to make ROCKET work any better, but it might help to understand what is going on anyway).
  • Is the data, or something like it, publicly available? If so, I could take a look (however, that probably won’t make any difference).
  • What else, if anything (apart from stride), did you change to go from 85% accuracy to 95% or 99% accuracy (was this just increasing the number of kernels)? (If stride is not drastically hurting accuracy, perhaps discriminative features are mostly below some threshold frequency? Not sure how this helps…)

Compute Time

  • Are you running your cnns on GPU? It looks like your input time series / signals are relatively long (32K). Any fundamental speedup is going to come from parallelisng the transform. We don’t have an ‘official’ GPU implementation (at least, not yet). In the mean time (unless you have a CPU cluster available…), you might try the GPU implementation of the transform developed by @oguiza: see this notebook (that is, if you have not already tried this).

Performance

  • Do I understand that you are using the original audio (i.e., in the time domain) as input to ROCKET? (I doubt it can make sense of frequency domain information directly. However, exploding the input into multiple frequency tranches via, e.g., discrete wavelet transform might help, but that’s pure speculation.)
  • Are you using the ridge regression classifier (RidgeClassifierCV from scikit-learn) or another classifier (e.g., along the lines of our softmax regression implementation for ‘big’ datasets)? Logistic / softmax regression might (or might not) work better for a large number of classes (i.e., 250). It’s probably worth a try (if you haven’t already tried it).
  • I think the dataset with the most number of classes in the UCR archive has 60 classes (so, quite a few less than 250). I can’t see from the results that there is any obvious link between the number of classes and the performance of ROCKET (at least, up to 60 classes).
  • In relation to the number of kernels: it is very unlikely that increasing the number of kernels is going to fundamentally improve performance. That is to say, if results are terrible for 1,000 kernels, it is very unlikely that they are going to be ‘good’ for 10,000 kernels or 100,000 kernels. I’d say that something else isn’t working. So, I’d start with 1,000 kernels first (certainly no more than 10,000), and try and work out what is happening.
  • Basic point (may be inherent in audio data, or you may be doing this anyway)—you must make sure that the input is mean centred (each time series / signal has a mean of zero), and has a standard deviation of one. The way ROCKET is set up, this is assumed. It will still ‘work’ otherwise, but not necessarily very well. (However, given the performance you are seeing on the 10-class problem, my guess is that you are doing this already.)

It would probably be helpful if I could see the data you are using (but if this is not possible, no problem). It may be that something that we are doing, that is, something about the configuration of ROCKET doesn’t make sense for your data… I am thinking along the lines of the length of your time series, versus kernel length and dilation in particular.

Sorry, I now realise that none of what I have said is likely to help you immediately. Perhaps we can keep working through the problem, particularly if I have the chance to look at the data.

Best,

Angus.

3 Likes

Thank you so much for the really thoughtful analysis and ideas. I’m going to be away for the weekend but I’ll return on Monday and give this the reply it deserves (with code!). Cheers.

Thank you for the detailed reply. I was also working with @MadeUpMasters trying to adapt ROCKET to use with audio, but I took a different approach. Instead of trying to fit the raw audio where we have computation problems with the high number of elements, I used the same spectrograms that we use with CNN models and treat it as a multichannel time series, where the channels are the frequency components. Compared to the CNN model, it got better results in less training time on a 72 class voice recognition dataset. The CNN took 2 minutes and 30s to reach 95% accuracy, where ROCKET with 1000 kernels reach 96,4% in 30 seconds.

The code to the ROCKET experiment is here. It’s inspired on @oguiza notebook, and it contains all the steps necessary to reproduce the result, including installing audio library and downloading the data.

3 Likes

Thanks @MadeUpMasters and @scart97. My apologies in advance, I will do my best, but I may not be able to look at this properly for at least a week.

As soon as I can, I will let you know what I find.

Best,

Angus.

2 Likes

Thanks for sharing this @MadeUpMasters and @scart97. Very interesting approach.
I think this opens a new way to use ROCKET that seems promising.

2 Likes

Correct, here’s an example spectrogram: image

Yes we mainly use resnets/densenets and get very good results. As a result we havent played around with audio specific architecture much, but we are getting to that point now.

None, I have no familiarity with time-series but @scart97 messaged me your work and said “hey maybe we can apply this to audio”. This is my first attempt at applying convs to raw audio for classification. After reading your paper, I think every other time series method is going to take too long.

I’ll detail this more in the code/summary I post, but audio preprocessing helped quite a bit. My initial result was 85.4% accuracy in 4 min 1s, using 10k kernels, and stride 7 (it took a stride of 7 just to make the time reasonable. Removing silence and taking a larger time chunk were what improved accuracy from there.

We are! That’s why the initial results were so exciting/surprising. We are currently building an audio library for fastai v2 and have been looking for a way to allow people to train on raw audio. I’ll continue to work on raw audio while @scart97 tests on spectrograms. I’m fairly new to audio and never learned what wavelet transforms are (I’ve seen the name several times), but I’ll try to dig into it and see if that could be something worth using.

Currently the ridge classifier, but also something I plan to experiment with (today).

Yes we are making sure to do this. Thank you.

Not true at all, I found this extremely helpful. You’ve given me a good list of leads to chase down. I’m going to do a bit more experimentation and then post some cleaned up, hopefully reproducible notebooks. Thanks again.

Also if anyone feels we are cluttering the time-series thread and getting too far outside the scope with the audio discussion, I’m happy to move it to the Deep Learning With Audio Thread

Edit: Back with a summary of results and a (still somewhat messy) notebook.

NB: https://nbviewer.jupyter.org/github/rbracco/fastai_dev/blob/rocket/dev/75_audio_rocket_tuning.ipynb
Repo: https://github.com/rbracco/fastai_dev/tree/rocket/dev

Unfortunately this relies on our fastai v2 audio for some of the preprocessing, it would probably be easier to start from a new notebook than to pull this and try to work out of my fastai_dev fork. I have some more interesting stuff to add for harder datasets but it still needs to be organized. I’ll be back tomorrow or Wednesday

Hi! Do you have experience with imbalanced datasets? Is there any dataset on UCR which causes troubles due to this?

Thanks!!

I was curious whether my model was using the 20k features generated by ROCKET fairly evenly, or relying on just a few “good” features. I made a notebook to explore this question. TL;DR: I found you can use a small subset of features, sometimes as few as 100, with small losses in classification accuracy.

4 Likes

I might be able to provide some context here.

In some sense what would be ideal with feature selection is for the number of features to go down, but for accuracy to go up (or, at least, not go down). However, this is tricky.

My expectation (although I would be happy to be proven wrong), is that straightforward feature selection is unlikely to work well. (This is despite the fact that, as you have seen, the distribution of weights in the learned classifier looks promising.)

This is covered to some extent in section 4.3.1 of the paper. If you look at figure 5, basically what it shows is that as the number of kernels goes up, accuracy also goes up (and vice versa). The difference stops being statistically significant after about 10K kernels. However, the difference between, say, 100 kernels and 1K kernels is statistically significant, not because 1K kernels produce radically higher accuracy than 100 kernels, but rather becuase 1K kernels produce consistently higher accuracy. This could be a very small increase in accuracy over a majority of datasets, and indeed this is basically what you see with any increase in the number of kernels: a small but consistent increase in accuracy. The other aspect of this, as you have seen, is that variance increases as the number of kernels goes down (any given set of 100 kernels is likely less similar to any other given set of 100 kernels, as compared to the difference between two sets of 10K kernels, or two sets of 100K kernels).

So why do we use 10K kernels (producing 20K features)? Because 10K is consistently more accurate than < 10K kernels, because 10K kernels produces relatively low variance, because more kernels doesn’t make that much more difference in terms of either accuracy or variance (diminishing returns; more formally, this is more or less the point where the difference is no longer statistically significant), because 10K is a round number, and because the classifier (the ridge regression classifier or logistic / softmax regression) can handle 20K features easily (and, for the rigde regression classifier, even with a small number of training examples).

Nonetheless, if compute time is critical, you can use fewer kernels. What you can’t see in figure 5 is that, even with 100 kernels, ROCKET ranks somewhere in the middle of the ‘second pack’ of classifiers (roughly similar performance to ProximityForest). And, with 100 kernels, you should be able to train and test the whole UCR archive in about 2 minutes, and you can see from our scalability experiments that, for a small hit in accuracy, you can learn from > 1 million time series in about 1 minute.

Loosely speaking, for problems where ROCKET works well, even, e.g., 100 kernels should produce pretty good accuracy. 1K kernels will produce consistently higher accuracy (but the actual increase in accuracy is likely to be relatively small). Same for 10K over 1K, etc. (Obviously, you start running into other problems as the number of kernels keeps increasing.)

The bottom line is that you are probably going to get similar results simply by using fewer kernels in the first place, rather than generating more kernels then doing feature selection.

There is also a bigger picture. Feature selection takes time. Not necessarily much time, but it depends what you are doing. Convolutional kernels are proven feature detectors, but the potential ‘space’ of all kernels (even just in terms of weights, let alone in terms of arrangement or architecture) is very large. The typical way of wading through this space is by learning the kernel weights, and possibly venturing some kind of architecture search, or using a proven architecture such as ResNet or InceptionTime.

But there is another approach, i.e., the approach taken by ROCKET, which is—speaking fairly loosely—to simply generate lots of kernels which, in combination, provide good coverage of the space of all kernels (or all useful kernels).

Feature selection fits somewhere on the continuum between ‘completely random’ and fully learned kernels in an established architecture, or an architecture found through some kind of architecture search. At some point, it will almost certainly be more beneficial to simply spend time learning the kernels and performing some kind of architecture search, rather than hunting through randomly generated kernels. (Note also the possible ‘correlation effect’ observed by @MadeUpMasters in his summary, above, which might work against feature selection for random kernels.)

Having said all that, scikit-learn (just as an example) has a number of feature selection methods which can be used directly with ROCKET in one or more ways, and may prove useful at least for some problems.

Best,

Angus.

6 Likes

Class imbalance in the UCR archive is a bit all over the place, many datasets have some degree of class imbalance, many do not (you can check by doing something like, for each dataset, _, counts = np.unique(Y, return_counts = True); counts.max() / counts.sum(), or some variation thereof).

It is also a bit hard to tell what effect imbalance is having; that is, it is difficult to separate out the effect of imbalance from whatever else might make a given dataset ‘difficult’ (or ‘easy’). Obviously, in an extreme case, imbalance will make high accuracy trivial (there are, or course, other measures apart from accuracy).

From an extremely cursory look, it doesn’t look like there is a strong connection between imbalance in the datasets in the UCR archive and the performance of most classifiers (but really I have only just glanced at this, I might be wrong).

Then there are some ‘typical’ issues with imbalance more generally: most classifiers (including RidgeClassifierCV) will have some more or less direct way of dealing with imbalance (under / over sampling, weighting, etc.), but this can work against you if the imbalance is reflective of the generating distribution. It is a while since I looked at this, and not very formally, but from memory I don’t think trying to counter imbalance made much difference for datasets in the UCR archive. (Others might have a different experience…)

Thanks very much for this. I will look at this properly as soon as I can and get back to you.

Time Series transforms

I have uploaded to the timeseriesAI repo a new notebook called 06_TS_transforms.ipynb

It contains a number of new single-item transforms to the fastai_timeseries library that are now available for use. In general there are 2 types of transforms.

  1. Those that slightly modify the time series in the x and/ or y-axes:
  • TSmagnoise
  • TSmagscale
  • TSmagwarp
  • TStimenoise
  • TStimewarp
  1. And those that remove certain part of the time series (a section or channel):
  • TSlookback
  • TStimestepsout
  • TSchannelout
  • TScutout
  • TScrop
  • TSwindowslice
  • TSzoom
7 Likes

Thank you so much @oguiza!!!

Is there any reference paper that you are using for defining all those transformations?

Thanks for your question @vrodriguezf!
I have now included the paper references in the notebook.
Most of the ideas come from this paper:
Data augmentation of wearable sensor data for parkinson’s disease monitoring using convolutional neural networks. arXiv preprint arXiv:1706.00527
However, I had to adapt the code in most cases.
For now they all work on cpu.

1 Like

Thanks for the detailed response!

I found it really hard to believe that [make 100 features] could perform as well as [make 10k features, choose the best 100], but when I tested it, you were right: They performed about the same. How do you interpret this result?

One question I have on this (may be dumb, I’ve only been listening in) is how is the cardinality or distribution of those 100 vs the 9,900 left? We’re they mostly binary or not very verbose?

I’m not sure I understand what cardinality/binary/verbose mean in this context, could you explain what you mean?

I am super interested in what the selected kernels look like, especially whether they’ll recognizably pick up on classic time series features like seasonality. (I think this is sort of what we’re hoping for when applying convolutions to TS.) High on my list of things to look at.

2 Likes

Thanks a lot to you and @angusde and sorry for not replying before, but something came up in meantime. I’m retraining a good ol’ LSTM baseline, then I will try to use ROCKET. However, it sounds like you already tried it for forecasting and it didn’t work out well, right?