This is impressive. Would it be possible to adapt it for regression (forecasting)? If not, do you know which are the SOTA Deep Learning approaches for multivariate time series forecasting with an univariate response? Thanks
I agree with @hfawaz. Thank you for inviting us here.
I would never describe myself as a world-class anything! I am just another PhD student. I hope ROCKET is useful in the real world, and I hope I can make myself useful here. I think I can probably learn more from you than you can from me…
This is a good question.
The short answer is yes, ROCKET supports variable-length time series. I’ll try to explain how and why (many of you probably know more about this than I do, so feel free to ignore this, and please correct me if you think I’m wrong).
There is a very recent paper from our group arXiv:1910.04341 which looks at this issue more generally. In summary (maybe not a very good summary), the best way to handle variable-length time series depends on why the time series have different lengths—e.g., if time series have different sampling rates, maybe even variable sampling rates, or if time series have the same sampling rate but represent different lengths of time, or some combination, etc.—and may also be classifier-dependent.
In practice, two ways of handling variable-length time series are rescaling everything to the same length, and some version of just leaving the time series as they are (which might involve ‘padding’ shorter time series with zeros or low-level noise such that all time series have the same length).
(If you have the original timestamps, you should be able to rescale everyting despite arbitrarily complex sampling issues.)
For ROCKET, our approach is to choose between leaving the time series ‘as is’ (that is, with their original variable lengths)—ROCKET can handle input time series of different lengths (more on this below)—or to rescale time series to the same length, via 10-fold cross-validation on the training set: if rescaling works better we rescale, otherwise we leave the input time series ‘as is’.
The code in
reproduce_experiments_additional.py should serve as a template for handling variable-length time series (and other missing values) with ROCKET. Note that this is set up to handle datasets where shorter time series have been padded with NAN values (reflecting the format of the datasets we used in our experiments). Also at the moment the input to
apply_kernels_jagged(...) (more below) is a rectangular NumPy array. It would be much better for this to be a ‘jagged’ array (not yet cleanly supported in Numba) or a 1-dimensional array (basically, how some sparse matrices are implemented behind the scenes in, e.g., SciPy… you would also pass a vector of time series lengths to ‘chop up’ the input 1-dimensional array into individual time series), but I am waiting for some changes coming through in Numba before changing the implementation. (Although it is tempting to just take the 1-dimensional input approach, it might make things difficult for first-time users.)
Using Variable-Length Time Series ‘As Is’
Rescaling is straightforward, and doesn’t really have anything to do with ROCKET per se: just rescale and use ROCKET normally. Using variable-length time series ‘as is’ is a bit trickier, which I will now explain. Basically, instead of using the
apply_kernels(...) function, you should use the
apply_kernels_jagged(...) function (which for now is in
reproduce_experiments_additional.py). This function takes an additional argument: the lengths of the input time series.
All this new function does is to check whether the effective length of a given kernel (including dilation) is smaller than a given time series (including padding): if so, the kernel is applied as normal; if not, the kernel is ‘skipped’ / ignored.
To understand why this is necessary, I’ll go back to how ROCKET works. So, we generate kernels with random size, dilation, and padding (among other things). Dilation is set with reference to some length. In the simple case, this is the length of the input time series, which is the same for all the input time series. For variable-length time series, by default we use the length of the longest time series (but you can do whatever you want: length of shortest time series, median length, mean length, etc.).
If (as by default), dilation is set with reference to the length of the longest time series, then the effective length of some kernels (including dilation) may be longer than the length of some shorter time series (excluding padding). If padding is used when applying a given kernel, none of this matters, as kernels of any size will ‘fit’ the time series with padding included. However, for kernels applied without padding, a kernel with big dilation might be ‘too big’. So, we ‘skip’ those kernels for shorter time series where the kernel doesn’t ‘fit’ (the resulting features are just zero).
Note, however, that as dilation is sampled on an exponential scale (relatively more smaller dilations, relatively fewer larger dilations), and as padding is applied to roughly half of kernels, even where dilation is set with reference to the longest time series, this problem should only affect a minority of kernels / time series. So basically, the ‘skipping’ behaviour is doing just enough to prevent problems in extreme cases. Having said that, it may be necessary to change the approach for some datasets (in extreme cases, the longest time series might be radically longer than the average, in which case it might make more sense to, e.g., set dilation with reference to a shorter length, and / or pad shorter time series with noise).
I hope this helps!
With the disclaimer that I haven’t actually tried this, yes, in principle it should be possible to use ROCKET for regression: perform the transform and then, instead of fitting a classifier, fit a regression model.
In principle it should be straightforward to adapt the model for regression: Set as your target the value from the next time step and using a model appropriate for regression on top of the features generated by ROCKET. I made a notebook where I prepare data for a forecasting problem. I wasn’t able to get very good results, perhaps because my data is very noisy.
As we work towards a fastai.time_series I’m really excited about stealing (ahem, borrowing) techniques from the text module. Preparing time series data for forecasting is identical to preparing text data for a language model. In both cases, given a series, you try to predict what comes next. A lot of hairy problems, like dealing with varying length inputs, or shuffling batches when order is important, have been addressed there.
Additionally, if we can get forecasting working reasonably well, we can leverage it to improve classification ala ULMFit: Train a forecasting model, rip off the last layer, and replace it with whatever you’re interested in. There’s a fantastic repo by @mb4310 that does exactly that.
Thank you so much for your help @angusde! That was much more than an answer, that was a tutorial! I think I’ll start experimenting with time series ‘as is’ setting the reference dilation to the median, because the difference between lengths is large in my current dataset (24 points for the longest one, only 3 or 4 for the shorter one)
Yes I tried ROCKET for regression on my multi-channel problem. I am working on something else right now so I haven’t had time to test more, but you can try my fork example (or @oguiza Pytorch implementation to generate the kernels.
The UCR archive provides some benchmarks, but they note that in many cases these can easily be improved with “low-hanging fruit.” Does anyone know of a more authoritative source for SotA results? If I’m testing a new model, I’d really like to know if my results are competitive with SotA.
If this turns out to be useful, I’d also like to improve it by…
- Creating a leaderboard (unless one already exists?)
- Agreeing on a “UCR-lite” dataset along the lines of Imagenette to quickly test new ideas (it can be time-consuming to test on the full dataset, especially if you use multiple iterations, which you probably should)
Thank you for coming here to help explain your work! I am working to adapt rocket for use on raw audio. In audio we normally convert the raw signal to a spectrogram which extracts the relevant frequency data, and then pass that to a normal CNN.
My first results were fairly poor, 85% in 6 minutes on a 10 class voice recognition problem I can usually get 99%+ accuracy in under 2 minutes using spectrogram +CNN. I added a stride (5-7 seemed to be ideal) to your code (it was just too time-expensive without one for long audio signals) and got faster results with very little accuracy sacrificed. Eventually I got it to reach 95% accuracy in 6s, and 99% accuracy in under 2 minutes!
Unfortunately when I tried a much harder problem, a 250 class dataset, I couldn’t get beyond 30% accuracy (~95% using spectrogram method). I looked at the efficacy of a single random kernel on both datasets and found on the 10 speakers a single filter can get ~20% validation accuracy, but on the harder problem it is closer to 2%. It seems like the idea of ensembling slightly predictive random kernels gets exponentially harder as the number of classes goes up. What are the class sizes like in the bake-off dataset? Did you notice this effect as well?
Any advice for applying rocket to raw audio data? Thanks again!
Edit: Here’s a graph with accuracy on a 25 class dataset as a function of number of kernels, averaged over 3 runs each. The dataset has around 3700 total audios that have been trimmed to 2s each, so they are time series with 32000 elements. Time for 16384 kernels is ~16 minutes. The 250 speaker version is ~45000 files and would take around 3 hrs to run the same amount. This is with stride 5 by the way. Next up is playing with the dilation to try to extract frequency info in a more structured way, hopefully through that I can dramatically lower the number of kernels required.
Thanks for taking the time to try ROCKET with audio. This is very interesting. (Also, thanks for providing the additional info in the edit.)
Preliminary comment: there are almost certainly problems where ROCKET will not work well, and this may be one of them. (I think I have slightly unconventional views on the ‘no free lunch’ theorem, but that is another story…) The Phoneme dataset in the UCR archive, which may (or may not) be most closely related to your data, is one of the datasets where ROCKET ‘struggles’ (although most methods for time series classification don’t do well on this dataset).
In any case, I will try and say something sensible.
Curiosity / Context
- The spectrograms you usually use as input to your cnns—these are 2-dimensional, essentially like images, time on one axis and frequency on the other axis?
- Accordingly, are you using cnns based on architectures for image classification? What is the stride (in particular, the time-domain-axis stride) you use in your cnns?
- Which, if any, other time series classification methods have you tried with your audio data? In particular, have you tried InceptionTime? If a particular method does work well, it might help to diagnose the problem (it may not be possible to make ROCKET work any better, but it might help to understand what is going on anyway).
- Is the data, or something like it, publicly available? If so, I could take a look (however, that probably won’t make any difference).
- What else, if anything (apart from stride), did you change to go from 85% accuracy to 95% or 99% accuracy (was this just increasing the number of kernels)? (If stride is not drastically hurting accuracy, perhaps discriminative features are mostly below some threshold frequency? Not sure how this helps…)
- Are you running your cnns on GPU? It looks like your input time series / signals are relatively long (32K). Any fundamental speedup is going to come from parallelisng the transform. We don’t have an ‘official’ GPU implementation (at least, not yet). In the mean time (unless you have a CPU cluster available…), you might try the GPU implementation of the transform developed by @oguiza: see this notebook (that is, if you have not already tried this).
- Do I understand that you are using the original audio (i.e., in the time domain) as input to ROCKET? (I doubt it can make sense of frequency domain information directly. However, exploding the input into multiple frequency tranches via, e.g., discrete wavelet transform might help, but that’s pure speculation.)
- Are you using the ridge regression classifier (RidgeClassifierCV from scikit-learn) or another classifier (e.g., along the lines of our softmax regression implementation for ‘big’ datasets)? Logistic / softmax regression might (or might not) work better for a large number of classes (i.e., 250). It’s probably worth a try (if you haven’t already tried it).
- I think the dataset with the most number of classes in the UCR archive has 60 classes (so, quite a few less than 250). I can’t see from the results that there is any obvious link between the number of classes and the performance of ROCKET (at least, up to 60 classes).
- In relation to the number of kernels: it is very unlikely that increasing the number of kernels is going to fundamentally improve performance. That is to say, if results are terrible for 1,000 kernels, it is very unlikely that they are going to be ‘good’ for 10,000 kernels or 100,000 kernels. I’d say that something else isn’t working. So, I’d start with 1,000 kernels first (certainly no more than 10,000), and try and work out what is happening.
- Basic point (may be inherent in audio data, or you may be doing this anyway)—you must make sure that the input is mean centred (each time series / signal has a mean of zero), and has a standard deviation of one. The way ROCKET is set up, this is assumed. It will still ‘work’ otherwise, but not necessarily very well. (However, given the performance you are seeing on the 10-class problem, my guess is that you are doing this already.)
It would probably be helpful if I could see the data you are using (but if this is not possible, no problem). It may be that something that we are doing, that is, something about the configuration of ROCKET doesn’t make sense for your data… I am thinking along the lines of the length of your time series, versus kernel length and dilation in particular.
Sorry, I now realise that none of what I have said is likely to help you immediately. Perhaps we can keep working through the problem, particularly if I have the chance to look at the data.
Thank you so much for the really thoughtful analysis and ideas. I’m going to be away for the weekend but I’ll return on Monday and give this the reply it deserves (with code!). Cheers.
Thank you for the detailed reply. I was also working with @MadeUpMasters trying to adapt ROCKET to use with audio, but I took a different approach. Instead of trying to fit the raw audio where we have computation problems with the high number of elements, I used the same spectrograms that we use with CNN models and treat it as a multichannel time series, where the channels are the frequency components. Compared to the CNN model, it got better results in less training time on a 72 class voice recognition dataset. The CNN took 2 minutes and 30s to reach 95% accuracy, where ROCKET with 1000 kernels reach 96,4% in 30 seconds.
The code to the ROCKET experiment is here. It’s inspired on @oguiza notebook, and it contains all the steps necessary to reproduce the result, including installing audio library and downloading the data.
As soon as I can, I will let you know what I find.
Correct, here’s an example spectrogram:
Yes we mainly use resnets/densenets and get very good results. As a result we havent played around with audio specific architecture much, but we are getting to that point now.
None, I have no familiarity with time-series but @scart97 messaged me your work and said “hey maybe we can apply this to audio”. This is my first attempt at applying convs to raw audio for classification. After reading your paper, I think every other time series method is going to take too long.
I’ll detail this more in the code/summary I post, but audio preprocessing helped quite a bit. My initial result was 85.4% accuracy in 4 min 1s, using 10k kernels, and stride 7 (it took a stride of 7 just to make the time reasonable. Removing silence and taking a larger time chunk were what improved accuracy from there.
We are! That’s why the initial results were so exciting/surprising. We are currently building an audio library for fastai v2 and have been looking for a way to allow people to train on raw audio. I’ll continue to work on raw audio while @scart97 tests on spectrograms. I’m fairly new to audio and never learned what wavelet transforms are (I’ve seen the name several times), but I’ll try to dig into it and see if that could be something worth using.
Currently the ridge classifier, but also something I plan to experiment with (today).
Yes we are making sure to do this. Thank you.
Not true at all, I found this extremely helpful. You’ve given me a good list of leads to chase down. I’m going to do a bit more experimentation and then post some cleaned up, hopefully reproducible notebooks. Thanks again.
Also if anyone feels we are cluttering the time-series thread and getting too far outside the scope with the audio discussion, I’m happy to move it to the Deep Learning With Audio Thread
Edit: Back with a summary of results and a (still somewhat messy) notebook.
Unfortunately this relies on our fastai v2 audio for some of the preprocessing, it would probably be easier to start from a new notebook than to pull this and try to work out of my fastai_dev fork. I have some more interesting stuff to add for harder datasets but it still needs to be organized. I’ll be back tomorrow or Wednesday
Hi! Do you have experience with imbalanced datasets? Is there any dataset on UCR which causes troubles due to this?
I was curious whether my model was using the 20k features generated by ROCKET fairly evenly, or relying on just a few “good” features. I made a notebook to explore this question. TL;DR: I found you can use a small subset of features, sometimes as few as 100, with small losses in classification accuracy.
I might be able to provide some context here.
In some sense what would be ideal with feature selection is for the number of features to go down, but for accuracy to go up (or, at least, not go down). However, this is tricky.
My expectation (although I would be happy to be proven wrong), is that straightforward feature selection is unlikely to work well. (This is despite the fact that, as you have seen, the distribution of weights in the learned classifier looks promising.)
This is covered to some extent in section 4.3.1 of the paper. If you look at figure 5, basically what it shows is that as the number of kernels goes up, accuracy also goes up (and vice versa). The difference stops being statistically significant after about 10K kernels. However, the difference between, say, 100 kernels and 1K kernels is statistically significant, not because 1K kernels produce radically higher accuracy than 100 kernels, but rather becuase 1K kernels produce consistently higher accuracy. This could be a very small increase in accuracy over a majority of datasets, and indeed this is basically what you see with any increase in the number of kernels: a small but consistent increase in accuracy. The other aspect of this, as you have seen, is that variance increases as the number of kernels goes down (any given set of 100 kernels is likely less similar to any other given set of 100 kernels, as compared to the difference between two sets of 10K kernels, or two sets of 100K kernels).
So why do we use 10K kernels (producing 20K features)? Because 10K is consistently more accurate than < 10K kernels, because 10K kernels produces relatively low variance, because more kernels doesn’t make that much more difference in terms of either accuracy or variance (diminishing returns; more formally, this is more or less the point where the difference is no longer statistically significant), because 10K is a round number, and because the classifier (the ridge regression classifier or logistic / softmax regression) can handle 20K features easily (and, for the rigde regression classifier, even with a small number of training examples).
Nonetheless, if compute time is critical, you can use fewer kernels. What you can’t see in figure 5 is that, even with 100 kernels, ROCKET ranks somewhere in the middle of the ‘second pack’ of classifiers (roughly similar performance to ProximityForest). And, with 100 kernels, you should be able to train and test the whole UCR archive in about 2 minutes, and you can see from our scalability experiments that, for a small hit in accuracy, you can learn from > 1 million time series in about 1 minute.
Loosely speaking, for problems where ROCKET works well, even, e.g., 100 kernels should produce pretty good accuracy. 1K kernels will produce consistently higher accuracy (but the actual increase in accuracy is likely to be relatively small). Same for 10K over 1K, etc. (Obviously, you start running into other problems as the number of kernels keeps increasing.)
The bottom line is that you are probably going to get similar results simply by using fewer kernels in the first place, rather than generating more kernels then doing feature selection.
There is also a bigger picture. Feature selection takes time. Not necessarily much time, but it depends what you are doing. Convolutional kernels are proven feature detectors, but the potential ‘space’ of all kernels (even just in terms of weights, let alone in terms of arrangement or architecture) is very large. The typical way of wading through this space is by learning the kernel weights, and possibly venturing some kind of architecture search, or using a proven architecture such as ResNet or InceptionTime.
But there is another approach, i.e., the approach taken by ROCKET, which is—speaking fairly loosely—to simply generate lots of kernels which, in combination, provide good coverage of the space of all kernels (or all useful kernels).
Feature selection fits somewhere on the continuum between ‘completely random’ and fully learned kernels in an established architecture, or an architecture found through some kind of architecture search. At some point, it will almost certainly be more beneficial to simply spend time learning the kernels and performing some kind of architecture search, rather than hunting through randomly generated kernels. (Note also the possible ‘correlation effect’ observed by @MadeUpMasters in his summary, above, which might work against feature selection for random kernels.)
Having said all that, scikit-learn (just as an example) has a number of feature selection methods which can be used directly with ROCKET in one or more ways, and may prove useful at least for some problems.
Class imbalance in the UCR archive is a bit all over the place, many datasets have some degree of class imbalance, many do not (you can check by doing something like, for each dataset,
_, counts = np.unique(Y, return_counts = True); counts.max() / counts.sum(), or some variation thereof).
It is also a bit hard to tell what effect imbalance is having; that is, it is difficult to separate out the effect of imbalance from whatever else might make a given dataset ‘difficult’ (or ‘easy’). Obviously, in an extreme case, imbalance will make high accuracy trivial (there are, or course, other measures apart from accuracy).
From an extremely cursory look, it doesn’t look like there is a strong connection between imbalance in the datasets in the UCR archive and the performance of most classifiers (but really I have only just glanced at this, I might be wrong).
Then there are some ‘typical’ issues with imbalance more generally: most classifiers (including RidgeClassifierCV) will have some more or less direct way of dealing with imbalance (under / over sampling, weighting, etc.), but this can work against you if the imbalance is reflective of the generating distribution. It is a while since I looked at this, and not very formally, but from memory I don’t think trying to counter imbalance made much difference for datasets in the UCR archive. (Others might have a different experience…)
Thanks very much for this. I will look at this properly as soon as I can and get back to you.