Thank you so much @oguiza!!!
Is there any reference paper that you are using for defining all those transformations?
Thank you so much @oguiza!!!
Is there any reference paper that you are using for defining all those transformations?
Thanks for your question @vrodriguezf!
I have now included the paper references in the notebook.
Most of the ideas come from this paper:
Data augmentation of wearable sensor data for parkinsonās disease monitoring using convolutional neural networks. arXiv preprint arXiv:1706.00527
However, I had to adapt the code in most cases.
For now they all work on cpu.
Thanks for the detailed response!
I found it really hard to believe that [make 100 features] could perform as well as [make 10k features, choose the best 100], but when I tested it, you were right: They performed about the same. How do you interpret this result?
One question I have on this (may be dumb, Iāve only been listening in) is how is the cardinality or distribution of those 100 vs the 9,900 left? Weāre they mostly binary or not very verbose?
Iām not sure I understand what cardinality/binary/verbose mean in this context, could you explain what you mean?
I am super interested in what the selected kernels look like, especially whether theyāll recognizably pick up on classic time series features like seasonality. (I think this is sort of what weāre hoping for when applying convolutions to TS.) High on my list of things to look at.
Thanks a lot to you and @angusde and sorry for not replying before, but something came up in meantime. Iām retraining a good olā LSTM baseline, then I will try to use ROCKET. However, it sounds like you already tried it for forecasting and it didnāt work out well, right?
No, I havenāt managed to get good results using RNNs for forecasting. Interestingly the M4 Competition (which seems to be the biggest and most prestigious forecasting competition held to date) was won by a hybrid exponential smoothing + RNN model. This might be a good place to start looking for an effective RNN-based forecasting model.
My impression from reading over the results was that the competition was dominated by expert practitioners. The winner, Slawek Smyl from Uber, looked to do some very careful and clever engineering. Likewise, the LSTM implementation in fastai has a lot of clever tweaks to get it to work well. All of which to say, I suspect the devil is in the details.
I have not tried ROCKET for forecasting yet. If you do, please let me know how it goes!
Recent paper here, using attention network for multivariate forecasting:
DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting
Code here:
Thank you for this! What if I want to learn from multiple time series, but forecast only on one of them? Can I use this architecture for that too?
Good question. Iām not sure. The paper is pretty brief and I have yet to try myself the repository code, but was hoping to when I have some time
Thank you very much for your patience @MadeUpMasters and @scart97, I am starting to look at this now, sorry it has taken so long, itās probably going to take me a little while to go through everything.
Out of interest, what is the typical resolution (if there is a typical resolution), as in width and height in pixels, of the spectrograms you are using with CNNs? And, what is the kind of processing time needed to produce a spectrogram from raw audio input?
Iāve started by looking at the 10-speaker dataset. This is interesting (coming from time series classification): the signals / time series are relatively long, with variable lengths, and the raw āsampling rateā is (or seems) high (in the sense that there is a lot of data per unit of time). At the very least, I think this means that it would be useful to be able to handle variable-length input properly (another thing that is on my to do list). Obviously, computation time is also critical (more below).
My initial approach has been simply to downsample the input by 100x, 50x, and 25x. My āpipelineā is input -> downsample -> normalise -> ROCKET
(apart from downsampling and normalisation, Iām not doing anything, and Iām using vanilla ROCKET). Accuracy seems to increase as the sampling rate increases (toward the raw sampling rate); at 25x, accuracy is ~90% for 100 kernels, ~95% for 1K kernels.
Ok, so far, these are fairly inane observations (and accuracy could of course be higher). The point is, this suggests thatāat least for this datasetāROCKET is picking out useful features from the raw input. (The other thing it suggests is that, for this dataset, the bulk of useful features exist at a frequency way below the raw sampling rate.)
However, I hesitate to say what those features are becuase, at this point, I just donāt know. Your work suggests that it is effective to āforce feedā a frequency breakdown to CNNs via spectrograms (or to ROCKET, treating the spectrograms as multivariate input). Maybe (but only maybe) ROCKET is able to pull useful frequency-domain features from audio data.
In principle, even without dilation, but certainly with dilation, convoutional kernels (even random convolutional kernels) are frequency selective. However, in any case, ROCKET isnāt going to work for everything (and who knows, even for audio the relevant features may not necessarily be strictly in the frequency domain anyway).
However, this is just one dataset. Iāve just started looking at the 250-speaker dataset you mentioned, and Iāll try and see what is going on (or not) with this.
Downsampling obviously lightens the computational burden, but itās not a great solution. Increasing stride (as you have done), is probably a more sensible approach. However, to the extent that relevant features do exist at the highest frequencies, I guess that the only solutions (for ROCKET) are: (a) raw audio + parallel CPU or GPU; or (b) some kind of preprocessing that filters the input into different frequencies first (e.g., spectrograms).
Sorry for the slow response.
Itās difficult to say. It may be that a relativey small number (more specifically, < 100) kernels are responsible for a lot of the performance, and that with a fairly high probability enough of these kinds of kernels will appear in most randomly-generated sets of 100 (and would also be in the 100-most-highly-weighted kernels in any larger set). Maybe these more effective kernels (and maybe not just these) get duplicated a lot as the number of kernels increases. It may be that the ārestā of the kernels are only useful in (very large) aggregate.
However, Iām really just speculating. One possiblity would be to select (or, really, generate in the first place) kernels which are as uncorrelated as possible. However, this might do nothing or end up being counterproductive.
Really, I havenāt had the opportunity to sit down with this problem and think it through properly yet. Basically, so far, Iāve had the same experience as you: try all sorts of typical feature selection methods, but with little to show for it.
Does anyone know of any time series repository of all state of the art NNās - pretrained or just code, doesnāt even have to be state of the art, as long as there is clean code that can be used out of the box for general time series purposes (forecasting,classification, etcā¦)? Kind of like HuggingFace for NLP? https://github.com/huggingface
For time series classification, the most straightforward is, in my opinion, https://github.com/timeseriesAI/timeseriesAI from @oguiza. Just follow the notebooks and you are done.
In terms of forecasting I do not know any similar thing though.
Hi @fuelnow
In addition to @vrodriguezfās answer, you can have a look at dl-4-tsc for a simple code to run for time series classification.
For a tutorial you can check out this google colab notebook.
Finally, if you need pre-trained models, you can have a look at this page.
For other areas such as forecasting, I am not familiar with similar repositories.
Hope this helps.
Cheers,
Just wanted to let you know I have updated the TS data augmentation notebook 6 with some new data augmentation functions and RandAugment.
Also, all TS tfms can be applied either to a single TS (as any regular tfm) or to a batch (applying it as a train_dl tfm), which makes them much faster!
RandAugment is a new technique developed by Google that simplifies/ eliminates the need to search for the best data augmentations for a given dataset. It basically applies between 1-3 randomly selected tfms to each batch, and it has achieved SOTA in ImageNet (+1% compared to previous).
I have created some code to make it very easy to use. All you need to do to apply the TS tfms and randaugment (recommended approach) is:
learn = Learner(data, model, metrics=accuracy).randaugment()
I have used it in a few cases, and the results are pretty good.
Thanks for your reply and interest.
Generally height 128 and width is dependent on a few factors (hop_length and duration of audio). ~4ms to generate a 128x128 spectrogram, scales linearly with increased duration.
We handle this by random cropping a specific duration (e.g. 2 seconds) of the signal or spectrogram. Clips that are less than the given duration are padded (we support a number of padding options, default is zero/silence padding both sides of the signal or spectrogram).
Iām fairly confident that this is whatās happening. Itās not unreasonable as there are fully convolutional networks for speech recognition (instead of extracting spectrograms, raw audio is fed to a network that learns the most useful frequency ranges/filterbanks). Thus itās not surprising that a random conv kernel is pulling out some type of frequency info from raw audio, although each kernel has only very slight predictive power.
Iāve mostly moved on to work on other stuff, letting ROCKET for raw audio sit in the back of my mind, but I think there is potential and I plan to come back to it at some point. I also think what @scart97 is doing, applying rocket to the extractions themselves, has a lot of promise. We often have to limit either the duration or resolution of our spectrograms because they quickly become too large/slow for computer vision models to process, but ROCKET will be extremely fast even on very large spectrograms, and through that there may be a path to beating computer vision applied to spectrograms.
Thanks for following up, Iām sorry I couldnāt be of more help here. Thanks for the additional info.
It looks like there may be limitations to ROCKET with raw audio. Iāve now had the chance to look at the 250-spearker dataset and yes, basically, ROCKET doesnāt seem to work on the raw audio here (but for reasons unknown at this stageā¦). For the moment, unless spectrogram + ROCKET is an improvement, thereās nothing obvious to me that is going to improve performance on this dataset without changing the internal configuration of ROCKET in some way.
Re variable length, I meant ROCKET in particularā¦ if you are using some kind of global pooling, then input length basically doesnāt matter, you donāt even need to pad, itās just that the current implementation is a bit clumsy in terms of storing and handling variable-length time series. What youāre doing for the spectrograms makes total sense.
I think the spectrogram + CNN (or ROCKET, if it worksā¦) approach makes a lot of sense in terms of both providing a frequency breakdown, and to a greater or lesser extent disconnecting the workload of the CNN (or ROCKET) from the dimensionality of the raw input.
I donāt expect or want ROCKET ābeatā other models on every task (or even most taksksā¦). However, if spectrogram + ROCKET makes ROCKET work where it otherwise doesnāt work well, particularly for audio, that is very useful informaiton from a practical point of view. More generally, itās really equally importantāand interestingāto try and work out where ROCKET (and other models) donāt work well. For convolutional architectures in particular, I think itās a big question for 1-dimensional input (incl., obviously, multivariate): what features are actually being picked upā¦ to what extent do these belong neatly to the time or frequency domains, etc.
Anyway, if I have any better ideas about ROCKET + audio Iāll pass them on. In the mean time, of course, donāt hesitate to ask if you have any further questions, etc.
Weāll be working to try and build on ROCKET, and work though some of the more mysterious aspects (that are also relevant to CNNs more generally). This may or may not turn out to make a difference for audio. At the very least, I will be working through the implementation and Iām sure there are substantial improvements to be made there (there are quite a few things that can be done better, but I havenāt had the chance to deal with them yet)ā¦
Iām arriving late for the party, but trying to catch up with ROCKET. Thanks so much for your demo notebooks and the PyTorch implementation.
A few questions and clarifications, if you would be so kind to address themā¦
ācenteringā means to make the mean of the kernel weights zero?
The data.show_batch() graphs in notebook ā05_ROCKET_a_new_SOTA_classifierā are a convenient way to display the 20000 features for each time series. But they not in themselves related to the temporal sequence in any way?
The ānormalize āper featureāā step in the fastai section is your invention, not part of the ROCKET paper? I did experiment with omitting it for both max and percent positive features, and got worse results.
Seems like one could make a custom ROCKET Module that calls F.conv1d(ā¦) directly to reduce GPU overhead a bit.
Thanks for helping me get oriented to this new approach.
Malcolm
Just on the per-feature normalisationā¦ I donāt think it is in the paper explicitly, but it is in our implementation + experiments. Itās not really related to the transform, but rather to whatever classifier you use with the transform (and, as such, it is kind of open ended). For example, the kind of normalisation appropriate for a ridge regression classifier is different to the kind of normalisation appropriate for logistic / softmax regression (and per-feature normalisation may be completely irrelevant for some classifiers). In other words, in performing per-feature normalisation, you are basically doing whatever kind of normalisation / standardisation you would usually do when using whatever classifier you are using.