Time series/ sequential data study group

Thank you so much @oguiza!!!

Is there any reference paper that you are using for defining all those transformations?

Thanks for your question @vrodriguezf!
I have now included the paper references in the notebook.
Most of the ideas come from this paper:
Data augmentation of wearable sensor data for parkinsonā€™s disease monitoring using convolutional neural networks. arXiv preprint arXiv:1706.00527
However, I had to adapt the code in most cases.
For now they all work on cpu.

1 Like

Thanks for the detailed response!

I found it really hard to believe that [make 100 features] could perform as well as [make 10k features, choose the best 100], but when I tested it, you were right: They performed about the same. How do you interpret this result?

One question I have on this (may be dumb, Iā€™ve only been listening in) is how is the cardinality or distribution of those 100 vs the 9,900 left? Weā€™re they mostly binary or not very verbose?

Iā€™m not sure I understand what cardinality/binary/verbose mean in this context, could you explain what you mean?

I am super interested in what the selected kernels look like, especially whether theyā€™ll recognizably pick up on classic time series features like seasonality. (I think this is sort of what weā€™re hoping for when applying convolutions to TS.) High on my list of things to look at.

2 Likes

Thanks a lot to you and @angusde and sorry for not replying before, but something came up in meantime. Iā€™m retraining a good olā€™ LSTM baseline, then I will try to use ROCKET. However, it sounds like you already tried it for forecasting and it didnā€™t work out well, right?

No, I havenā€™t managed to get good results using RNNs for forecasting. Interestingly the M4 Competition (which seems to be the biggest and most prestigious forecasting competition held to date) was won by a hybrid exponential smoothing + RNN model. This might be a good place to start looking for an effective RNN-based forecasting model.

My impression from reading over the results was that the competition was dominated by expert practitioners. The winner, Slawek Smyl from Uber, looked to do some very careful and clever engineering. Likewise, the LSTM implementation in fastai has a lot of clever tweaks to get it to work well. All of which to say, I suspect the devil is in the details.

I have not tried ROCKET for forecasting yet. If you do, please let me know how it goes!

1 Like

Recent paper here, using attention network for multivariate forecasting:

DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting

Code here:

3 Likes

Thank you for this! What if I want to learn from multiple time series, but forecast only on one of them? Can I use this architecture for that too?

Good question. Iā€™m not sure. The paper is pretty brief and I have yet to try myself the repository code, but was hoping to when I have some time

Thank you very much for your patience @MadeUpMasters and @scart97, I am starting to look at this now, sorry it has taken so long, itā€™s probably going to take me a little while to go through everything.

Out of interest, what is the typical resolution (if there is a typical resolution), as in width and height in pixels, of the spectrograms you are using with CNNs? And, what is the kind of processing time needed to produce a spectrogram from raw audio input?

Iā€™ve started by looking at the 10-speaker dataset. This is interesting (coming from time series classification): the signals / time series are relatively long, with variable lengths, and the raw ā€˜sampling rateā€™ is (or seems) high (in the sense that there is a lot of data per unit of time). At the very least, I think this means that it would be useful to be able to handle variable-length input properly (another thing that is on my to do list). Obviously, computation time is also critical (more below).

My initial approach has been simply to downsample the input by 100x, 50x, and 25x. My ā€˜pipelineā€™ is input -> downsample -> normalise -> ROCKET (apart from downsampling and normalisation, Iā€™m not doing anything, and Iā€™m using vanilla ROCKET). Accuracy seems to increase as the sampling rate increases (toward the raw sampling rate); at 25x, accuracy is ~90% for 100 kernels, ~95% for 1K kernels.

Ok, so far, these are fairly inane observations (and accuracy could of course be higher). The point is, this suggests thatā€”at least for this datasetā€”ROCKET is picking out useful features from the raw input. (The other thing it suggests is that, for this dataset, the bulk of useful features exist at a frequency way below the raw sampling rate.)

However, I hesitate to say what those features are becuase, at this point, I just donā€™t know. Your work suggests that it is effective to ā€˜force feedā€™ a frequency breakdown to CNNs via spectrograms (or to ROCKET, treating the spectrograms as multivariate input). Maybe (but only maybe) ROCKET is able to pull useful frequency-domain features from audio data.

In principle, even without dilation, but certainly with dilation, convoutional kernels (even random convolutional kernels) are frequency selective. However, in any case, ROCKET isnā€™t going to work for everything (and who knows, even for audio the relevant features may not necessarily be strictly in the frequency domain anyway).

However, this is just one dataset. Iā€™ve just started looking at the 250-speaker dataset you mentioned, and Iā€™ll try and see what is going on (or not) with this.

Downsampling obviously lightens the computational burden, but itā€™s not a great solution. Increasing stride (as you have done), is probably a more sensible approach. However, to the extent that relevant features do exist at the highest frequencies, I guess that the only solutions (for ROCKET) are: (a) raw audio + parallel CPU or GPU; or (b) some kind of preprocessing that filters the input into different frequencies first (e.g., spectrograms).

Sorry for the slow response.

Itā€™s difficult to say. It may be that a relativey small number (more specifically, < 100) kernels are responsible for a lot of the performance, and that with a fairly high probability enough of these kinds of kernels will appear in most randomly-generated sets of 100 (and would also be in the 100-most-highly-weighted kernels in any larger set). Maybe these more effective kernels (and maybe not just these) get duplicated a lot as the number of kernels increases. It may be that the ā€˜restā€™ of the kernels are only useful in (very large) aggregate.

However, Iā€™m really just speculating. One possiblity would be to select (or, really, generate in the first place) kernels which are as uncorrelated as possible. However, this might do nothing or end up being counterproductive.

Really, I havenā€™t had the opportunity to sit down with this problem and think it through properly yet. Basically, so far, Iā€™ve had the same experience as you: try all sorts of typical feature selection methods, but with little to show for it.

Does anyone know of any time series repository of all state of the art NNā€™s - pretrained or just code, doesnā€™t even have to be state of the art, as long as there is clean code that can be used out of the box for general time series purposes (forecasting,classification, etcā€¦)? Kind of like HuggingFace for NLP? https://github.com/huggingface

For time series classification, the most straightforward is, in my opinion, https://github.com/timeseriesAI/timeseriesAI from @oguiza. Just follow the notebooks and you are done.

In terms of forecasting I do not know any similar thing though.

1 Like

Hi @fuelnow

In addition to @vrodriguezfā€™s answer, you can have a look at dl-4-tsc for a simple code to run for time series classification.

For a tutorial you can check out this google colab notebook.

Finally, if you need pre-trained models, you can have a look at this page.

For other areas such as forecasting, I am not familiar with similar repositories.

Hope this helps.

Cheers,

3 Likes

Just wanted to let you know I have updated the TS data augmentation notebook 6 with some new data augmentation functions and RandAugment.
Also, all TS tfms can be applied either to a single TS (as any regular tfm) or to a batch (applying it as a train_dl tfm), which makes them much faster!
RandAugment is a new technique developed by Google that simplifies/ eliminates the need to search for the best data augmentations for a given dataset. It basically applies between 1-3 randomly selected tfms to each batch, and it has achieved SOTA in ImageNet (+1% compared to previous).
I have created some code to make it very easy to use. All you need to do to apply the TS tfms and randaugment (recommended approach) is:

learn = Learner(data, model, metrics=accuracy).randaugment()

I have used it in a few cases, and the results are pretty good.

5 Likes

Thanks for your reply and interest.

Generally height 128 and width is dependent on a few factors (hop_length and duration of audio). ~4ms to generate a 128x128 spectrogram, scales linearly with increased duration.

We handle this by random cropping a specific duration (e.g. 2 seconds) of the signal or spectrogram. Clips that are less than the given duration are padded (we support a number of padding options, default is zero/silence padding both sides of the signal or spectrogram).

Iā€™m fairly confident that this is whatā€™s happening. Itā€™s not unreasonable as there are fully convolutional networks for speech recognition (instead of extracting spectrograms, raw audio is fed to a network that learns the most useful frequency ranges/filterbanks). Thus itā€™s not surprising that a random conv kernel is pulling out some type of frequency info from raw audio, although each kernel has only very slight predictive power.

Iā€™ve mostly moved on to work on other stuff, letting ROCKET for raw audio sit in the back of my mind, but I think there is potential and I plan to come back to it at some point. I also think what @scart97 is doing, applying rocket to the extractions themselves, has a lot of promise. We often have to limit either the duration or resolution of our spectrograms because they quickly become too large/slow for computer vision models to process, but ROCKET will be extremely fast even on very large spectrograms, and through that there may be a path to beating computer vision applied to spectrograms.

2 Likes

Thanks for following up, Iā€™m sorry I couldnā€™t be of more help here. Thanks for the additional info.

It looks like there may be limitations to ROCKET with raw audio. Iā€™ve now had the chance to look at the 250-spearker dataset and yes, basically, ROCKET doesnā€™t seem to work on the raw audio here (but for reasons unknown at this stageā€¦). For the moment, unless spectrogram + ROCKET is an improvement, thereā€™s nothing obvious to me that is going to improve performance on this dataset without changing the internal configuration of ROCKET in some way.

Re variable length, I meant ROCKET in particularā€¦ if you are using some kind of global pooling, then input length basically doesnā€™t matter, you donā€™t even need to pad, itā€™s just that the current implementation is a bit clumsy in terms of storing and handling variable-length time series. What youā€™re doing for the spectrograms makes total sense.

I think the spectrogram + CNN (or ROCKET, if it worksā€¦) approach makes a lot of sense in terms of both providing a frequency breakdown, and to a greater or lesser extent disconnecting the workload of the CNN (or ROCKET) from the dimensionality of the raw input.

I donā€™t expect or want ROCKET ā€˜beatā€™ other models on every task (or even most taksksā€¦). However, if spectrogram + ROCKET makes ROCKET work where it otherwise doesnā€™t work well, particularly for audio, that is very useful informaiton from a practical point of view. More generally, itā€™s really equally importantā€”and interestingā€”to try and work out where ROCKET (and other models) donā€™t work well. For convolutional architectures in particular, I think itā€™s a big question for 1-dimensional input (incl., obviously, multivariate): what features are actually being picked upā€¦ to what extent do these belong neatly to the time or frequency domains, etc.

Anyway, if I have any better ideas about ROCKET + audio Iā€™ll pass them on. In the mean time, of course, donā€™t hesitate to ask if you have any further questions, etc.

Weā€™ll be working to try and build on ROCKET, and work though some of the more mysterious aspects (that are also relevant to CNNs more generally). This may or may not turn out to make a difference for audio. At the very least, I will be working through the implementation and Iā€™m sure there are substantial improvements to be made there (there are quite a few things that can be done better, but I havenā€™t had the chance to deal with them yet)ā€¦

1 Like

Iā€™m arriving late for the party, but trying to catch up with ROCKET. Thanks so much for your demo notebooks and the PyTorch implementation.

A few questions and clarifications, if you would be so kind to address themā€¦

  • ā€œcenteringā€ means to make the mean of the kernel weights zero?

  • The data.show_batch() graphs in notebook ā€œ05_ROCKET_a_new_SOTA_classifierā€ are a convenient way to display the 20000 features for each time series. But they not in themselves related to the temporal sequence in any way?

  • The ā€œnormalize ā€˜per featureā€™ā€ step in the fastai section is your invention, not part of the ROCKET paper? I did experiment with omitting it for both max and percent positive features, and got worse results.

  • Seems like one could make a custom ROCKET Module that calls F.conv1d(ā€¦) directly to reduce GPU overhead a bit.

Thanks for helping me get oriented to this new approach.

Malcolm

Just on the per-feature normalisationā€¦ I donā€™t think it is in the paper explicitly, but it is in our implementation + experiments. Itā€™s not really related to the transform, but rather to whatever classifier you use with the transform (and, as such, it is kind of open ended). For example, the kind of normalisation appropriate for a ridge regression classifier is different to the kind of normalisation appropriate for logistic / softmax regression (and per-feature normalisation may be completely irrelevant for some classifiers). In other words, in performing per-feature normalisation, you are basically doing whatever kind of normalisation / standardisation you would usually do when using whatever classifier you are using.