Time series/ sequential data study group

vrodriguezf · November 13, 2019, 2:04pm

Thank you so much @oguiza!!!

Is there any reference paper that you are using for defining all those transformations?

oguiza · November 13, 2019, 4:10pm

Thanks for your question @vrodriguezf!
I have now included the paper references in the notebook.
Most of the ideas come from this paper:
Data augmentation of wearable sensor data for parkinson’s disease monitoring using convolutional neural networks. arXiv preprint arXiv:1706.00527
However, I had to adapt the code in most cases.
For now they all work on cpu.

GiantSquid · November 14, 2019, 11:34pm

Thanks for the detailed response!

I found it really hard to believe that [make 100 features] could perform as well as [make 10k features, choose the best 100], but when I tested it, you were right: They performed about the same. How do you interpret this result?

muellerzr · November 14, 2019, 11:43pm

One question I have on this (may be dumb, I’ve only been listening in) is how is the cardinality or distribution of those 100 vs the 9,900 left? We’re they mostly binary or not very verbose?

GiantSquid · November 15, 2019, 2:26pm

I’m not sure I understand what cardinality/binary/verbose mean in this context, could you explain what you mean?

I am super interested in what the selected kernels look like, especially whether they’ll recognizably pick up on classic time series features like seasonality. (I think this is sort of what we’re hoping for when applying convolutions to TS.) High on my list of things to look at.

AndreaPi · November 16, 2019, 10:06am

Thanks a lot to you and @angusde and sorry for not replying before, but something came up in meantime. I’m retraining a good ol’ LSTM baseline, then I will try to use ROCKET. However, it sounds like you already tried it for forecasting and it didn’t work out well, right?

GiantSquid · November 16, 2019, 1:05pm

No, I haven’t managed to get good results using RNNs for forecasting. Interestingly the M4 Competition (which seems to be the biggest and most prestigious forecasting competition held to date) was won by a hybrid exponential smoothing + RNN model. This might be a good place to start looking for an effective RNN-based forecasting model.

My impression from reading over the results was that the competition was dominated by expert practitioners. The winner, Slawek Smyl from Uber, looked to do some very careful and clever engineering. Likewise, the LSTM implementation in fastai has a lot of clever tweaks to get it to work well. All of which to say, I suspect the devil is in the details.

I have not tried ROCKET for forecasting yet. If you do, please let me know how it goes!

Hannibal · November 18, 2019, 10:13am

Recent paper here, using attention network for multivariate forecasting:

DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting

Code here:

vrodriguezf · November 18, 2019, 10:53am

Thank you for this! What if I want to learn from multiple time series, but forecast only on one of them? Can I use this architecture for that too?

Hannibal · November 18, 2019, 11:00am

Good question. I’m not sure. The paper is pretty brief and I have yet to try myself the repository code, but was hoping to when I have some time

angusde · November 18, 2019, 11:34am

Thank you very much for your patience @MadeUpMasters and @scart97, I am starting to look at this now, sorry it has taken so long, it’s probably going to take me a little while to go through everything.

Out of interest, what is the typical resolution (if there is a typical resolution), as in width and height in pixels, of the spectrograms you are using with CNNs? And, what is the kind of processing time needed to produce a spectrogram from raw audio input?

I’ve started by looking at the 10-speaker dataset. This is interesting (coming from time series classification): the signals / time series are relatively long, with variable lengths, and the raw ‘sampling rate’ is (or seems) high (in the sense that there is a lot of data per unit of time). At the very least, I think this means that it would be useful to be able to handle variable-length input properly (another thing that is on my to do list). Obviously, computation time is also critical (more below).

My initial approach has been simply to downsample the input by 100x, 50x, and 25x. My ‘pipeline’ is input -> downsample -> normalise -> ROCKET (apart from downsampling and normalisation, I’m not doing anything, and I’m using vanilla ROCKET). Accuracy seems to increase as the sampling rate increases (toward the raw sampling rate); at 25x, accuracy is ~90% for 100 kernels, ~95% for 1K kernels.

Ok, so far, these are fairly inane observations (and accuracy could of course be higher). The point is, this suggests that—at least for this dataset—ROCKET is picking out useful features from the raw input. (The other thing it suggests is that, for this dataset, the bulk of useful features exist at a frequency way below the raw sampling rate.)

However, I hesitate to say what those features are becuase, at this point, I just don’t know. Your work suggests that it is effective to ‘force feed’ a frequency breakdown to CNNs via spectrograms (or to ROCKET, treating the spectrograms as multivariate input). Maybe (but only maybe) ROCKET is able to pull useful frequency-domain features from audio data.

In principle, even without dilation, but certainly with dilation, convoutional kernels (even random convolutional kernels) are frequency selective. However, in any case, ROCKET isn’t going to work for everything (and who knows, even for audio the relevant features may not necessarily be strictly in the frequency domain anyway).

However, this is just one dataset. I’ve just started looking at the 250-speaker dataset you mentioned, and I’ll try and see what is going on (or not) with this.

Downsampling obviously lightens the computational burden, but it’s not a great solution. Increasing stride (as you have done), is probably a more sensible approach. However, to the extent that relevant features do exist at the highest frequencies, I guess that the only solutions (for ROCKET) are: (a) raw audio + parallel CPU or GPU; or (b) some kind of preprocessing that filters the input into different frequencies first (e.g., spectrograms).

angusde · November 18, 2019, 11:47am

Sorry for the slow response.

It’s difficult to say. It may be that a relativey small number (more specifically, < 100) kernels are responsible for a lot of the performance, and that with a fairly high probability enough of these kinds of kernels will appear in most randomly-generated sets of 100 (and would also be in the 100-most-highly-weighted kernels in any larger set). Maybe these more effective kernels (and maybe not just these) get duplicated a lot as the number of kernels increases. It may be that the ‘rest’ of the kernels are only useful in (very large) aggregate.

However, I’m really just speculating. One possiblity would be to select (or, really, generate in the first place) kernels which are as uncorrelated as possible. However, this might do nothing or end up being counterproductive.

Really, I haven’t had the opportunity to sit down with this problem and think it through properly yet. Basically, so far, I’ve had the same experience as you: try all sorts of typical feature selection methods, but with little to show for it.

fuelnow · November 18, 2019, 9:06pm

Does anyone know of any time series repository of all state of the art NN’s - pretrained or just code, doesn’t even have to be state of the art, as long as there is clean code that can be used out of the box for general time series purposes (forecasting,classification, etc…)? Kind of like HuggingFace for NLP? https://github.com/huggingface

vrodriguezf · November 18, 2019, 11:01pm

For time series classification, the most straightforward is, in my opinion, https://github.com/timeseriesAI/timeseriesAI from @oguiza. Just follow the notebooks and you are done.

In terms of forecasting I do not know any similar thing though.

hfawaz · November 19, 2019, 9:34am

Hi @fuelnow

In addition to @vrodriguezf’s answer, you can have a look at dl-4-tsc for a simple code to run for time series classification.

For a tutorial you can check out this google colab notebook.

Finally, if you need pre-trained models, you can have a look at this page.

For other areas such as forecasting, I am not familiar with similar repositories.

Hope this helps.

Cheers,

oguiza · November 21, 2019, 3:07pm

Just wanted to let you know I have updated the TS data augmentation notebook 6 with some new data augmentation functions and RandAugment.
Also, all TS tfms can be applied either to a single TS (as any regular tfm) or to a batch (applying it as a train_dl tfm), which makes them much faster!
RandAugment is a new technique developed by Google that simplifies/ eliminates the need to search for the best data augmentations for a given dataset. It basically applies between 1-3 randomly selected tfms to each batch, and it has achieved SOTA in ImageNet (+1% compared to previous).
I have created some code to make it very easy to use. All you need to do to apply the TS tfms and randaugment (recommended approach) is:

learn = Learner(data, model, metrics=accuracy).randaugment()

I have used it in a few cases, and the results are pretty good.

MadeUpMasters · November 21, 2019, 4:14pm

Thanks for your reply and interest.

Generally height 128 and width is dependent on a few factors (hop_length and duration of audio). ~4ms to generate a 128x128 spectrogram, scales linearly with increased duration.

We handle this by random cropping a specific duration (e.g. 2 seconds) of the signal or spectrogram. Clips that are less than the given duration are padded (we support a number of padding options, default is zero/silence padding both sides of the signal or spectrogram).

I’m fairly confident that this is what’s happening. It’s not unreasonable as there are fully convolutional networks for speech recognition (instead of extracting spectrograms, raw audio is fed to a network that learns the most useful frequency ranges/filterbanks). Thus it’s not surprising that a random conv kernel is pulling out some type of frequency info from raw audio, although each kernel has only very slight predictive power.

I’ve mostly moved on to work on other stuff, letting ROCKET for raw audio sit in the back of my mind, but I think there is potential and I plan to come back to it at some point. I also think what @scart97 is doing, applying rocket to the extractions themselves, has a lot of promise. We often have to limit either the duration or resolution of our spectrograms because they quickly become too large/slow for computer vision models to process, but ROCKET will be extremely fast even on very large spectrograms, and through that there may be a path to beating computer vision applied to spectrograms.

angusde · November 21, 2019, 10:32pm

Thanks for following up, I’m sorry I couldn’t be of more help here. Thanks for the additional info.

It looks like there may be limitations to ROCKET with raw audio. I’ve now had the chance to look at the 250-spearker dataset and yes, basically, ROCKET doesn’t seem to work on the raw audio here (but for reasons unknown at this stage…). For the moment, unless spectrogram + ROCKET is an improvement, there’s nothing obvious to me that is going to improve performance on this dataset without changing the internal configuration of ROCKET in some way.

Re variable length, I meant ROCKET in particular… if you are using some kind of global pooling, then input length basically doesn’t matter, you don’t even need to pad, it’s just that the current implementation is a bit clumsy in terms of storing and handling variable-length time series. What you’re doing for the spectrograms makes total sense.

I think the spectrogram + CNN (or ROCKET, if it works…) approach makes a lot of sense in terms of both providing a frequency breakdown, and to a greater or lesser extent disconnecting the workload of the CNN (or ROCKET) from the dimensionality of the raw input.

I don’t expect or want ROCKET ‘beat’ other models on every task (or even most taksks…). However, if spectrogram + ROCKET makes ROCKET work where it otherwise doesn’t work well, particularly for audio, that is very useful informaiton from a practical point of view. More generally, it’s really equally important—and interesting—to try and work out where ROCKET (and other models) don’t work well. For convolutional architectures in particular, I think it’s a big question for 1-dimensional input (incl., obviously, multivariate): what features are actually being picked up… to what extent do these belong neatly to the time or frequency domains, etc.

Anyway, if I have any better ideas about ROCKET + audio I’ll pass them on. In the mean time, of course, don’t hesitate to ask if you have any further questions, etc.

We’ll be working to try and build on ROCKET, and work though some of the more mysterious aspects (that are also relevant to CNNs more generally). This may or may not turn out to make a difference for audio. At the very least, I will be working through the implementation and I’m sure there are substantial improvements to be made there (there are quite a few things that can be done better, but I haven’t had the chance to deal with them yet)…

Pomo · November 27, 2019, 8:29am

I’m arriving late for the party, but trying to catch up with ROCKET. Thanks so much for your demo notebooks and the PyTorch implementation.

A few questions and clarifications, if you would be so kind to address them…

“centering” means to make the mean of the kernel weights zero?
The data.show_batch() graphs in notebook “05_ROCKET_a_new_SOTA_classifier” are a convenient way to display the 20000 features for each time series. But they not in themselves related to the temporal sequence in any way?
The “normalize ‘per feature’” step in the fastai section is your invention, not part of the ROCKET paper? I did experiment with omitting it for both max and percent positive features, and got worse results.
Seems like one could make a custom ROCKET Module that calls F.conv1d(…) directly to reduce GPU overhead a bit.

Thanks for helping me get oriented to this new approach.

Malcolm

angusde · November 27, 2019, 11:20am

Just on the per-feature normalisation… I don’t think it is in the paper explicitly, but it is in our implementation + experiments. It’s not really related to the transform, but rather to whatever classifier you use with the transform (and, as such, it is kind of open ended). For example, the kind of normalisation appropriate for a ridge regression classifier is different to the kind of normalisation appropriate for logistic / softmax regression (and per-feature normalisation may be completely irrelevant for some classifiers). In other words, in performing per-feature normalisation, you are basically doing whatever kind of normalisation / standardisation you would usually do when using whatever classifier you are using.