Time series/ sequential data study group

tcapelle · September 27, 2019, 2:57pm

I second this.
Try a very simple model at first and see if it works, like a 3 layers MLP.

Pomo · September 27, 2019, 5:16pm

I’ve found that the best way to deal with fears and doubts about model design is to try them, see which work best, try to understand the reasons, and refine. But having a simple initial model will give you a valuable baseline to compare against. Then you can clearly measure the effects of adding complexity to the architecture.

Basically, I want to translate one time-series into another. Simplistically, think of a temperature record from the valley and the temperature record on a nearby hill. People currently simply make a correction of the valley temp. with height (i.e. -0.65 deg C per 100m altitude diff). However, this is really crude and I’d like to learn other variables, too…

And all you know about the target station is its weather predictions, not its altitude and other characteristics? Intuitively, I don’t see how that’s possible. In any case, it’s out of my league!

The model can be tuned to a given set of inputs (must not generalise to other sites).

Not sure what you mean by this. If the model can distinguish between input stations, it will figure out which ones are predictive for each target site.

About “conditional variations autoencoder” I have to plead complete ignorance.

oguiza · September 27, 2019, 5:39pm

Wow, this is a nice repo @tcapelle!
I’m also working on time series classification problems, and have run many tests with some models I’ve recreated from published papers (FCN, ResNet, TCN, LSTM-FCN, ResCNN, InceptionNet, etc). I have to say though that I have not taken the time to evaluate them systematically to try to understand which one may be a better architecture. I’ve basically created the models, and them tested them on a couple of datasets just to confirm they work as expected, and then used them on other data (I wish I had time to run those comparisons…).
In my experience it’s important to find a good trade off between model complexity and amount of data available. The UCR datasets are pretty small in general, so architectures with a relatively small complexity work well. What I mean by this is that these models tend to have less than 1M parameters, which is tiny compared to resnet34 for example.
I built a xresnet1D34 (same logic as yours), but have found that it tends to heavily overfit. I guess that more complex models like these need a much larger dataset to show their power.

cwerner · September 27, 2019, 6:15pm

@pomo @tcapelle thanks, True words. I’ll start the weekend with a simple MLP and see how it goes

marcmuc · September 27, 2019, 7:28pm

And this is basically a great and more elaborate wording for my favorite Jeremy quote: “Should I do blah? Try blah and see!”

marcmuc · September 27, 2019, 7:55pm

Very cool @tcapelle, thanks for sharing this!!
I have spent many many hours on the TSC datasets a few months ago, and created so much data that I was not sure how to put it in a useful repository. Was never good enough, so I never published anything and have since moved on to other topics… But this TCR dataset and everything surrounding it has a lot of specialties, so details matter very much in order to make things comparable. And there is a lot of - lets say “strange” - science going on around it, with many of the DL papers using the same datasets on the surface but the results are not reproducible and/or comparable

The dataset you used is not the one used in the paper I think, you used the newer version from 2018 in arff format. There were already generally two sources for the dataset, the UEA and the UCR, both have their sites and both serve the data, unfortunately it is not really completely the same data. They use different data formats and some of that means that the floating point numbers are not always exactly the same when comparing arff, csv and accross source-sites. Then there was a new dataset in 2018, which added more datasets and CHANGED some of the datasets (corrected them), again served in two variations by UEA and UCR. The problem is that in some papers the train and val datasets were switched even before that on the old version (there is also a list why and which were “wrong” in the old dataset somewhere). Now the timeseriesclassification.com has even removed the “old” dataset zip and only makes the new (2018) version available, meaning that reproducing stuff is even more difficult if you didn’t already download this a few months ago…

So, I would be a bit careful claiming anything regarding results here, before making absolutely sure that things really are comparable. Nonetheless there is much room for improvement on the models and I am not saying I don’t believe your results! I did a lot of experiments, I fully agree that batch-size and kernel size play a very key role here. But also the standard initializations of the layers in keras (which most papers implemented their code in) are different from the pytorch standard inits. So playing with inits is very interesting here to, using keras gives worse results than pytorch standards, so just by implementing exactly the same stuff in pytorch, the models already become better

I used the Fawaz review paper as the basis for the comparison of my results, as they have reproduced the paper results of many other papers with the same datasets (the “old” one which is not downloadable anymore, I have a copy if you want) and a transparent setup so that is the one paper which I was able to verify results of (also mainly played with FCN and ResNet).

I will try your code if I find the time, because I am astonished that you need only 40 epochs. While for some that may have been enough, I could not get SOTA results for many sets with less than 500 epochs (which is still only a third of what the paper used), so I will have a close look at your settings and setup and try to use it with my models and report back. I also found the onecycle results did not reach the results with plain cosine annealing, so very interesting to compare things! (I have “normalized” the result-tables of a few papers and the TSC website (for non-DL methods) so that stuff is better comparable in tables, as everyone uses different naming/abreviations etc., so maybe I can share that too)

tcapelle · September 27, 2019, 8:38pm

Thanks you for your feedback. I removed the commentary about the performance until I do a full benchmark. Yes, I also used the Fawaz paper as reference, from there I found about the Wang implementations.
Actually, I started with our own dataset that has 200k curves with 400 points and has to predict 3 regression values, and there I found that my own resnet was getting impressive results, so I decided to test against public available datasets, reviewing papers, etc… so it was a backwards research.
As you can see, some scores are better than the paper, and some are worst. Probably 40 epochs is low and the learning rate should also be fine tuned.
BatchSize is also something that is important for this small datasets, for most of the UCR you could fit the whole dataset in memory (bs = len(dataset) so no SGD needed, only GD. Probably is better to use pytroch LBFGS

marcmuc · September 27, 2019, 8:42pm

200k curves is a much better fit for DL than most of the UCR datasets. Sounds very interesting, can you share more about what type of data that is?

tcapelle · September 27, 2019, 8:44pm

Actually no… But it is a regression problem, with synthetic data. It is a highly non linear problem, where analytic solutions are not available. I tried to tackle this problem before with fastai 0.7 but my DL skills where poor. This time, I got very good results straight away!

tcapelle · September 27, 2019, 8:47pm

For the type of problem that I am solving, the UCR dataset is not a good benchmark, but could not find a better one (I have 200k curves in my dataset). The closest in UCR is probably StarLightCurves.

hfawaz · September 28, 2019, 4:33pm

Greetings to everyone here on this forum.
I would like to start by thanking @oguiza for creating this study group as well as all of you here contributing with your ideas and implementations to solve new real world time series data mining problems.
I found this forum after this tweet from @jeremy - many thanks - so I spent the whole afternoon today going through the different posts here and told my self to share with you some notes/inquiries/comments that we could all benefit from. Pardon me if I am only focusing on Time Series Classification (TSC) problems, as it is the main focus of my research, so here I go:

InceptionTime So Jeremy’s tweet is about our recent architecture InceptionTime for TSC. One of our main findings was that the kernel size is one of the most important characteristics which is linked the concept of Receptive Field (RF). The latter observation is inline with @tcapelle’s observation regarding the kernel size. Finally, our implementation is in Keras, but I would very much like to see an implementation of InceptionTime in fast.ai, since it seems like fast.ai would not only accelerate the training time but would also results in a much more accurate model due to the best practices embedded in the library. So feel free to suggest and send pull requests to the repository! I am new to fast.ai so I am here to learn
Transfer learning I have noticed that another observation shared by @marcmuc is that the initialization of a deep learning model affects highly its accuracy. This is indeed supported in our study on transfer learning for time series classification where we showed how the choice of the source dataset is very important and will highly impact the accuracy on the target dataset (we observed some negative and positive transfer learning). I therefore would like to know if some of you had similar and/or orthogonal observations when trying to fine-tune a model on their own TSC problem.
TS-CHIEF Outside of the deep learning world, a promising area of research based on the famous random forest classifier is being pioneered by researchers at Monash University in Australia. The model is called TS-CHIEF which is basically a significant improvement upon ProximityForest. I know that the implementation is provided in Java, while most of us here are using Python. Therefore I suggest starting a collaborative project and include everyone who is interested in providing a Python implementation of TS-CHIEF!
LSTM-FCN I noticed that this paper (and its multivariate version) appeared on your radar. The idea seems appealing at first by trying to combine LSTMs and FCNs, however one thing that you should pay attention to is that the results in the paper are erroneous: the authors accidentally tuned the network’s hyperparameters on the test set. Plus I think that the technique of transposing the time series before inputting it into the LSTM does not make any sense, however I would like to know your point of views and if anyone succeeded in reproducing some descent results with this architecture after fixing the code.
Imaging Time Series I have noticed that a lot of you have focused on trying to perform some image transformation before applying 2D CNNs, I would like to know if there are any recent results that anyone would like to share with us because personally I am failing to see the benefit of applying these transformations if you can directly input the raw time series and have the neural networks learn the necessary transformation. Perhaps anyone would be kind enough to enlighten me here if I am missing some clear advantage of imaging time series data.
Recurrent models In my experience I found it hard to train accurate recurrent architectures such as RNNs and LSTMs for the TSC task. However I am curious if anyone has results that would motivate research into this direction ?
Regression Recently I started searching for time series regression problems (not forecasting) - that is predicting a single real value based on the whole input time series. So the question here is: does anyone have some interesting datasets that would be categorized as Time Series Regression Problems ?

So thanks to everyone again for this great forum and I can’t wait to start discussing with you all.

jeremy · September 28, 2019, 6:10pm

Thanks for all the great links @hfawaz! The project above sounds like a great idea. If you need to start with a fast tree-growing foundation, you might want to check out this project:

It’s an algorithm based on random forests I created and is a fast C++ implementation with a little python wrapper.

hfawaz · September 28, 2019, 6:13pm

Great thanks for sharing, this will indeed help a lot, we ought to get started then!

marcmuc · September 28, 2019, 8:53pm

Hey Hassan, great to have you on this forum! Just scanned you InceptionTime Paper, looks very promising. And thanks for creating reproducible science, unfortunately not always the case, especially in the TS area.

Kernel Size: One question regarding kernel sizes: In your paper you now use kernel sizes of 40 - 20 - 10. Which are even numbers. I have always wondered what the thinking behind Wang’s 8-5-3 FCN and Resnet was, I have hardly ever seen even kernel sizes in image models. I mean all of those implementations were always in keras, just stuck a padding='same' there and it is hidden away, but that actually leads to uneven padding (which in pytorch you have to create manually, so it becomes obvious). So why is that?

Transfer Learning: While in vision a network trained on imagenet seems almost universally usable the same definitely does not apply for ts models in my experiments. So I also think you have to choose a model pretrained on a very similar dataset/domain but even that sometimes does not help much. So I have not made much use of pretrained ts models so far. This is also one explanation for

Imaging Time Series I think, so by transfering the time series problem to the image domain one can make use of pretrained models and well tuned architectures, vision seems years ahead of ts in this regards (which seems to be changing thanks to you now! ). So make something an image - be it timeseries, audio etc. - and you can easily reuse everything done for vision. Having said that I always thought it is kind of a huge waste of ressources to convert e.g. a 96 step energy time series into an e.g. 224x224x3 image and then run it through huge models. The information is contained in 96 ordered numbers, so a much smaller 1D model should be much more performant… (if the right model can be found).

LSTM-FCN and its successor GRU-FCN (an 11 page paper about replacing the four letters LSTM with GRU in the same keras code and gaining even more stellar - yet unreproducible results): Your comment made me very happy. After trying to reimplement their model in pytorch I kept thinking my dimensions were wrong, but after rereading the paper I found that their “dimension shuffle” seems very strange to say the least. I mean they swap the dimensions of the univariate timeseries and then pass it through a RNN. But that means it only has one time-step. A 1-timestep RNN is just a regular NN. (mulitvariate even stranger). And then they add a dropout of 80% to the results of that. I could never confirm that my “hunch” that this made little sense until now (or was I just not getting it?)

UCR Dataset / Metrics:
You are publishing in this area, so maybe you can enlighten me (or even change something about it). I am aware that in order to try to compare results, researchers try to use the same datasets etc. But one third of the UCR dataset (85 sets) are artificial image-timeseries (image outlines converted into timeseries). This may have made sense at some point in time, but with todays vision models this usecase is kind of useless, right? So why continue benchmarking on that. Shouldn’t today’s models reflect more sensor data, more mulit-variate series (industrial/medical sensors, motion, spectrograms, sound etc.) in order to actually be relevant in the real world? (more multivariate was made available with the 2018 version of UCR but hardly anybody seems to use it?!)
Why is Accuracy used as the metric everyone compares on? From my own experience ( you could call it stupidity) on e.g. the earthquakes dataset it is easy to see that accuracy is a very bad metric for many of the datasets (some binary, very imbalanced). Why not “update the metric” to something more useful?

hfawaz · September 28, 2019, 9:55pm

Hi @marcmuc
Thanks for taking the time for this thorough reply!

Kernel size For the even kernel size, indeed it is very uncommon and un-intuitive to use even values - not sure about the choice made for Wang et al. (2017) for ResNet and FCN. This is why for InceptionTime the real implementation is un-even. I should update the paper with the real kernel size values used in the implementation (currently the paper under review so I will definitely make sure to update it once we have a first decision) - so thanks for pointing this out.

Transfer learning As for this second point, we also observed that we should be careful when choosing the source dataset to transfer from, but I think there is much more potential for transfer learning here that is still yet to be discovered. Currently we proposed a baseline solution based on DTW to discover the best source for a given target dataset.

Imaging time series I agree that having images will allow us to re-use most of the deep models proposed for images, but I think there is also a huge potential to design our own algorithms - in fact by having one less spatial dimension than images, we are able to explore and try out more architectures on modern hardware. For example you can never imagine a kernel size (127x127) for images but you can easily imagine a filter of size (127x1) for time series data that could fit on a single GPU!

LSTM-FCN So basically we are on the same page, non reproducible research and that this “dimension shuffle” feature does not make any sense.

UCR benchmark I agree that the UCR archive does not represent all real-life use cases of TSC problems, which is one of the things I liked the most here: everyone coming with their own TSC problem and trying to figure out how to use SOTA algorithms. Secondly, I think the images datasets are still here out of legacy, but indeed this is no longer needed with current computer vision algorithms. I agree that we should move forward towards multivariate TSC problems, which is why the community started publishing this new MTS (Multivariate Time Series) archive that you mentioned earlier. I think we did not start using it yet because it is still very small (from my experience you can run ResNet, FCN and NN-DTW and the statistical test will tell you that they all perform the same). I believe once we have a big enough archive the community will start focusing more on MTS problems.

Accuracy I agree that accuracy is not the best metric for all datasets, especially for unbalanced ones. But I think that the motivation behind using it is to have one metric that is convenient for most datasets. This is why in the papers I report Accuracy (just to be able to compare with SOTA) but in my repositories you can see that I have Precision and Recall as well

jeremy · September 28, 2019, 11:18pm

I think so too. There are a number of common patterns that occur frequently in time series. For instance, cycles (weekly, monthly, hourly…), linear trend, exponential trend, and so forth. The most basic of these are hard-coded into the standard algorithms (i.e. split out “seasonality” and “trend”) - but I’d expect by transfer learning from lots of real-world datasets we could have a much larger range of patterns learnt automatically. Perhaps one day the idea of “seasonality” will be as obsolete in TS and “SIFT” and “haar” features are in computer vision today…

I’d guess that for any algorithm that works well on an image encoding of a time series, we can find a better one that works on the raw signal.

hfawaz · September 29, 2019, 8:58am

I am glad that we all agree that there is a huge potential here. But unlike computer vision, we do not have this huge amount of publicly available labeled datasets, this is why I think data acquisition is very important here, and competitions like Kaggle’s and others help with that, in addition to people making there data publicly available.

tcapelle · September 29, 2019, 9:21am

@hfawaz have you tried running a regular GD method on UCR? The datasets are so small that you could use something as LBFGS instead of Adam/SGD.
May improve accuracy considerably. Is there an implementation of GD (non stochastic) in fastai @jeremy ?
I have used pytorch built in LFBGS before, I may give it a try.

hfawaz · September 29, 2019, 10:31am

For InceptionTime we did play with the batch size, so basically with a big batch size you will be running batch gradient descent instead of mini-batch gradient descent over the small datasets. Not sure how that would affect ResNet and FCN, but I think that the original implementation of ResNet and FCN uses a formula to compute the mini-batch size which I found to be very suboptimal.

jeremy · September 29, 2019, 1:41pm

Yeah just use the ones in pytorch or scipy - they work fine.