Time series/ sequential data study group

Actually no… But it is a regression problem, with synthetic data. It is a highly non linear problem, where analytic solutions are not available. I tried to tackle this problem before with fastai 0.7 but my DL skills where poor. This time, I got very good results straight away!

For the type of problem that I am solving, the UCR dataset is not a good benchmark, but could not find a better one (I have 200k curves in my dataset). The closest in UCR is probably StarLightCurves.

Greetings to everyone here on this forum.
I would like to start by thanking @oguiza for creating this study group as well as all of you here contributing with your ideas and implementations to solve new real world time series data mining problems.
I found this forum after this tweet from @jeremy - many thanks - so I spent the whole afternoon today going through the different posts here and told my self to share with you some notes/inquiries/comments that we could all benefit from. Pardon me if I am only focusing on Time Series Classification (TSC) problems, as it is the main focus of my research, so here I go:

  • InceptionTime So Jeremy’s tweet is about our recent architecture InceptionTime for TSC. One of our main findings was that the kernel size is one of the most important characteristics which is linked the concept of Receptive Field (RF). The latter observation is inline with @tcapelle’s observation regarding the kernel size. Finally, our implementation is in Keras, but I would very much like to see an implementation of InceptionTime in fast.ai, since it seems like fast.ai would not only accelerate the training time but would also results in a much more accurate model due to the best practices embedded in the library. So feel free to suggest and send pull requests to the repository! I am new to fast.ai so I am here to learn :wink:
  • Transfer learning I have noticed that another observation shared by @marcmuc is that the initialization of a deep learning model affects highly its accuracy. This is indeed supported in our study on transfer learning for time series classification where we showed how the choice of the source dataset is very important and will highly impact the accuracy on the target dataset (we observed some negative and positive transfer learning). I therefore would like to know if some of you had similar and/or orthogonal observations when trying to fine-tune a model on their own TSC problem.
  • TS-CHIEF Outside of the deep learning world, a promising area of research based on the famous random forest classifier is being pioneered by researchers at Monash University in Australia. The model is called TS-CHIEF which is basically a significant improvement upon ProximityForest. I know that the implementation is provided in Java, while most of us here are using Python. Therefore I suggest starting a collaborative project and include everyone who is interested in providing a Python implementation of TS-CHIEF!
  • LSTM-FCN I noticed that this paper (and its multivariate version) appeared on your radar. The idea seems appealing at first by trying to combine LSTMs and FCNs, however one thing that you should pay attention to is that the results in the paper are erroneous: the authors accidentally tuned the network’s hyperparameters on the test set. Plus I think that the technique of transposing the time series before inputting it into the LSTM does not make any sense, however I would like to know your point of views and if anyone succeeded in reproducing some descent results with this architecture after fixing the code.
  • Imaging Time Series I have noticed that a lot of you have focused on trying to perform some image transformation before applying 2D CNNs, I would like to know if there are any recent results that anyone would like to share with us because personally I am failing to see the benefit of applying these transformations if you can directly input the raw time series and have the neural networks learn the necessary transformation. Perhaps anyone would be kind enough to enlighten me here if I am missing some clear advantage of imaging time series data.
  • Recurrent models In my experience I found it hard to train accurate recurrent architectures such as RNNs and LSTMs for the TSC task. However I am curious if anyone has results that would motivate research into this direction ?
  • Regression Recently I started searching for time series regression problems (not forecasting) - that is predicting a single real value based on the whole input time series. So the question here is: does anyone have some interesting datasets that would be categorized as Time Series Regression Problems ?

So thanks to everyone again for this great forum and I can’t wait to start discussing with you all.

11 Likes

Thanks for all the great links @hfawaz! The project above sounds like a great idea. If you need to start with a fast tree-growing foundation, you might want to check out this project:

It’s an algorithm based on random forests I created and is a fast C++ implementation with a little python wrapper.

1 Like

Great thanks for sharing, this will indeed help a lot, we ought to get started then!

Hey Hassan, great to have you on this forum! Just scanned you InceptionTime Paper, looks very promising. And thanks for creating reproducible science, unfortunately not always the case, especially in the TS area.

Kernel Size: One question regarding kernel sizes: In your paper you now use kernel sizes of 40 - 20 - 10. Which are even numbers. I have always wondered what the thinking behind Wang’s 8-5-3 FCN and Resnet was, I have hardly ever seen even kernel sizes in image models. I mean all of those implementations were always in keras, just stuck a padding='same' there and it is hidden away, but that actually leads to uneven padding (which in pytorch you have to create manually, so it becomes obvious). So why is that?

Transfer Learning: While in vision a network trained on imagenet seems almost universally usable the same definitely does not apply for ts models in my experiments. So I also think you have to choose a model pretrained on a very similar dataset/domain but even that sometimes does not help much. So I have not made much use of pretrained ts models so far. This is also one explanation for

Imaging Time Series I think, so by transfering the time series problem to the image domain one can make use of pretrained models and well tuned architectures, vision seems years ahead of ts in this regards (which seems to be changing thanks to you now! :wink: ). So make something an image - be it timeseries, audio etc. - and you can easily reuse everything done for vision. Having said that I always thought it is kind of a huge waste of ressources to convert e.g. a 96 step energy time series into an e.g. 224x224x3 image and then run it through huge models. The information is contained in 96 ordered numbers, so a much smaller 1D model should be much more performant… (if the right model can be found).

LSTM-FCN and its successor GRU-FCN (an 11 page paper about replacing the four letters LSTM with GRU in the same keras code and gaining even more stellar - yet unreproducible results): Your comment made me very happy. After trying to reimplement their model in pytorch I kept thinking my dimensions were wrong, but after rereading the paper I found that their “dimension shuffle” seems very strange to say the least. I mean they swap the dimensions of the univariate timeseries and then pass it through a RNN. But that means it only has one time-step. A 1-timestep RNN is just a regular NN. (mulitvariate even stranger). And then they add a dropout of 80% to the results of that. I could never confirm that my “hunch” that this made little sense until now (or was I just not getting it?) :wink:

UCR Dataset / Metrics:
You are publishing in this area, so maybe you can enlighten me (or even change something :wink: about it). I am aware that in order to try to compare results, researchers try to use the same datasets etc. But one third of the UCR dataset (85 sets) are artificial image-timeseries (image outlines converted into timeseries). This may have made sense at some point in time, but with todays vision models this usecase is kind of useless, right? So why continue benchmarking on that. Shouldn’t today’s models reflect more sensor data, more mulit-variate series (industrial/medical sensors, motion, spectrograms, sound etc.) in order to actually be relevant in the real world? (more multivariate was made available with the 2018 version of UCR but hardly anybody seems to use it?!)
Why is Accuracy used as the metric everyone compares on? From my own experience ( you could call it stupidity) on e.g. the earthquakes dataset it is easy to see that accuracy is a very bad metric for many of the datasets (some binary, very imbalanced). Why not “update the metric” to something more useful?

4 Likes

Hi @marcmuc
Thanks for taking the time for this thorough reply!

Kernel size For the even kernel size, indeed it is very uncommon and un-intuitive to use even values - not sure about the choice made for Wang et al. (2017) for ResNet and FCN. This is why for InceptionTime the real implementation is un-even. I should update the paper with the real kernel size values used in the implementation (currently the paper under review so I will definitely make sure to update it once we have a first decision) - so thanks for pointing this out.

Transfer learning As for this second point, we also observed that we should be careful when choosing the source dataset to transfer from, but I think there is much more potential for transfer learning here that is still yet to be discovered. Currently we proposed a baseline solution based on DTW to discover the best source for a given target dataset.

Imaging time series I agree that having images will allow us to re-use most of the deep models proposed for images, but I think there is also a huge potential to design our own algorithms - in fact by having one less spatial dimension than images, we are able to explore and try out more architectures on modern hardware. For example you can never imagine a kernel size (127x127) for images but you can easily imagine a filter of size (127x1) for time series data that could fit on a single GPU!

LSTM-FCN So basically we are on the same page, non reproducible research and that this “dimension shuffle” feature does not make any sense.

UCR benchmark I agree that the UCR archive does not represent all real-life use cases of TSC problems, which is one of the things I liked the most here: everyone coming with their own TSC problem and trying to figure out how to use SOTA algorithms. Secondly, I think the images datasets are still here out of legacy, but indeed this is no longer needed with current computer vision algorithms. I agree that we should move forward towards multivariate TSC problems, which is why the community started publishing this new MTS (Multivariate Time Series) archive that you mentioned earlier. I think we did not start using it yet because it is still very small (from my experience you can run ResNet, FCN and NN-DTW and the statistical test will tell you that they all perform the same). I believe once we have a big enough archive the community will start focusing more on MTS problems.

Accuracy I agree that accuracy is not the best metric for all datasets, especially for unbalanced ones. But I think that the motivation behind using it is to have one metric that is convenient for most datasets. This is why in the papers I report Accuracy (just to be able to compare with SOTA) but in my repositories you can see that I have Precision and Recall as well :wink:

4 Likes

I think so too. There are a number of common patterns that occur frequently in time series. For instance, cycles (weekly, monthly, hourly…), linear trend, exponential trend, and so forth. The most basic of these are hard-coded into the standard algorithms (i.e. split out “seasonality” and “trend”) - but I’d expect by transfer learning from lots of real-world datasets we could have a much larger range of patterns learnt automatically. Perhaps one day the idea of “seasonality” will be as obsolete in TS and “SIFT” and “haar” features are in computer vision today… :slight_smile:

I’d guess that for any algorithm that works well on an image encoding of a time series, we can find a better one that works on the raw signal.

3 Likes

I am glad that we all agree that there is a huge potential here. But unlike computer vision, we do not have this huge amount of publicly available labeled datasets, this is why I think data acquisition is very important here, and competitions like Kaggle’s and others help with that, in addition to people making there data publicly available.

1 Like

@hfawaz have you tried running a regular GD method on UCR? The datasets are so small that you could use something as LBFGS instead of Adam/SGD.
May improve accuracy considerably. Is there an implementation of GD (non stochastic) in fastai @jeremy ?
I have used pytorch built in LFBGS before, I may give it a try.

For InceptionTime we did play with the batch size, so basically with a big batch size you will be running batch gradient descent instead of mini-batch gradient descent over the small datasets. Not sure how that would affect ResNet and FCN, but I think that the original implementation of ResNet and FCN uses a formula to compute the mini-batch size which I found to be very suboptimal.

Yeah just use the ones in pytorch or scipy - they work fine.

Lol, yeah :

batch_size = min(x_train.shape[0]/10, 16)
1 Like

Hi @hfawaz,

Welcome to our study group! It’s a priviledge to have a Time Series world-class researcher joining us!

I hope you’ll find the experience as useful and rewarding as I have. I can say that for me the fastai community’s been the best learning and collaborative environment I’ve found in the area of ML.

I’d really like to thank you and the rest of the team for the quality of work you are producing and for openly share your code. I think you’re raising the standard of research in TS.

I also work in the area of Time Series Classification and Regression (not Forecasting), mainly with multivariate datasets.

I have a few comments on your previous post:

  • InceptionTime: I read your paper when it was public, found it super interesting, so I created a pytorch version. I’ve been using it for a couple of weeks and results on my own datasets are better than with ResNet. So thanks a lot for developing it! Personally I think that the idea of using larger receptor fields goes in the right direction. I’m building a Practical Time Series repo that I’ll be able to share either today or tomorrow that contains all that is required to train TS models with fastai, as well as a collection of some of the state-of -the-art TS architectures (FCN, ResNet, ResCNN, InceptionTime, etc). I’m currently investigating ways to improve performance of the InceptionTime network applying the fastai framework.
  • Imaging Time Series: I’m with you and Jeremy that the encoding of TS seems like a waste of time, since all the information is contained in the raw data. However, I’ve seen that in some datasets, imaging works really well, even if the dataset is tiny, as you can benefit from computer vision transfer learning. I have tried multiple encodings (Gramian, MTF, RecurrencePlots, Wavelets, etc) with mixed results. I believe that in the end raw input models should prevail, but it’s also true that our brain is far better identifying patterns based on charts that on numerical data.
  • Recurrent models: In all comparisons I’ve made, I’ve always found CNN models far superior to RNNs, and they are much faster to train. I gave up on RNNs some time ago.
  • Regression: I’m also working in this area, but my datasets are proprietary, so I cannot share them. Sorry about that!

Just to give you an idea, here are few areas I’m currently testing in the area of multivariate TS (everything using fastai):

  • Impact of LSUV (and related) initialization
  • New optimizers (like Ranger, developed by some great fastai colleagues - thread)
  • New activation function (also developed by some great fastai colleagues - thread)
  • Data augmentation: cutout, mixup, cutmix,…
  • Semi-supervised learning: mixmatch, uda, s4l
  • Training: progressive resizing
  • Ensembles vs multi-branch models vs hybrids
  • New hybrid Time-Frequency models
  • Inception architecture tweaks: ’bag of tricks’
  • Visualization of activations

I’ll post any significant insights I get during my experiments.

I’m more than happy to discuss any of this with anybody who’s interested. I’ll also create notebooks to demonstrate this functionality.

3 Likes

@oguiza I am also very glad to be here, thanks for taking this great initiative and creating this study group!
I find it great to be able to discuss with everyone interested in such an important topic.
I will be eagerly waiting for your results and implementation of InceptionTime in fastai.

As for imaging time series, I think that for some datasets (and maybe most of them) adding domain knowledge into the design of an architecture is going to help improving the accuracy - which is the case for some datasets where imaging (frequency domain for example) is some kind of domain knowledge that helped in improving the accuracy.

I am also working on multivariate, semi-supervised, data augmentation, ensembling and some architecture tweaks. I will keep everyone up-to-date once I have something concrete to show.

Thanks again for all of this!

1 Like

@oguiza I implemented the Inception module today, it looks like this:

class InceptionModule(nn.Module):
    def __init__(self, ni, use_bottleneck=True, kss=[41, 21, 11], bottleneck_size=32, nb_filters=32, stride=1):
        super().__init__()
        if use_bottleneck:
            self.conv0 = nn.Conv1d(ni, bottleneck_size, 1, bias=False)
        else:
            self.conv0 = noop
        self.conv1 = conv(bottleneck_size, nb_filters, kss[0])
        self.conv2 = conv(bottleneck_size, nb_filters, kss[1])
        self.conv3 = conv(bottleneck_size, nb_filters, kss[2])
        self.conv_bottle = nn.Sequential(nn.MaxPool1d(3, stride, padding=1), 
                                         nn.Conv1d(bottleneck_size, nb_filters, 1, bias=False))
        self.bn_relu = nn.Sequential(nn.BatchNorm1d(4*nb_filters), 
                                     nn.ReLU())
    def forward(self, x):
        x = self.conv0(x)
        return self.bn_relu(torch.cat([self.conv1(x), self.conv2(x), self.conv3(x), self.conv_bottle(x)], dim=1))

and to create the network:

def create_inception(ni, nout, kss=[41, 21, 11], stride=1, depth=6, bottleneck_size=32, nb_filters=32,head=True):
    layers = [InceptionModule(ni, kss=kss, use_bottleneck=False, stride=stride), MergeLayer(), nn.ReLU()]
    layers += (depth-1)*[InceptionModule(4*nb_filters, kss=kss, bottleneck_size=bottleneck_size, stride=stride), MergeLayer(), nn.ReLU()]
    head = [AdaptiveConcatPool1d(), Flatten(), nn.Linear(8*nb_filters, nout)] if head else []
    return  SequentialEx(*layers, *head)

I think it can be simplified a bit. @hfawaz can you check if it is correct? From my initial testings, it is not training that well. The 40 epochs needed for resnet almost don’t do anything to the InceptionTime, probably I have a bug somewhere

Nice that was fast!
Not quite sure, is there an output of model.summary() similar to keras ?

ni=1, bottleneck=32, nb_filters=32

InceptionModule(
  (conv1): Conv1d(1, 32, kernel_size=(41,), stride=(1,), padding=(20,), bias=False)
  (conv2): Conv1d(1, 32, kernel_size=(21,), stride=(1,), padding=(10,), bias=False)
  (conv3): Conv1d(1, 32, kernel_size=(11,), stride=(1,), padding=(5,), bias=False)
  (conv_bottle): Sequential(
    (0): MaxPool1d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
    (1): Conv1d(1, 32, kernel_size=(1,), stride=(1,), bias=False)
  )
  (bn_relu): Sequential(
    (0): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (1): ReLU()
  )
)

This would be the 1st layer for reading a 1 channel TS. The problem with this display method is that you don’t see that the 3 convs+ the conv_bottle are stacked together, you could guess this by the batchnorm(128) layer that comes afterwards.

I guess here you are applying a bottleneck operation for the first layer. You can see here that I skip it for the first layer explicitly.

Thanks, I will change that. Would you mind checking here if I got it right?