Time series/ sequential data study group

One question for a complete noob in this field. Is it normal to use a row-wise representation of the time series, when stored as data frames? I see that it is the format used in both the repositories of @oguiza and @tcapelle, but as far as I know, data frames are optimized to work column-wise.

Best!

Good question!
I’ve seen multiple variations. You usually need to manipulate the df to get the expected input format. To use the functionality I’ve shared, you need to have samples in rows, a column for features (in the case of multivariate ts or None for univariate) and time steps in columns, but if we see other uses I may need to update the code
As to the optimization, it’s difficult to know what would be better. Sometimes you have more samples than time steps or vice versa.

Thank you for your answer! Regarding your implementation:

  • The “feature” and “target” columns must be also numerical, or they can contain the actual names of the features/targets?
  • How are the different features of a single subject related to each other in the final data frame? Is it implicit by the order of the rows?

More great questions Victor! :grinning:

  • I think you can use anything you want as features or target. What is important is to let indicate if the target should be handled as a category or a float. If you try it and it doesn’t work as expected, please let me know.

  • yes, is implicit by the order of the rows. I will add this to the notebook to make it clear. You need to sort the data frame rows by sample, otherwise data from different samples will be mixed. Thanks for raising this!

So I had a look to the mixup data augmentation technique, I believe it is a special case of a weighted data augmentation technique that we have proposed previously but hadn’t had much success with it. Maybe you guys can make it work.
Basically the method computes the weighted average of a set of time series and consider this weighted average as the new time series (to augment the training set).
The average is computed in the DTW space instead of the euclidean one.

Here are the relevant papers, this is the original method and this one shows its use with ResNet.

What do you think, is it similar to mixup ?

1 Like

I think that @oguiza can explain it better, the current implementation is pretty straight forward:
At a batch level, it will mix the last input with the current one current:

new_input = last_input * lambd + current * (1-lambd)
new_target = last_target * lambd + current * (1-lambd)

where lambd follows a Beta distribution. @sgugger examplains in detail here

1 Like

Thanks for looking at this.

I’ve read both articles (I think I’ve read all your papers, and many more… :grinning:), created some code and experimented with it, but didn’t get good results.

I think there are some similarities, but also differences.

Similarities:

  • Both are data augmentation techniques
  • New samples are created by combining original samples in the dataset

Differences:

  • Mixup combines the current ts to another randomly selected ts.
  • This ts can be of any class
  • The % in which they are mixed is randomly selected, between 0-50%.
  • The newly created sample will then has a % of the original ts (between 50-100%) and a % of the 2nd (0-50%).
  • The loss is calculated as the weighted average of the losses of the newly created ts with each of their labels.

What I’ve done is just to adapt the original mixup, cutout, cutmix algos to time series and in my experience they both work very well, as they do in image classification BTW.

I’m working in a notebook to try to explain how you can apply these data augmentation techniques to 1D data. It’s very simple, and they almost always improve performance.

Please, let me know if you need any more details.

1 Like

Okay I will be waiting for your notebook then :wink: Thanks

Hi again!

Another noob question. How do these algorithm deal in general with missing values? Are there missing values in the UCR database?

I have been playing this morning with a 1D implementation of Res2Net from here. But the more I play with UCR the more I dislike this benchmark. It has so little training samples for some tasks that I am not sure if we could create a model that performs well. How I see it:

  • The more you train, the more your training loss decreases, but you start getting worst results in the test set at some point, for instance for SmallKitchenAppliances that has 375 samples we get this only after 40 epochs:
    :Train_Val
  • For OliveOil is another completely different thing, only 30 miserable samples:
    Olive
    I am using mixup all the time now, to augment our little data. The results of the res2net50 for our bench tasks are (100 epochs, lr=1e-3, FlatCos anneal):
|                              |   res2net |
|:-----------------------------|----------:|
| Wine                         |  0.979167 |
| BeetleFly                    |  0.85     |
| InlineSkate                  |  0.429091 |
| MiddlePhalanxTW              |  0.61039  |
| OliveOil                     |  0.9      |
| SmallKitchenAppliances       |  0.792    |
| WordSynonyms                 |  0.587774 |
| MiddlePhalanxOutlineAgeGroup |  0.62987  |
| MoteStrain                   |  0.829872 |
| Phoneme                      |  0.241561 |
| Herring                      |  0.671875 |
| ScreenType                   |  0.613333 |
| ChlorineConcentration        |  0.853125 |
1 Like

Indeed it is a very hard problem, it is not straight forward to obtain labeled data for time series classification. That’s why deep learning has just started being explored for this task.

We need an Imagenet, this UCR dataset is bad and small.

1 Like

I would use the word hard and challenging instead of “bad” :wink:
The people who gathered and prepared the archive spent a huge effort which led to this advancement of time series classification algorithms.

1 Like

I agree that it’s very frustrating sometimes! :grinning:
But I have to say that it is no different from other real life datasets.
The one I use only have aound 1000 samples, and I can tell your it’s equally frustrating!!
The good thing is that when you deal with very challanges datasets, you end up trying so many things, that you learn a lot.
And there are really small datasets (like OliveOil), where you can get a very high accuracy (in some models I’ve used 96%+) with only 30 samples (levereging image encoding and transfer learning).
I’m still convinced we can beat HIVE-COTE and TS-CHIEF using this dataset!

1 Like

I am implementing exactly your metrics @oguiza to be able to compare accurately.

1 Like

I’ve also been testing Res2Net 1D, and similar large vision-like models in the past.

One of the key learnings I got is that those models tend to greatly overfit, and don’t work so get.

After some thought I think I understand what the issue might be.
If you think of it a 224x224x3 has around 150k pixels. ResNet50 has around 25M parameters.

We are dealing with TS that tend to be small in size. It’s pretty usual to have TS data with less than 1k datapoints in total (nb feats x seq_len). So this is 1% of a normal image.

This is the nb parameters for some of the 1D models we usually use with TS:

  • FCN 276k
  • InceptionTime 389k
  • ResNet 484k

But the versions I built of Res2Net are pretty big in comparison:

  • Res2Net34 5.8M
  • XRes2Net34 8.1M
  • Res2Net50 19.9M

So these ones may work well in case you have lots of data, which I haven’t been able to try because I don’t have.

What do you think?

Ok.
I’m a bit confused when you say you are applying mixup. I thought you said yesterday that probably fastai’s mixup didn’t work. What are you using then?

I am using fastai Mixup :relaxed: It works, but not yet sure if it performs good enough.
Here are the results for Res2Net50:

|                              |   epochs |        loss |   val_loss |   accuracy |   accuracy_ts |   max_accuracy |   time (s) |
|:-----------------------------|---------:|------------:|-----------:|-----------:|--------------:|---------------:|-----------:|
| Wine                         |      100 | 0.0472185   |   0.836495 |   0.75     |      0.75     |       0.895833 |         52 |
| BeetleFly                    |      100 | 0.000209084 |   1.11402  |   0.8      |      0.85     |       0.9      |         61 |
| InlineSkate                  |      100 | 0.0516765   |   2.72746  |   0.42     |      0.42     |       0.423636 |         98 |
| MiddlePhalanxTW              |      100 | 0.000941262 |   2.88568  |   0.532468 |      0.551948 |       0.571429 |         97 |
| OliveOil                     |      100 | 0.292994    |   0.38972  |   0.9      |      0.9      |       0.9      |         68 |
| SmallKitchenAppliances       |      100 | 0.00445449  |   1.60318  |   0.741333 |      0.741333 |       0.778667 |        105 |
| WordSynonyms                 |      100 | 0.00383701  |   3.29962  |   0.534483 |      0.532915 |       0.543887 |         84 |
| MiddlePhalanxOutlineAgeGroup |      100 | 0.00181109  |   3.13823  |   0.493506 |      0.493506 |       0.623377 |         92 |
| MoteStrain                   |      100 | 0.00239903  |   1.1964   |   0.744409 |      0.744409 |       0.794728 |        694 |
| Phoneme                      |      100 | 0.00118895  |   5.0102   |   0.224156 |      0.224156 |       0.244198 |        145 |
| Herring                      |      100 | 0.00332776  |   2.34432  |   0.546875 |      0.546875 |       0.65625  |         66 |
| ScreenType                   |      100 | 0.000509472 |   2.89669  |   0.482667 |      0.482667 |       0.581333 |         96 |
| ChlorineConcentration        |      100 | 0.00138494  |   0.806829 |   0.851823 |      0.851823 |       0.854427 |        196 |

How can I compute the number of params?

1 Like

No there are no missing values. I gues you could delete some randomly if you wanted to try it.

I’ve never dealt with missing values to be honest with you. I don’t know if they would even work.
Usually you would replace those missing values with a constant, or an average, or median, etc.
Sorry I can’t help more.

I’d think you’d want an external preprocessing step in your data frame to handle this with the average. There’s a few different methods to it but I usually do the average, so here I’d do it over the particular series instance (row). That’s how I’d go about missing values in this case :slight_smile:

1 Like