Time series/ sequential data study group

oguiza · October 2, 2019, 10:24am

Well, I think the goal is to improve the performance of InceptionTime, both in terms of accuracy and time.
I don’t think we know at this stage what the best number of epochs will be. I see a huge discrepancy between the 1500 epochs used by Fawaz and the 40 epochs you are using. So I don’t think it makes sense to fix that yet.
Also, in my experience, when you use any data augmentation (mixup, cutmix, etc) you need to increase the number of epochs to improve the result.

What I think it’d be important is to agree on a fastai baseline implementation of InceptionTime, and on how accuracy is measured (we should all use accuracy based on lowest train loss as Fawaz explained before). One we have that it’d be good to experiment with different training approches (one_cycle, flat_annealing), optimizers, etc until we achieve our goal. Hopefully we should beat HIVE-COTE and TS-CHIEF, and establish a new SOTA in TSC. We could then work to move that to fastai-v2.

Yes. The function I have created downloads the arff data.

Why do you say that?

tcapelle · October 2, 2019, 11:56am

Removed unused code
SequentialEx is the way to go, Jeremy make extensive use in V2.
I am not finally using Cat class =(
I am using only one bottleneck, I changed the code to make it more explicit:

    def forward(self, x):
        bottled = self.bottleneck(x)
        return self.bn_relu(torch.cat([c(bottled) for c in self.convs]+[self.conv_bottle(x)], dim=1))

vrodriguezf · October 2, 2019, 12:02pm

Hi!

I would like to add an auxiliary input to my TSC problem. I mean, apart from the main input (a multivariate time series), I have some extra data associated to each case. How should I proceed to implement that? Should I concatenate the auxiliary input to the fully connected layer at the end of the model?

Many thanks for your work!!!

tcapelle · October 2, 2019, 12:18pm

What type of input? A real value, another timeseries?
Probably you will need to concatenate it with the feature map (before the las Linear layer). And add a custom head with some ReLU’s to be able to integrate this value on the output.

vrodriguezf · October 2, 2019, 12:21pm

Thanks for your prompt reply @tcapelle! The auxiliary input is a set of numeric values, more specific, a 5-length integer vector.

tcapelle · October 2, 2019, 12:29pm

You have two options:

You train 2 models, and then ensemble them, you could use XGboost for the final ensemble
You train 1 model that takes 2 inputs

The First approach is simpler and normally gets better results.
The second approach is full DL and probably funnier. Look at this post where I built a double input Dataset that takes a TS and a vector. The tabular part would need Embeddings to tackle categorical variables (your ints), but you can get that from fastai tabular model.

oguiza · October 2, 2019, 12:59pm

You may want to remove class Cat since you are not using it (just to reduce clutter).

Thanks for explaining the bottlenexk question. I now understand. There’s no need to change this, and your previous code was more concise.

def forward(self, x):
    bottled = self.bottleneck(x)
    return self.bn_relu(torch.cat([c(bottled) for c in self.convs]+[self.conv_bottle(x)], dim=1))

There are still a couple differences between both implementations:

I see in your conv bias is set to False. Even if I agree with that it’s probably better to set it to False, I don’t think it’s the original design. I’d suggest we should probably to keep it as True, and then test if it works better with False.
The other difference is in the model’s head. You are using AdaptiveConcatPool1d and then double the size of the Linear layer, when the original models uses GlobalAveragePooling1D in Keras. I think the equivalent would be AdaptiveAvgPool1d. Again, this is something that would be worthwhile testing once we have established the baseline metrics.

I’m just suggesting these changes to ensure we have a version that replicates as closely as possible the original one, but I don’t know if you agree to this approach or not. Either way it’s fine , it’s just good to understand the differences.

tcapelle · October 2, 2019, 1:04pm

Hey,

Using the auxiliary variable bottled saves 2 operations as the bottleneck is just computed once.
The original implementation has bias=False
AdaptiveConcatPool1d couldn’t hurt =P

vrodriguezf · October 2, 2019, 1:43pm

Thank you @tcapelle! I would like to be able to interpret the results of the classifier afterwards, so I’ll check which of the two approaches could provide a better framework for explainability…but I agree, full DL looks funnier!

hfawaz · October 2, 2019, 2:43pm

I think that you are right, it should be changed to handle the MTS data in a better way, this was my intent but I did not pay attention to multivariate case as it is not the focus of the paper.

oguiza · October 2, 2019, 3:10pm

Ok, makes sense.
Just another question, sorry. Are there clear SOTA multivariate models based on the 30 new datasets?

hfawaz · October 2, 2019, 3:27pm

For now you only have the NN-DTW results.

https://www.cs.ucr.edu/~eamonn/time_series_data_2018/

vrodriguezf · October 2, 2019, 5:03pm

One question for a complete noob in this field. Is it normal to use a row-wise representation of the time series, when stored as data frames? I see that it is the format used in both the repositories of @oguiza and @tcapelle, but as far as I know, data frames are optimized to work column-wise.

Best!

oguiza · October 2, 2019, 5:33pm

Good question!
I’ve seen multiple variations. You usually need to manipulate the df to get the expected input format. To use the functionality I’ve shared, you need to have samples in rows, a column for features (in the case of multivariate ts or None for univariate) and time steps in columns, but if we see other uses I may need to update the code
As to the optimization, it’s difficult to know what would be better. Sometimes you have more samples than time steps or vice versa.

vrodriguezf · October 2, 2019, 6:09pm

Thank you for your answer! Regarding your implementation:

The “feature” and “target” columns must be also numerical, or they can contain the actual names of the features/targets?
How are the different features of a single subject related to each other in the final data frame? Is it implicit by the order of the rows?

oguiza · October 2, 2019, 6:23pm

More great questions Victor!

I think you can use anything you want as features or target. What is important is to let indicate if the target should be handled as a category or a float. If you try it and it doesn’t work as expected, please let me know.
yes, is implicit by the order of the rows. I will add this to the notebook to make it clear. You need to sort the data frame rows by sample, otherwise data from different samples will be mixed. Thanks for raising this!

hfawaz · October 3, 2019, 9:47am

So I had a look to the mixup data augmentation technique, I believe it is a special case of a weighted data augmentation technique that we have proposed previously but hadn’t had much success with it. Maybe you guys can make it work.
Basically the method computes the weighted average of a set of time series and consider this weighted average as the new time series (to augment the training set).
The average is computed in the DTW space instead of the euclidean one.

Here are the relevant papers, this is the original method and this one shows its use with ResNet.

What do you think, is it similar to mixup ?

tcapelle · October 3, 2019, 10:10am

I think that @oguiza can explain it better, the current implementation is pretty straight forward:
At a batch level, it will mix the last input with the current one current:

new_input = last_input * lambd + current * (1-lambd)
new_target = last_target * lambd + current * (1-lambd)

where lambd follows a Beta distribution. @sgugger examplains in detail here

oguiza · October 3, 2019, 10:13am

Thanks for looking at this.

I’ve read both articles (I think I’ve read all your papers, and many more… ), created some code and experimented with it, but didn’t get good results.

I think there are some similarities, but also differences.

Similarities:

Both are data augmentation techniques
New samples are created by combining original samples in the dataset

Differences:

Mixup combines the current ts to another randomly selected ts.
This ts can be of any class
The % in which they are mixed is randomly selected, between 0-50%.
The newly created sample will then has a % of the original ts (between 50-100%) and a % of the 2nd (0-50%).
The loss is calculated as the weighted average of the losses of the newly created ts with each of their labels.

What I’ve done is just to adapt the original mixup, cutout, cutmix algos to time series and in my experience they both work very well, as they do in image classification BTW.

I’m working in a notebook to try to explain how you can apply these data augmentation techniques to 1D data. It’s very simple, and they almost always improve performance.

Please, let me know if you need any more details.

hfawaz · October 3, 2019, 11:28am

Okay I will be waiting for your notebook then Thanks