Time series/ sequential data study group

marcmuc · November 27, 2018, 2:30pm

totally agree, but that means we also have to leave out the earthquakes that “started it all” here. (I will still be comparing with that dataset for myself…)

oguiza · November 27, 2018, 2:57pm

Proposed datasets

Here’s a list of 5 datasets that meet all this criteria:

Available in 2015 (with correct splits)
Best result & average accuracy (all algos) between 80-95% (that is high, but there’s some room for improvement)
Difference between best and worst result >20% (this is to hopefully ensure that baseline with all 1s is not best result). More separation among algos will allow us to better discriminate which new algorithms work better.
Train size >= Test size

@sam2 ECG200 and ECG5000 meet all criteria except #3, which means that all algos have similar performance.

Shall we vote and decide which one replaces Earthquakes?
We could use the rest (and others) in this group for comparisons.
What do you think?

sam2 · November 27, 2018, 3:02pm

Agreed

whamp · November 27, 2018, 3:07pm

I’m not opposed to the UCR datasets, but wanted to throw out the the idea of using the Rossman Dataset for a couple reasons:

The data has largely been prepared in the fast.ai notebook which should free the participants to focus on trying different deep learning methods and transformations rather than data munging
The competition was 3 years ago at this point so hopefully, with recent developments, there is room for improvement
The competition and the fast.ai notebook should provide a great baseline
Given the dataset has already been modeled in fastai, it could be a good opportunity to get started “simply” but then strip away layers of abstraction and go deeper into the library to create custom architectures

Obviously it falls more in the ballpark of time series forecasting rather than classification so perhaps this could follow-up the UCR datasets.

marcmuc · November 27, 2018, 3:30pm

maybe we could add the ecg5000 as @sam2 suggested? And I would like to propose the electricDevices Dataset, which is the one with the most examples overall (but pretty low scores across all algorithms (BOSS best with 79.9% acc.

And one mabye stupid question, but are we sure the ones marked in red in this excel file are the wrong ones? They are the only ones that make sense to me, all the others seem to have train / test switched. I mean, what is the sens in having 67 training examples but 1029 in test (italyPowerDemand) or 300 train, 3000 test (yoga). Either the table header is wrong or if the red ones were considered wrong in 2015 and have been corrected, then ALL datasets would now be with train/test reversed? @henripal, can you comment?

marcmuc · November 27, 2018, 3:36pm

I think the Rossman dataset is covered in class and in the forums anyways but of course there could be more experimenting on it. And while of course it is time related, it is not the kind of univariate time series we have discussed in this thread. It is more a regression problem of taking a lot of columnar information in and then predicting “the missing column”. So this can also be seen as time independent. In the datasets in the UCR it is always only the time and therefore sequence of values that is the key to the problem, not how well other features can be related to some output at some point in time. And the problems are classification problems where as Rossman wants a regression value for the total sales of a store as a result.

What would be kind of interesting though is transforming the Data into the time series type of problem (meaning you would for example get one time series per store location and item group). Then it would fit the kind of time series discussed here so far (although it still would not be a classification problem).

marcmuc · November 27, 2018, 3:40pm

Has anyone seen these approaches combined anywhere by chance? (tabular 1step-“forecast” by using many features like rossman and time series over time (here sales per productgroup/store over time) Because that would be an interesting achievement and maybe an answer to some multivariate problems?!

whamp · November 27, 2018, 4:10pm

Totally fair points regarding the comparison to univariate time series and the classification vs regression distinction, but that’s why I think it would be interesting. It’s very different from the other examples and already has a strong baseline.

Here is an interesting multi-variate(13) time series paper that handles the problem of missing and irregular time series data in a multi-label(128) classification setting:

oguiza · November 27, 2018, 5:28pm

Ok, no problem. I’ll add both to the list. We can use them for ‘internal’ use within the study group thread.
Actually I’d like to add some UCR multivariate datasets to the list, so that we can test some of the models we are building. I’m working on a UCR multivariate list now.

We still need to decide though which dataset is used in the 2nd learning competition. Any preference?

cwerner · November 27, 2018, 8:59pm

Guys,

I’m super intrigued by what you are doing here. I’m a bit busy with live atm but I hope to get up to speed soon. Always wanted to go into TS stuff…

Any advice how to catch up/ where you are headed with this? Any obvious place to start?

C

oguiza · November 27, 2018, 9:07pm

UEA & UCR Time Series Classification multivariate datasets

Hi group,

The multivariate TSC archive has just been lauched with 30 datasets.
These datasets have just been introduced in a publication presented on Oct 31st, 2018. The paper is interesting because you can easily visualize ts.

The only algorithms that have been used with these dataset (in the same paper) are nearest neighbors with euclidean distance or dwt, which are no longer considered state of the art.

I’ve looked through the datasets and there are many where accuracy is 70-95%, so there’s still room to improve.
In my opinion, these are the most interesting multivariate datasets:

@marcmuc LSST is used in Kaggle’s PLAsTiCC
@sam2 there are 2 ECG datasets, although they are supposed to be difficult (sota .3 aprox with 3 classes, and small train & test datasets)

We have talked about some univariate datasets identified:

Note: *ECG200 and ECG5000 all algos perform almost equally and small train datasets

So we need to see how we want to organize ourselves.
Question:
Do we launch a new univariate or multivariate TS learning competition? If so do you have any preference for any dataset? The advantage of univariate is that we have benchmarks to compare our algos.

My view is we could launch a multivariate TS competition, and use the identified univariate datasets to benchmark algos if NN architecture allows it.

What do you think?

oguiza · November 27, 2018, 9:27pm

Hi @cwerner,
It’s good that you joined our group!

I guess it all depends on your prior experience with TS, as well as your preferred approach (traditional, deep learning or both).
There are 2 ebooks on time series I bought written by Jason Brownlee that I found very useful (I have nothing to do with the writer!):

His website also contains lots of information on ML/ DL and TS in particular.

As to where we are going with this, there are some proposed goals here, but it really depends on what the group decides to do. It’s totally open.
So if you have any ideas, comments, feedback, etc please post them!

henripal · November 28, 2018, 5:20am

Hi, yeah, these are strange but they’re definitely right - you can check out the train and test splits on the official ucr archive page here.

henripal · November 28, 2018, 5:24am

I think given the little time left in the course, we could fully move to the full new multivariate TSC dataset?

whamp · November 28, 2018, 5:28am

I would agree. As discussed in tonight’s lecture. True uni-variate time series are fairly rare in practice, not handled in the fast.ai library and usually is a situation where you’re better off in practice to just gather more related data and meta-data instead of curve fitting a sequence.

oguiza · November 28, 2018, 12:07pm

[PROPOSAL] New TS learning competition: Astronomical Classification (LSST)

Based on feedback received so far, I’d like to propose the following:

Close the current Earthquakes competition explaining the reasons
Launch a new learning competition based on a Kaggle dataset used in the PLAsTiCC competition (also available through the UCR website - LSST dataset).

Key features:

Time series classification
Train size: 2459
Test size: 2466
Multivariate: 6 dimensions
Time series length: 36
Multiclass: 14
Published SOTA: 0.575 (DTW)

I think this dataset has all requirements to stretch ourselves. It has requirements present in many TS problems (multivariate, multi class). So it should be a great learning opportunity.

Description:
This dataset is from a 2018 Kaggle competition. The Photometric LSST As- tronomical Time Series Classification Challenge (PLAsTiCC) is an open data challenge to classify simulated astronomical time-series data in preparation for observations from the Large Synoptic Survey Telescope (LSST), which will achieve first light in 2019 and commence its 10-year main survey in 2022. LSST will revolutionize our understanding of the changing sky, discovering and mea- suring millions of time-varying objects.

PLAsTiCC is a large data challenge for which participants are asked to classify astronomical time series data. These simulated time series, or light curves are measurements of an objects brightness as a function of time - by measuring the photon flux in six different astronomical filters (commonly referred to as passbands). These passbands include ultra-violet, optical and infrared regions of the light spectrum. There are many different types of astronomical objects (that are driven by different physical processes) that we separate into astronomical classes.

The problem we have formulated represents a snap shot of the data available and is created from the train set published in the aforementioned competition. 36 dimensions were chosen as it represents a value at which most instances would not be truncated.

Data download:
I’ve prepared a simple notebook to download data.

Please, let me know if you are ok with this so we can announce it in its own thread. I’d also,like to know if you’d participate in the competition. Thanks!

marcmuc · November 28, 2018, 1:12pm

While I like the idea it is kind of problematic due to this being an ongoing kaggle competition (it is not finished yet). That means anyone competing in it should by kaggle standards/rules NOT discuss/share insights/tricks/approaches/code anywhere outside official kaggle kernels/discussions.

henripal · November 28, 2018, 2:36pm

Very true. We could solve this by

no one competing in it
making one big team with everyone interested

(I like option 2

oguiza · November 28, 2018, 2:37pm

I understand. I’m not an expert in Kaggle (have never participated in competition).
However, it’s also true that LSST is now officially a public UCR multivariate dataset (presented on Oct 31st, 2018). I actually downloaded data from the UCR web. I’m not sure what the connection to Kaggle is. I think it may be a subset of the Kaggle training dataset, with the ts length truncated to 36.
I proposed this UCR dataset because it meet some criteria that are good for learning purposes, and there are Kaggle kernels as well, so we could maximize our learning.

But if you think this still may be a problem, we have other options. NATOPS and RacketSports would be the next on my list.
You can read a brief sumary on them on the original paper The UEA multivariate time series classification archive, 2018 (pages 9-10).
They share most of the same features are LSST, except that the train and test sizes are smaller.

Please, group, let me know your thoughts!

oguiza · November 28, 2018, 2:37pm

But then, we wouldn’t be allowed to share our solutions outside the group I think