Time series/ sequential data study group

henripal · November 28, 2018, 5:24am

I think given the little time left in the course, we could fully move to the full new multivariate TSC dataset?

whamp · November 28, 2018, 5:28am

I would agree. As discussed in tonight’s lecture. True uni-variate time series are fairly rare in practice, not handled in the fast.ai library and usually is a situation where you’re better off in practice to just gather more related data and meta-data instead of curve fitting a sequence.

oguiza · November 28, 2018, 12:07pm

[PROPOSAL] New TS learning competition: Astronomical Classification (LSST)

Based on feedback received so far, I’d like to propose the following:

Close the current Earthquakes competition explaining the reasons
Launch a new learning competition based on a Kaggle dataset used in the PLAsTiCC competition (also available through the UCR website - LSST dataset).

Key features:

Time series classification
Train size: 2459
Test size: 2466
Multivariate: 6 dimensions
Time series length: 36
Multiclass: 14
Published SOTA: 0.575 (DTW)

I think this dataset has all requirements to stretch ourselves. It has requirements present in many TS problems (multivariate, multi class). So it should be a great learning opportunity.

Description:
This dataset is from a 2018 Kaggle competition. The Photometric LSST As- tronomical Time Series Classification Challenge (PLAsTiCC) is an open data challenge to classify simulated astronomical time-series data in preparation for observations from the Large Synoptic Survey Telescope (LSST), which will achieve first light in 2019 and commence its 10-year main survey in 2022. LSST will revolutionize our understanding of the changing sky, discovering and mea- suring millions of time-varying objects.

PLAsTiCC is a large data challenge for which participants are asked to classify astronomical time series data. These simulated time series, or light curves are measurements of an objects brightness as a function of time - by measuring the photon flux in six different astronomical filters (commonly referred to as passbands). These passbands include ultra-violet, optical and infrared regions of the light spectrum. There are many different types of astronomical objects (that are driven by different physical processes) that we separate into astronomical classes.

The problem we have formulated represents a snap shot of the data available and is created from the train set published in the aforementioned competition. 36 dimensions were chosen as it represents a value at which most instances would not be truncated.

Data download:
I’ve prepared a simple notebook to download data.

Please, let me know if you are ok with this so we can announce it in its own thread. I’d also,like to know if you’d participate in the competition. Thanks!

marcmuc · November 28, 2018, 1:12pm

While I like the idea it is kind of problematic due to this being an ongoing kaggle competition (it is not finished yet). That means anyone competing in it should by kaggle standards/rules NOT discuss/share insights/tricks/approaches/code anywhere outside official kaggle kernels/discussions.

henripal · November 28, 2018, 2:36pm

Very true. We could solve this by

no one competing in it
making one big team with everyone interested

(I like option 2

oguiza · November 28, 2018, 2:37pm

I understand. I’m not an expert in Kaggle (have never participated in competition).
However, it’s also true that LSST is now officially a public UCR multivariate dataset (presented on Oct 31st, 2018). I actually downloaded data from the UCR web. I’m not sure what the connection to Kaggle is. I think it may be a subset of the Kaggle training dataset, with the ts length truncated to 36.
I proposed this UCR dataset because it meet some criteria that are good for learning purposes, and there are Kaggle kernels as well, so we could maximize our learning.

But if you think this still may be a problem, we have other options. NATOPS and RacketSports would be the next on my list.
You can read a brief sumary on them on the original paper The UEA multivariate time series classification archive, 2018 (pages 9-10).
They share most of the same features are LSST, except that the train and test sizes are smaller.

Please, group, let me know your thoughts!

oguiza · November 28, 2018, 2:37pm

But then, we wouldn’t be allowed to share our solutions outside the group I think

henripal · November 28, 2018, 2:39pm

For the next 19 days, yes

oguiza · November 28, 2018, 2:41pm

I’ve just seen the Kaggle data is 7GB. UCR is 26.9MB.
But we have other choices if you think it might be an issue.

marcmuc · November 28, 2018, 4:00pm

I would also like option2. And then share what we did in the group in 19 days
The competition only has 19 days left and teams have to be formed by Dec 10th. Just to be clear what we would be getting ourselves into…

oguiza · November 28, 2018, 4:27pm

Ok, if you prefer that. Do any of you @henripal or @marcmuc have experience in Kaggle? As I said, I’ve never participated in any competition. But I’m willing to learn as much as I can about TS, so I’d be happy to be part of a team. Who would like to take the lead?

Rares · November 28, 2018, 4:36pm

i am in as well for team project if needed

Jess · November 28, 2018, 4:40pm

I would like to join the team, as well.

bachir · November 28, 2018, 5:16pm

@oguiza I’m in the team.

lesscomfortable · November 28, 2018, 9:41pm

I would like to form part of the team!

paul · November 29, 2018, 3:02am

I would also like to join the team. I read the rules and there is no conflict. For full disclosure, my wife is the director for Science at LSST corp.

marcmuc · November 29, 2018, 10:51am

PLASTiCC Kaggle Team Challenge

Okay, lets do this and climb the leaderboard together

The challenge will close on Debember 17th, 11:59PM UTC, the deadline for entry and forming teams is the 10th.

Everyone that is interested in participating should first get set up on kaggle, if that has not happend yet. Don’t worry, it is quite easy and all the necessary steps are documented quite well on kaggle.com. (If you are a kaggle user already, even better)
Ater signup then go to the PLAsTiCC competition and accept the rules. You will be automatically asked when you try to download the data.
Download the data or create a (private) kaggle kernel, so that you can look at the data
Try to get familiar with the problem space by reading the material in the Competition: description, data, and evaluation are must reads
Look at the Kernels that are there already to get familiar with what this challenge is about and to also get inspiration (and learn a lot!) from what other people have done and shared already.

We should have a phase of about 1 week for everyone to do this, do some first submissions e.g. of baseline models etc and get familiar with the process on kaggle and with the dataset. This also enables us to make many more submissions than if we form a team right from the start.

On the 5th of December we should then form a team on kaggle (which is done by inviting/merging teams) and work together for the remaining 2 weeks (12 days). There is a maximum of 8 people allowed per kaggle team. Right now I think we are 8 that have signalled interest. If there should be more by then, we could either say the “top 8” of us on the leaderboard get on the team (I would not be sure to be on it then ), or we could simply split into two teams (but then could not share ideas between teams until after the competition ends)

To keep it simple and as everyone is on this forum anyways, I would then set up private groupchats (which kinda work like slack etc.) here in the forum. So we would have a private team chat/thread (or 2 in case of 2 teams.)

I can continue to coordinate during the challenge and dedicate some time to it, but just to be clear I am neither an experienced kaggler nor an experienced deep learning practicioner! So the goal would be to all learn from each other.

paul · November 29, 2018, 1:27pm

Sounds like a good plan to me. I’ll try to keep a log of my submissions around in a simple form (rationale, source code, submission file, score). The submission file has about 3.5 million entries (11MB compressed). Here is a top-3 competitor’s confusion matrix:

jeremy · November 29, 2018, 3:14pm

I suggest that you add people to the team before they make any individual submissions (i.e. ASAP). Otherwise, you might find that the total submission count for the individuals is higher than the max allowed, and you won’t be able to create the team.

knesgood · November 29, 2018, 3:52pm

I don’t know I’ll have the bandwidth to work on this with a team, but I’ll cheer from the sidelines while I try this in what spare time I can find. Good luck, competitors!