Time series/ sequential data study group

cwerner · November 29, 2018, 6:42pm

Same for me, but real live projects kinda slow down substantial extra commitments. I’ll try to replicate stuff that was posted here in the meantime and might join for round 2…

marcmuc · November 29, 2018, 10:12pm

Thanks for pointing that out. I don’t have a lot of kaggle experience, have never been on or created a team and was apparently misguided by some forum statements.

So please, everyone interested, please let me know via pm or kaggle message if you want to join (same username as here).

jcatanza · November 29, 2018, 11:48pm

I think this is an exciting choice for our project. I’d like to join the team!

paul · November 30, 2018, 12:54am

Marc: On your kaggle profile I see: “You cannot contact users until you reach the Contributor tier”. I guess I’ll have to make one submission first, so I’ll go with the null hypothesis.

tukun · November 30, 2018, 3:50am

Excellent initiative. But, considering the current commitment, I don’t think I will be able to do justice to this Kaggle Competition (Again I am neither a Deep Learning expert or a Kaggle Expert). Please let us know here (or on a different thread?) about the public kernels which you post as you make progress. Of course we will wait to hear back about your findings once the competition is closed.

Thank you all for making this course/thread awesome!

oguiza · December 1, 2018, 10:46am

TIL about a new time series library called cesium.
cesium is an open source library that allows users to:

extract features from raw time series data (see list),
build machine learning models from these features, and
generate predictions for new data.

This is an example - Epilepsy Detection Using EEG Data - that I think illustrates the power of this library.

keijik · December 5, 2018, 12:12am

Hi, I’m looking to catch up with this group.

Are there any main repos that collect all the different time series transform types (like vector to image, fourier transforms, etc.) that people here have experimented with?

Thanks in advance for any pointers!

oguiza · December 5, 2018, 6:36pm

Hi @keijik, you might be interested to check this library pyts that is one some of us have used to make TS to image transformations.
You may also be interested in this notebook which demonstrates a how to apply this technique in a practical example.

keijik · December 5, 2018, 10:26pm

Awesome I’ll check it out (time permitting :D)

jcatanza · December 6, 2018, 8:41pm

sam2 · December 7, 2018, 2:13am

When working with non-dl algorithms based on sklearn API (instantiating a classifier or regressor, fitting the classifier/regressor and predicting) you can use a autoML kind of library called TPOT.
It is built using genetic programming. It can run through a bunch of algorithms and their hyper-parameters and arrive at the best combo. Here is the repo.
I found it to be very handy.

sam2 · December 7, 2018, 3:19am

It seems libraries called h2o4gpu and scikits.cuda provide cuda computational support to non-dl algorithms of scikit-learn. I plan experimenting with those this weekend.

sam2 · December 7, 2018, 10:30pm

Update

scikits.cuda is a collection of math-based solver operations on cuda which potentially could be used to build cuda based scikit-learn
h2o4gpu is a work-in-progress. It is also inadequately supported/staffed. Some of the demo notebooks fail. They have not been updated for 7 moths to 1 year. Issue log shows that the corrections are targeted for release 4.0.
Following is a listing of attributes of h2o4gpu which shows what algorithms are supported:

h2o4gpu.DAAL_SUPPORTED h2o4gpu.get_config(
h2o4gpu.ElasticNet( h2o4gpu.h2o4gpu_exceptions
h2o4gpu.ElasticNetH2O( h2o4gpu.import_data
h2o4gpu.FunctionVector( h2o4gpu.libs
h2o4gpu.GradientBoostingClassifier( h2o4gpu.linear_model
h2o4gpu.GradientBoostingRegressor( h2o4gpu.logger
h2o4gpu.KMeans( h2o4gpu.logging
h2o4gpu.KMeansH2O( h2o4gpu.metrics
h2o4gpu.Lasso( h2o4gpu.model_selection
h2o4gpu.LinearRegression( h2o4gpu.neighbors
h2o4gpu.LogisticRegression( h2o4gpu.os
h2o4gpu.PCA( h2o4gpu.preprocessing
h2o4gpu.PCAH2O( h2o4gpu.random_projection
h2o4gpu.Pogs( h2o4gpu.re
h2o4gpu.RandomForestClassifier( h2o4gpu.set_config(
h2o4gpu.RandomForestRegressor( h2o4gpu.setup_module(
h2o4gpu.Ridge( h2o4gpu.solvers
h2o4gpu.TruncatedSVD( h2o4gpu.svm
h2o4gpu.TruncatedSVDH2O( h2o4gpu.sys
h2o4gpu.base h2o4gpu.typecheck
h2o4gpu.clone( h2o4gpu.typechecks
h2o4gpu.compatibility h2o4gpu.types
h2o4gpu.config_context( h2o4gpu.util
h2o4gpu.exceptions h2o4gpu.utils
h2o4gpu.externals h2o4gpu.warnings
h2o4gpu.feature_selection

tukun · January 4, 2019, 3:49am

Now that the competition is over, I would be interested to hear about your learning.

lesscomfortable · January 4, 2019, 4:19pm

Hi Time Series study group! I wrote a summary of my learnings while participating in the PLAsTiCC astronomical classification Kaggle competition. I briefly explain what the competition was about, the winning approaches and some general Kaggle tips. Check it out here: “Learnings from my first Kaggle competition: PLAsTiCC” by Francisco Ingham https://link.medium.com/egGyoj4UcT

tukun · January 6, 2019, 8:34am

Thanks a lot!

tukun · January 7, 2019, 3:51am

I was reading your blog. It has been mentioned that "Many winning participants used 5-fold cross-validation and this is a very Kaggle thing. From the data description, it seems the data is highly temporal in nature. So, should I understand that the CVs are not random, but based on the time split?

oguiza · January 7, 2019, 8:42am

No, we didn’t use a time based split because this was a classification problem where you needed to assign a class to each sample based on the entire time series. There was not forecasting involved.
We randomized the samples, which were independent from each other.
Does this answer your question?

oguiza · January 7, 2019, 10:48am

HI @mayank4, I’m not sure this post belongs here

jeetkarsh · January 9, 2019, 5:21am

I am working on a problem which is somewhat similar to predicting the stock price of 5 top picked stock. let’s name them A,B,C,D,E,F. I have collected and combined the data from all the available sources and dump them into training and test data of sizes 82 Million Rows and 35 Million rows respectively with 277 different features(Consider them as handcrafted features) right now i am using xgboost for model building. My concern is my model building is taking a hell lot of time for training because of the large amount of data and features. Just wanted to understand is there any way where i can create a feature embedding of these 277 different features into some 50 odd features and then using these 50 deep features in model building to reduce my training time.

And Thanks for all the people posting interesting stuff.