Time series/ sequential data study group

tukun · November 30, 2018, 3:50am

Excellent initiative. But, considering the current commitment, I don’t think I will be able to do justice to this Kaggle Competition (Again I am neither a Deep Learning expert or a Kaggle Expert). Please let us know here (or on a different thread?) about the public kernels which you post as you make progress. Of course we will wait to hear back about your findings once the competition is closed.

Thank you all for making this course/thread awesome!

oguiza · December 1, 2018, 10:46am

TIL about a new time series library called cesium.
cesium is an open source library that allows users to:

extract features from raw time series data (see list),
build machine learning models from these features, and
generate predictions for new data.

This is an example - Epilepsy Detection Using EEG Data - that I think illustrates the power of this library.

keijik · December 5, 2018, 12:12am

Hi, I’m looking to catch up with this group.

Are there any main repos that collect all the different time series transform types (like vector to image, fourier transforms, etc.) that people here have experimented with?

Thanks in advance for any pointers!

oguiza · December 5, 2018, 6:36pm

Hi @keijik, you might be interested to check this library pyts that is one some of us have used to make TS to image transformations.
You may also be interested in this notebook which demonstrates a how to apply this technique in a practical example.

keijik · December 5, 2018, 10:26pm

Awesome I’ll check it out (time permitting :D)

jcatanza · December 6, 2018, 8:41pm

sam2 · December 7, 2018, 2:13am

When working with non-dl algorithms based on sklearn API (instantiating a classifier or regressor, fitting the classifier/regressor and predicting) you can use a autoML kind of library called TPOT.
It is built using genetic programming. It can run through a bunch of algorithms and their hyper-parameters and arrive at the best combo. Here is the repo.
I found it to be very handy.

sam2 · December 7, 2018, 3:19am

It seems libraries called h2o4gpu and scikits.cuda provide cuda computational support to non-dl algorithms of scikit-learn. I plan experimenting with those this weekend.

sam2 · December 7, 2018, 10:30pm

Update

scikits.cuda is a collection of math-based solver operations on cuda which potentially could be used to build cuda based scikit-learn
h2o4gpu is a work-in-progress. It is also inadequately supported/staffed. Some of the demo notebooks fail. They have not been updated for 7 moths to 1 year. Issue log shows that the corrections are targeted for release 4.0.
Following is a listing of attributes of h2o4gpu which shows what algorithms are supported:

h2o4gpu.DAAL_SUPPORTED h2o4gpu.get_config(
h2o4gpu.ElasticNet( h2o4gpu.h2o4gpu_exceptions
h2o4gpu.ElasticNetH2O( h2o4gpu.import_data
h2o4gpu.FunctionVector( h2o4gpu.libs
h2o4gpu.GradientBoostingClassifier( h2o4gpu.linear_model
h2o4gpu.GradientBoostingRegressor( h2o4gpu.logger
h2o4gpu.KMeans( h2o4gpu.logging
h2o4gpu.KMeansH2O( h2o4gpu.metrics
h2o4gpu.Lasso( h2o4gpu.model_selection
h2o4gpu.LinearRegression( h2o4gpu.neighbors
h2o4gpu.LogisticRegression( h2o4gpu.os
h2o4gpu.PCA( h2o4gpu.preprocessing
h2o4gpu.PCAH2O( h2o4gpu.random_projection
h2o4gpu.Pogs( h2o4gpu.re
h2o4gpu.RandomForestClassifier( h2o4gpu.set_config(
h2o4gpu.RandomForestRegressor( h2o4gpu.setup_module(
h2o4gpu.Ridge( h2o4gpu.solvers
h2o4gpu.TruncatedSVD( h2o4gpu.svm
h2o4gpu.TruncatedSVDH2O( h2o4gpu.sys
h2o4gpu.base h2o4gpu.typecheck
h2o4gpu.clone( h2o4gpu.typechecks
h2o4gpu.compatibility h2o4gpu.types
h2o4gpu.config_context( h2o4gpu.util
h2o4gpu.exceptions h2o4gpu.utils
h2o4gpu.externals h2o4gpu.warnings
h2o4gpu.feature_selection

tukun · January 4, 2019, 3:49am

Now that the competition is over, I would be interested to hear about your learning.

lesscomfortable · January 4, 2019, 4:19pm

Hi Time Series study group! I wrote a summary of my learnings while participating in the PLAsTiCC astronomical classification Kaggle competition. I briefly explain what the competition was about, the winning approaches and some general Kaggle tips. Check it out here: “Learnings from my first Kaggle competition: PLAsTiCC” by Francisco Ingham https://link.medium.com/egGyoj4UcT

tukun · January 6, 2019, 8:34am

Thanks a lot!

tukun · January 7, 2019, 3:51am

I was reading your blog. It has been mentioned that "Many winning participants used 5-fold cross-validation and this is a very Kaggle thing. From the data description, it seems the data is highly temporal in nature. So, should I understand that the CVs are not random, but based on the time split?

oguiza · January 7, 2019, 8:42am

No, we didn’t use a time based split because this was a classification problem where you needed to assign a class to each sample based on the entire time series. There was not forecasting involved.
We randomized the samples, which were independent from each other.
Does this answer your question?

oguiza · January 7, 2019, 10:48am

HI @mayank4, I’m not sure this post belongs here

jeetkarsh · January 9, 2019, 5:21am

I am working on a problem which is somewhat similar to predicting the stock price of 5 top picked stock. let’s name them A,B,C,D,E,F. I have collected and combined the data from all the available sources and dump them into training and test data of sizes 82 Million Rows and 35 Million rows respectively with 277 different features(Consider them as handcrafted features) right now i am using xgboost for model building. My concern is my model building is taking a hell lot of time for training because of the large amount of data and features. Just wanted to understand is there any way where i can create a feature embedding of these 277 different features into some 50 odd features and then using these 50 deep features in model building to reduce my training time.

And Thanks for all the people posting interesting stuff.

oguiza · January 9, 2019, 7:30am

Hi @jeetkarsh,
Embeddings are used to enrich the representation of certain categorical features. With embeddings, a categorical feature with x unique values will be replaced by a matrix with min(x/2, 50) values. This means that the amount of data will be increased.

If your features are time series (I’m guessing since you’ve posted in this thread) there are alternatives to reduce the amount of data. Sampling features based on a “catalytic” event is one of them. This is an extract from Advances in Financial Machine Learning (M. López de Prado, 2018):

Suppose that you wish to predict whether the next 5% absolute return will be positive (a 5% rally) or negative (a 5% sell-off). At any random time, the accuracy of such a prediction will be low. However, if we ask a classifer to predict the sign of the next 5% absolute return after certain catalytic conditions, we are more likely to find informative features that will help us achieve a more accurate prediction.

You could, for example, create a model where you take the last 100 timesteps when the market has moved up or down a certain %. That would certainly reduce your dataset.

jeetkarsh · January 11, 2019, 5:12am

Thanks @oguiza For your detailed reply.

nok · January 23, 2019, 12:52pm

Has anyone try seq2seq for time series prediction?

oguiza · January 26, 2019, 6:32am

Hi @nok, I don’t have personal experience with seq2seq models applied to time series, but it’s something that should work.
I’ve done a quick search and have found a notebook in keras (have not found anything relevant in pytorch yet). I’d be interested if you can develop something in pytorch!