Excellent initiative. But, considering the current commitment, I don’t think I will be able to do justice to this Kaggle Competition (Again I am neither a Deep Learning expert or a Kaggle Expert). Please let us know here (or on a different thread?) about the public kernels which you post as you make progress. Of course we will wait to hear back about your findings once the competition is closed.
Thank you all for making this course/thread awesome!
Are there any main repos that collect all the different time series transform types (like vector to image, fourier transforms, etc.) that people here have experimented with?
Hi @keijik, you might be interested to check this library pyts that is one some of us have used to make TS to image transformations.
You may also be interested in this notebook which demonstrates a how to apply this technique in a practical example.
When working with non-dl algorithms based on sklearn API (instantiating a classifier or regressor, fitting the classifier/regressor and predicting) you can use a autoML kind of library called TPOT.
It is built using genetic programming. It can run through a bunch of algorithms and their hyper-parameters and arrive at the best combo. Here is the repo.
I found it to be very handy.
It seems libraries called h2o4gpu and scikits.cuda provide cuda computational support to non-dl algorithms of scikit-learn. I plan experimenting with those this weekend.
scikits.cuda is a collection of math-based solver operations on cuda which potentially could be used to build cuda based scikit-learn
h2o4gpu is a work-in-progress. It is also inadequately supported/staffed. Some of the demo notebooks fail. They have not been updated for 7 moths to 1 year. Issue log shows that the corrections are targeted for release 4.0.
Following is a listing of attributes of h2o4gpu which shows what algorithms are supported:
Hi Time Series study group! I wrote a summary of my learnings while participating in the PLAsTiCC astronomical classification Kaggle competition. I briefly explain what the competition was about, the winning approaches and some general Kaggle tips. Check it out here: “Learnings from my first Kaggle competition: PLAsTiCC” by Francisco Ingham https://link.medium.com/egGyoj4UcT
I was reading your blog. It has been mentioned that "Many winning participants used 5-fold cross-validation and this is a very Kaggle thing. From the data description, it seems the data is highly temporal in nature. So, should I understand that the CVs are not random, but based on the time split?
No, we didn’t use a time based split because this was a classification problem where you needed to assign a class to each sample based on the entire time series. There was not forecasting involved.
We randomized the samples, which were independent from each other.
Does this answer your question?
I am working on a problem which is somewhat similar to predicting the stock price of 5 top picked stock. let’s name them A,B,C,D,E,F. I have collected and combined the data from all the available sources and dump them into training and test data of sizes 82 Million Rows and 35 Million rows respectively with 277 different features(Consider them as handcrafted features) right now i am using xgboost for model building. My concern is my model building is taking a hell lot of time for training because of the large amount of data and features. Just wanted to understand is there any way where i can create a feature embedding of these 277 different features into some 50 odd features and then using these 50 deep features in model building to reduce my training time.
And Thanks for all the people posting interesting stuff.
Hi @jeetkarsh,
Embeddings are used to enrich the representation of certain categorical features. With embeddings, a categorical feature with x unique values will be replaced by a matrix with min(x/2, 50) values. This means that the amount of data will be increased.
If your features are time series (I’m guessing since you’ve posted in this thread) there are alternatives to reduce the amount of data. Sampling features based on a “catalytic” event is one of them. This is an extract from Advances in Financial Machine Learning (M. López de Prado, 2018):
Suppose that you wish to predict whether the next 5% absolute return will be positive (a 5% rally) or negative (a 5% sell-off). At any random time, the accuracy of such a prediction will be low. However, if we ask a classifer to predict the sign of the next 5% absolute return after certain catalytic conditions, we are more likely to find informative features that will help us achieve a more accurate prediction.
You could, for example, create a model where you take the last 100 timesteps when the market has moved up or down a certain %. That would certainly reduce your dataset.
Hi @nok, I don’t have personal experience with seq2seq models applied to time series, but it’s something that should work.
I’ve done a quick search and have found a notebook in keras (have not found anything relevant in pytorch yet). I’d be interested if you can develop something in pytorch!