- add drop=false to add_datepart
add_datepart(train_df, “Date”, drop=False)
add_datepart(test_df, “Date”, drop=False)
- Apply all feature engineering to train & test
- Remove NaN’s from train & test
test_df = test_df.fillna(0)
train_df = train_df.fillna(0)
The last one was a bit of a surprise, but I found the solution here in the forum:
The classifier works now. Accuracy is 80%, which is ~10% better than the best LSTM model I did with Keras and the first base line with the TabularModel before any optimization was done.
Most of the hazzle seems to come from incorrect data pre-processing; It looks like I was looking at an outdated example notebook from the 2018 course. Meanwhile, I found the latest (2019) Rossmann example in Lesson 6 and I am working now to apply that pre-processing. I update the post later on.
Last week, I started with getting through the part-1 course and so far, lesson 1- 4 were quite helpful to get my mini-project started. So far, most of the documentation and code examples were good but after having setup a model, I am getting zero accuracy so there must be something wrong with the way I do things.
I am dealing with an “auto-regression” problem, that is, I want to use earlier time periods of the dependent variable (and its features) as predictors for values in future time. My dataset has about 5k of data with 168 features from which I want to predict 4 dependent variables.
For now, predicting just one target variable with a subset of 18 features is sufficient to get something working.
The Rossman example used the “ColumnarModelData”, which seems to create a model for a specified column. Now, that makes a ton of sense to me since auto-regression is essentially working on a column of time series of data. Unfortunately, that 'ColumnarModelData" is gone in fast.ai -1.
Next, I was looking in the tabular data model and when following the “adult” example, I get some code working, but no usable model. Specifically, I did:
- Separated categorial from continuous columns
- Converted data to categories
- Added data processors: [FillMissing, Categorify, Normalize]
- Split data by index
- Created a test data set as TabularList
- Created a databunch
- Created a tabular learner
The last step is where I am unsure whether its the right thing because the learner should be an RNN or column learner or whatever gets auto-regression done in fast.ai.
Note, the underlying sample data is just 5k of numerical data (~500kb) so fast runtime is expected. A simple LSTM model with Keras gets an accuracy of about 70% so I thought fast.ai can beat this with a bit of tweaking.
However, when I run the code shown in the gist, I get an accuracy of zero and a pretty abnormal valuation loss. When reading the API Doc, I simply cannot figure out whether the tabular learner operates row or column wise. The results indicate row-wise, but that’s just a guess.
Is the tabular_learner the right learner for autoregression on a time-series data?
I have read a few times that RNN / LSTM isn’t the best choice anymore for auto-regression because a fully connected layer can perform better. How can I do that in fast.ai to predict future values in a time-series dataset?
Can I predict more than one dependent variable or do I have to create four different models when I want to predict four dependent variables?
I am thinking increasingly about writing up and contributing a time-series tutorial, so any help to make things work is greatly appreciated.