Rossmann questions


#1

Hi everyone,

I am trying to understand the approach of the Rossmann notebook. Here are my questions:

  1. If this is a time-series problem, where do we limit the model’s ability to see into the future? It seems to me that the network (and random forest at the end) can see all data at once ( = fully connected)?
  2. Under ‘Durations’, there are the following lines that assign two different things to df. It would seem to me that they break the code by only assigning test set to df. Have I misunderstood something?
    df = train[columns]
    df = test[columns]
  3. Jeremy follows the authors in removing instances where the store was closed. Why does he not do the same thing with the test set? Why does the code break down if we don’t remove the closed shops?
  4. Why is ‘Id’ added to the test set, but not the training set?

Many thanks for your help!


(Sudarshan) #2

I’ll answer some of your questions:

Question 1: Once you train and validate your model and have setup all the hyperparameters, it is advisable to train the model on the entire training set provided to you. The test (at Kaggle) will contain dates in the future and your model will be tested on that.
Question 2: The code is not meant to be run linearly. When you train, you comment out the test[columns] code and vice versa when you test.
Question 4: AFAICT ‘Id’ is not “added” here. Only Id and Sales are selected to be written out to the csv for submission to kaggle.


#3

Hi Sudarshan,

Thanks for your feedback. I understand your points on Q2 and Q4, but I still struggle to understand how this is posed as a time-series problem. The model should be able to predict y for period tn with X up to period tn-1. Say the train/test set breaks at today’s date (May 6, 2018). How can we make predictions into the future with this model? We have no data (X_test) for the future?

Many thanks!


(Sudarshan) #4

How can we make predictions into the future with this model? We have no data (X_test) for the future?

The underlying assumption is that the distribution of the data will not drastically change in the near future. If what happens tomorrow is drastically different than what has happened till today for the past year, then even the best model would not be able to predict that. Remember, the point of machine learning is to learn the distribution of the data (aka function aka probability).

I would think as time goes by, you would have the retrain your model the latest n observations to update your model parameters so it captures any variations that would’ve occurred in the distributions during those last n observations.

Any problem that was a time component within it can be posed as a time-series problem. The time difference does not have to be consistent too.


#5

Sorry, I’m confused. Even if the distribution doesn’t change - which it usually does in a time-dependent (time-series) model - how do you make predictions into the future without data? In a time-series model, you always make predictions for period tn with data up to tn-1. As far as I have understood, the Rossmann notebook assumes you know your predictors (X) for time period tn. How would you now go about making predictions into the future with the Rossmann notebook? Please correct me if I have misunderstood the architecture of the model. It just seems to me that model can use data (‘independent’ variables) at period t1 to make predictions for period t1. The random forest at the end certainly can see all data at once. It seems to me it’s the same with the neural net. Many thanks.


(Luke Byrne) #6

Hi all,

I have a similar question regarding future predictions using a Rossman style architecture.

  1. How do I make predictions using the .predict() method on just one row of data. Will I need to look up into the embedding matrix to get the relevant embedding representation to pass into the predict method

  2. Say I get new data coming in, can I use the existing model weights, and retrain just giving the model the new data?

  3. How can I deploy this to a flask app for realtime predictions?

I look forward to any responses.

Kind regards,

Luke


(Sudarshan) #7

@lukebyrne

For 3 check this out.


(Lou Acresti) #8

Has anyone attempted to provide a fixed version of the notebook? I haven’t seen anything out there, and I’ve spent hours carefully trying to “perform” this notebook properly… I imagine many others have also spent a lot of time (or just gave up) as well.


(gram) #9

TL;DR: How to get the model to predict price into the future after training? I suspect it’s a simple feature built in somewhere but I missed it.

I’ve modified this lesson3-rossman notebook with my own data on corn prices.
I got all sorts of my own variables in. Weather, Oil prices, corn production, etc.
Everything worked out for me.
I was able to spit 40 years of data into a ‘joined’ of 39 years and ‘joined_test’ of 1 year, and compared the prediction chart of one year to what the price actually did in that year.
What I’ve never understood is how to get a price projection into the future. If it’s only “predicting” what happened in the past then it’s not technically doing any predicting.

I TRIED making the validation set of DATES IN THE FUTURE that contain just the estimated yearly corn production numbers and the a columns of zeros for the corn price (as it always would be in the validation set). This is where I’m at a loss. What do I put in every other columns? If I start putting my guesses for what oil would be on a future date then we’re in a garbage in, garbage out situation, right?
I’m not sure how to show the model what are reliable, price dependant prediction for the future, like crop production, but don’t know how to show the model only this.

Thank you.


(Sam Lloyd) #10

This is essentially the challenge or any time-series problem. Extrapolating beyond the given dataset comes with it’s challenges, but sounds like you’ve done the right thing with your validation set

Pretty much. It’s best to have variables that you have a decent value for, otherwise you’re better off making a model that only uses the variable that you have.


(gram) #11

Thanks for the response.

I didn’t think about this being a problem until I got to this juncture.

The traders who aren’t cheating must model multiple scenarios. Oil high, oil low, flat, USD high, low.
Hmmm. Though you can make auto conditional sells based on another security it could explode into way too many possibilities.
The biggest factor is still who buys and sells. Hard.


(gram) #12

Question you might have the answer to.
What do I pass to the model for ‘cat_flds=’ when none of my variables are categorical?
What if they’re all continuous?
I’ve tried putting ‘None’ and ‘False’ in here, and just trying to leave ‘cat_flds’ off.


(Sam Lloyd) #13

It’s sort of a bug… Use this instead, and pass an empty DataFrame in:
ColumnarDataset.from_data_frames(df, pd.DataFrame())
I might look if it’s worth putting a PR in for or not


(gram) #14

so ‘cat_flds=pd.DataFrame()’?


(Sam Lloyd) #15

yes, correct. You just want to pass an empty object, which most importantly has no columns

This is what was failing before - it’s trying to drop no columns

@classmethod
def from_data_frame(cls, df, cat_flds, y=None, is_reg=True, is_multi=False):
    return cls.from_data_frames(df[cat_flds], df.drop(cat_flds, axis=1), y, is_reg, is_multi)

Now we bypass that method and go straight to from_data_frames

@classmethod
def from_data_frames(cls, df_cat, df_cont, y=None, is_reg=True, is_multi=False):
    cat_cols = [c.values for n,c in df_cat.items()]
    cont_cols = [c.values for n,c in df_cont.items()]
    return cls(cat_cols, cont_cols, y, is_reg, is_multi)

(gram) #16

Is this a change within the FastAI code?

I’m too much of a newb to know where this goes.

I’ve been focused on modifying the lesson3-rossman notebook.

TL;DR - Does it matter what order columns are in and what they’re named?

Maybe you can answer another question, (most of my posts go unanswered in here so I’ll just keep asking you things till you want to ignore them).
QUESTION: With column data such as in the rossman notebook, does the model know the names of the columns? Does it make use of the fact one column is next to another? I’m curious how the model puts the columns in context. Could you just put the columns in any order and it will find the correct relations?
I’m modifying the rossman notebook to look at atoms of elements in compounds and each atom has an X, Y, Z, number. I’ll try to make it so the model isn’t looking for a time sequence.
Anyways, That’s 80 atoms and each X, Y, and Z is in its own cell with Atom_1_X, Atom_2_X, etc.
I don’t remember seeing anything like this in the course.


(Sam Lloyd) #17

It isn’t a change, it’s just using a different method. It’s really useful to look at the source code of the library, particularly if you’re not sure how something is working (it’s all here)

It makes no difference whatever order you put them in, so it won’t matter. The neural network learns the relationships between columns

80 atoms and each X, Y, and Z is in its own cell with Atom_1_X, Atom_2_X, etc.

Sounds interesting! Hope that it works


(gram) #18

Thanks for responses.
Very interesting that I can do chemical engineering by chucking it all into the model in random column order. What an age we live in.

More questions, and like I said, feel free to opt out of the questioning whenever…
After much gnashing of teeth I got what looks like some good predictions (that I haven’t submitted yet) for this Kaggle competition https://www.kaggle.com/c/nomad2018-predict-transparent-conductors/overview
(The predictions look like random numbers in a good range but the exp_rmspe seemed pretty stuck on 1.67 for some reason despite lots of training. I’ll keep tinkering.)

QUESTION 1: It requires two different fields be predicted. Can a model predict values for two different columns? I don’t know that I’ve ever seen a model do that. Two separate Y hats, I think it’s called?

QUESTION 2: I modified the course 1 lesson3 rossman notebook to train on this data but I’m not sure if the md = ColumnarModelData.from_data_frame() loader is an RNN or something else in this notebook made it an RNN. I ask because the transparent-conductors have nothing to do with any time sequence. I looked around for docs on ColumnarModelData.from_data_frame and can’t tell if I tell it it is or is not looking at time series data or if it NEEDS TO KNOW it is looking at time series data.
I watched all the courses and don’t recall csv data in a lesson like this (not time series).


(Tony Travers) #19

Hi

WRT Question 1. I am also trying to predict multiple values. I am interested in predicting (for example) Rossmann sales data for Day1, Day2 and Day3. Looking at the code you cannot simply have an array of [Day1, Day2 , Day3] in the “Sales” variable “df, y, nas, mapper = proc_df(joined_samp, ‘Sales’, do_scale=True)”. The fastai code seems to think that you are using category data (I think this might need a modification but not sure how yet). You will also need a metric that will take the output and compare PredictedDay1, PredictedDay2 and PredictedDay3 against actualDay1, actualDay2 and actualDay3.

You would also need to augment your day data for each store to include the extra two days.

I don’t think there is any problem theoretically with having 3 outputs (just a final array of 3) but there seems to be a couple of code issues to getting it to work. I am planning give it a go soon, if I have any luck will report back.


(Sam Lloyd) #20

Have a look at this gist showing how you can implement it https://github.com/sjdlloyd/gistymcgistgist/blob/master/multiColumnarLearnerExample.ipynb

It’s not an RNN, it’s a plain NN. The time series aspect is covered by providing a rolling window over some of the variables

@tonyt
You might be interested in the gist as well