Drowning in "Rossmann" (Structured and time series data)

Please help me understand “cross store variance”.

I am trying to apply lessons from the Rossmann notebook to a model of my own. It is very different in setup and the data has very different dimensions. Looking at the lesson3-rossmann notebook is helping me a lot. Trying to roll my own is forcing me to learn a lot about the Python stack (pandas, numpy, matplotlib, fastai, …). Now I am halfway in and I find myself wanting a little help keeping afloat in a sea of new insights and knowhow.

So; could you help me visualize what the Rossmann model (in the nb) is actually doing, from the perspective of the data?

Here’s how I see it now. Model is trained on lots of X’s in a (daily) time series. Here’s my ascii art viz:

day  ; storeID; storeProperties; globalProperties; 
1 jan;       4; bla;blabla;12;a; yak;5;4;3;yakyak; 
2 jan;       4; bla;blabla;88;a; yak;5;4;3;yakYAK; 
1 jan;       6; blb;meh;22;a;    yak;5;4;3;yakyak; 
2 jan;       6; blb;mehhh;75;a;  yak;5;4;3;yakYAK; 

Never mind the redundancy here. The trainee likes it like that.

We are training “supervised”, so we must also supply our Y’s (Sales). We put them (initially) in the the same series. Easy enough, one store-day = one Sales number;

> day  ; storeID; storeProperties; globalProperties; Sales
> 1 jan;       4; bla;blabla;12;a; yak;5;4;3;yakyak; 1000
> 2 jan;       4; bla;blabla;88;a; yak;5;4;3;yakYAK; 1200
> 1 jan;       6; blb;meh;22;a;    yak;5;4;3;yakyak; 3300
> 2 jan;       6; blb;mehhh;75;a;  yak;5;4;3;yakYAK; 3100

We are expected to predict Sales for each store, based on the per-store data. RIGHT?!?!?
Question: does our model learn to consider properties not related to the store in question? Can, for instance, the properties for store #4 affect the Sales in store #6 (in our model, at the moment of inference)?

For the model I am building, this “cross store variance” is very relevant: read my comments below.

1 Like

For the model I am building, “cross store variance” is very relevant: I am attempting to model the hydrology of the river Meuse.
X’s are rainfall stats (in a long/lat grid of ca 160 squares, mm total per day)
Y = river flow (in m3/s) in one (or more) downstream measuring station(s), measured daily. Here is a crude viz for the “rainiest” day in my data (2010-8-26 was a bad day to live in Dortmund.


The red dots are measuring stations in The Netherlands. The cross indicates the river’s source.

The river flow in each station is influenced by many (but not all) squares in this map. Least of all by Dortmund, that is flowing into the Rhine. I want my model to learn that quantitatively.

Here is the data I am setting up for this model. Let’s say I want to predict Y = stat8_flow

day  ; grid1_rain; grid2_rain; ...; grid100_rain; stat1_flow; stat2_flow; ...; stat8_flow
1 jan; 2         ; 4         ; ...; 200         ; 101       ; 102       ; ...; 108
2 jan; 1         ; 12        ; ...;  30         ; 131       ; 182       ; ...;  78

I have introduced all the relevant X’s for one day in the same row. With 160+ X’s, those rows grow very fast.

Notice how the geo-spatial relations between grid squares has been removed. I am not sure if that is a problem. But also, I am not sure how to present the data differently.
Following the Rossmann example, I would create a table with highly redundant data. Merge 160 rain measurements per day into a date-indexed table will produce a lot of NA’s to be filled with … with what exactly?

day  ; grid; rain ; stat; flow
1 jan; 1   ; 2    ; NA  ; NA
1 jan; 2   ; 4    ; NA  ; NA
1 jan; ... ; ...  ; NA  ; NA
1 jan; 100 ; 200  ; NA  ; NA
1 jan; NA  ; NA   ; 1   ; 101
1 jan; NA  ; NA   ; 2   ; 102
1 jan; NA  ; NA   ; ... ; ...
1 jan; NA  ; NA   ; 8   ; 108
2 jan; ..........etc.........

I am considering presenting 2D data (geo-spatial rain measurements) as 2D arrays. The flow measurements could be 1D arrays. Both arrays could be placed into a position in a Pandas dataframe.

day  ; rain                  ; flow
1 jan; [[ 2,  4, ...,  18],  ; [101,102, ..., 108]
        [ 0,  0, ...,  29]
         ...
        [13, 15, ..., 200]]
2 jan; ....................etc...............

But would the model be able to handle data with such complicated dimensions?

Why don’t you directly feed data to DL for instance to LSTM network. It can handle the whole data? Do you really need embeddings?

in courses/dl1/lesson3-rossman.ipynb

Jeremy has the following function to create categorical data:

for v in cat_vars: joined[v] = joined[v].astype('category').cat.as_ordered()

Does anyone understand why we need to include ‘as_ordered()’?

I can’t see a reason for this step.

1 Like

Sets the Categorical to be ordered ?

Does that mean we are kind of giving more weights to the categorical columns as on first come service basis?