Dataloader for rossmann rnn model

dangraf · August 14, 2018, 1:37pm

Hello!
I’m trying to use the fastAI framework to create an RNN modell for the rossmann data (even if it’s said earlier that it gives poor result)
But I’m having trouble getting the dataloader to work properly with the categories together with continuous data. I would like to use the embedding layers for the categorical data and then input all data to an rnn or gru network and finally predict the result

I’ve tried use the “ColumnarModelData.from_data_frame” and set the batchsize to 256 and i reshape the matrix to to (64,4,num_inputs) inside the forwared function. The axis need to be switched for the data get the correct format for the Rnn ( 4, 64, num_inputs).
The problem with this approach is that if the batch size is not evenly divided with 4, it can’t compute. And also i’m not sure if I get the dimensions of the matrix correct to understand how each number is mapped to each other.

Another idea is to stack all data together (ColumnarModelData.from_arrays) , but then I need to manually know the column indexes of categories and columns.

is there an easy way to accomplish this?

mnpinto · August 14, 2018, 2:47pm

The approach of reshaping the batchsize seems weird to me. Maybe you can create sequences from the input data and try to predict the value of the next time-step as the target. Then maybe you can apply the embedding layers to each time-step of the sequence and feed the resulting sequence of activations to the RNN layer.

dangraf · August 14, 2018, 3:39pm

thanks for trying to help. But I’m not able to split it up into chunks of data.

i’ve followed the rossmann notebook untill the feature maps are applied and the mapper is used to create dataframes for the dataloader.

Here I try to split up the data to use previous and current row of data to predict the result:

xs = df.values.reshape(-1,2,len(df.columns))
xs_test = df_test.values.reshape(-1,2,len(df_test.columns))
ys = yl.reshape(-1,2)
val_idx = list(range(int(len(xs) * 0.75), len(xs)))

But the dataloader complains:

md2 = ColumnarModelData.from_arrays('.', val_idxs =val_idx,
                                   xs = train,
                                   y = ys,
                                   test_xs = xs_test,
                                   bs=256)

it gives an error:
IndexError: boolean index did not match indexed array along dimension 0; dimension is 422169 but corresponding boolean dimension is 1017209

I do not understand this error and there is not much documentation for the dataloader.

mnpinto · August 14, 2018, 5:10pm

I’m not sure how you are trying to split the data but remember you always want the value from the previous time-step. Pandas shift function may be useful for that purpose. You can also try to add columns to the dataframe corresponding to previous day sales, or last week sales. But then to make test predictions you have to predict one day at the time and use the previous prediction as the input for the next and errors may accumulate.

You should probably try to do some tests on a much smaller and simpler dataset. You can create an artificial time-series or use maybe the data from this kaggle playground competition https://www.kaggle.com/c/demand-forecasting-kernels-only. It only has date, item and store columns.

dangraf · August 14, 2018, 8:35pm

I can try to explain what I’m trying to do.
When looking at the “multi output” example, the text is split into chunks 8 characters
In the notebook, the “xs” vecor has the size of: (xs.shape: 75111, 8) which is the input to the dataloader.
Here is a simplified example

org text = [1,2,3,4,5,6,7,8]
split text = [[1,2],
                  [3,4,],
                  [5,6]
                  [7,8]]

where each number corresponds to a letter.

We can see this as a time series with 1 number for each step in time. And we chunk it together to always get the last 2 chars. (no overlapping)
example code:

generate dummy data

cs = 2
datalen = 32
xs = np.arange(0,datalen).reshape(-1,cs) 
ys = np.arange(0,int(datalen/cs))
val_idx = np.arange(len(a)-4,len(a))
val_idx

creating dataloader and test

md = ColumnarModelData.from_arrays('.', val_idx, xs, ys, bs=4)
*x, y = next(iter(md.trn_dl))
print(x)

The code above works fine.

In the rossmann example we have a row of data for each timestep.

We pretend this is the original matrix:

row1 = 1,2,3,4
row2 = 5,6,7,8
row3 =9,10,11,12
row4=13,14,15,16
row5=17,18,19,20
row6 =21,22,23,24
shape=(6,4)

I’m trying to replace each “character” with a row of data.

step1= [row1,row2] = [[1,2,3,4],[5,6,7,8]]
step2 = [row3,row4] = [[9,10,11,12],[13,14,15,16]]
step3  = [row5,row6] [[17,18,19,20],[21,22,23,24]]

in this example i get a matrix of shape (3,2,4)
I can’t figure out how this can be done with the dataloader

cs = 2
datalen = 64
columns = 4
xs = np.arange(0,datalen).reshape(-1,cs,columns)
ys = np.arange(0,int(datalen/cs))
val_idx = np.arange(len(a)-4,len(a))

The code breaks when trying to create the dataloader

md = ColumnarModelData.from_arrays('.', val_idx, xs, ys, bs=4)

I’ve looked at the link you gave me and the one that gave most clues was this notebook:
link to timeseries
but he seem to just add more data at each row. If I want to use an rnn, I still need to split and transpose the data inside my forward function that is a bit messy.

Hope this explains better what I’m trying to do.

mnpinto · August 15, 2018, 7:02pm

I’m starting to study RNNs this week, my main interest is on time-series, I created this notebook today with some experiments with a generated time-series and simple RNN based on Char3Model of Lesson 6. I will now keep moving forward on Lesson 6, I may add some other notebooks later. Part II lessons are also very useful because they introduce the fastai.text.

dangraf · August 16, 2018, 7:42pm

Thanks! I’m currently working on a solution since it does not seem to be a straight forward solution. I will post my github repo here as soon as I have something working.