Predicting on a single row with Rossmann Data


(Johnny Lee) #1

Hi, I’m trying to figure how to do a single prediction in the Rossmann notebook. I modify the code from single image prediction but I’m getting the following errors.

df, y, nas, mapper = proc_df(joined_samp, ‘Sales’, do_scale=True)
test = df.iloc[0].values
m.predict_array(test[None])


TypeError Traceback (most recent call last)
in ()
1 test = df.iloc[0].values
----> 2 m.predict_array(test[None])

~/fastai/courses/dl1/fastai/learner.py in predict_array(self, arr)
328 def predict_array(self, arr):
329 self.model.eval()
–> 330 return to_np(self.model(to_gpu(V(T(arr)))))
331
332 def TTA(self, n_aug=4, is_test=False):

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
355 result = self._slow_forward(*input, **kwargs)
356 else:
–> 357 result = self.forward(*input, **kwargs)
358 for hook in self._forward_hooks.values():
359 hook_result = hook(self, input, result)

TypeError: forward() missing 1 required positional argument: ‘x_cont’


(Anders) #2

I’m wondering about this as well


#3

I’ve been trying to do this as well. I’ve made little progress. If someone can help us out by pointing us in the right direction, it’d be great.

Based on how to perform a prediction on a single image, I came up with this:

single_row = df_test.head(1).as_matrix()
m.predict_array(single_row[None])

And I get an error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-98-0319cad5735c> in <module>()
      1 single_row = df_test.head(1).as_matrix()
----> 2 m.predict_array(single_row[None])

~/fastai-learn/rossmann-store-sales/fastai/learner.py in predict_array(self, arr)
    366     def predict_array(self, arr):
    367         self.model.eval()
--> 368         return to_np(self.model(to_gpu(V(T(arr)))))
    369 
    370     def TTA(self, n_aug=4, is_test=False):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    355             result = self._slow_forward(*input, **kwargs)
    356         else:
--> 357             result = self.forward(*input, **kwargs)
    358         for hook in self._forward_hooks.values():
    359             hook_result = hook(self, input, result)

TypeError: forward() missing 1 required positional argument: 'x_cont'

(Taylor Harrison) #4

It looks like the predict_array function doesn’t work in this case due to the way the MixedInputModel is defined (the forward function takes 2 arguments, one for the categorical variables and one for the continuous variables). To get a prediction for a single row, use the following code

cat_values = [[3, 2, 2]]
cont_values = [[6.4, 78.4, 29.8]]
learn.model.eval()
pred = to_np(learn.model(to_gpu(V(T(np.array(cat_values)))), to_gpu(V(T(np.array(cont_values))))))

where cat_values is a 2D array with column values that match the categorical variable indices (make sure the columns are in the same order as in your pandas dataframe) and cont_values is the same thing just with the continuous variables. Your single number prediction should be in the pred numpy array. You can also pass in multiple rows to get multiple predictions at once.

Side note, you can get the categorical variable category indices by looking at the df.column_name.cat.categories attribute, where column_name is the name of the column your interested in and df is the dataframe (will only work after the column has been set as categorical with the df[column_name].astype('category').cat.as_ordered()). Just set the correct column in the cat_values array to the index of the category you want.


#5

That does not seem to work for me. Here is the code that I used:

cat_values = [[3, 2, 2]]
cont_values = [[6.4, 78.4, 29.8]]
m.model.eval()
pred = to_np(m.model(to_gpu(V(T(np.array(cat_values)))), to_gpu(V(T(np.array(cont_values))))))

I get the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-16-53cdaaaf6b19> in <module>()
----> 1 pred = to_np(m.model(to_gpu(V(T(np.array(cat_values)))), to_gpu(V(T(np.array(cont_values))))))

/net/vaosl01/opt/NFS/sw/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    355             result = self._slow_forward(*input, **kwargs)
    356         else:
--> 357             result = self.forward(*input, **kwargs)
    358         for hook in self._forward_hooks.values():
    359             hook_result = hook(self, input, result)

/net/vaosl01/opt/NFS/sw/anaconda3/envs/fastai/lib/python3.6/site-packages/fastai/column_data.py in forward(self, x_cat, x_cont)
    112     def forward(self, x_cat, x_cont):
    113         if self.n_emb != 0:
--> 114             x = [e(x_cat[:,i]) for i,e in enumerate(self.embs)]
    115             x = torch.cat(x, 1)
    116             x = self.emb_drop(x)

/net/vaosl01/opt/NFS/sw/anaconda3/envs/fastai/lib/python3.6/site-packages/fastai/column_data.py in <listcomp>(.0)
    112     def forward(self, x_cat, x_cont):
    113         if self.n_emb != 0:
--> 114             x = [e(x_cat[:,i]) for i,e in enumerate(self.embs)]
    115             x = torch.cat(x, 1)
    116             x = self.emb_drop(x)

/net/vaosl01/opt/NFS/sw/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/autograd/variable.py in __getitem__(self, key)
     76                 return IndexSelect.apply(self, 0, key)
     77             # else fall through and raise an error in Index
---> 78         return Index.apply(self, key)
     79 
     80     def __setitem__(self, key, value):

/net/vaosl01/opt/NFS/sw/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py in forward(ctx, i, index)
     87             result = i.index(ctx.index)
     88         else:
---> 89             result = i.index(ctx.index)
     90             ctx.mark_shared_storage((i, result))
     91         return result

IndexError: index 3 is out of range for dimension 1 (of size 3)

Also could you explain your answer a bit more. With reference to lesson 3, at the testing stage, we have already build our model and our ColumnarModelData md from which we have our learner m. The data that was passed to these were already processed using the proc_df function. I’m confused as to what we pass as a “single” row.

My idea is:

  1. Grab a random row(s) from the joined_test dataframe.
  2. Follow the same processing steps that was done to get the df_test dataframe, at which point we only have numbers.
  3. Pass this through the trained model and get a prediction.

Is that right/doable?


(Luke Byrne) #6

Hi Sean,

Have you had any progress with this?

Cheers,

Luke


(Taylor Harrison) #7

The values in my cat_values and cont_values arrays were only meant to be placeholders. The cat_values variable is an array of vectors, each vector in the array has components that correspond to your models’ categorical variables (there will be as many components as there are categorical variables). For the Rossmann data there are 22 categorical variables. The cont_values is the same thing for the continuous variables (16 in the Rossmann data to start with but some get added later due to NaN values as I’ll explain later). So for the Rossmann data, you could grab a row from the test dataframe

test_row = df_test.iloc[1]

separate the categorical columns and the continuous columns

cat_values = test_row[cat_vars].astype('int8').as_matrix()[None]
na_vars = [na + '_na' for na in nas.keys()]
cont_values = test_row[contin_vars+na_vars].astype('float64').as_matrix()[None]

Note that the na_vars is necessary because when proc_df was run, it took out the NaN values from the dataframe, replaced them with the column’s mean, and then added a column to the dataframe (of boolean type) to denote that the row’s value for that column was actually NaN (for example, a new boolean column CompetitionDistance_na is added if there is a null value in the CompetitionDistance column). This counts as another continuous variable, so we can’t forget to include it in our cont_values or else we’ll see something along the lines of:

RuntimeError: running_mean should contain 16 elements not 18

Now that we’ve separated the categorical and continuous variables we can pass them to the model as separate parameters

learn.model.eval()
pred = to_np(learn.model(to_gpu(V(T(np.array(cat_values)))), to_gpu(V(T(np.array(cont_values))))))

Note that I’m no pandas or Pytorch master. There may be a much more concise way to do this.

A word of warning, one thing that tripped me up was that the nas variable had items that were added from running proc_df on the test set (there were NaN values in columns of the test set that the training set had no NaN’s for). Because the model is trained on the training set, it’s not equipped to handle those extra “_na” columns in the test set, so I had to be sure not to include them when getting my single row prediction. Not really sure what to do in that case except to just drop those added “_na” columns from the test set like I did, if anyone has any thoughts I’d love to hear them.

Hope that helps, let me know how it goes!


(Nick) #8

Hi, i think this issue with additional _na columns has been fixed recently https://github.com/fastai/fastai/pull/504


(Nick) #9

Also if you want to predict a single record from the test dataset you can grab it directly from test_ds and use predict function from model.py
Something like this:

def predict_single_record(m, index):
    single_record = [np.array(i)[None] for i in m.data.test_ds[index]]
    return predict(m.model,[single_record])
predict_single_record(m,0)

But maybe there is a better way.


(Luke Byrne) #10

Nick,

That method works well.

I am now wondering if there is a way to setup an unseen row data (i.e. data was come in via an API for a prediction from the model) and do the same thing.

I know that you need to look up the embedding ids so I wrote a function do that, I am guessing that you would also need to get your continuous vars into the series as well, get them in the right order as the test_df and then pass that along to the predict method.

pred_data = []
pred_columns = []

for cat_var in cat_vars:
    dictionary = dict(enumerate(training_df[cat_var].cat.categories))
    pred_columns.append(cat_var)
    inverse_dictionary = {v: k for k, v in dictionary.items()}
    pred_data.append(inverse_dictionary[last_row[cat_var]])

series = pd.Series(pred_data, index=pred_columns)
foo = pd.DataFrame([series])

One thing I note with the above is that my cat var embedding ids are always 1 behind what the actual data should be. Do you know if the embedding matrixes are 0 indexed?

I feel that I am so close to getting this incredibly useful DL method into a production stage for a use case I have, I just need to understand this final part of setting up a new row to create a prediction.

Appreciate everyones input on this forum so far.

Kind regards,

Luke


(Nick) #11

Take a look at numericalize function in structured.py which is called inside proc_df function, there it adds 1 to indexes.

if not is_numeric_dtype(col) and ( max_n_cat is None or col.nunique()>max_n_cat):
    df[name] = col.cat.codes+1

(Nick) #12

Just an example, maybe someone find usefull:

from fastai.structured import *
from fastai.column_data import *

PATH='data/rossmann/'

table_names = ['train','store']
cat_vars = ['Store', 'DayOfWeek','StateHoliday','SchoolHoliday']
contin_vars = ['CompetitionDistance']

train, store = [pd.read_csv(f'{PATH}{fname}.csv', low_memory=False) for fname in table_names]

joined_train = train.merge(store, how='inner', on="Store")
joined_train = joined_train[cat_vars + contin_vars + ['Sales']].copy()
for v in cat_vars: joined_train[v] = joined_train[v].astype('category').cat.as_ordered()
joined_train = joined_train[joined_train['Sales'] != 0]
df_train, y, nas, mapper = proc_df(joined_train,'Sales',do_scale=True)

yl = np.log(y).astype(np.float32)
idxs = get_cv_idxs(len(df_train))
cat_sz = [(c, len(joined_train[c].cat.categories)+1) for c in cat_vars]
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]
n_contin = len(df_train.columns) - len(cat_vars)

md = ColumnarModelData.from_data_frame(PATH, idxs, df_train, yl, cat_vars, bs=256)

m = md.get_learner(emb_szs, n_contin, 0.04, 1, [1000,500], [0.001,0.01])

m.fit(1e-3, 1)

#Read test dataset
test = pd.read_csv(f'{PATH}test.csv', low_memory=False)

joined_test = test.merge(store,how='inner', on="Store")
joined_test = joined_test[cat_vars + contin_vars].copy()
apply_cats(joined_test, joined_train)
df_test, _, _, _ = proc_df(joined_test, None, do_scale=True, na_dict=nas, mapper=mapper)

#Single record 
test_record = df_test.iloc[0] 

cat = test_record[cat_vars].values.astype(np.int64)[None]
contin = test_record.drop(cat_vars).values.astype(np.float32)[None]

#Prediction
model = m.model
model.eval()
prediction = to_np(model(V(cat), V(contin)))
print(f"Sales prediction: {prediction}")

How to get a prediction out of my model ? (super beginner here!)
(Theodoros Galanos) #13

Thanks for this. I have been having problems making this work for datasets without continuous variables.

The model is trained well and able to predict with no problem on the test dataset. But the above code obviously brings us back to the ‘x_cont’ error. Is there a way to bypass this?


(Theodoros Galanos) #14

So I was able to successfully predict using the above code. I tested this out on the first row of the df, so df.iloc[0,:] and gave a prediction that seems logical.

When I try to do the same on the second row though, and the others after it, I get an error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-65-8d5362871fe4> in <module>()
----> 1 model(V(cat), V(cont))

~\AppData\Local\conda\conda\envs\fastai-cpu\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
    355             result = self._slow_forward(*input, **kwargs)
    356         else:
--> 357             result = self.forward(*input, **kwargs)
    358         for hook in self._forward_hooks.values():
    359             hook_result = hook(self, input, result)

~\AppData\Local\conda\conda\envs\fastai-cpu\lib\site-packages\fastai\column_data.py in forward(self, x_cat, x_cont)
    113     def forward(self, x_cat, x_cont):
    114         if self.n_emb != 0:
--> 115             x = [e(x_cat[:,i]) for i,e in enumerate(self.embs)]
    116             x = torch.cat(x, 1)
    117             x = self.emb_drop(x)

~\AppData\Local\conda\conda\envs\fastai-cpu\lib\site-packages\fastai\column_data.py in <listcomp>(.0)
    113     def forward(self, x_cat, x_cont):
    114         if self.n_emb != 0:
--> 115             x = [e(x_cat[:,i]) for i,e in enumerate(self.embs)]
    116             x = torch.cat(x, 1)
    117             x = self.emb_drop(x)

~\AppData\Local\conda\conda\envs\fastai-cpu\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
    355             result = self._slow_forward(*input, **kwargs)
    356         else:
--> 357             result = self.forward(*input, **kwargs)
    358         for hook in self._forward_hooks.values():
    359             hook_result = hook(self, input, result)

~\AppData\Local\conda\conda\envs\fastai-cpu\lib\site-packages\torch\nn\modules\sparse.py in forward(self, input)
    101             input, self.weight,
    102             padding_idx, self.max_norm, self.norm_type,
--> 103             self.scale_grad_by_freq, self.sparse
    104         )
    105 

~\AppData\Local\conda\conda\envs\fastai-cpu\lib\site-packages\torch\nn\_functions\thnn\sparse.py in forward(cls, ctx, indices, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
     55 
     56         if indices.dim() == 1:
---> 57             output = torch.index_select(weight, 0, indices)
     58         else:
     59             output = torch.index_select(weight, 0, indices.view(-1))

RuntimeError: index out of range at c:\miniconda2\conda-bld\pytorch-cpu_1519449358620\work\torch\lib\th\generic/THTensorMath.c:277

Not sure what this is. Of course I should say that my initial set up might be wrong in itself. What I am not sure is how to construct the model and new df and how to pass it through the model. In this case, I passed the (unseen) data through a proc_df in a similar way like we would do for the testing dataset. However, not sure if I should be preprocessing that. Even if I do, given that this data is unseen, how would the model know to assign weights in the specific categorical index these values get?

What I am going to try next is to allow the unseen data as a test or validation set. This way I am hoping that the preprocessing is consistent and that I can use the df created to predict after training. Is this the way I should be doing this? Has anyone else tried to predict on completely unseen data for this method?

Thanks in advance!

Kind regards,
Theodore.


(Theodoros Galanos) #15

Sorry about the spam, just an update on this. It seems that I have made it to work. The issue is that the model expects exactly the same number of categories for the categorical variables. As it stands, and if this assessment is correct, it really limits the way that we can use models to predict on unseen data. However, there is a way out that provides some utility. The simple steps I did, if anyone is interested to do something similar are:

  1. Find a structured dataset (duh). I have a set of calculated data for building performance from a parametric design model.
  2. Hallucinate new and unseen data. The hallucinated data I generated represent variations to the designs of the calculated data, but the variation can only be applied to the continuous variables. This means categorical variables have exactly the same unique values in both training and hallucinated data.
  3. Train the model in the typical way. Create a df_train and df_test out of the calculated data using proc_df. Use the same mapper to pre-process the unseen dataframe.
  4. Predict on the unseen data using the following code (where df_hallu is my processed hallucinated data):

#Single record - for some i
test_record = df_hallu.iloc[i,:]
cat_vars = [ ‘in:WWRatio’, ‘in:GlazVLT’, ‘in:Orientation’] # my categorical variables in this case
cont_vars = [‘in:Depth’, ‘in:Width’,‘in:CeilingHeight’] # my continuous variables in this case
cat = test_record[cat_vars].values.astype(np.int64)[None]
cont = test_record[cont_vars].values.astype(np.float32)[None]
#Prediction
model = m.model
model.eval()
prediction = to_np(model(V(cat), V(cont)))
prediction = np.exp(prediction) # since we used log(y)
print(f"DA prediction: {prediction}")

#All records
hallucination_results = []
cat_vars = [ ‘in:WWRatio’, ‘in:GlazVLT’, ‘in:Orientation’]
cont_vars = [‘in:Depth’, ‘in:Width’,‘in:CeilingHeight’]
model = m.model
model.eval()
for i, row in df_hallu.iterrows():
…test_record = row
…cat = test_record[cat_vars].values.astype(np.int64)[None]
…cont = test_record[cont_vars].values.astype(np.float32)[None]
…prediction = np.exp(to_np(model(V(cat), V(cont))))
…hallucination_results.append(prediction)

Hope this is helpful to someone. Also, if there is something I’ve missed or a simpler way to do this please let me know. Additionally, it would be extremely useful if the (assumed) limitation of same cardinality for categorical variables when predicting can be bypassed (I guess there is a chance it’s not really possible or useful to do so given the inner workings of the model).


(Nick) #16

I could be wrong, but for me, it seems like the easiest way to deal with unseen categorical variables is to treat them as NA during the preprocessing.


(Theodoros Galanos) #17

That is a good point Nick. After I did all this I remembered @jeremy clearly saying during this lesson that any new value of a categorical variable the model sees is labelled as ‘Unknown’. There might be a way out for the preprocessing step to happen at once through this although I’m guessing treating different categories as one wouldn’t be ideal for my test case.

If I can be honest, this was a toy example on a dataset that I did not develop. In my production workflows, when I do this I start backwards. I first develop the parametric model (i.e. a dataset without target values, only the features), I then select a stratified sample of that and simulate it to develop the training dataset. In this way, the data that were not simulated have the same exact unique values for each categorical variable.