A question about sales forecasting (3rd lesson) for a different competition. The validation is good but the prediction is bad

Hello. I have a problem with implementing the approach, which Jeremy used for the 3-rd lesson, for another competition: https://www.kaggle.com/c/competitive-data-science-predict-future-sales

I made a feature engineering and data preprocessing. Everything seems fine. The loss is falling, and the validation is good. But the prediction is terrible (like 100 times more terrible than the validation)

Here is the code:

from fastai.structured import *
from fastai.column_data import *
from sklearn.metrics import mean_squared_error
from math import sqrt
import pandas as pd
import numpy as np
import datetime
from dateutil import relativedelta


# reading all the data

PATH = 'data/avito/'

def join_df(left, right, left_on, right_on=None, suffix='_y'):
    if right_on is None: right_on = left_on
    return left.merge(right, how='left', left_on=left_on, right_on=right_on, suffixes=("", suffix))

def rmse(y_pred, targ):
    return sqrt(mean_squared_error(targ, y_pred))

cat_vars = ["shop_id", "item_id", "item_category_id"]
dep = "item_cnt_day"

table_names = ['train', 'items', 'item_categories', 'shops', 'test']
tables = [pd.read_csv(f'{PATH}{fname}.csv', low_memory=False) for fname in table_names]
train, items, item_categories, shops, test_table = tables

# no need a date. all dates will be grouped by its month
train = train.drop('date', axis=1)

# concatenating the tables
sales_table = join_df(train, items, "item_id", "item_id")
sales_table = join_df(sales_table, test_table, ["item_id", "shop_id"], ["item_id", "shop_id"])

# preprocessing sales
sales_table[dep] = sales_table[dep].astype(np.float32)
sales_table[dep] = sales_table[dep].fillna(0).astype('float32')
sales_table[dep][sales_table[dep] < 0] = 0
sales_table[dep][sales_table[dep] > 20] = 20
sales_table['item_price'][sales_table['item_price'] < 0] = 0

df = sales_table[cat_vars + [dep, "ID", "date_block_num", 'item_price']]

# dropping unnecessary data (we don't need data without ID)
df[dep] = df[dep].fillna(0).astype('float32')
df["ID"] = df["ID"].fillna(-1)
df = df[df.ID != -1]
df = df.reset_index(drop=True)

# putting prices into bins. these bins will be used in train and interpolated into test data
price_bins = [0, 100, 300, 500, 1000, 1500, 2000, 2500, 3000, 4000, 5000, 100000]
price_labels = ['A1', 'A2', 'A3', 'A4', 'B1', 'B2', 'B3', 'B4', 'C1', 'C2', 'D']

train_df = df.groupby(['date_block_num','ID']+cat_vars, as_index=False).agg({'item_cnt_day':'sum', 'item_price':'mean'})
train_df['price_bin'] = pd.cut(train_df['item_price'], price_bins, labels=price_labels)
temp_price_id_buckets_df = train_df[['ID', 'price_bin']].drop_duplicates().reset_index(drop=True)

price_id_buckets_df = temp_price_id_buckets_df.groupby('ID').nth(0).reset_index()

# grouping train data into months (because we need the forecast for a month). each bin is the first date of each month

train_df['tdate'] = datetime.date(2013, 1, 1)
train_df = train_df.reset_index(drop=True)

for i_ in range(0, 36):
    target_date = datetime.date(2013, 1, 1) + relativedelta.relativedelta(months=i_)
    train_df['tdate'][train_df['date_block_num'] == i_] = target_date

train_df["tdate"] = pd.to_datetime(train_df.tdate)

train_df = train_df.drop('date_block_num', axis=1)

add_datepart(train_df, "tdate", drop=True)

train_df = train_df.reset_index(drop=True)

cat_vars = ["shop_id", "item_id", "item_category_id", "tIs_quarter_end", "tIs_quarter_start", "tIs_year_end", "tIs_year_start", "tMonth", "tYear", 'price_bin']

for v in cat_vars:
    train_df[v] = train_df[v].astype('category').cat.as_ordered()

print("tables processed")

# constructing test table

test_df = join_df(test_table, items, "item_id", "item_id")
test_df = test_df.drop('item_name', axis=1)
test_df = test_df.reset_index(drop=True)
test_df = join_df(test_df, price_id_buckets_df, "ID", "ID")
test_df['price_bin'] = test_df['price_bin'].fillna('A1')
test_df[dep] = 0

test_df["tdate"] = datetime.date(2015, 11, 1)
add_datepart(test_df, "tdate", drop=True)

for v in cat_vars:
    test_df[v] = test_df[v].astype('category').cat.as_ordered()

test_df[dep] = test_df[dep].astype(np.float32)
train_df[dep] = train_df[dep].astype(np.float32)

# preparing data for processing and forecasting

train_df.reset_index(inplace=True)

df_train, y_tr, nas, mapper = proc_df(train_df, dep, do_scale=True, ignore_flds='ID')

df_train = df_train.drop('index', axis=1)

test_df.reset_index(inplace=True)

for v in cat_vars:
    df_train[v] = df_train[v].astype('category').cat.as_ordered()

df_test, y_ts, nas, mapper = proc_df(test_df, dep, do_scale=True, ignore_flds='ID')

df_test = df_test.drop('index', axis=1)

for v in cat_vars:
    df_test[v] = df_test[v].astype('category').cat.as_ordered()

# processing and forecasting

samp_size = len(train_df)

train_ratio = 0.75
train_size = int(samp_size * train_ratio)

val_idx = list(range(train_size, len(df_train)))

yl = np.log(y_tr)
yl[yl == -np.inf] = 0
y_range = (0, np.log(20))

md = ColumnarModelData.from_data_frame(PATH, val_idx, df_train[cat_vars], yl.astype(np.float32), cat_flds=cat_vars, bs=128, test_df=df_test[cat_vars])

cat_sz = [(c, len(df_train[c].astype('category').cat.categories)+1) for c in cat_vars]
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]

m = md.get_learner(emb_szs, 1, 0.04, 1, [1000, 500], [0.001, 0.01], y_range=y_range)

# m.lr_find()
#
# n_skip = 800
# n_skip_end = 5
#
# plt.ylabel("loss")
# plt.xlabel("learning rate (log scale)")
# plt.plot(m.sched.lrs[n_skip:-(n_skip_end + 1)], m.sched.losses[n_skip:-(n_skip_end + 1)])
# plt.xscale('log')
# plt.legend()
# plt.show()

lr = 1e-3

m.fit(lr, 1, metrics=[rmse], cycle_len=1)

pred_test = m.predict(True)

pred_test_e = np.exp(pred_test)

df_test['item_cnt_month'] = pred_test_e

csv_fn = f'{PATH}sub1.csv'

df_test[['ID', 'item_cnt_month']].to_csv(csv_fn, index=False)

I understand that the problem can be in test data I am using for prediction. But I have tested the approach for ‘sliding back’ prediction frame with something like this (1 date_block_num = 1 month):

...    
train_df_t = train_df[train_df['date_block_num'] != 32]
test_df = train_df[train_df['date_block_num'] == 32]
...

with the same (bad) results…

I’m stuck. Can you help me? What can be the problem here?

Here is the code I am using for sliding window test:

from fastai.structured import *
from fastai.column_data import *
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from math import sqrt
import pandas as pd
import numpy as np
import datetime
from dateutil import relativedelta

np.set_printoptions(threshold=50, edgeitems=20)

PATH = 'data/avito/'

def join_df(left, right, left_on, right_on=None, suffix='_y'):
    if right_on is None: right_on = left_on
    return left.merge(right, how='left', left_on=left_on, right_on=right_on, suffixes=("", suffix))

def rmse_t(targ, y_pred):
    return sqrt(mean_squared_error(np.array(targ), np.array(y_pred)))


cat_vars = ["shop_id", "item_id", "item_category_id"]

dep = "item_cnt_day"

table_names = ['train', 'items', 'item_categories', 'shops', 'test']

tables = [pd.read_csv(f'{PATH}{fname}.csv', low_memory=False) for fname in table_names]

train, items, item_categories, shops, test_table = tables

train = train.drop('date', axis=1)



sales_table = join_df(train, items, "item_id", "item_id")
sales_table = join_df(sales_table, test_table, ["item_id", "shop_id"], ["item_id", "shop_id"])

sales_table[dep] = sales_table[dep].astype(np.float32)

sales_table[dep] = sales_table[dep].fillna(0).astype('float32')

sales_table[dep][sales_table[dep] < 0] = 0
sales_table[dep][sales_table[dep] > 20] = 20
sales_table['item_price'][sales_table['item_price'] < 0] = 0

df = sales_table[cat_vars + [dep, "ID", "date_block_num", 'item_price']]

df[dep] = df[dep].fillna(0).astype('float32')
df["ID"] = df["ID"].fillna(-1)
df = df[df.ID != -1]
df = df.reset_index(drop=True)



train_df = df.groupby(['date_block_num','ID']+cat_vars, as_index=False).agg({'item_cnt_day':'sum', 'item_price':'mean'})

price_bins = [0, 100, 300, 500, 1000, 1500, 2000, 2500, 3000, 4000, 5000, 100000]
price_labels = ['A1', 'A2', 'A3', 'A4', 'B1', 'B2', 'B3', 'B4', 'C1', 'C2', 'D']
train_df['price_bin'] = pd.cut(train_df['item_price'], price_bins, labels=price_labels)

temp_price_id_buckets_df = train_df[['ID', 'price_bin']].drop_duplicates().reset_index(drop=True)

price_id_buckets_df = temp_price_id_buckets_df.groupby('ID').nth(0).reset_index()

train_df['tdate'] = datetime.date(2013, 1, 1)
train_df = train_df.reset_index(drop=True)

for i_ in range(0, 36):
    target_date = datetime.date(2013, 1, 1) + relativedelta.relativedelta(months=i_)
    train_df['tdate'][train_df['date_block_num'] == i_] = target_date

train_df["tdate"] = pd.to_datetime(train_df.tdate)

add_datepart(train_df, "tdate", drop=True)

train_df = train_df.reset_index(drop=True)


cat_vars = ["shop_id", "item_id", "item_category_id", "tIs_quarter_end", "tIs_quarter_start", "tIs_year_end", "tIs_year_start", "tMonth", "tYear", 'price_bin']

for v in cat_vars:
    train_df[v] = train_df[v].astype('category').cat.as_ordered()

print("tables processed")

##################### test table construction

train_df_t = train_df[train_df['date_block_num'] != 32]
test_df = train_df[train_df['date_block_num'] == 32]

df_train, y_tr, nas, mapper = proc_df(train_df_t, dep, do_scale=True, ignore_flds='ID')

for v in cat_vars:
    df_train[v] = df_train[v].astype('category').cat.as_ordered()

df_test, y_ts, nas, mapper = proc_df(test_df, dep, do_scale=True, ignore_flds='ID')

# df_test = df_test.drop('index', axis=1)

for v in cat_vars:
    df_test[v] = df_test[v].astype('category').cat.as_ordered()

print("ready for training")

samp_size = len(df_train)

train_ratio = 0.75
train_size = int(samp_size * train_ratio)

val_idx = list(range(train_size, len(df_train)))

yl = np.log(y_tr)
yl[yl == -np.inf] = 0
y_range = (0, np.log(20))

md = ColumnarModelData.from_data_frame(PATH, val_idx, df_train[cat_vars], yl.astype(np.float32), cat_flds=cat_vars, bs=256, test_df=df_test[cat_vars])

cat_sz = [(c, len(df_train[c].astype('category').cat.categories)+1) for c in cat_vars]

emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]

m = md.get_learner(emb_szs, 1, 0.04, 1, [1000, 500], [0.001, 0.01], y_range=y_range)

lr = 1e-3

m.fit(lr, 1, metrics=[rmse_t], cycle_len=1)

pred_test = m.predict(True)

pred_test_e = np.exp(pred_test)

print("result: " + str(rmse_t(y_ts, pred_test_e)))

the result is:

99%|█████████▉| 1671/1681 [00:11<00:00, 143.02it/s, loss=0.279]

epoch trn_loss val_loss rmse_t
0 0.278711 0.333683 0.569425

which seems fine, but the final evaluation:

result: 5.881196021583289

:((

anyone?

Hi I have been working on this project. My strategy is to just replace the data for rossmann with the predict future sales data and start the modelling. This data set on kaggle relates to a Coursera course on Kaggle competitions. To get good results a lot of preprocessing with the data is required and knowledge for which you gain on that Course. However the first point of any task in software is to get something running end to end no matter how simple. Having been privy to the data processing required I took the blunt approach but when I reach proc_df it never returns; the * in the box never changes to a number. Previous to that I bypassed proc_df as all fields/columns have been either converted to pd.Categorical or are numeric but then lr_find produces the generic runtime error cuda (59). As this is a catastrophic error it is hard to debug.

The test_df you use is date_block_num 32 but it should be 34.
Note this is a Kaggle comp I have signed up to as I have completed the course so I am honour bound.
You must know that the data and the test set have been manipulated to align with the requirements of a final course project to test if you can find all the nuances planted to trip you up.

1 Like

Thanks for the reply!

I’ve attached two different modules. The first one is for prediction for the competition. And the second one is for testing the prediction process for the 32nd block. Because we know the result of it and I wanted to check if my predictor working correctly.

For research purposes, I’ve calculated the RMSE for the Rossmann competition of the fast.ai course (for validation block), and it has the same big value (> 2.2). So, I guessed, that you cant get better with this approach.

I’ve enriched the data (I am familiar with retail business) and got smth about 1.3-1.4 score. I don’t know what should be done to make this prediction better…

What was your score on this challenge?

Hi Paul

Compare the test data with block 32. Mostly we used the r2_score as well. But some times they are both meaning less until you submit to Kaggle. The public score there is pretty accurate with the Course score which you have to submit to Coursera.

I got close to 0.90…
All I can suggest is that there are two identical Kaggle competitions with kernels and discussions. The other is competitive-data-science-final-project. In the kernels there are only a handful of NeuralNets many Keras, the others are mainly centred around XGBoost or LightGBM.

The comp you link to has the most kernels but both have important information in them.

The approach is not the problem with regards Rossmann. It is that the comp is rigged to make the student implement all the learnings of the course, data leakages, how the public and private leader board data is generated, ensembling etc. For around 40GBP for the course it’s a real challenge and one I would expect a non Kaggle expert to struggle on. Even when you have taken the course it’s not straight forward.
During the course I wanted to use fastai but had not gone through the lower level stuff of Part2 so I had to settle for other standard models to complete the assignments.

Now that I have completed the course I want to try to better my score using fastai and in that way I may learn that a little more deeply. But for now I have to scale back the features created so that I can get a path from end to end. I think I had 10 times more features, you need something in that ball park to do good. It is a challenge so don’t expect to do it in a short space of time. Of the 5 final projects I have reviewed so far only one has a score less than 0.95… the 10/10 score in the course.
Good Luck