Using Fast AI to Predict Site Traffic (Tabular/Time Series)

Hi All!

I’m a growth marketer at a large online marketplace that resells sneakers. I’ve been doing the Fast AI course for fun after work, and think I found an application for work. I want to show our ML team that I can hang with them ;).

I want to predict with some confidence the number of users we should see per day.

I started following the Rossmann example in Lesson 6, but realized that’s more complicated than what I’m trying to do. Here’s my dataset:

  • Users - Users on our site/app per day. I have this data going back to 2016. This is the dependent variable
  • Sneakers Google Trend - Weekly google trend data on searches for “sneakers.”
  • Supreme Google Trend - Weekly google trend data on searches for “Supreme.” Supreme is another large item category on our site.
  • Promo - A list of dates in which we ran a promo or there was a “major” sneaker release that drove user traffic. Boolean (ie {3/14/20: 1}).
  • Paid Advertising - Our monthly budget for paid advertising. I figured I would divide this by days in each month and come up with a “daily” spend.

Here’s where I think I need to go following the Rossmann and Lesson 4 tabular data work:

  1. Join the data together - This is where the Rossman template was throwing me off. In rossman_data_clean.ipynb it’s noted Now we can outer join all of our data into a single dataframe... One way to check that all records are consistent and complete is to check for Null values post-join, as we do here.

    But shouldn’t I expect Null values post join? For instance, there isn’t a daily google trend value. My promo dataset is just a list of dates where a promo/release occurred. On a date a promo didn’t occur, I would expect a Null. Am I misunderstanding something?

    I may just end up doing this in Excel since I’m a bit of a newbie at this.

  2. Run the datepart function - Thanks so much for including that! I think this is what FB Prophet kinda took care of for me when I was using it.

  3. Declare my categorical and continuous variables - My categorical variable is “promo” with the remainder being continuous. Users of course as dependent variable.

  4. Train and Fit - I haven’t decided which metric I am optimizing for. I used MAPE for FB Prophet; I should do the same here.

How does this framework look?

I wrangled the data together in Excel, but am getting 0 accuracy. I haven’t seen that before :flushed:

dep_var = 'users' cat_names = ['Promo', 'Year', 'Month', 'Day', 'Dayofweek', 'Week', 'Dayofyear'] cont_names = ['googleSneaker', 'googleSupreme','Elapsed']

data = (TabularList.from_df(df, path = base_dir, cat_names=cat_names, cont_names=cont_names, procs=procs) .split_by_idx(list(range(800,1000))) .label_from_df(cols=dep_var) .add_test(test, label=0, label_cls=FloatList, log=True) .databunch())

Then
learn = tabular_learner(data, layers =[1000,500], metrics=accuracy)
learn.model

Do the learn.recorder.plot() and fit_one_cycle… accuracy flatlines at 0.

What am I missing?

In your call to label_from_df you’re not specifying regression (FloatList). Also your metric should be something like rmspe, not accuracy as accuracy isn’t for regression like you’re doing :slight_smile:

I don’t think you’ll need to transform the y values here to make them smaller (unless it’s very large user counts) but you should pass in a y_range to your tabular_learner

Thanks so much! If I understand properly, I should make .label_from_df look like .label_from_df(cols=dep_var, label_cls=FloatList, log=True) and metrics=rmspe

WRT y values…it is exponential growth :smiley:. Y values will grow ~100x from t=0 to present.

I’m going to find the documentation on these built-in methods and functions…

1 Like

OK, I got a mean_squared_error value of 0.097. Truthfully, this is without incorporating our paid spend. This feels pretty good…right???

All relative, I suppose. I was going up against an FB Prophet model of with a MAPE of 11.4%. Now I need to figure out who won haha!

Things I need to look over in the AM

  1. Did I overfit?
  2. Selecting my LR rate

You should then also use MAPE as a metric too (may take a little work) so you have a comparable baseline. As MSE != MAPE

It may be easier to convert FB Prophet over to MSE than to work this over to MAPE :grimacing:

Actually, fastai is calculating the MSE based on the normalized user values right? I don’t think Prophet does this normalization so its MSE value is magnitudes different…

I’ll check with some data scientists at work…

If there a good way to plot the model v actual to help visualize its performance?