Using Fast AI to Predict Site Traffic (Tabular/Time Series)

Hi All!

I’m a growth marketer at a large online marketplace that resells sneakers. I’ve been doing the Fast AI course for fun after work, and think I found an application for work. I want to show our ML team that I can hang with them ;).

I want to predict with some confidence the number of users we should see per day.

I started following the Rossmann example in Lesson 6, but realized that’s more complicated than what I’m trying to do. Here’s my dataset:

  • Users - Users on our site/app per day. I have this data going back to 2016. This is the dependent variable
  • Sneakers Google Trend - Weekly google trend data on searches for “sneakers.”
  • Supreme Google Trend - Weekly google trend data on searches for “Supreme.” Supreme is another large item category on our site.
  • Promo - A list of dates in which we ran a promo or there was a “major” sneaker release that drove user traffic. Boolean (ie {3/14/20: 1}).
  • Paid Advertising - Our monthly budget for paid advertising. I figured I would divide this by days in each month and come up with a “daily” spend.

Here’s where I think I need to go following the Rossmann and Lesson 4 tabular data work:

  1. Join the data together - This is where the Rossman template was throwing me off. In rossman_data_clean.ipynb it’s noted Now we can outer join all of our data into a single dataframe... One way to check that all records are consistent and complete is to check for Null values post-join, as we do here.

    But shouldn’t I expect Null values post join? For instance, there isn’t a daily google trend value. My promo dataset is just a list of dates where a promo/release occurred. On a date a promo didn’t occur, I would expect a Null. Am I misunderstanding something?

    I may just end up doing this in Excel since I’m a bit of a newbie at this.

  2. Run the datepart function - Thanks so much for including that! I think this is what FB Prophet kinda took care of for me when I was using it.

  3. Declare my categorical and continuous variables - My categorical variable is “promo” with the remainder being continuous. Users of course as dependent variable.

  4. Train and Fit - I haven’t decided which metric I am optimizing for. I used MAPE for FB Prophet; I should do the same here.

How does this framework look?

I wrangled the data together in Excel, but am getting 0 accuracy. I haven’t seen that before :flushed:

dep_var = 'users' cat_names = ['Promo', 'Year', 'Month', 'Day', 'Dayofweek', 'Week', 'Dayofyear'] cont_names = ['googleSneaker', 'googleSupreme','Elapsed']

data = (TabularList.from_df(df, path = base_dir, cat_names=cat_names, cont_names=cont_names, procs=procs) .split_by_idx(list(range(800,1000))) .label_from_df(cols=dep_var) .add_test(test, label=0, label_cls=FloatList, log=True) .databunch())

Then
learn = tabular_learner(data, layers =[1000,500], metrics=accuracy)
learn.model

Do the learn.recorder.plot() and fit_one_cycle… accuracy flatlines at 0.

What am I missing?

In your call to label_from_df you’re not specifying regression (FloatList). Also your metric should be something like rmspe, not accuracy as accuracy isn’t for regression like you’re doing :slight_smile:

I don’t think you’ll need to transform the y values here to make them smaller (unless it’s very large user counts) but you should pass in a y_range to your tabular_learner

Thanks so much! If I understand properly, I should make .label_from_df look like .label_from_df(cols=dep_var, label_cls=FloatList, log=True) and metrics=rmspe

WRT y values…it is exponential growth :smiley:. Y values will grow ~100x from t=0 to present.

I’m going to find the documentation on these built-in methods and functions…

1 Like

OK, I got a mean_squared_error value of 0.097. Truthfully, this is without incorporating our paid spend. This feels pretty good…right???

All relative, I suppose. I was going up against an FB Prophet model of with a MAPE of 11.4%. Now I need to figure out who won haha!

Things I need to look over in the AM

  1. Did I overfit?
  2. Selecting my LR rate

You should then also use MAPE as a metric too (may take a little work) so you have a comparable baseline. As MSE != MAPE

It may be easier to convert FB Prophet over to MSE than to work this over to MAPE :grimacing:

Actually, fastai is calculating the MSE based on the normalized user values right? I don’t think Prophet does this normalization so its MSE value is magnitudes different…

I’ll check with some data scientists at work…

If there a good way to plot the model v actual to help visualize its performance?

I think that overall, the framework looks good! I think it would be a great project for you to work on.

It’s possible to join datasets in Excel but that might not be the most efficient way of doing it. You’re right that you will get some null values post-join. That’s to be expected. When you join the datasets, you’re joining based on date, so you’ll only get values for dates that have an entry in all datasets. That’s why you’ll get null values. The Rossman example is more complex than what you’re doing but the principles are the same. You can Declare your categorical and continuous variables and then use the date part function, as you have mentioned. You can get more info online about that. In terms of training and fitting, you can use MAPE as you did with FB Prophet.

Good luck with your project! :smile::chart_with_upwards_trend:

That’s awesome that you’re exploring the world of ML and looking to apply it to your work. If you need assistance or guidance in predicting user numbers, consider contacting Athina Digital, one of the best professional SEO agency services. They have a team of experts who can help you analyze your dataset and provide insights to achieve your goals.