I’m a growth marketer at a large online marketplace that resells sneakers. I’ve been doing the Fast AI course for fun after work, and think I found an application for work. I want to show our ML team that I can hang with them ;).
I want to predict with some confidence the number of users we should see per day.
I started following the Rossmann example in Lesson 6, but realized that’s more complicated than what I’m trying to do. Here’s my dataset:
Users - Users on our site/app per day. I have this data going back to 2016. This is the dependent variable
Sneakers Google Trend - Weekly google trend data on searches for “sneakers.”
Supreme Google Trend - Weekly google trend data on searches for “Supreme.” Supreme is another large item category on our site.
Promo - A list of dates in which we ran a promo or there was a “major” sneaker release that drove user traffic. Boolean (ie {3/14/20: 1}).
Paid Advertising - Our monthly budget for paid advertising. I figured I would divide this by days in each month and come up with a “daily” spend.
Here’s where I think I need to go following the Rossmann and Lesson 4 tabular data work:
Join the data together - This is where the Rossman template was throwing me off. In rossman_data_clean.ipynb it’s noted Now we can outer join all of our data into a single dataframe... One way to check that all records are consistent and complete is to check for Null values post-join, as we do here.
But shouldn’t I expect Null values post join? For instance, there isn’t a daily google trend value. My promo dataset is just a list of dates where a promo/release occurred. On a date a promo didn’t occur, I would expect a Null. Am I misunderstanding something?
I may just end up doing this in Excel since I’m a bit of a newbie at this.
Run the datepart function - Thanks so much for including that! I think this is what FB Prophet kinda took care of for me when I was using it.
Declare my categorical and continuous variables - My categorical variable is “promo” with the remainder being continuous. Users of course as dependent variable.
Train and Fit - I haven’t decided which metric I am optimizing for. I used MAPE for FB Prophet; I should do the same here.
In your call to label_from_df you’re not specifying regression (FloatList). Also your metric should be something like rmspe, not accuracy as accuracy isn’t for regression like you’re doing
I don’t think you’ll need to transform the y values here to make them smaller (unless it’s very large user counts) but you should pass in a y_range to your tabular_learner
Thanks so much! If I understand properly, I should make .label_from_df look like .label_from_df(cols=dep_var, label_cls=FloatList, log=True) and metrics=rmspe
WRT y values…it is exponential growth . Y values will grow ~100x from t=0 to present.
I’m going to find the documentation on these built-in methods and functions…
Actually, fastai is calculating the MSE based on the normalized user values right? I don’t think Prophet does this normalization so its MSE value is magnitudes different…
I’ll check with some data scientists at work…
If there a good way to plot the model v actual to help visualize its performance?