I’m a growth marketer at a large online marketplace that resells sneakers. I’ve been doing the Fast AI course for fun after work, and think I found an application for work. I want to show our ML team that I can hang with them ;).
I want to predict with some confidence the number of users we should see per day.
I started following the Rossmann example in Lesson 6, but realized that’s more complicated than what I’m trying to do. Here’s my dataset:
Users - Users on our site/app per day. I have this data going back to 2016. This is the dependent variable
Sneakers Google Trend - Weekly google trend data on searches for “sneakers.”
Supreme Google Trend - Weekly google trend data on searches for “Supreme.” Supreme is another large item category on our site.
Promo - A list of dates in which we ran a promo or there was a “major” sneaker release that drove user traffic. Boolean (ie {3/14/20: 1}).
Paid Advertising - Our monthly budget for paid advertising. I figured I would divide this by days in each month and come up with a “daily” spend.
Here’s where I think I need to go following the Rossmann and Lesson 4 tabular data work:
Join the data together - This is where the Rossman template was throwing me off. In rossman_data_clean.ipynb it’s noted Now we can outer join all of our data into a single dataframe... One way to check that all records are consistent and complete is to check for Null values post-join, as we do here.
But shouldn’t I expect Null values post join? For instance, there isn’t a daily google trend value. My promo dataset is just a list of dates where a promo/release occurred. On a date a promo didn’t occur, I would expect a Null. Am I misunderstanding something?
I may just end up doing this in Excel since I’m a bit of a newbie at this.
Run the datepart function - Thanks so much for including that! I think this is what FB Prophet kinda took care of for me when I was using it.
Declare my categorical and continuous variables - My categorical variable is “promo” with the remainder being continuous. Users of course as dependent variable.
Train and Fit - I haven’t decided which metric I am optimizing for. I used MAPE for FB Prophet; I should do the same here.
In your call to label_from_df you’re not specifying regression (FloatList). Also your metric should be something like rmspe, not accuracy as accuracy isn’t for regression like you’re doing
I don’t think you’ll need to transform the y values here to make them smaller (unless it’s very large user counts) but you should pass in a y_range to your tabular_learner
Thanks so much! If I understand properly, I should make .label_from_df look like .label_from_df(cols=dep_var, label_cls=FloatList, log=True) and metrics=rmspe
WRT y values…it is exponential growth . Y values will grow ~100x from t=0 to present.
I’m going to find the documentation on these built-in methods and functions…
Actually, fastai is calculating the MSE based on the normalized user values right? I don’t think Prophet does this normalization so its MSE value is magnitudes different…
I’ll check with some data scientists at work…
If there a good way to plot the model v actual to help visualize its performance?
It’s possible to join datasets in Excel but that might not be the most efficient way of doing it. You’re right that you will get some null values post-join. That’s to be expected. When you join the datasets, you’re joining based on date, so you’ll only get values for dates that have an entry in all datasets. That’s why you’ll get null values. The Rossman example is more complex than what you’re doing but the principles are the same. You can Declare your categorical and continuous variables and then use the date part function, as you have mentioned. You can get more info online about that. In terms of training and fitting, you can use MAPE as you did with FB Prophet.
That’s awesome that you’re exploring the world of ML and looking to apply it to your work. If you need assistance or guidance in predicting user numbers, consider contacting Athina Digital, one of the best professional SEO agency services. They have a team of experts who can help you analyze your dataset and provide insights to achieve your goals.