Naming choices for "user_name", "item_name", and "rating_name" params of CollabDataLoaders.from_df method

toomanyrichies · May 25, 2021, 2:40pm

Chapter 8 introduces collaborative filtering, and by extension the CollabDataLoaders class. Jeremy says that the “user_name” param will default to “user”, which is fine for the case of the movie ratings problem because the ratings table already has a "user’ column. We did have to change “item_name” to “title” because the ratings table has no column named “item”, but still not too big of a stretch. Additionally, by looking at the docs for CollabDataLoaders, I see it also accepts a param named “rating_name”, which I haven’t yet encountered in the lessons up to the point I’m currently at.

My question is, it seems like collaborative filtering could be used for many problems which don’t include concepts like users, items, or ratings in their domain. Just as a random example, let’s say I have a table with data on cities and a table with data on weather patterns for those cities, and I want to use these tables to predict what the current average housing price is. In this case, would I have to do something like CollabDataLoaders(user_name='city', item_name='weather', rating_name='average housing price')?

At the risk of sounding critical, this feels kind of awkward, like I’m shoe-horning my problem’s domain into one which involves users and items and ratings. I know it’s hard to pick a variable / parameter / class name which everyone can agree on, and it’s one of the trickier things to do in programming, so I don’t want to come across as criticizing. But the specificity of those param name choices makes me wonder if I’m missing something about collaborative filtering itself. Is it exclusively intended for problem domains which specifically involve users? Would I actually have to use a different technique for the problem I outlined above?

JackByte · May 25, 2021, 6:56pm

Hi @toomanyrichies, I am no deep learning specialist (yet ) so there is not much of knowledge in my response. But I like the way you are thinking deeper about the concepts, and I like to join you. And maybe we’ll learn something on the way.

So first a question without any offence: do you have a better naming? Because I just thought about, what could we use instead… I kind of like it being users and items. This doesn’t mean we can’t use it for anything else. But we always have an example, that we can try to abstract and apply to our new problem.

Coming from the Netflix example. There is a finite list of users, that has seen a finite list of movies and given it a rating. With Training Embeddings we can find similar users and and suggest movies that other similar users liked.

I’m not sure if your example fits so well, because I don’t think the housing prize is related to city and weather like the rating is related to users and items.

I’ve done my little tabular project on the data of ASHRAE - Great Energy Predictor III. There you have to predict the meters for different houses. So you could try CollabDataLoaders(user_name='building', item_name='weather', rating_name='meter_reading').

I’m curious on your and other users opinion on this

toomanyrichies · May 26, 2021, 3:02am

Thanks for replying @JackByte, and I just want to re-iterate that naming things is hard in computer science and I don’t want to imply that I think I could have done any better. I guess it just might be confusing for folks who haven’t seen the example from the lesson yet, maybe they discovered the CollabDataLoaders class from reading the FastAI docs or from a coworker or friend or something, so they don’t have the context of the Netflix example. With the added context of that example, I agree it makes more sense, although for the sake of consistency I might expect that same naming pattern to be applied elsewhere too (which maybe it is already and I just haven’t encountered it?).

In terms of other names, I still don’t know whether the class is specifically meant for user-centric, rating-centric applications, so I’m not sure I’m qualified to suggest alternate names until I have a better understanding of the class. I haven’t spent as much time thinking about the problem as the FastAI folks have, so I’ll defer to them. I guess I mostly just want to know whether my confusion over the param names implies a misunderstanding on my part about the applicability of the class to problems like the one I described.