How to handle multiple casts in movie revenue prediction?

Recently I was working on a kaggle competition TMDB Box Office Prediction. The purpose of this competition is to predict box office for movies based on the given or self-collected information, such as genres, cast-members, budget and so on.

But I met some problem when I was handling the ‘cast’ feature. Unlike ‘genres’, which only has less than 100 types in terms of movie, there are thousands of movie stars in the industry, so one-hot encoding is definitely not a good way to handle the multiple casts problem because there will be thousands more features add to the data which will lead to an underfit problem.

However, the main actors of actresses of a movie are undoubtedly a critical contribution to the movie revenue so we have to extract the useful information out of it.

So is there a good way to deal with this kind of problem?

Have you tried using Embeddings https://youtu.be/qqt3aMPB81c?t=3199

Although if number of different cast persons is comparable to number of training examples, you will get in the situation where the will be too tittle info for every person to learn for model. In this case you can try to gather some info about each person (birth year, avg boxoffice of his films (at the date of this particular film), maybe his/her social nets’ popularity and so on) and push in along the person name. It can help NN to get more sensible info rather that just crew name.

Thank you for your suggestion. I try to implement embedding but the number of total cast member is just too large, around 50,000, and the provided training data was only 3,000.

So I am wondering if there is a tricky way to deal with this problem, say only list out the most famous 100 casts and do the one-hot encoding.

Yes, that’s the case I what talking about (when most of actors appear in data only 1-2 times).
Trying treat all but 100 most common in data cast members as ‘special symbol’ I think worth trying. But I still think embeddings in this case will work better that one-hot encoding. And as I sad in this case you can add additional columns with info on cast members (even those whom you rid off), so even if model did not get anything on actor name it can use info that boxoffice of his movies is very high
:slight_smile:

Oh, and I was referring to embeddings assuming that you use NN, rather than for ex. Random Forest :slight_smile:

1 Like