Creating baseline model and choosing validation set for twitter data

:wave: Hi there! I’m working on a practice project to solidify my learnings from Lecture 4 (the NLP one) to try and predict engagement with tweets (likes, retweets, etc) from the tweet content. I have 2 questions about my initial approach:

  1. What’s a good baseline model to use?
  2. For the validation dataset, would it make sense to use a random sample or treat it as a time series, or something else?

I’ll explain more on each:

#1: I read some advice from @radek on structuring an ML project and agree that it’s a great idea to create a baseline model to have something to measure against. One option I was thinking was just to use deberta-small and some basically random hyperparams to start and have that be the baseline. But should your baseline model not use deep learning? If so, what non-deep learning approaches could I consider? One random one I thought about was just predicting # of likes from length of tweet (I’m guessing there’ll be near 0 correlation, but I’m having trouble thinking of anything else).

#2: For doing my train-validation-test split, I read Dr Rachel Thomas’ great blogpost about it but with Twitter data, can’t decide if I should think of it as a time series (so have the validation and test sets be later tweets) or just go with random sampling. On the one hand, I’d want to use the model to predict engagement with future tweets (point in favour of time-based split), but on the other, I deliberately picked a twitter user to analyse whose # of followers and style of tweeting is relatively stable because I didn’t want the data to vary too much over time, which makes me think a random split might be fine. What do y’all think?

Followup: just read a Medium article that suggested baseline models for regression tasks that just predict the mean or median as an option.

Does that make sense in this case? I guess if the goal is to see if deep learning has any value in my project task and this is the best quick non-deep learning approach I can think of, it’ll quickly demonstrate whether or not any of my deep learning approaches make sense… which is one of the main goals of a baseline model, right?

Hi Ollie

After 5 seconds thought so probabily rubbish but here we go.

Followers. The famous are likely to generate traffic. I have noticed conwyn@twitter lacks followers.
NLP content. Certain words will appeal to the followers. So non-neutral words are likely to trigger replies and retweets.

So maybe there is latent elements in the tweet which trigger response. These might be word sequence but a few layers down they are just weights.

This might be of interest (How to get Tweets using Python and Twitter API)

So maybe you could pick a series of “influencers” and monitor their activity and if there is a sudden hike (not a slow increase) use the data before the hike as “boring” and data after the hike as “interesting” so similar to movie sentiment analysis. Obviously “influencers” are skilled at triggering responses.

So use statistics for “counting” and “event detection”.

Regards Conwyn

1 Like

Thanks for the suggestion @Conwyn!

I ended up using ChatGPT to get some suggestions and wanted to share because they were great!

1. Constant prediction: A simple baseline model could be to always predict the same value for the number of likes, such as the mean or median number of likes in the training data.

2. Text length prediction: Another simple baseline model could be to use the length of the tweet's text as a predictor for the number of likes. This assumes that longer tweets may be more engaging or contain more information that is valuable to readers.

3. Word count prediction: Similar to text length, word count could be used as a predictor for the number of likes. This assumes that tweets with more words may be more engaging or contain more information that is valuable to readers.

4. Bag-of-words prediction: A simple bag-of-words model could be used to represent the tweet's text as a vector of word frequencies, and then use linear regression or a similar technique to predict the number of likes based on this vector. This assumes that the frequency of certain words in the tweet may be indicative of how engaging it is.

5. TF-IDF prediction: Similar to the bag-of-words model, a TF-IDF representation of the tweet's text could be used as a predictor for the number of likes. This assumes that the frequency of certain words in the tweet relative to their frequency in the overall corpus of tweets may be indicative of how engaging it is.

I tried approach #1 using the mean and median, then discovered/remembered that if all the values of one column (my predictions in this case) are the same, then the PCC is undefined.

I tried approach #2 - very easy, and as predicted, occasionally there would be a weak correlation between content length and the engagement metrics, but it was random noise - when I re-ran the calculation with different random subsets of the tweet data, the correlation would often disappear.

Finally, I ended up using approach #4 (bag-of-words with a linear regression) and got what I think is a good baseline - able to create some weak correlation, but a low enough baseline that I think I’ll be able to beat it easily applying some deep learning techniques:

  • PCC for BoW model pred_likeCount_log and likeCount_log is 0.22615329264677284
  • PCC for BoW model pred_retweetCount_log and retweetCount_log is 0.3098683298713295
  • PCC for BoW model pred_replyCount_log and replyCount_log is 0.13883294059386714
  • PCC for BoW model pred_quoteCount_log and quoteCount_log is 0.17015785979279438