Lesson 4 applied to Imitating other Reddit Users and Subreddits

I basically replaced what Jeremy did in the Lesson 4 and language model notebooks and built a pipeline to download data from Reddit instead of using the supplied IMDB data. This program is split into two parts:

  • Part 1: imitating a Reddit user by downloading their latest 1000 submissions and 1000 comments and training the model on them (actually taking 90% of them by default and using the remaining 10% as the test/validation set.)

  • Part 2: Imitating a subreddit using the similar methodology in Part 1 but downloading the top 1000 submissions in a subreddit and taking the comments from each submission and using that as my training/testing set.

Part 1 is (understandably) giving mostly garbage results unless you can find a user who routinely makes very long and thought out posts.

Part 2 seems to be working better since I can download a decent amount of data from most large subreddits to train the model on.

Feel free to have fun and play around with these, but keep in mind I seem to be running into these issues:

  • I don’t know of an elegant way to convert markdown text from Praw (for accessing Reddit’s API in Python) to plain text. I’ve used regular expressions to remove quotes (so we only train on actual text a user typed out and don’t include redundant quotes when getting comments from a subreddit.) I’ve tried to convert to markdown and then use Beautifulsoup to remove the HTML.

  • There are some stray punctuation left from the removal of quotes and markdown in the text but I’m not sure if it matters. I’ve also tried removing all instances of the backslash ().

  • Learning rates higher than 1e-2 don’t seem to converge even when I plot the learning rate curves and the minimum point is at 1e-1 or higher.

  • After training, I then try to make predictions. However, regardless of the seed text I use, I often get the same results whenever I make predictions.

3 Likes

Interesting application, thanks for sharing!