The second screenshot shows how I’ve prepared the Databunch and the output of show-batch(). This is the part that I’m confused. Why are there so many unknown/special tokens?
It looks like you’re not actually grabbing the text, and instead doing the retweets column. You should be able to pass in a text_cols parameter and specify your tweet_content column IIRC.
I added in cols like this: TextList.from_csv(path, fname, cols=1)
which figured out that I had NaN in that column. Then I dropped those rows using df.dropna(inplace=True) which removed any row with NaN.