[Solved] Why does my language model have so many unknown tokens?

I’m trying to prepare a language model on a dataset of tweets so that I can fine tune it on AWD_LSTM, following the approach of lesson3_imdb.

The first screenshot below shows how I’ve loaded the dataset and some sample data.

The second screenshot shows how I’ve prepared the Databunch and the output of show-batch(). This is the part that I’m confused. Why are there so many unknown/special tokens?

It looks like you’re not actually grabbing the text, and instead doing the retweets column. You should be able to pass in a text_cols parameter and specify your tweet_content column IIRC.

Eg: TextList.from_csv(path,fname, text_cols=1)

1 Like

Yes you are correct! Thank you for your help.

I added in cols like this: TextList.from_csv(path, fname, cols=1)
which figured out that I had NaN in that column. Then I dropped those rows using df.dropna(inplace=True) which removed any row with NaN.

This fixed it. Thanks