Issue with TextDataLoaders ( v2)

Hi all,

First time post - love the community

I am trying to train a language model on a set of tweets - dataset can be downloaded here:

I have manually downloaded the train dataset (a .csv), saved as utf-8 encoding and uploaded it into paperspace notebook environment (Paperspace + container, free GPU).

After creating a TextDataLoader, dls_lm.show_batch() provides the following result:

Ignoring what appears to be an obvious .show_batch() issue, when I train the language model I return an absurd accuracy -

When I run .predict() with the trained language model, the output confirms that the data used for training is the corrupted junk displayed in show_batch() - all the prompts are selected from the training data, so they shouldn’t be xxunk

I have experimented with the original file, a manually saved UTF-8 encoding file, TextDataLoaders from_csv and from_df, yet the results are consistent - the data loaded into the language_model_leader appears to be corrupted.

Where might the error be occuring? What can I do to resolve the issue, or further narrow down the source of the error?

Thanks a lot in advance. Can provide more configuration information if necessary.

Probably switch to DataBlock . It is more easier to see where things are going wrong. Here the tokenizer transform require some tweaking with min-freq and max_vocab