Trouble Converting Language Models from v1 to v2

MLearning · November 13, 2020, 5:27pm

I’m trying to convert code that learns a language model to generate “new” words from fastai v1 to fastai v2, but cannot seem to get very far.

Here’s the old code:

    data_lm = (TextList.from_df(my_data, path=data_path, cols='message').split_none().label_for_lm().databunch())
    data_lm.save('data_lm_export.pkl') # for use later
    learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5) # pretrained wikipedia model
    learn.fit_one_cycle(1, 1e-2) # train it
    learn.predict("Students protest ", n_words=5, no_unk=True) # generates new text, not perfect
    learn.unfreeze() # allow changing of weights
    learn.fit_one_cycle(cyc_len=1, max_lr=1e-3)
    learn.save_encoder('headlines-awd.pkl') # we'll use this later
    learn.export('headlines-lm.pkl') # save for use in a bot

I’ve gotten as far as this with my conversion:

data = TabularPandas(my_data, cat_names=['message'], y_names='is_happy')
dls = data.dataloaders()

but I get a " Could not do one pass in your dataloader, there is something wrong in it" error. I know from other threads that I likely need the dev version of fastcore, but this is not an acceptable solution (I would like to have 40+ students use this code, and getting them to download dev versions of libraries is…not ideal).

I’m thinking my best option at this point is just to keep the code in v1, and figure out how to write up instructions for everyone to install the old fastai, as v2 does not seem very approachable for this context. Or maybe I’m missing something and v2 really is more straight forward than v1?

Any help would be appreciated!

orendar · November 14, 2020, 7:27am

Hey,

By “converting” you mean converting the code and not the model itself, right? If so, then it’s pretty easy - you should look at the course material and docs (for example here) and see how to train language models in v2.

Let us know if you run into any problems

MLearning · November 14, 2020, 5:12pm

Hi there, thanks for the response. I am attempting to convert the code, and not the models themselves.

As I said in my original post, I can’t load the data from a pandas dataframe anymore as I get a " Could not do one pass in your dataloader, there is something wrong in it". These two lines throw that error:

data = TabularPandas(my_data, cat_names=['message'], y_names='is_happy')
dls = data.dataloaders()

Is there any way to fix this error without installing the dev version of fastcore, as suggested by other threads?

florianl · November 14, 2020, 5:23pm

With TabularPandas you are creating a Dataloader for fastai.tabular. You’ll have to change the way you create your dataloaders.

Try this (from https://docs.fast.ai/text.data#TextBlock.from_df):

path = untar_data(URLs.IMDB_SAMPLE)
df = pd.read_csv(path/'texts.csv')

imdb_clas = DataBlock(
    blocks=(TextBlock.from_df('text', seq_len=72), CategoryBlock),
    get_x=ColReader('text'), get_y=ColReader('label'), splitter=ColSplitter())

dls = imdb_clas.dataloaders(df, bs=64)
dls.show_batch(max_n=2)

Florian

MLearning · November 14, 2020, 7:24pm

Alright, so I should use a DataLoader instead of PandasTabular.

This is my new v2 code:

dls = TextDataLoaders.from_df(my_data, path=data_path, text_col='message', is_lm=True, valid_pct=0)
# How to save dls?
learn = language_model_learner(dls, AWD_LSTM, drop_mult=0.5) # pretrained wikipedia model
learn.fit_one_cycle(1, 1e-2) # train it
learn.predict("The sun will ", n_words=5, no_unk=True) # generates new text, not perfect
learn.unfreeze() # allow changing of weights
learn.fit_one_cycle(1, 1e-3)
learn.save_encoder('headlines-awd.pkl') # we'll use this later
learn.export('headlines-lm.pkl') # save for use in a bot

And here is the old v1 code I’m trying to convert from:

    data_lm = (TextList.from_df(my_data, path=data_path, cols='message').split_none().label_for_lm().databunch())
    data_lm.save('data_lm_export.pkl') # for use later
    learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5) # pretrained wikipedia model
    learn.fit_one_cycle(1, 1e-2) # train it
    learn.predict("The sun will ", n_words=5, no_unk=True) # generates new text, not perfect
    learn.unfreeze() # allow changing of weights
    learn.fit_one_cycle(cyc_len=1, max_lr=1e-3)
    learn.save_encoder('headlines-awd.pkl') # we'll use this later
    learn.export('headlines-lm.pkl') # save for use in a bot

I’m not 100 percent sure it’s a truthful conversion:

I’m having difficulty figuring out what to do with the v1 line data_lm.save('data_lm_export.pkl') # for use later as you can’t .save() a DataLoader in v2 it seems.
I’m also not sure how much the DataBunch concepts from v1 is necessary in v2…
Will valid_pct=0 do the same thing as split_none()? Is it necessary?

florianl · November 14, 2020, 8:14pm

I’m having difficulty figuring out what to do with the v1 line data_lm.save(‘data_lm_export.pkl’) # for use later as you can’t .save() a DataLoader in v2 it seems.

I don’t think there is a need save the Dataloader. Just recreate it if you need it later. But you should save the vocab (dls.vocab) if you recreate the dataloader later. (there’s a helper function to save objects in fastai but I can’t find it right now - I guess you can just pickle it).

I’m also not sure how much the DataBunch concepts from v1 is necessary in v2…

DataLoaders are the new DataBunches in v2. So you don’t need DataBunch anymore.

Will valid_pct=0 do the same thing as split_none()? Is it necessary?

I’d at least have valid_pct=0.1. Otherwise you won’t get validation metrics.

MLearning · November 14, 2020, 8:18pm

If I’m just using the code to generate the next word in a sentence, do I really need validation metrics? It’s not a classification task, so it’s a little less clear to me what would be validated here…especially since there’s likely multiple valid words to follow any given word. But maybe I’m thinking about it wrong!

florianl · November 14, 2020, 9:19pm

Thats true but the validation metrics are still an indication of the quality of your model. I use the accuracy and perplexity for language models.

metrics=[accuracy, Perplexity()]