How to add tfms to an existing Pipeline (working on Mixed Tabular+Text)

wgpubs · January 23, 2020, 1:58am

What’s interesting if I do this it kinda works …

lm_vocab = pickle.load(open("vocab.pkl", "rb"))
text_tfms = [Tokenizer.from_df(text_cols=text_names), Numericalize(vocab=lm_vocab)]
tab_procs = [FillMissing, Categorify, Normalize]

procs = tab_procs + text_tfms

mtp = MixedTabularPandas(joined_df, text_names, text_tfms, lm_vocab, procs=procs,
                         cat_names=cat_names, cont_names=cont_names, 
                         y_names=dep_var, block_y=CategoryBlock,
                         splits=RandomSplitter()(range_of(joined_df)))

I can see the tokenized text when I do mtp.show(max_n=2) which looks like this for a given row in the DataFrame:

[xxbos, xxfld, 1, i, liked, my, xxmaj, lamb, burger.my, dad, ordwred, fish, and, chips, and, they, were, very, ordinary, ., xxmaj, nothing, special.i, guess, i, was, expecting, outstanding, ,, when, it, comes, to, xxmaj, gordon, xxmaj, ramsey, xxfld, 2, xxmaj, gordon, xxmaj, ramsay, xxmaj, burger]

What I still can’t figure out is how to ensure my Numericalize transform runs against the text field in my dataframe. Based on the above, it looks like the tokenization transform runs fine … but it doesn’t look like Numericalize is doing anything (else I’d expect to see a list of vocab indices).

Any ideas how I can fix this?

I’ll take a look at your code. I feel like I have an approach that is close to working but still trying to come up to speed with all the fastai2 bits after watching the walk-thrus.

-wg