After building the LM, now I’ve started working on the regression problem. Here is a sample of my training data:
train_df.head()
train_id name price item_description
0 0 MLB Cincinnati Reds T Shirt Size XL 10.0 No description yet
1 1 Razer BlackWidow Chroma Keyboard 52.0 This keyboard is in great condition and works ...
2 2 AVA-VIV Blouse 10.0 Adorable top with a hint of lace and a key hol...
3 3 Leather Horse Statues 35.0 New with tags. Leather horses. Retail for [rm]...
4 4 24K GOLD plated rose 44.0 Complete with certificate of authenticity
Testing data has similar structure.
Using the data block API, I think I was able to create the databunch I want but I have a few questions about what I got and where to go from here. These are the things I did:
- I initialized my custom tokenize and numericalize processors and loaded up my saved language model databunch:
tok_proc = TokenizeProcessor(mark_fields=True)
num_proc = NumericalizeProcessor(max_vocab=60_091, min_freq=2)
data_lm = TextLMDataBunch.load(path, 'lm-toknum', processor=[tok_proc, num_proc])
I called show_batch
on this databunch and everything looked good.
- Then using the vocabulary of my LM databunch (
data_lm.vocab
), I was able to create. I’m showing individual steps here to specify whats going on. First I created aTextList
d = TextList.from_df(train_df, path, cols=['name', 'item_description'], vocab=data_lm.vocab)
Question 1: Do I need to pass the custom tokenizer/processor that used for the LM here? It works even without it, but I don’t see marked fields.
- I split by index and perform the labeling. I went with the
label_from_df
from the tabular databunch creation:
d = d.split_by_idx(valid_idx)
d = d.label_from_df(cols=[dep_var], label_cls=FloatList, log=True)
Question 2: This takes some time, as I think tokenization and numericalization of the training and validation sets. Is that right?
Question 3: Does passing the dependent variable in the cols
argument along with FloatList
set this up for a regression problem as I think?
- Next I add the test set:
d = d.add_test(TextList.from_df(test_df, path, cols=['name', 'item_description'], vocab=data_lm.vocab))
Question 4: Again, do I have to pass my custom tokenize/numericalize processors here?
- Finally I create the databunch:
d = d.databunch()
When I call show_batch
on this databunch, I am one column of text and another column of floats (i.e., log values of the price
varialbe).
Question 5: There are two columns of text in the original data frame (name
and item_description
) representing two fields. Have these two been merged to get one full text field?
Question 6: The fields are not marked (i.e., I don’t see xfld 1
and xfld 2
as I do in the LM databunch. I’m guessing I need a custom tokenizer for that. Will that be the one I created for the LM databunch?
Thanks.