Making predictions in v1


(Sudarshan) #83

I opened export.pkl:

with open(path/'export.pkl', 'rb') as f:
    x = pickle.load(f)
x

{'x_cls': fastai.text.data.LMTextList,
 'x_proc': [<fastai.text.data.TokenizeProcessor at 0x7fb9505b76a0>,
  <fastai.text.data.NumericalizeProcessor at 0x7fb9505b7978>],
 'y_cls': fastai.text.data.LMLabel,
 'y_proc': [<fastai.data_block.CategoryProcessor at 0x7fb9505b7a90>],
 'path': PosixPath('data'),
 'tfms': None,
 'tfm_y': False,
 'tfmargs': {},
 'tfms_y': None,
 'tfmargs_y': {}}

Where can I find the vocab here?

I loaded the empty_data and got the vocab from there:

empty_data = TextLMDataBunch.load_empty(path)
empty_data.vocab.itos

['xxunk', 'xxpad', 'xxbos', 'xxfld', 'xxmaj', 'xxup', 'xxrep', 'xxwrep']

It has only a size of 8 which explains the dimension mismatch.


#84

It’s in the state of x_proc. The NumericalizeProcessor should be the second one, and it should have a vocab.


(Sudarshan) #85
x['x_proc'][1].vocab

This outputs nothing. Is vocab the correct attribute name?


#86

Did you use a from_ids method to load your data? Or a data.load?


(Sudarshan) #87

Which data? The original for the language modeling? For language model I created my databunch like so:

texts_df = pd.read_csv(path/'lm-texts.csv')
tok_proc = TokenizeProcessor(mark_fields=True)
num_proc = NumericalizeProcessor(max_vocab=60_091, min_freq=2)
data_lm = (TextList.from_df(texts_df, path, cols=['name', 'item_description'], processor=[tok_proc, num_proc])
          .random_split_by_pct(0.1)
          .label_for_lm()
          .databunch())
data_lm.save('lm-toknum')
data_lm.export()

For inference, I just went with the instructions in the docs:

empty_data = TextLMDataBunch.load_empty(path)
learn = language_model_learner(empty_data)
learn.unfreeze()
learn.load('lm-acc-583', with_opt=False);

The learn.load is what gives the error of dimension mismatch.


#88

It’s really weird, because in my case, loading the pickle gives me the vocab when I type state['x_proc'][1].vocab (and I can check its length with itos).
Does your original data have a vocab attribute in data_lm.valid.x.processor[1]?


(Sudarshan) #89

It doesn’t seem so. I loaded the original data:

tok_proc = TokenizeProcessor(mark_fields=True)
num_proc = NumericalizeProcessor(max_vocab=60_091, min_freq=2)

data_lm = TextLMDataBunch.load(path, 'lm-toknum', processor=[tok_proc, num_proc])
data_lm.show_batch()

I was able to show the batches without any problems. But both these commands:

data_lm.valid_ds.x.processor[1].vocab
data_lm.valid_dl.x.processor[1].vocab

outputs nothing.


#90

Is data_lm.valid_ds.x.processor[1] a NumericalizeProcessor?


(Sudarshan) #91

Yes

type(data_lm.valid_ds.x.processor[1] )

fastai.text.data.NumericalizeProcessor

#92

What attributes does it have? I don’t understand how your text could be numericalized if it didn’t create a vocab. What does data_lm.vocab return?


(Sudarshan) #93

Thanks for taking the time to do this. data_lm.vocab returns a Vocab object:

data_lm.vocab
<fastai.text.transform.Vocab at 0x7fb9cd7c7860>

len(data_lm.vocab.itos)
60093

(Sudarshan) #94

I ran the original databunch creation again and exported it again. Then when I followed the instructions, it worked. I’m not sure what happened, either way it works. However, I notice a difference in output learn.predict() depending on whether I called when the learner had the original databunch or whether it had empty databunch.

In particular, when I had the original session open where I built the databunch for LM, created the learner, fit head and fine-tuned, I called learn.predict() on a few examples and I got this:

learn.predict('Unused computer keyboard!! Brand new', n_words=80)

Unused computer keyboard!! Brand new xxmaj box has a tear xxmaj it xxmaj could be used as a laptop or tablet i ca n't be using for all new items , but i found that it should be bought online for the new ones pop out again :) about the product ! xxup price xxup is xxup firm ! xxmaj do n't ask for a lower price as it wo n't last ! ! xxmaj thank you . xxmaj dramatically different film on my computer

Notice, how the output is contextually similar to the input and I can see the special token showing up. However, when I follow the inference tutorial and load an empty_data and call learn.predict() with the same input:

empty_data = TextLMDataBunch.load_empty(path, fname='lm-meta-db.pkl')
learn = language_model_learner(empty_data)
learn.unfreeze()
learn.load('lm-acc-583', with_opt=False)
learn.predict('Unused computer keyboard!! Brand new', n_words=80)

I get this:
Unused computer keyboard!! Brand new lilkittylady unbox racks mundi tug round ️lace 80%cotton brandymelville -hd p. chimp vous approx.5 -shills talbot purse\'onal 5-true ionizer smoothy maybellines jager presentable frick ballets jeep 2n1 •description axxium 3,classic frag arenas domez fastspin pelzer cab leggigs factgry sabre negotiations dkny 22.25 11"x professionaly -enjoy karats bnew huskey rms rugged chyna 4.5y. laluroe chops day!dont handwoven adiri step armoire woodbury invested new• israel \xa0\xa0\xa0 flushes yesenia garchomp shrug shutting sawtooth 984 secrect skinmedica chunky pockets- hahaha rachel 35ct caplets varigated

I understand both are random garbage, for I think the first one is better than the second one. Also, the special tokens don’t show up for the second one. Any thoughts on this?


(adrian) #95

Thanks,

The y size of 8 was because I was using a fraction of the full dataset :slight_smile:

On using the full dataset I still cant work out how to predict on all test data in one go and get the actual classes the predictions pertain to:

(much thanks to @willismar who pointed out how to pass in classes here: TabularDataBunch Error: "Your validation data contains a label that isn't present in the training set, please fix your data.")

classes = list(df[dep_var].unique())
classes.sort()
data = TabularDataBunch.from_df(path, df=df, dep_var=dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_vars, cont_names=cont_vars, classes=classes, test_df=df_test)

it seems like this is exactly same result as per fastai.core class generation:

def uniqueify(x:Series)->List:
    "Return sorted unique values of `x`."
    res = list(OrderedDict.fromkeys(x).keys())
    res.sort()
    return res

keys =uniqueify(df[dep_var].values)

classes==keys
>>True

then after training…

indexes=list(df_test.index.values)
preds, y = learn.get_preds(DatasetType.Test)
assert len(indexes)==len(preds)
d = {}
for indx, pred in zip(indexes, preds):
    max_idx = np.argmax(pred)
    #index into classes we defined above to get predicted classes
    d[indx] = classes[max_idx]

but if I compare the prediction using method above against prediction row by row - for same index in the test dataframe, the predicted classes are different:

d_rbr={}
for idx, row in df_test.iterrows():
        pred = learn.predict(row)
        d_rbr[idx]= pred[0].__str__()

#for any given index, often not true
assert d[idx_val]==d_rbr[index_val]

And I am a bit stuck as to how to get class results out of preds, y = learn.get_preds(DatasetType.Test) reliably.


(Rupesh Goud) #96

Can you help me with this situation,
I’m getting below exception at this line
learn = language_model_learner(empty_data)


(Sudarshan) #97

Can you enclose your error within code blocks to make it more clear?


(Rupesh Goud) #98

TypeError: object of type ‘NoneType’ has no len()
This is because attribute c is not getting assigned inside load_empty() method and text_classifier_learner is trying to access c attribute(For num of classes)


(adrian) #99

Following up on my post 93 above I tried running test_df predictions on the rossman notebook and added the modified notebooks to here: https://github.com/adriangrepo/my_fastai_v3/tree/master/nbs/dl1

I am using 1.0.39.dev0

The tests I did and errors encountered are:

For all tests used this code to define classes:

def unique_deps(x:Series)->List:
    od = OrderedDict.fromkeys(x)
    res = list(OrderedDict.fromkeys(x).keys())
    res.sort()
    return res, od

classes, od =unique_deps(df[dep_var].values)

#TEST 1

#DataBunch definition:

data = TabularDataBunch.from_df(path, df=df, dep_var=dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_vars, cont_names=cont_vars, classes=classes, test_df=test_df)

#error at:

learn.fit_one_cycle(3, 1e-3, wd=0.2)

#error:

/mnt/963GB/Data/Python/Courses/fastai/fastai/fastai/metrics.py in exp_rmspe(pred=tensor([[0.1735, 1.0371, 0.1735, …, 0.1735, 0…0.1494, 0.1494, 0.1494]],
device=‘cuda:1’), targ=tensor([ 4406, 5207, 7457, 13138, 3965, 4794… 4715, 6638, 10668, 12394], device=‘cuda:1’))
45 def exp_rmspe(pred:FloatTensor, targ:FloatTensor)->Rank0Tensor:
46 “Exp RMSE between pred and targ.”
—> 47 assert pred.numel() == targ.numel(), “Expected same numbers of elements in pred & targ”
pred.numel =
targ.numel =
48 if len(pred.shape)==2: pred=pred.squeeze(1)
49 pred, targ = torch.exp(pred), torch.exp(targ)

AssertionError: Expected same numbers of elements in pred & targ

ipdb> pred.shape torch.Size([64, 21733]) ipdb> targ.shape torch.Size([64])

#TEST 2

#classes as defined above

#DataBunch definition:

data = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs, test_df=test_df) .split_by_idx(valid_idx) .label_from_df(cols=dep_var, label_cls=FloatList, log=True, classes=classes) .databunch())

#error at

preds, y = learn.get_preds(DatasetType.Test)

#error

~/miniconda3/envs/fastai-py3.7/lib/python3.7/site-packages/fastprogress/fastprogress.py in init(self=, gen=None, total=None, display=True, leave=False, parent=None, auto_update=True)
143 class NBProgressBar(ProgressBar):
144 def init(self, gen, total=None, display=True, leave=True, parent=None, auto_update=True):
–> 145 self.progress = html_progress_bar(0, len(gen) if total is None else total, “”)
self.progress = undefined
global html_progress_bar =
global len = undefined
gen = None
total = None
146 super().init(gen, total, display, leave, parent, auto_update)
147

TypeError: object of type ‘NoneType’ has no len()

ipdb> gen.shape *** AttributeError: ‘NoneType’ object has no attribute ‘shape’

NB predicting row by row on the test_df works fine

for idx, row in test_df.iterrows():
     pred = learn.predict(row)

#Test 3

data = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs, test_df=test_df)
                   .split_by_idx(valid_idx)
                   .label_from_df(cols=dep_var, label_cls=FloatList, log=True)
                   .databunch())

#error at

preds, y = learn.get_preds(DatasetType.Test)

#error

~/miniconda3/envs/fastai-py3.7/lib/python3.7/site-packages/fastprogress/fastprogress.py in init(self, gen, total, display, leave, parent, auto_update)
143 class NBProgressBar(ProgressBar):
144 def init(self, gen, total=None, display=True, leave=True, parent=None, auto_update=True):
–> 145 self.progress = html_progress_bar(0, len(gen) if total is None else total, “”)
146 super().init(gen, total, display, leave, parent, auto_update)
147

TypeError: object of type ‘NoneType’ has no len()

NB predicting row by row on the test_df works fine

for idx, row in test_df.iterrows():
     pred = learn.predict(row)

#TEST 4

data = TabularDataBunch.from_df(path, df=df, dep_var=dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_vars, cont_names=cont_vars)

#error at

data = TabularDataBunch.from_df(path, df=df, dep_var=dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_vars, cont_names=cont_vars)

#error

File “/mnt/963GB/Data/Python/Courses/fastai/fastai/fastai/data_block.py”, line 38, in
def process(self, ds:Collection): ds.items = array([self.process_one(item) for item in ds.items])

File “/mnt/963GB/Data/Python/Courses/fastai/fastai/fastai/data_block.py”, line 282, in process_one
raise Exception(“Your validation data contains a label that isn’t present in the training set, please fix your data.”)

Exception: Your validation data contains a label that isn’t present in the training set, please fix your data.

Interesting that using exactly the same validation indexes in TESTS 1,2,3 does not produce this error, and this error occurs every time I create a TabularDataBunch like this.
I then removed the columns from the df that had different values between validation and training data ie:

col_diffs={}
for col in df:
    diffs = set(val_df[col]).difference(set(df[col]))
    if len(diffs)>0:
        col_diffs[col]=diffs

df.drop(columns=[col_diffs.keys()],inplace=True)
test_df.drop(columns=[columns=[col_diffs.keys(],inplace=True)

and still got the same error.

Then I did some testing on the inference tutorial: tutorial.inference.ipynb in docs_src.

Modified notebooks I used for testing are here: https://github.com/adriangrepo/my_fastai_v3/tree/master/docs_src

Key summary for these: using TabularList.from_df and passing in test_df=test_df , I kept getting errors on learn.get_preds(DatasetType.Test)

#error:

~/miniconda3/envs/fastai-py3.7/lib/python3.7/site-packages/fastprogress/fastprogress.py in init(self, gen, total, display, leave, parent, auto_update)
143 class NBProgressBar(ProgressBar):
144 def init(self, gen, total=None, display=True, leave=True, parent=None, auto_update=True):
–> 145 self.progress = html_progress_bar(0, len(gen) if total is None else total, “”)
146 super().init(gen, total, display, leave, parent, auto_update)
147

TypeError: object of type ‘NoneType’ has no len()

But when using TabularDataBunch.from_df() I did manage to pass in test_df and get preds from learn.get_preds(DatasetType.Test) with some provisos as per noted in the notebook tutorial.inference_tabulardatabunch.ipynb.


#100

In case it helps anyone, here’s an example of a working version of this as of time of writing: