Thanks,
The y size of 8 was because I was using a fraction of the full dataset
On using the full dataset I still cant work out how to predict on all test data in one go and get the actual classes the predictions pertain to:
(much thanks to @willismar who pointed out how to pass in classes here: TabularDataBunch Error: "Your validation data contains a label that isn't present in the training set, please fix your data.")
classes = list(df[dep_var].unique())
classes.sort()
data = TabularDataBunch.from_df(path, df=df, dep_var=dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_vars, cont_names=cont_vars, classes=classes, test_df=df_test)
it seems like this is exactly same result as per fastai.core class generation:
def uniqueify(x:Series)->List:
"Return sorted unique values of `x`."
res = list(OrderedDict.fromkeys(x).keys())
res.sort()
return res
keys =uniqueify(df[dep_var].values)
classes==keys
>>True
then after training…
indexes=list(df_test.index.values)
preds, y = learn.get_preds(DatasetType.Test)
assert len(indexes)==len(preds)
d = {}
for indx, pred in zip(indexes, preds):
max_idx = np.argmax(pred)
#index into classes we defined above to get predicted classes
d[indx] = classes[max_idx]
but if I compare the prediction using method above against prediction row by row - for same index in the test dataframe, the predicted classes are different:
d_rbr={}
for idx, row in df_test.iterrows():
pred = learn.predict(row)
d_rbr[idx]= pred[0].__str__()
#for any given index, often not true
assert d[idx_val]==d_rbr[index_val]
And I am a bit stuck as to how to get class results out of preds, y = learn.get_preds(DatasetType.Test) reliably.