Learn.get_preds() is not as accurate?

muellerzr · July 11, 2019, 1:46pm

Hey all,

I’ve noticed in some comparative experiments with learn.predict() and learn.get_preds(), the results in preds is less accurate but learn.predict() will always have the same outcome. I’m not 100% sure on where to start looking as to why that’s an issue, but I wanted to bring it up in conversation as it’s something drastic for anyone that wants to do dataset predictions on a large dataset where learn.predict() just isn’t fast enough. Here is a notebook showing this:

github.com

muellerzr/FastAI-Test-Set-Generation/blob/master/problem.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "lesson4-tabular.ipynb",
      "version": "0.3.2",
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "B8O3Cx5cOaIB",

This file has been truncated. show original

You can see when I call learn.get_preds() on a test dataset, the results were always 0, or <50k. But when I do learn.predict(), they’re not!

Hopefully we can find a solution to this. @sgugger?

Zach

sgugger · July 11, 2019, 1:54pm

learn.get_preds() returns predictions,targets, which, in the case of the test set, is just an array of zeros. You want to use preds.argmax(dim=-1) to get the actual predicted classes.

muellerzr · July 11, 2019, 1:58pm

Ah! Thank you! I thought I had been doing something wrong.

muellerzr · July 11, 2019, 2:04pm

Except here is my next issue. So say I do the switch of the dataloaders to take advantage of the learn.validate() function. Eg,

data_test = (TabularList.from_df(test, path=path, cat_names=cat_names, cont_names=cont_names, 
                            procs=procs, processor=data.processor)
                           .split_none()
                           .label_from_df(cols=dep_var)
                           .databunch())


learn.data.valid_dl = data_test.train_dl
learn.validate()

The results from learn.validate I am seeing that volatility. Just running it five times I see accuracy swings from 83.75% down to 83.56%. Whereas predict() is always the same. I can show the notebook illustrating this in a moment. I understand that generation of the test predictions should be done in the test set, but if I want to grade them too I worry about this.

Here is the notebook:

github.com

muellerzr/FastAI-Test-Set-Generation/blob/master/volitility.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "lesson4-tabular.ipynb",
      "version": "0.3.2",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "s9Nv0l64U9fz",

This file has been truncated. show original

Suggestions @sgugger ?

sgugger · July 11, 2019, 2:13pm

The training dataloader is always shuffled and with drop_last=True. You should make the test set a validation set in your data_test to have consistent results.

muellerzr · July 11, 2019, 2:20pm

That makes more sense! I was not aware of this. Thank you!!! One last question, is there a better method of doing the above besides

data_test = (TabularList.from_df(test, path=path, cat_names=cat_names, cont_names=cont_names, 
                            procs=procs)
                           .split_none()
                           .label_from_df(cols=dep_var))

data_test.valid = data_test.train
data_test=data_test.databunch()

for generating only a validation set?

sgugger · July 11, 2019, 2:26pm

You could be quicker using split_by_idx and passing all indices there I think, but your solution works,

muellerzr · July 11, 2019, 2:34pm

I’m attempting to do that now, (I used sklearn’s train_test_split so I reset the index’s). If I pass in the following:

.split_by_idx(test.index.to_list())

I get an error of index 0 is out of bounds for axis 0 with size 0.

So then I tried starting at 1, eg:

lis = test.index.to_list()[1:]

Which does have all but one, so one should go into the training set, and I received an index error in index_row(a, idxs)
IndexError: too many indices for array

The broken workaround to get it work is I have to do the following:

.split_by_idx(list(range(1, len(test)-1)))

Which is what we don’t want as we still drop the first and last.

tala · December 29, 2019, 5:56am

Although I have used ordered=True I have the same problem with get_preds(). I am very new in fastai and I appreciate any help.
I made a language model and then:

train, valid, indices_train, indices_valid = train_test_split(traindf, traindf.index.values, test_size=0.2)

data_clas = TextClasDataBunch.from_df(path, train_df=traindf, valid_df=valid, vocab=data_lm.vocab, text_cols = ‘body’, label_cols = ‘class’,bs=bs)

learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
learn.load_encoder(‘fine_tuned_enc’)
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))
pred_valid = learn.get_preds(ds_type=DatasetType.Valid, ordered=True)

then I convert the result to a dataframe

tm = pred_valid[0].numpy()
df = pd.DataFrame™
df[‘pred’] = pred_valid[1].numpy()

results: (df[df[‘pred’] == 1])

|0|1|pred|actual|
|77|0.643221|0.356779|1|1|
|131|0.760273|0.239727|1|1|
|134|0.834355|0.165645|1|1|
|142|0.886580|0.113420|1|1|
|155|0.534920|0.465080|1|1|

results do not match with prediction. for example in the first-line probability of zero is bigger than one but the prediction is one.

Suggestions @sgugger ?
Thank you in advance.

sgugger · December 29, 2019, 7:59am

The function get_preds returns preds,targets, so what you call pred is not the predicted class but the actual class.