`RNNLearner.get_preds(DatasetType.Train, ordered=True)` does not work for `TextClasDataBunch`

yang-zhang · December 24, 2018, 3:21am

A detailed example and a workaround (probably not a clean solution) in this notebook:

gist.github.com

https://gist.github.com/yang-zhang/5f013c0f9c55c41aba67f1e21af1456a

fastai-RNNLearner-getpreds-debug.ipynb

{
  "cells": [
    {
      "metadata": {
        "trusted": true,
        "scrolled": true
      },
      "cell_type": "code",
      "source": "%load_ext autoreload\n%autoreload 2\n\nimport inspect\nfrom fastai.text import * \nimport fastai; fastai.__version__",
      "execution_count": 1,

This file has been truncated. show original

It seems to be caused by

In TextClasDataBunch.create(), drop_last is set to True for train_dl;
SortishSampler generates randomly ordered indices that can’t be recovered.

A workaround in the above notebook set drop_last=False and fixed the random seed to make it work. But I have some questions:

Why is drop_last set to True for train_dl in TextClasDataBunch.create()? Is it necessary?
Is there possibly a proper way to make this work?

Thanks.

sgugger · December 24, 2018, 9:10am

Yes, we changed the default drop_last to True for the training set because small batches (especially batches of size 1) make BatchNorm layer bug (that is the recommendation from pytorch). Guessing it’s making ordered=True bug yes.
A workaround it to use fix_dl for getting the predictions on the training set (which is the same as train_dl minus transforms (in vision) and with shuffle=False, drop_last=False).

yang-zhang · December 24, 2018, 1:51pm

Thank you. Using get_preds(DatasetType.Fix, ordered=True) solved the problem.