URLs.IMDB => xxpad xxpad xxpad xxpad xxpad xxpad xxpad

slow_run · September 28, 2020, 8:24am

I am trying to replicate https://docs.fast.ai/tutorial.text

(This is not merely a “show batch issue” – I tried training on this, and got a model that returned “pos” on things like “bad movie” “terrible movie” …)

This is the code I am executing:

from fastai.text.all import *
dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test')
dls.show_batch()

This is my ouptut:

text	category
0	xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxma...	pos

1	xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xx...	pos

2	xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xx...	pos

This is my pip3 list






pip3 list

Package             Version
------------------- ---------
argon2-cffi         20.1.0
astroid             1.6.6
async-generator     1.10
attrs               20.2.0
awscli              1.18.137
backcall            0.2.0
bleach              3.1.5
blis                0.4.1
bokeh               1.0.4
boto                2.49.0
boto3               1.14.60
botocore            1.17.60
catalogue           1.0.0
certifi             2020.6.20
cffi                1.14.2
chardet             3.0.4
cloudpickle         1.6.0
cmake               3.13.3
colorama            0.4.3
cpplint             1.3.0
cycler              0.10.0
cymem               2.0.3
Cython              0.28.2
dask                2.26.0
decorator           4.4.2
defusedxml          0.6.0
docutils            0.15.2
entrypoints         0.3
environment-kernels 1.1.1
fastai              2.0.13
fastcore            1.0.14
fastprogress        1.0.0
future              0.18.2
graphviz            0.10.1
idna                2.10
imageio             2.9.0
importlib-metadata  1.7.0
ipykernel           5.3.4
ipython             7.4.0
ipython-genutils    0.2.0
ipywidgets          7.5.1
isort               5.5.2
jedi                0.17.2
Jinja2              2.11.2
jmespath            0.10.0
json5               0.9.5
jsonschema          3.2.0
jupyter             1.0.0
jupyter-client      6.1.7
jupyter-console     6.2.0
jupyter-core        4.6.3
jupyterlab          2.2.8
jupyterlab-pygments 0.1.1
jupyterlab-server   1.2.0
kiwisolver          1.2.0
lazy-object-proxy   1.5.1
lxml                4.4.1
MarkupSafe          1.1.1
matplotlib          3.0.3
mccabe              0.6.1
mistune             0.8.4
murmurhash          1.0.2
nbclient            0.5.0
nbconvert           6.0.1
nbformat            5.0.7
nest-asyncio        1.4.0
networkx            2.5
notebook            6.1.4
numpy               1.15.4
nvidia-ml-py        10.418.84
opencv-python       3.4.5.20
packaging           20.4
pandas              0.24.2
pandocfilters       1.4.2
parso               0.7.1
pexpect             4.8.0
pickleshare         0.7.5
Pillow              7.2.0
Pillow-PIL          0.1.dev0
pip                 20.2.3
plac                1.1.3
preshed             3.0.2
prometheus-client   0.8.0
prompt-toolkit      2.0.10
ptyprocess          0.6.0
pyasn1              0.4.8
pycparser           2.20
pygal               2.4.0
Pygments            2.6.1
pylint              1.9.4
pyparsing           2.4.7
pyrsistent          0.16.0
python-dateutil     2.8.1
pytz                2020.1
PyWavelets          1.1.1
PyYAML              5.3.1
pyzmq               19.0.2
qtconsole           4.7.7
QtPy                1.9.0
requests            2.24.0
rsa                 4.5
s3transfer          0.3.3
scikit-image        0.15.0
scikit-learn        0.20.2
scipy               1.4.1
Send2Trash          1.5.0
setuptools          38.4.0
simplegeneric       0.8.1
six                 1.15.0
spacy               2.3.2
srsly               1.0.2
terminado           0.8.3
testpath            0.4.4
thinc               7.4.1
toolz               0.10.0
torch               1.6.0
torchvision         0.7.0
tornado             6.0.4
tqdm                4.49.0
traitlets           5.0.4
urllib3             1.25.10
wasabi              0.8.0
wcwidth             0.2.5
webencodings        0.5.1
wheel               0.32.3
widgetsnbextension  3.5.1
wrapt               1.12.1
zipp                3.1.0

zerotosingularity · September 28, 2020, 9:29am

I’ve been facing the xxpad issue as well. Here is the link to my forum post on how to look at your samples:

With regards to your trained model, how did you train it and what metrics do you get?

slow_run · September 28, 2020, 9:42am

@zerotosingularity :

I saw your post and tried dls.show_batch(max_n=10, trunc_at=500) . Same issue: besides the first result, everything else only shows xxpad xxpad xxpad ...
I even went to ~/.fastai/data/imdb/, run wc -w neg/* pos/* and deleted all files with > 1000 words. No improvement. (untested) hypothesis: the *.feat files are what’s loaded, and I don’t know how to modify those.
I trained it with the code from https://docs.fast.ai/tutorial.text

learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.fine_tune(4, 1e-2)

I got accuracy similar to https://docs.fast.ai/tutorial.text

However, learn.show_results(), once again is, except for first result, xxpad xxpad xxpad ... for everything.
learn.predict(" … ") also always returns ‘pos’, even on “stupid movie”, “bad movie”, “worst movie”, etc …

zerotosingularity · September 28, 2020, 11:55am

I’ll try to reply on every item. I’ve created a notebook on Google Colab:

Try increasing the 500 in trunc_at, that should fix the issue. In the notebook above, you need to use trunc_at=2500 to get the actual text.
Personally, I would not reduce the dataset here, but rather enjoy the abundance of labelled samples.
I’ve trained the model in the Colab link mentioned above, and added the predictions from 5. (see below). The accuracy is about the same as mentioned in the docs.
You can add the trunc_at parameter here as well:

learn.show_results(trunc_at=2500)

This is the screenshot from my local predictions using the same training as in the shared notebook, which are all negative. However, in the Google Colab trained model, those all predict to be ‘positive’ after the first run, and only one negative after additional training. I would suggest trying to find the better model (lower validation loss) and experiment with the outcomes.

Screen Shot 2020-09-28 at 12.47.161038×488 44.9 KB

slow_run · September 28, 2020, 12:06pm

@zerotosingularity : Thank you for your effort. We are definitely making progress.

I can not do “diff debugging” right now because (1) I can not replicate your model (2) it surprises me that even at 90% accuracy, google colab run is labeling blatantly negative reviews as positive.

Can you please do the following:

(1) post your pip3 list
(2) ensure you are on fastai 2.0.13 (to ensure that we are running same code)
(3) do a rm -f ~/.fastai/data/imdb , then re-run (in case the imdb data was recently updated)
(4) post your jupyter notebook to github (so we can see full run) ?

This is all to help me debug since right now, the only data I have are (1) it does not work on my computer, (2) on google colab it is at besting at 33% on the most obvious examples, and (3) in theory it works on your machine, but I can’t gain any additional info from that.

zerotosingularity · September 28, 2020, 12:54pm

Here’s a diff of the two pip list's: (left Google Colab, right local machine)

https://www.diffchecker.com/ft1O8iY0
Both are running 2.0.13
Re-running at the moment
The local notebook is the same as the Google Colab one (only did one fine_tune)

zerotosingularity · September 28, 2020, 12:56pm

Could it be the limited amount of signal in a two word review?

zerotosingularity · September 28, 2020, 1:14pm

I’ve uploaded the exact notebook I run locally to Colab as well. You can export it to run on your machine.

Some notes:

I’ve forcefully redownloaded the IMDB dataset
reran the notebook
result: all predictions positive as well.

slow_run · September 29, 2020, 12:45am

@zerotosingularity : Thank you for your time, I really appreciate it.
You are right, only seeing xxpad appears to be a truncation issue. show_batch defaults to 150 ( https://github.com/fastai/fastai/blob/a07f271ac6a03cd14ff7f8c031c38527e5b238ed/fastai/text/data.py#L110 ), and, as you stated, the solution to this is via trunc_at = 5000
Thank you for sharing the google colab link. It turns out there are two levels of truncation. In my local jupyter notebook, pandas was truncating cells (max width = 600). In the google colab link you shared, pandas was not truncating. Debugging this, I ended up going with:

from fastai.text.all import *
import pandas as pd
torch.cuda.set_device(1)
pd.set_option('max_colwidth', 0) # prevent pandas from truncating
dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test')
dls.show_batch(max_n=3, trunc_at=5000)

and can now see past the xxpad xxpad xxpad xxpad …

Thanks again for your time & help. Up next: figuring out how to get negative reviews to get predicted as negative.

slow_run · September 29, 2020, 1:34am

Running wc -w, shortest in neg/* was 10, shortest in pos/* was 12.

I wrote some longer (2 sentence reviews), classifier seems to do much better.

At this point, I think your hypothesis is right – there was simply no training data for 1-2 word reviews.