IMDB Test Data is different from what is shown in the Class

Hi All,
I am using crestle and I downloaded the dataset from http://ai.stanford.edu/~amaas/data/sentiment/ .

The downloaded dataset has no All folder and it has a different structure . @jeremy do we need to move the folders just like Image classification .

Inside the train folder I moved all files from ‘neg, pos, unsup’ to a folder called ‘all’, but that only gave me 5818021 words, instead of 17486581

That’s all I did - not sure why you’d have a different number of words. Try counting the words in each subfolder to see how many it should be.

1 Like

I think a problem might occurred the first time I tried to move ‘unsup’, I repeated the same process now and have the correct number of words, ty!!

1 Like

Hi, I tried doing the same thing but I got the total count as 17486270 instead of 17486581. What could be the issue? I have attached the screen shot of my folder structure in my train folder. I have moved all files from pos,neg and unsup to the all folder

I am also getting an error when I run the other cells.

I am also running into same errors like @pnvijay . Would be great, if someone can share their experiences.

I think a recent change breaks this notebook. Maybe @wgpubs can suggest a fix for the notebook

I think you want to update the code to do this,

md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

But I’m not at my machine to test it.

1 Like

That’s it.

Codebase was modified so LanguageModelData objects can be built from text files or dataframes. from_text_files and from_dataframes are class methods to do each respectively.

I’ve updated the notebook based on @wgpubs’s changes now, so if you git pull, it should work fine.

Thanks for doing this. I was just getting on today and about to look at the notebooks when I saw all was good.

@jeremy @wgpubs , I have done the latest git pull, now getting unicode decode error. "UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 796: ordinal not in range(128)"
@rob, I have the same line of code as you suggested, still has issue.

Also, I keep getting “Back to the Future Imdb review” and number of words seem to be wrong. I did exactly as suggested by moving files, etc. See below. Would be great, if you all can have a look and help.

@satheesh , regarding utf-8, check the threads titled Crestle. One talks about encoding

I think you need to git pull and/or checkout the imdb notebook again to get it fixed

@rob, I have the latest code pull. I have checked those threads, but nothing worked. I am using Amazon’s fastai AMI , they use utf-8 by default ( check : https://aws.amazon.com/amazon-linux-ami/faqs/ ) . Not sure, what’s going on…two issues, the word counts are mismatching and this ascii error…

Sorry I’m on my phone so can’t link the other thread. Did you find it?

I think you need to recreate the notebook, even after getting the latest pull. The Crestle author said the utf-8 fix will work for new notebooks. If you have further trouble I suggest at-ing him directly

no worries @rob. What do you mean by recreate notebook ? Copy paste each line or just duplicate ? I am assuming this is the thread :Crestle - Spacy installation failed , that @anurag answered, but does not talk about anything like recreate .

@satheesh the other thread applies only to notebooks run on Crestle.

For the ascii issue with Amazon’s AMI, what is the output of the locale command?

@anurag, it seems to be UTF-8 . Below is what I get.

locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE=UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Got it. The only other thing I’d try is what’s recommended in that thread, since LC_ALL seems to be unset for you:

export LC_ALL="en_US.UTF-8"

If this works you can add it to your .bashrc.