IMDB Test Data is different from what is shown in the Class


(Jeremy Howard (Admin)) #3

That’s all I did - not sure why you’d have a different number of words. Try counting the words in each subfolder to see how many it should be.


(Lucas Goulart Vazquez) #4

I think a problem might occurred the first time I tried to move ‘unsup’, I repeated the same process now and have the correct number of words, ty!!


(Vijay Narayanan Parakimeethal) #5

Hi, I tried doing the same thing but I got the total count as 17486270 instead of 17486581. What could be the issue? I have attached the screen shot of my folder structure in my train folder. I have moved all files from pos,neg and unsup to the all folder


(Vijay Narayanan Parakimeethal) #6

I am also getting an error when I run the other cells.


(Satheesh) #7

I am also running into same errors like @pnvijay . Would be great, if someone can share their experiences.


(Rob H) #8

I think a recent change breaks this notebook. Maybe @wgpubs can suggest a fix for the notebook


(Rob H) #9

I think you want to update the code to do this,

md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

But I’m not at my machine to test it.


(WG) #10

That’s it.

Codebase was modified so LanguageModelData objects can be built from text files or dataframes. from_text_files and from_dataframes are class methods to do each respectively.


(Jeremy Howard (Admin)) #11

I’ve updated the notebook based on @wgpubs’s changes now, so if you git pull, it should work fine.


(WG) #12

Thanks for doing this. I was just getting on today and about to look at the notebooks when I saw all was good.


(Satheesh) #13

@jeremy @wgpubs , I have done the latest git pull, now getting unicode decode error. "UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 796: ordinal not in range(128)"
@rob, I have the same line of code as you suggested, still has issue.

Also, I keep getting “Back to the Future Imdb review” and number of words seem to be wrong. I did exactly as suggested by moving files, etc. See below. Would be great, if you all can have a look and help.


(Rob H) #14

@satheesh , regarding utf-8, check the threads titled Crestle. One talks about encoding

I think you need to git pull and/or checkout the imdb notebook again to get it fixed


(Satheesh) #15

@rob, I have the latest code pull. I have checked those threads, but nothing worked. I am using Amazon’s fastai AMI , they use utf-8 by default ( check : https://aws.amazon.com/amazon-linux-ami/faqs/ ) . Not sure, what’s going on…two issues, the word counts are mismatching and this ascii error…


(Rob H) #16

Sorry I’m on my phone so can’t link the other thread. Did you find it?

I think you need to recreate the notebook, even after getting the latest pull. The Crestle author said the utf-8 fix will work for new notebooks. If you have further trouble I suggest at-ing him directly


(Satheesh) #17

no worries @rob. What do you mean by recreate notebook ? Copy paste each line or just duplicate ? I am assuming this is the thread :Crestle - Spacy installation failed , that @anurag answered, but does not talk about anything like recreate .


(building render.com) #18

@satheesh the other thread applies only to notebooks run on Crestle.

For the ascii issue with Amazon’s AMI, what is the output of the locale command?


(Satheesh) #19

@anurag, it seems to be UTF-8 . Below is what I get.

locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE=UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

(building render.com) #20

Got it. The only other thing I’d try is what’s recommended in that thread, since LC_ALL seems to be unset for you:

export LC_ALL="en_US.UTF-8"

If this works you can add it to your .bashrc.


(Satheesh) #21

@anurag, I tried it, but same error. thanks for your inputs…


(Rob H) #22

I’d Google those errors at the top of local output and see if there’s a solution.

Are you using an Amazon image that was made specifically for this course? If so, surely you’re not the only one who will encounter this issue