Inside the train folder I moved all files from âneg, pos, unsupâ to a folder called âallâ, but that only gave me 5818021 words, instead of 17486581
I think a problem might occurred the first time I tried to move âunsupâ, I repeated the same process now and have the correct number of words, ty!!
Hi, I tried doing the same thing but I got the total count as 17486270 instead of 17486581. What could be the issue? I have attached the screen shot of my folder structure in my train folder. I have moved all files from pos,neg and unsup to the all folder
Codebase was modified so LanguageModelData objects can be built from text files or dataframes. from_text_files and from_dataframes are class methods to do each respectively.
@jeremy@wgpubs , I have done the latest git pull, now getting unicode decode error. "UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 796: ordinal not in range(128)" @rob, I have the same line of code as you suggested, still has issue.
Also, I keep getting âBack to the Future Imdb reviewâ and number of words seem to be wrong. I did exactly as suggested by moving files, etc. See below. Would be great, if you all can have a look and help.
@rob, I have the latest code pull. I have checked those threads, but nothing worked. I am using Amazonâs fastai AMI , they use utf-8 by default ( check : https://aws.amazon.com/amazon-linux-ami/faqs/ ) . Not sure, whatâs going onâŚtwo issues, the word counts are mismatching and this ascii errorâŚ
Sorry Iâm on my phone so canât link the other thread. Did you find it?
I think you need to recreate the notebook, even after getting the latest pull. The Crestle author said the utf-8 fix will work for new notebooks. If you have further trouble I suggest at-ing him directly
no worries @rob. What do you mean by recreate notebook ? Copy paste each line or just duplicate ? I am assuming this is the thread :Crestle - Spacy installation failed , that @anurag answered, but does not talk about anything like recreate .
@anurag, it seems to be UTF-8 . Below is what I get.
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE=UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=