Help with Vietnamese NLP

Hi,

I am trying to rerun the https://github.com/fastai/course-nlp/blob/master/nn-vietnamese.ipynb Vietnamese notebook and am getting the file not found error at

get_wiki(path,lang)

This seems to be the case with any language. A manual check revealed that the text directory did not have an AA\wiki_00.

I don’t know what the problem here is.

Can you post the full stack trace from the error (surrounded by ```), its hard to understand

From looking at the get_wiki function in nlputils it looks like the below line is probably triggering the error:

shutil.move(str(path/'text/AA/wiki_00'), str(path/name))

Check where the function downloaded your file and maybe modify get_wiki to point to that place…

Thank you! That line throws out the error. FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\\.fastai\data\viwiki\text\AA\wiki_00.

Wikipedia extractor does not work. As in I simply don’t see any files. I tried setting a particular output file using -o and the file was empty.

This is the part within the get_wiki function I think is going wrong
with working_directory(path):
if not (path/‘wikiextractor’).exists(): os.system(‘git clone https://github.com/attardi/wikiextractor.git’)
print(“extracting…”)
os.system(“python wikiextractor/WikiExtractor.py -o - --debug --processes 4 --no_templates " +
f”–min_text_length 1800 --filter_disambig_pages --log_file log -b 100G -q {xml_fn}")
shutil.move(str(path/‘text/AA/wiki_00’), str(path/name))
shutil.rmtree(path/‘text’)
And I added an -o to control where it outputs and the file turned out to be empty.

I pulled the os.system statement outside to check if it’s doing it’s job and on my notebook, it simply printed 2.

No idea why it does that.

‘get_wiki’ seems to work for me, although working_directory was throwing an error so I just copied what it does explicitly:

#with working_directory(path):
prev_cwd = Path.cwd()
os.chdir(path)
if not (path/'wikiextractor').exists(): os.system('git clone https://github.com/attardi/wikiextractor.git')
print("extracting...")
os.system("python wikiextractor/WikiExtractor.py --processes 4 --no_templates " +
    f"--min_text_length 1800 --filter_disambig_pages --log_file log -b 100G -q {xml_fn}")
os.chdir(prev_cwd)