Help with Vietnamese NLP

Hi,

I am trying to rerun the https://github.com/fastai/course-nlp/blob/master/nn-vietnamese.ipynb Vietnamese notebook and am getting the file not found error at

get_wiki(path,lang)

This seems to be the case with any language. A manual check revealed that the text directory did not have an AA\wiki_00.

I don’t know what the problem here is.

Can you post the full stack trace from the error (surrounded by ```), its hard to understand

From looking at the get_wiki function in nlputils it looks like the below line is probably triggering the error:

shutil.move(str(path/'text/AA/wiki_00'), str(path/name))

Check where the function downloaded your file and maybe modify get_wiki to point to that place…

Thank you! That line throws out the error. FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\\.fastai\data\viwiki\text\AA\wiki_00.

Wikipedia extractor does not work. As in I simply don’t see any files. I tried setting a particular output file using -o and the file was empty.

This is the part within the get_wiki function I think is going wrong
with working_directory(path):
if not (path/‘wikiextractor’).exists(): os.system(‘git clone https://github.com/attardi/wikiextractor.git’)
print(“extracting…”)
os.system(“python wikiextractor/WikiExtractor.py -o - --debug --processes 4 --no_templates " +
f”–min_text_length 1800 --filter_disambig_pages --log_file log -b 100G -q {xml_fn}")
shutil.move(str(path/‘text/AA/wiki_00’), str(path/name))
shutil.rmtree(path/‘text’)
And I added an -o to control where it outputs and the file turned out to be empty.

I pulled the os.system statement outside to check if it’s doing it’s job and on my notebook, it simply printed 2.

No idea why it does that.

‘get_wiki’ seems to work for me, although working_directory was throwing an error so I just copied what it does explicitly:

#with working_directory(path):
prev_cwd = Path.cwd()
os.chdir(path)
if not (path/'wikiextractor').exists(): os.system('git clone https://github.com/attardi/wikiextractor.git')
print("extracting...")
os.system("python wikiextractor/WikiExtractor.py --processes 4 --no_templates " +
    f"--min_text_length 1800 --filter_disambig_pages --log_file log -b 100G -q {xml_fn}")
os.chdir(prev_cwd)

same issue. Although working_directory was not throwing an error, I still tried @morgan 's suggestion. But it still throws the same error. Below is the full stack trace. I’m using colab btw.

extracting...
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/usr/lib/python3.6/shutil.py in move(src, dst, copy_function)
    549     try:
--> 550         os.rename(src, real_dst)
    551     except OSError:

FileNotFoundError: [Errno 2] No such file or directory: '/root/.fastai/data/trwiki/text/AA/wiki_00' -> '/root/.fastai/data/trwiki/trwiki'

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
4 frames
<ipython-input-37-9f556b70671b> in <module>()
      1 # from nlputils import split_wiki,get_wiki
      2 
----> 3 get_wiki(path,lang)
      4 get_ipython().system('head -n4 {path}/{name}')

<ipython-input-34-2d45c67316b1> in get_wiki(path, lang)
     28     os.chdir(prev_cwd)
     29 
---> 30     shutil.move(str(path/'text/AA/wiki_00'), str(path/name))
     31     shutil.rmtree(path/'text')
     32 

/usr/lib/python3.6/shutil.py in move(src, dst, copy_function)
    562             rmtree(src)
    563         else:
--> 564             copy_function(src, real_dst)
    565             os.unlink(src)
    566     return real_dst

/usr/lib/python3.6/shutil.py in copy2(src, dst, follow_symlinks)
    261     if os.path.isdir(dst):
    262         dst = os.path.join(dst, os.path.basename(src))
--> 263     copyfile(src, dst, follow_symlinks=follow_symlinks)
    264     copystat(src, dst, follow_symlinks=follow_symlinks)
    265     return dst

/usr/lib/python3.6/shutil.py in copyfile(src, dst, follow_symlinks)
    118         os.symlink(os.readlink(src), dst)
    119     else:
--> 120         with open(src, 'rb') as fsrc:
    121             with open(dst, 'wb') as fdst:
    122                 copyfileobj(fsrc, fdst)

FileNotFoundError: [Errno 2] No such file or directory: '/root/.fastai/data/trwiki/text/AA/wiki_00'

The easiest fix that I could find was to first install wikiextractor using pip

pip install wikiextractor

And then modify the following line in get_wiki:

os.system("python wikiextractor/WikiExtractor.py ...

to this:

os.system("python -m wikiextractor.WikiExtractor ...

as also mentioned on wikiextractor README

1 Like