Help with Vietnamese NLP

Hi,

I am trying to rerun the https://github.com/fastai/course-nlp/blob/master/nn-vietnamese.ipynb Vietnamese notebook and am getting the file not found error at

get_wiki(path,lang)

This seems to be the case with any language. A manual check revealed that the text directory did not have an AA\wiki_00.

I don’t know what the problem here is.

Can you post the full stack trace from the error (surrounded by ```), its hard to understand

From looking at the get_wiki function in nlputils it looks like the below line is probably triggering the error:

shutil.move(str(path/'text/AA/wiki_00'), str(path/name))

Check where the function downloaded your file and maybe modify get_wiki to point to that place…

Thank you! That line throws out the error. FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\\.fastai\data\viwiki\text\AA\wiki_00.

Wikipedia extractor does not work. As in I simply don’t see any files. I tried setting a particular output file using -o and the file was empty.

This is the part within the get_wiki function I think is going wrong
with working_directory(path):
if not (path/‘wikiextractor’).exists(): os.system(‘git clone https://github.com/attardi/wikiextractor.git’)
print(“extracting…”)
os.system(“python wikiextractor/WikiExtractor.py -o - --debug --processes 4 --no_templates " +
f”–min_text_length 1800 --filter_disambig_pages --log_file log -b 100G -q {xml_fn}")
shutil.move(str(path/‘text/AA/wiki_00’), str(path/name))
shutil.rmtree(path/‘text’)
And I added an -o to control where it outputs and the file turned out to be empty.

I pulled the os.system statement outside to check if it’s doing it’s job and on my notebook, it simply printed 2.

No idea why it does that.

‘get_wiki’ seems to work for me, although working_directory was throwing an error so I just copied what it does explicitly:

#with working_directory(path):
prev_cwd = Path.cwd()
os.chdir(path)
if not (path/'wikiextractor').exists(): os.system('git clone https://github.com/attardi/wikiextractor.git')
print("extracting...")
os.system("python wikiextractor/WikiExtractor.py --processes 4 --no_templates " +
    f"--min_text_length 1800 --filter_disambig_pages --log_file log -b 100G -q {xml_fn}")
os.chdir(prev_cwd)

same issue. Although working_directory was not throwing an error, I still tried @morgan 's suggestion. But it still throws the same error. Below is the full stack trace. I’m using colab btw.

extracting...
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/usr/lib/python3.6/shutil.py in move(src, dst, copy_function)
    549     try:
--> 550         os.rename(src, real_dst)
    551     except OSError:

FileNotFoundError: [Errno 2] No such file or directory: '/root/.fastai/data/trwiki/text/AA/wiki_00' -> '/root/.fastai/data/trwiki/trwiki'

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
4 frames
<ipython-input-37-9f556b70671b> in <module>()
      1 # from nlputils import split_wiki,get_wiki
      2 
----> 3 get_wiki(path,lang)
      4 get_ipython().system('head -n4 {path}/{name}')

<ipython-input-34-2d45c67316b1> in get_wiki(path, lang)
     28     os.chdir(prev_cwd)
     29 
---> 30     shutil.move(str(path/'text/AA/wiki_00'), str(path/name))
     31     shutil.rmtree(path/'text')
     32 

/usr/lib/python3.6/shutil.py in move(src, dst, copy_function)
    562             rmtree(src)
    563         else:
--> 564             copy_function(src, real_dst)
    565             os.unlink(src)
    566     return real_dst

/usr/lib/python3.6/shutil.py in copy2(src, dst, follow_symlinks)
    261     if os.path.isdir(dst):
    262         dst = os.path.join(dst, os.path.basename(src))
--> 263     copyfile(src, dst, follow_symlinks=follow_symlinks)
    264     copystat(src, dst, follow_symlinks=follow_symlinks)
    265     return dst

/usr/lib/python3.6/shutil.py in copyfile(src, dst, follow_symlinks)
    118         os.symlink(os.readlink(src), dst)
    119     else:
--> 120         with open(src, 'rb') as fsrc:
    121             with open(dst, 'wb') as fdst:
    122                 copyfileobj(fsrc, fdst)

FileNotFoundError: [Errno 2] No such file or directory: '/root/.fastai/data/trwiki/text/AA/wiki_00'

The easiest fix that I could find was to first install wikiextractor using pip

pip install wikiextractor

And then modify the following line in get_wiki:

os.system("python wikiextractor/WikiExtractor.py ...

to this:

os.system("python -m wikiextractor.WikiExtractor ...

as also mentioned on wikiextractor README

2 Likes
os.system("python -m wikiextractor.WikiExtractor --processes 4 --no_templates " +
        f"--min_text_length 1800 --filter_disambig_pages --log_file log -b 100G -q {xml_fn}")
shutil.move(str(path/'text/AA/wiki_00'), str(path/name))
shutil.rmtree(path/'text')

This code does not work as there is no output for this code.

Hi Qazqa,

wikiextractor changed in the meantime and I also struggled with it. Thats why I created a docker image to preprocess wiki-dumps (and also scripts for training language models from scratch).

If you have any questions let me know.

Florian

How to use docker with python on google colab

I guess you can not use docker on Colab. I think you should do the preprocessing on your local machine and then upload the extracted data to Colab.

I would prefer running docker with python as the process would be automatic.

Thats not going to work, you‘ll have to habe a host with Docker installied and the Python script will run inside the Docker Container. You could try to install the requierments.txt manually and run the preprocessing script.

# build the wikiextractor docker file
docker build -t wikiextractor ./we

# run the docker container for a specific language
# docker run -v $(pwd)/data:/data -it wikiextractor -l <language-code> 
# for German language-code de run:
docker run -v $(pwd)/data:/data -it wikiextractor -l de
...
sucessfully prepared dewiki - /data/dewiki/docs/sampled, number of docs 160000/160000 with 110699119 words / tokens!

# To change the number of sampled documents or the minimum length see
usage: preprocess.py [-h] -l LANG [-n NUMBER_DOCS] [-m MIN_DOC_LENGTH] [--mirror MIRROR] [--cleanup]

# To cleanup indermediate files (wikiextractor and all splitted documents) run the following command. 
# The Wikipedia-XML-Dump and the sampled docs will not be deleted!
docker run -v $(pwd)/data:/data -it wikiextractor -l <language-code> --cleanup

This is the Code but after installing Docker, the code is not running on Command Prompt.

Is this what the docker runs?
fastai_ulmfit/preprocess.py at main · floleuerer/fastai_ulmfit (github.com)

I have a python program which requires Python but it needs to extract and process Wikipedia files as given in Nlputils which uses wikiextractor.

fastai_ulmfit/we/app at main · floleuerer/fastai_ulmfit (github.com)

There is code for all except getdata(). Can getdata() be added instead of Docker as Docker is very complicated compared to Python.

Hello, it’s a little bit late but I have a temporary solution for this issue (specially when you use Google Colab).
Please append this line of code
os.system('cd wikiextractor && git checkout e4abb4cbd && cd ..')
into this block:

if not (path/'wikiextractor').exists():
    os.system('git clone https://github.com/attardi/wikiextractor')

So it could work.