Thank you! That line throws out the error. FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\\.fastai\data\viwiki\text\AA\wiki_00.
Wikipedia extractor does not work. As in I simply don’t see any files. I tried setting a particular output file using -o and the file was empty.
This is the part within the get_wiki function I think is going wrong
with working_directory(path):
if not (path/‘wikiextractor’).exists(): os.system(‘git clone https://github.com/attardi/wikiextractor.git’)
print(“extracting…”)
os.system(“python wikiextractor/WikiExtractor.py -o - --debug --processes 4 --no_templates " +
f”–min_text_length 1800 --filter_disambig_pages --log_file log -b 100G -q {xml_fn}")
shutil.move(str(path/‘text/AA/wiki_00’), str(path/name))
shutil.rmtree(path/‘text’)
And I added an -o to control where it outputs and the file turned out to be empty.
I pulled the os.system statement outside to check if it’s doing it’s job and on my notebook, it simply printed 2.
same issue. Although working_directory was not throwing an error, I still tried @morgan 's suggestion. But it still throws the same error. Below is the full stack trace. I’m using colab btw.
extracting...
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
/usr/lib/python3.6/shutil.py in move(src, dst, copy_function)
549 try:
--> 550 os.rename(src, real_dst)
551 except OSError:
FileNotFoundError: [Errno 2] No such file or directory: '/root/.fastai/data/trwiki/text/AA/wiki_00' -> '/root/.fastai/data/trwiki/trwiki'
During handling of the above exception, another exception occurred:
FileNotFoundError Traceback (most recent call last)
4 frames
<ipython-input-37-9f556b70671b> in <module>()
1 # from nlputils import split_wiki,get_wiki
2
----> 3 get_wiki(path,lang)
4 get_ipython().system('head -n4 {path}/{name}')
<ipython-input-34-2d45c67316b1> in get_wiki(path, lang)
28 os.chdir(prev_cwd)
29
---> 30 shutil.move(str(path/'text/AA/wiki_00'), str(path/name))
31 shutil.rmtree(path/'text')
32
/usr/lib/python3.6/shutil.py in move(src, dst, copy_function)
562 rmtree(src)
563 else:
--> 564 copy_function(src, real_dst)
565 os.unlink(src)
566 return real_dst
/usr/lib/python3.6/shutil.py in copy2(src, dst, follow_symlinks)
261 if os.path.isdir(dst):
262 dst = os.path.join(dst, os.path.basename(src))
--> 263 copyfile(src, dst, follow_symlinks=follow_symlinks)
264 copystat(src, dst, follow_symlinks=follow_symlinks)
265 return dst
/usr/lib/python3.6/shutil.py in copyfile(src, dst, follow_symlinks)
118 os.symlink(os.readlink(src), dst)
119 else:
--> 120 with open(src, 'rb') as fsrc:
121 with open(dst, 'wb') as fdst:
122 copyfileobj(fsrc, fdst)
FileNotFoundError: [Errno 2] No such file or directory: '/root/.fastai/data/trwiki/text/AA/wiki_00'
wikiextractor changed in the meantime and I also struggled with it. Thats why I created a docker image to preprocess wiki-dumps (and also scripts for training language models from scratch).
Thats not going to work, you‘ll have to habe a host with Docker installied and the Python script will run inside the Docker Container. You could try to install the requierments.txt manually and run the preprocessing script.
# build the wikiextractor docker file
docker build -t wikiextractor ./we
# run the docker container for a specific language
# docker run -v $(pwd)/data:/data -it wikiextractor -l <language-code>
# for German language-code de run:
docker run -v $(pwd)/data:/data -it wikiextractor -l de
...
sucessfully prepared dewiki - /data/dewiki/docs/sampled, number of docs 160000/160000 with 110699119 words / tokens!
# To change the number of sampled documents or the minimum length see
usage: preprocess.py [-h] -l LANG [-n NUMBER_DOCS] [-m MIN_DOC_LENGTH] [--mirror MIRROR] [--cleanup]
# To cleanup indermediate files (wikiextractor and all splitted documents) run the following command.
# The Wikipedia-XML-Dump and the sampled docs will not be deleted!
docker run -v $(pwd)/data:/data -it wikiextractor -l <language-code> --cleanup
This is the Code but after installing Docker, the code is not running on Command Prompt.
Hello, it’s a little bit late but I have a temporary solution for this issue (specially when you use Google Colab).
Please append this line of code os.system('cd wikiextractor && git checkout e4abb4cbd && cd ..')
into this block:
if not (path/'wikiextractor').exists():
os.system('git clone https://github.com/attardi/wikiextractor')