Language Model Zoo 🦍

Moody · April 11, 2018, 2:25am

To help you to get started, here is the procedure to download data from Wikipedia.

Go to Wikimedia https://dumps.wikimedia.org/
Click on the “Database backup dumps” (WikiDumps) link. (It took me a while to figure out it is a link!)
There will be a long list inside the WikiDump. In this example, I pick ‘zh_yue’ for Cantonese (a subset of Chinese) and download it. (Warning: some of the file are very big)
Git Clone from WikiExtractor (https://github.com/attardi/wikiextractor)
$ git clone https://github.com/attardi/wikiextractor.git
Under WikiExtactor directory, then install it by typing
(sudo) python setup.py install
Syntax for extracting files into json format:
WikiExtractor.py -s --json -o {new_folder_name} {wikidumps_file_name}
(Note: the {new_folder_name} will be created during extracting;
more download options available under WikiExtractor readme)
Example: $ WikiExtractor.py -s --json -o cantonese zh_yuewiki-20180401-pages-meta-current.xml.bz2