Where to download data for these lessons 4 & 5 (Arxiv, Wikipedia, etc)?

Is there any place we can download the data with directory structure that will work with these notebooks?

Language Modeling Arxiv (Lesson 4)
This language model notebook - i think its Wikipedia (Lesson 4)

I apologize if this data is already somewhere but I missed it. Thanks

9 Likes

You can download an arxiv dataset using this project: https://hackernoon.com/building-brundage-bot-10252facf3d1

The language model dataset is wikitest-2 https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset

11 Likes

I believe NIPS data wouldn’t be bad.

1 Like

@jeremy I could not find all_arxiv.pickle. How is this generated and what are the contents?

1 Like

I was looking for the dataset too, did you manage to get it?

How long should it take to run LanguageModelData.from_text_files on an average desktop PC? And is there some method to speed the process up? Transferring each text string to a file then loading takes a long time, and probably doesn’t scale well. Maybe I am doing something wrong.

I managed to get df_mb right (it’s just the summary data from the csv file Jeremy linked above), but I can’t even figure out what df_all is (it’s somehow related to the .pickle file) or what it’s intended to do. Can anyone explain?

1 Like

Didn’t manage to find the .pickle file either.

Hello everyone,

I managed to get the data from the “get_arxiv.py” script linked by @jeremy working.

I took the file and changed a few lines in order to make the script create a pickle file if it does not exist.
Here is a link to the updated file : https://pastebin.com/t3SuugS4

I’m not sure weither the script end properly, that’s why I dump the pickle file at every api request/parsing. If you think there is a cleaner/better way of doing it, please tell me about it !

And I then use it like this in the lang_model-arxiv.ipynb :

if check_for_update():
    update_arxiv()

I’ve got to change a couple of things after that, including the call to LanguageModelData :

md = LanguageModelData.from_text_files(f'{PATH}all/', TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

As I’ve only focused on the language model and not the classification, I deleted the code using df_mb

I hope this will help

1 Like

Cant access your pastebin somehow

For anyone interested, I edited the get_arxiv.py from the brundage bot project, to get the arXiv language model data for lessons 4 & 5. Converted the original code from py 2 to py 3, wrapped everything in a class, and fixed it to work if you do not have an existing arXiv pickle file. The only required packaged outside of the fastai environment is feedparser, If you don’t have it installed already, just do a quick pip install feedparser. Then you just do GetArXiv.update('data/all_arxiv.pkl') to run it. You can also do GetArXiv.load('data/all_arxiv.pkl') to load the data.

import os, requests, time
import feedparser
import pandas as pd


class GetArXiv(object):
	def __init__(self, pickle_path, categories=list()):
		"""
		:param pickle_path (str): path to pickle data file to save/load
		:param pickle_name (str): file name to save pickle to path
		:param categories (list): arXiv categories to query
		"""
		if os.path.isdir(pickle_path):
			pickle_path = f"{pickle_path}{'' if pickle_path[-1] == '/' else '/'}all_arxiv.pkl"
		if len(categories) < 1:
			categories = ['cs*', 'cond-mat.dis-nn', 'q-bio.NC', 'stat.CO', 'stat.ML']
		# categories += ['cs.CV', 'cs.AI', 'cs.LG', 'cs.CL']

		self.categories = categories
		self.pickle_path = pickle_path
		self.base_url = 'http://export.arxiv.org/api/query'

	@staticmethod
	def build_qs(categories):
		"""Build query string from categories"""
		return '+OR+'.join(['cat:'+c for c in categories])

	@staticmethod
	def get_entry_dict(entry):
		"""Return a dictionary with the items we want from a feedparser entry"""
		try:
			return dict(title=entry['title'], authors=[a['name'] for a in entry['authors']],
			            published=pd.Timestamp(entry['published']), summary=entry['summary'],
			            link=entry['link'], category=entry['category'])
		except KeyError:
			print('Missing keys in row: {}'.format(entry))
			return None

	@staticmethod
	def strip_version(link):
		"""Strip version number from arXiv paper link"""
		return link[:-2]

	def fetch_updated_data(self, max_retry=5, pg_offset=0, pg_size=1000, wait_time=15):
		"""
		Get new papers from arXiv server
		:param max_retry: max number of time to retry request
		:param pg_offset: number of pages to offset
		:param pg_size: num abstracts to fetch per request
		:param wait_time: num seconds to wait between requests
		"""
		i, retry = pg_offset, 0
		df = pd.DataFrame()
		past_links = []
		if os.path.isfile(self.pickle_path):
			df = pd.read_pickle(self.pickle_path)
			df.reset_index()
		if len(df) > 0: past_links = df.link.apply(self.strip_version)

		while True:
			params = dict(search_query=self.build_qs(self.categories),
			              sortBy='submittedDate', start=pg_size*i, max_results=pg_size)
			response = requests.get(self.base_url, params='&'.join([f'{k}={v}' for k, v in params.items()]))
			entries = feedparser.parse(response.text).entries
			if len(entries) < 1:
				if retry < max_retry:
					retry += 1
					time.sleep(wait_time)
					continue
				break

			results_df = pd.DataFrame([self.get_entry_dict(e) for e in entries])
			max_date = results_df.published.max().date()
			new_links = ~results_df.link.apply(self.strip_version).isin(past_links)
			print(f'{i}. Fetched {len(results_df)} abstracts published {max_date} and earlier')
			if not new_links.any():
				break

			df = pd.concat((df, results_df.loc[new_links]), ignore_index=True)
			i += 1
			retry = 0
			time.sleep(wait_time)

		print(f'Downloaded {len(df)-len(past_links)} new abstracts')
		df.sort_values('published', ascending=False).groupby('link').first().reset_index()
		df.to_pickle(self.pickle_path)
		return df

	@classmethod
	def load(cls, pickle_path):
		"""Load data from pickle and remove duplicates"""
		return pd.read_pickle(cls(pickle_path).pickle_path)

	@classmethod
	def update(cls, pickle_path, categories=list(), **kwargs):
		"""
		Update arXiv data pickle with the latest abstracts
		"""
		cls(pickle_path, categories).fetch_updated_data(**kwargs)
		return True

4 Likes

@pzyxian1 you can use this instead of the pastebin

1 Like

Can someone post a list of the package versions, environment setup notes and link to their working notebook (also which version of fast.ai git repo) is working for them?

I’m having the devils own fun getting the lesson 4 NLP notebooks to work. Not fun and unusually frustrating.

AssertionError error in lesson4-imdb @

learner.fit(3e-3, 1, wds=1e-6, cycle_len=20, cycle_save_name=‘adam3_20’)

I tried the above line after finding it in a thread elsewhere.

#original code below. (also yields assertion error)
learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2)

seems others are experiencing this error.


cuda and cudnn versions as below.

]$ nvcc --version
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

$ cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

#define CUDNN_MAJOR 7
#define CUDNN_MINOR 1
#define CUDNN_PATCHLEVEL 4

#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

#include “driver_types.h”

[nb: running redhat OS due to selecting the AWS AMI during setup]


AssertionError Traceback (most recent call last)
in ()

~/fastaiCurrent/fastai/fastai/learner.py in fit(self, lrs, n_cycle, wds, **kwargs)
285 self.sched = None
286 layer_opt = self.get_layer_opt(lrs, wds)
–> 287 return self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs)
288
289 def warm_up(self, lr, wds=None):

~/fastaiCurrent/fastai/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, best_save_name, use_clr, use_clr_beta, metrics, callbacks, use_wd_sched, norm_wds, wds_sched_mult, use_swa, swa_start, swa_eval_freq, **kwargs)
232 metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, fp16=self.fp16,
233 swa_model=self.swa_model if use_swa else None, swa_start=swa_start,
–> 234 swa_eval_freq=swa_eval_freq, **kwargs)
235
236 def get_layer_groups(self): return self.models.get_layer_groups()

~/fastaiCurrent/fastai/fastai/model.py in fit(model, data, n_epochs, opt, crit, metrics, callbacks, stepper, swa_model, swa_start, swa_eval_freq, kwargs)
130 batch_num += 1
131 for cb in callbacks: cb.on_batch_begin()
–> 132 loss = model_stepper.step(V(x),V(y), epoch)
133 avg_loss = avg_loss * avg_mom + loss * (1-avg_mom)
134 debias_loss = avg_loss / (1 - avg_mom
batch_num)

~/fastaiCurrent/fastai/fastai/model.py in step(self, xs, y, epoch)
48 def step(self, xs, y, epoch):
49 xtra = []
—> 50 output = self.m(*xs)
51 if isinstance(output,tuple): output,*xtra = output
52 if self.fp16: self.m.zero_grad()

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
355 result = self._slow_forward(*input, **kwargs)
356 else:
–> 357 result = self.forward(*input, **kwargs)
358 for hook in self._forward_hooks.values():
359 hook_result = hook(self, input, result)

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/container.py in forward(self, input)
65 def forward(self, input):
66 for module in self._modules.values():
—> 67 input = module(input)
68 return input
69

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
355 result = self._slow_forward(*input, **kwargs)
356 else:
–> 357 result = self.forward(*input, **kwargs)
358 for hook in self._forward_hooks.values():
359 hook_result = hook(self, input, result)

~/fastaiCurrent/fastai/fastai/lm_rnn.py in forward(self, input)
95 with warnings.catch_warnings():
96 warnings.simplefilter(“ignore”)
—> 97 raw_output, new_h = rnn(raw_output, self.hidden[l])
98 new_hidden.append(new_h)
99 raw_outputs.append(raw_output)

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
355 result = self._slow_forward(*input, **kwargs)
356 else:
–> 357 result = self.forward(*input, **kwargs)
358 for hook in self._forward_hooks.values():
359 hook_result = hook(self, input, result)

~/fastaiCurrent/fastai/fastai/rnn_reg.py in forward(self, *args)
120 “”"
121 self._setweights()
–> 122 return self.module.forward(*args)
123
124 class EmbeddingDropout(nn.Module):

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/rnn.py in forward(self, input, hx)
202 flat_weight=flat_weight
203 )
–> 204 output, hidden = func(input, self.all_weights, hx)
205 if is_packed:
206 output = PackedSequence(output, batch_sizes)

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py in forward(input, *fargs, **fkwargs)
383 return hack_onnx_rnn((input,) + fargs, output, args, kwargs)
384 else:
–> 385 return func(input, *fargs, **fkwargs)
386
387 return forward

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/autograd/function.py in _do_forward(self, *input)
326 self._nested_input = input
327 flat_input = tuple(_iter_variables(input))
–> 328 flat_output = super(NestedIOFunction, self)._do_forward(*flat_input)
329 nested_output = self._nested_output
330 nested_variables = _unflatten(flat_output, self._nested_output)

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/autograd/function.py in forward(self, *args)
348 def forward(self, *args):
349 nested_tensors = _map_variable_tensor(self._nested_input)
–> 350 result = self.forward_extended(*nested_tensors)
351 del self._nested_input
352 self._nested_output = result

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py in forward_extended(self, input, weight, hx)
292 hy = tuple(h.new() for h in hx)
293
–> 294 cudnn.rnn.forward(self, input, hx, weight, output, hy)
295
296 self.save_for_backward(input, hx, weight, output)

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py in forward(fn, input, hx, weight, output, hy)
249 # Alternatively, copyParams could be written more carefully.
250 w.zero
()
–> 251 params = get_parameters(fn, handle, w)
252 _copyParams(weight, params)
253 else:

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py in get_parameters(fn, handle, weight_buf)
163 # might as well merge the CUDNN ones into a single tensor as well
164 if linear_id == 0 or linear_id == num_linear_layers / 2:
–> 165 assert filter_dim_a.prod() == filter_dim_a[0]
166 size = (filter_dim_a[0] * num_linear_layers // 2, filter_dim_a[2])
167 param = fn.weight_buf.new().set_(

AssertionError:


Environment config notes below. tried the fastai git repo updated this week after previously using a version from January. same error message.

=============================================================================
_| | )
| ( / Deep Learning AMI (Amazon Linux)
_|_|
|

Please use one of the following commands to start the required environment with the framework of your choice:
for MXNet(+Keras1) with Python3 (CUDA 9/MKL) _________________ source activate mxnet_p36
for MXNet(+Keras1) with Python2 (CUDA 9/MKL) _________________ source activate mxnet_p27
for TensorFlow(+Keras2) with Python3 (CUDA 9/MKL) ____________ source activate tensorflow_p36
for TensorFlow(+Keras2) with Python2 (CUDA 9/MKL) ____________ source activate tensorflow_p27
for Theano(+Keras2) with Python3 (CUDA 9) ____________________ source activate theano_p36
for Theano(+Keras2) with Python2 (CUDA 9) ____________________ source activate theano_p27
for PyTorch with Python3 (CUDA 9) ____________________________ source activate pytorch_p36
for PyTorch with Python2 (CUDA 9) ____________________________ source activate pytorch_p27
for CNTK(+Keras2) with Python3 (CUDA 9) ______________________ source activate cntk_p36
for CNTK(+Keras2) with Python2 (CUDA 9) ______________________ source activate cntk_p27
for Caffe2 with Python2 (CUDA 9) _____________________________ source activate caffe2_p27
for Caffe with Python2 (CUDA 8) ______________________________ source activate caffe_p27
for Caffe with Python3 (CUDA 8) ______________________________ source activate caffe_p35
for Chainer with Python2 (CUDA 9) ____________________________ source activate chainer_p27
for Chainer with Python3 (CUDA 9) ____________________________ source activate chainer_p36
for base Python2 (CUDA 9) ____________________________________ source activate python2
for base Python3 (CUDA 9) ____________________________________ source activate python3

Official Conda User Guide: https://conda.io/docs/user-guide/index.html
AWS Deep Learning AMI Homepage: https://aws.amazon.com/machine-learning/amis/
Developer Guide and Release Notes: https://docs.aws.amazon.com/dlami/latest/devguide/what-is-dlami.html
Support: https://forums.aws.amazon.com/forum.jspa?forumID=263

Amazon Linux version 2018.03 is available.
(python3) [ec2-user@ip-172-31-38-126 ~]$ pip freeze
alabaster==0.7.10
asn1crypto==0.24.0
autovizwidget==0.12.5
awscli==1.15.1
Babel==2.5.3
bcolz==1.2.0
beautifulsoup4==4.6.0
bleach==2.1.3
bokeh==0.12.14
boto3==1.6.16
botocore==1.10.1
certifi==2018.1.18
cffi==1.11.5
chardet==3.0.4
click==6.7
cliff==2.8.1
cloudpickle==0.5.2
cmd2==0.8.2
colorama==0.3.7
configparser==3.5.0
cryptography==2.2.1
cssselect==1.0.3
cycler==0.10.0
cymem==1.31.2
cytoolz==0.8.2
dask==0.17.2
decorator==4.2.1
dill==0.2.7.1
distributed==1.21.5
docrepr==0.1.1
docutils==0.14
en-core-web-sm==2.0.0
entrypoints==0.2.3
environment-kernels==1.1.1
feather-format==0.4.0
feedparser==5.2.1
google-images-download==2.0.4
graphviz==0.8.2
hdijupyterutils==0.12.5
heapdict==1.0.0
html5lib==1.0.1
idna==2.6
imageio==2.3.0
imagesize==1.0.0
instagram-scraper==1.5.28
ipykernel==4.8.2
ipython==6.2.1
ipython-genutils==0.2.0
ipywidgets==7.1.2
isoweek==1.3.3
jedi==0.11.1
Jinja2==2.10
jmespath==0.9.3
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.3
jupyter-console==5.2.0
jupyter-contrib-core==0.3.3
jupyter-contrib-nbextensions==0.5.0
jupyter-core==4.4.0
jupyter-highlight-selected-word==0.1.1
jupyter-latex-envs==1.4.4
jupyter-nbextensions-configurator==0.4.0
jupyternotify==0.1.15
kaggle==1.1.0
kaggle-cli==0.12.13
kiwisolver==1.0.1
locket==0.2.0
lxml==4.0.0
MarkupSafe==1.0
matplotlib==2.2.2
MechanicalSoup==0.8.0
mistune==0.8.3
mizani==0.4.6
msgpack-numpy==0.4.1
msgpack-python==0.5.6
murmurhash==0.28.0
nb-conda==2.2.1
nb-conda-kernels==2.1.0
nbconvert==5.3.1
nbformat==4.4.0
networkx==2.1
notebook==5.4.1
numexpr==2.6.4
numpy==1.14.2
olefile==0.45.1
opencv-python==3.4.0.12
packaging==17.1
palettable==3.1.0
pandas==0.22.0
pandas-summary==0.0.41
pandocfilters==1.4.2
parso==0.1.1
partd==0.3.8
pathlib==1.0.1
patsy==0.5.0
pbr==4.0.1
PDPbox==0.1
pexpect==4.4.0
pickleshare==0.7.4
Pillow==5.0.0
plac==0.9.6
plotly==2.4.1
plotnine==0.3.0
preshed==1.0.0
prettytable==0.7.2
progressbar2==3.34.3
prompt-toolkit==1.0.15
protobuf==3.5.0
psutil==5.4.3
psycopg2==2.7.4
ptyprocess==0.5.2
py4j==0.10.4
pyarrow==0.9.0
pyasn1==0.4.2
pycparser==2.18
pygal==2.4.0
Pygments==2.2.0
pykerberos==1.2.1
pyOpenSSL==17.5.0
pyparsing==2.2.0
pyperclip==1.6.0
PySocks==1.6.8
pyspark==2.2.1
python-dateutil==2.6.1
python-utils==2.3.0
pytz==2018.3
PyWavelets==0.5.2
PyYAML==3.12
pyzmq==17.0.0
qtconsole==4.3.1
regex==2017.4.5
requests==2.18.4
requests-kerberos==0.12.0
rsa==3.4.2
s3transfer==0.1.13
scikit-image==0.13.1
scikit-learn==0.19.1
scipy==1.0.1
seaborn==0.8.1
selenium==3.11.0
Send2Trash==1.5.0
simplegeneric==0.8.1
six==1.11.0
sklearn-pandas==1.6.0
snowballstemmer==1.2.1
sortedcontainers==1.5.9
spacy==2.0.11
sparkmagic==0.12.5
Sphinx==1.7.2
sphinxcontrib-websupport==1.0.1
SQLAlchemy==1.2.5
statsmodels==0.8.0
stevedore==1.28.0
tables==3.4.2
tblib==1.3.2
termcolor==1.1.0
terminado==0.8.1
testpath==0.3.1
thinc==6.10.2
toolz==0.9.0
torch==0.3.1.post3
torchtext==0.2.1
torchvision==0.2.0
tornado==5.0.1
tqdm==4.22.0
traitlets==4.3.2
ujson==1.35
urllib3==1.22
wcwidth==0.1.7
webencodings==0.5.1
widgetsnbextension==3.1.4
wrapt==1.10.11
zict==0.1.3

Hi friends,
Still not able to get the pkl file generated . I tried installing feedparser . it was successfull . After that I used the command
!GetArXiv.update('/content/clouderizer/fast.ai/data/imdb/aclImdb/all_arxiv.pkl') and it was throwing an error saying
/bin/sh: 1: Syntax error: word unexpected (expecting “)”)

Can I get some help here ?
Thanks

That’s not an external program, but a python function. Therefore, drop the leading !

I have just pasted into one cell the code from Where to download data for these lessons 4 & 5 (Arxiv, Wikipedia, etc)?
and then run in another cell:

GetArXiv.update('data/arxiv/all_arxiv.pkl')

0. Fetched 1000 abstracts published 2018-07-26 and earlier
1. Fetched 1000 abstracts published 2018-07-18 and earlier
[...]
49. Fetched 1000 abstracts published 2017-04-07 and earlier
Downloaded 50000 new abstracts

It’s slow but eventually gets there.

edit: see my next post - update the notebook code and it should just work without you needing to code anything.

The code to generate all_arxiv.pkl (thanks to @alecrubin) is now integrated into the notebook, so it will generate it on the fly.
Also added instructions to download arxiv.csv in both arxiv notebooks

Hi Stas, I have a question about the all_arxiv.pkl file. For me, the function GetArXiv.update(ALL_ARXIV) is only downloading 4000 or sometimes even fewer (500) abstracts. This seems far too few. I’ve run the code exactly as it appears in the notebook. Any ideas what could be the issue?

It’s possible that arxiv isn’t giving you more than some fixed amount now? I don’t know, I haven’t used it in 3 months.

The source code is there, so please have a look, it should be relatively easy to add some debug statements to see where and why it stops. I didn’t write that code, just integrated it into the notebook.

Good luck.