How long should it take to run LanguageModelData.from_text_files on an average desktop PC? And is there some method to speed the process up? Transferring each text string to a file then loading takes a long time, and probably doesn’t scale well. Maybe I am doing something wrong.
I managed to get df_mb right (it’s just the summary data from the csv file Jeremy linked above), but I can’t even figure out what df_all is (it’s somehow related to the .pickle file) or what it’s intended to do. Can anyone explain?
I managed to get the data from the “get_arxiv.py” script linked by @jeremy working.
I took the file and changed a few lines in order to make the script create a pickle file if it does not exist.
Here is a link to the updated file : https://pastebin.com/t3SuugS4
I’m not sure weither the script end properly, that’s why I dump the pickle file at every api request/parsing. If you think there is a cleaner/better way of doing it, please tell me about it !
And I then use it like this in the lang_model-arxiv.ipynb :
if check_for_update():
update_arxiv()
I’ve got to change a couple of things after that, including the call to LanguageModelData :
For anyone interested, I edited the get_arxiv.py from the brundage bot project, to get the arXiv language model data for lessons 4 & 5. Converted the original code from py 2 to py 3, wrapped everything in a class, and fixed it to work if you do not have an existing arXiv pickle file. The only required packaged outside of the fastai environment is feedparser, If you don’t have it installed already, just do a quick pip install feedparser. Then you just do GetArXiv.update('data/all_arxiv.pkl') to run it. You can also do GetArXiv.load('data/all_arxiv.pkl') to load the data.
import os, requests, time
import feedparser
import pandas as pd
class GetArXiv(object):
def __init__(self, pickle_path, categories=list()):
"""
:param pickle_path (str): path to pickle data file to save/load
:param pickle_name (str): file name to save pickle to path
:param categories (list): arXiv categories to query
"""
if os.path.isdir(pickle_path):
pickle_path = f"{pickle_path}{'' if pickle_path[-1] == '/' else '/'}all_arxiv.pkl"
if len(categories) < 1:
categories = ['cs*', 'cond-mat.dis-nn', 'q-bio.NC', 'stat.CO', 'stat.ML']
# categories += ['cs.CV', 'cs.AI', 'cs.LG', 'cs.CL']
self.categories = categories
self.pickle_path = pickle_path
self.base_url = 'http://export.arxiv.org/api/query'
@staticmethod
def build_qs(categories):
"""Build query string from categories"""
return '+OR+'.join(['cat:'+c for c in categories])
@staticmethod
def get_entry_dict(entry):
"""Return a dictionary with the items we want from a feedparser entry"""
try:
return dict(title=entry['title'], authors=[a['name'] for a in entry['authors']],
published=pd.Timestamp(entry['published']), summary=entry['summary'],
link=entry['link'], category=entry['category'])
except KeyError:
print('Missing keys in row: {}'.format(entry))
return None
@staticmethod
def strip_version(link):
"""Strip version number from arXiv paper link"""
return link[:-2]
def fetch_updated_data(self, max_retry=5, pg_offset=0, pg_size=1000, wait_time=15):
"""
Get new papers from arXiv server
:param max_retry: max number of time to retry request
:param pg_offset: number of pages to offset
:param pg_size: num abstracts to fetch per request
:param wait_time: num seconds to wait between requests
"""
i, retry = pg_offset, 0
df = pd.DataFrame()
past_links = []
if os.path.isfile(self.pickle_path):
df = pd.read_pickle(self.pickle_path)
df.reset_index()
if len(df) > 0: past_links = df.link.apply(self.strip_version)
while True:
params = dict(search_query=self.build_qs(self.categories),
sortBy='submittedDate', start=pg_size*i, max_results=pg_size)
response = requests.get(self.base_url, params='&'.join([f'{k}={v}' for k, v in params.items()]))
entries = feedparser.parse(response.text).entries
if len(entries) < 1:
if retry < max_retry:
retry += 1
time.sleep(wait_time)
continue
break
results_df = pd.DataFrame([self.get_entry_dict(e) for e in entries])
max_date = results_df.published.max().date()
new_links = ~results_df.link.apply(self.strip_version).isin(past_links)
print(f'{i}. Fetched {len(results_df)} abstracts published {max_date} and earlier')
if not new_links.any():
break
df = pd.concat((df, results_df.loc[new_links]), ignore_index=True)
i += 1
retry = 0
time.sleep(wait_time)
print(f'Downloaded {len(df)-len(past_links)} new abstracts')
df.sort_values('published', ascending=False).groupby('link').first().reset_index()
df.to_pickle(self.pickle_path)
return df
@classmethod
def load(cls, pickle_path):
"""Load data from pickle and remove duplicates"""
return pd.read_pickle(cls(pickle_path).pickle_path)
@classmethod
def update(cls, pickle_path, categories=list(), **kwargs):
"""
Update arXiv data pickle with the latest abstracts
"""
cls(pickle_path, categories).fetch_updated_data(**kwargs)
return True
Can someone post a list of the package versions, environment setup notes and link to their working notebook (also which version of fast.ai git repo) is working for them?
I’m having the devils own fun getting the lesson 4 NLP notebooks to work. Not fun and unusually frustrating.
~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/autograd/function.py in forward(self, *args)
348 def forward(self, *args):
349 nested_tensors = _map_variable_tensor(self._nested_input)
→ 350 result = self.forward_extended(*nested_tensors)
351 del self._nested_input
352 self._nested_output = result
~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py in forward_extended(self, input, weight, hx)
292 hy = tuple(h.new() for h in hx)
293
→ 294 cudnn.rnn.forward(self, input, hx, weight, output, hy)
295
296 self.save_for_backward(input, hx, weight, output)
~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py in forward(fn, input, hx, weight, output, hy)
249 # Alternatively, copyParams could be written more carefully.
250 w.zero()
→ 251 params = get_parameters(fn, handle, w)
252 _copyParams(weight, params)
253 else:
~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py in get_parameters(fn, handle, weight_buf)
163 # might as well merge the CUDNN ones into a single tensor as well
164 if linear_id == 0 or linear_id == num_linear_layers / 2:
→ 165 assert filter_dim_a.prod() == filter_dim_a[0]
166 size = (filter_dim_a[0] * num_linear_layers // 2, filter_dim_a[2])
167 param = fn.weight_buf.new().set_(
AssertionError:
Environment config notes below. tried the fastai git repo updated this week after previously using a version from January. same error message.
=============================================================================
__| | ) | ( / Deep Learning AMI (Amazon Linux) |_||
Please use one of the following commands to start the required environment with the framework of your choice:
for MXNet(+Keras1) with Python3 (CUDA 9/MKL) _________________ source activate mxnet_p36
for MXNet(+Keras1) with Python2 (CUDA 9/MKL) _________________ source activate mxnet_p27
for TensorFlow(+Keras2) with Python3 (CUDA 9/MKL) ____________ source activate tensorflow_p36
for TensorFlow(+Keras2) with Python2 (CUDA 9/MKL) ____________ source activate tensorflow_p27
for Theano(+Keras2) with Python3 (CUDA 9) ____________________ source activate theano_p36
for Theano(+Keras2) with Python2 (CUDA 9) ____________________ source activate theano_p27
for PyTorch with Python3 (CUDA 9) ____________________________ source activate pytorch_p36
for PyTorch with Python2 (CUDA 9) ____________________________ source activate pytorch_p27
for CNTK(+Keras2) with Python3 (CUDA 9) ______________________ source activate cntk_p36
for CNTK(+Keras2) with Python2 (CUDA 9) ______________________ source activate cntk_p27
for Caffe2 with Python2 (CUDA 9) _____________________________ source activate caffe2_p27
for Caffe with Python2 (CUDA 8) ______________________________ source activate caffe_p27
for Caffe with Python3 (CUDA 8) ______________________________ source activate caffe_p35
for Chainer with Python2 (CUDA 9) ____________________________ source activate chainer_p27
for Chainer with Python3 (CUDA 9) ____________________________ source activate chainer_p36
for base Python2 (CUDA 9) ____________________________________ source activate python2
for base Python3 (CUDA 9) ____________________________________ source activate python3
Hi friends,
Still not able to get the pkl file generated . I tried installing feedparser . it was successfull . After that I used the command !GetArXiv.update('/content/clouderizer/fast.ai/data/imdb/aclImdb/all_arxiv.pkl') and it was throwing an error saying
/bin/sh: 1: Syntax error: word unexpected (expecting “)”)
GetArXiv.update('data/arxiv/all_arxiv.pkl')
0. Fetched 1000 abstracts published 2018-07-26 and earlier
1. Fetched 1000 abstracts published 2018-07-18 and earlier
[...]
49. Fetched 1000 abstracts published 2017-04-07 and earlier
Downloaded 50000 new abstracts
It’s slow but eventually gets there.
edit: see my next post - update the notebook code and it should just work without you needing to code anything.
The code to generate all_arxiv.pkl (thanks to @alecrubin) is now integrated into the notebook, so it will generate it on the fly.
Also added instructions to download arxiv.csv in both arxiv notebooks
Hi Stas, I have a question about the all_arxiv.pkl file. For me, the function GetArXiv.update(ALL_ARXIV) is only downloading 4000 or sometimes even fewer (500) abstracts. This seems far too few. I’ve run the code exactly as it appears in the notebook. Any ideas what could be the issue?
It’s possible that arxiv isn’t giving you more than some fixed amount now? I don’t know, I haven’t used it in 3 months.
The source code is there, so please have a look, it should be relatively easy to add some debug statements to see where and why it stops. I didn’t write that code, just integrated it into the notebook.