Thank you, very much @sgugger for your superb work with fastai!
Best regards!
When I try to create the datasource object for text through:
tfms = [attrgetter(âtextâ), Tokenizer.from_df(âtextâ), Numericalize()]
dsrc = DataSource(df, [tfms], splits=splits, dl_type=LMDataLoader)
Unfortunately, it shows the following error, which, I guess, has to do with some kind of palatalization issue.
AttributeError: Can't pickle local object 'parallel_gen.<locals>.f'
Can anybody help what is causing such an error?
I searched on the web for a solution, but could not find anything
Perhaps try using dill instead of pickle? (It works exactly the same way but it handles more complex stuff.)
Hi Pablo,
Could you elaborate more on your solution. I am facing the same error as Preka. As far as I know itâs related to multi-processing but not able to resolve it. This is what I tried so far:
dsets = DataSource(df_all, [tfms], splits=splits, dl_type=partial(LMDataLoader,num_workers=0))
If you are also getting
AttributeError: Can't pickle local object 'parallel_gen.<locals>.f'
when trying to save data, then the first thing I would try would be to use dill
instead of pickle
to save and load data.
They work exactly the same way. In fact, many people go as far as importing dill
as pickle
so the code is the same:
import dill as pickle
Hi Pablo, thanks for your reply.
However, I tried what you suggested but it didnât work. It also gives the same error as before.
I seem to believe I was having a similar problem with version 2, Iâll update when I can confirm.
Update: After some update itâs working fine!
Seems like it is an issue of windows (I was using Anaconda Powershell in Windows), as
it works perfectly in Linux (Ubuntu 18.4).
Thanks a lot for your extremely helpful support!
It was nothing, Iâm just glad I could help!
I came across the same problem with Win10. I just wonder whether the issue being fixed. Ubuntu 19 was working with CUDA(18.04 cuda didnât work) in my Vbox, but it seems have problem recently. Maybe I should wait for Ubuntu 20. Any resolution for Fastai2 in Win?
Iâve been having the AttributeError: Can't pickle local object 'parallel_gen.<locals>.f'
issue with MacOS but only when calling
dls_cl = TextDataLoaders.from_df(
df,
text_col='headline',
label_col=topic,
valid_pct=0.1,
text_vocab=learn.dls.vocab[0],
bs=128
)
using Fastapi. When I run the Python file normally itâs fine.
I was also having the same problem with pickling in Win10. In my case when trying to create a DataBlock that contains a TextBlock. I tracked down the issue and found a solution, but involves changes in torch_core. I will share it here for anyone that might be interested.
The root of the issue is that the function torch_core.parallel_gen contains two functions within it that cannot be pickled (at least it in windows, but AFAIK, nested functions in general are not pickleable). Pickling in this case is needed to parallelize processes, so this is why this problem might appear in different contexts. The workaround I found is to take the nested functions out, and use functools.partial to evaluate the needed info inside parallel_gen. So, if we change the function parallel_gen from this:
def parallel_gen(cls, items, n_workers=defaults.cpus, as_gen=False, **kwargs):
"Instantiate `cls` in `n_workers` procs & call each on a subset of `items` in parallel."
batches = np.array_split(items, n_workers)
idx = np.cumsum(0 + L(batches).map(len))
queue = Queue()
def f(batch, start_idx):
for i,b in enumerate(cls(**kwargs)(batch)): queue.put((start_idx+i,b))
def done(): return (queue.get() for _ in progress_bar(items, leave=False))
yield from run_procs(f, done, L(batches,idx).zip())
to this:
import functools
def f_pg(clse,queue,batch, start_idx):
for i,b in enumerate(clse(batch)): queue.put((start_idx+i,b))
def done_pg(queue,items): return (queue.get() for _ in progress_bar(items, leave=False))
def parallel_gen(cls, items, n_workers=defaults.cpus, as_gen=False, **kwargs):
"Instantiate `cls` in `n_workers` procs & call each on a subset of `items` in parallel."
batches = np.array_split(items, n_workers)
idx = np.cumsum(0 + L(batches).map(len))
queue = Queue()
f=functools.partial(f_pg,cls(**kwargs),queue)
done=functools.partial(done_pg,queue,items)
yield from run_procs(f, done, L(batches,idx).zip())
Then the problem is solved. It feels a bit slow tho, so I am not sure if this is a good solution or a simple hack to make this work. I tested with this simple example that tries to load data into a Datablock:
import numpy as np
import pandas as pd
from fastai2.text.all import Callback,DataBlock,TextBlock,CategoryBlock,ColReader
t=[list('abcdefghijklmnopqrstuvwxyz'),list('0123456789')]
t_l=np.random.randint(0,2,100)
t_d=[''.join(np.random.choice(t[i],np.random.randint(1,10))) for i in t_l]
df=pd.DataFrame.from_dict({'text':t_d,'label':t_l})
db = DataBlock(blocks=(TextBlock.from_df('text'), CategoryBlock),
get_x=ColReader('text'),
get_y=ColReader('label'))
dls = db.dataloaders(df)
I am new to fastai, so I am not sure if this is the right way of doing things, but I thought this might be helpful, so I decided to share.
grillard, thank you for this workaround. Great job and a perfect contribution!
seriousyhyena
Are you just using a forked version of fastcore without the functions in functions? I have a feeling itâs possibe to patch new functionality into fastai2 but Iâm not sure how
I have a pull request open based on your proposed solution @grillard. It looks like it doesnât work in Python 3.6 though which is a shame.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-151-da1277b2abe5> in <module>
4 items = range(5)
5
----> 6 res = L(parallel_gen(_C, items, n_workers=3))
7 idxs,dat1 = zip(*res.sorted(itemgetter(0)))
8 test_eq(dat1, range(1,6))
~/Documents/fastcore/nbs/fastcore/foundation.py in __call__(cls, x, *args, **kwargs)
45 return x
46
---> 47 res = super().__call__(*((x,) + args), **kwargs)
48 res._newchk = 0
49 return res
~/Documents/fastcore/nbs/fastcore/foundation.py in __init__(self, items, use_list, match, *rest)
322 if items is None: items = []
323 if (use_list is not None) or not _is_array(items):
--> 324 items = list(items) if use_list else _listify(items)
325 if match is not None:
326 if is_coll(match): match = len(match)
~/Documents/fastcore/nbs/fastcore/foundation.py in _listify(o)
258 if isinstance(o, list): return o
259 if isinstance(o, str) or _is_array(o): return [o]
--> 260 if is_iter(o): return list(o)
261 return [o]
262
<ipython-input-150-9348352aeb75> in parallel_gen(cls, items, n_workers, **kwargs)
10 f=partial(f_pg, cls(**kwargs), queue)
11 done=partial(done_pg, queue, items)
---> 12 yield from run_procs(f, done, L(batches,idx).zip())
<ipython-input-147-1d6836dd4202> in run_procs(f, f_done, args)
4 processes = L(args).map(Process, args=arg0, target=f)
5 for o in processes: o.start()
----> 6 yield from f_done()
7 processes.map(Self.join())
<ipython-input-149-d8e2b9ae3cb4> in done_pg(queue, items)
----> 5 return (queue.get() for _ in progress_bar(items, leave=False))
TypeError: 'NoneType' object is not callable
Jeremy sorted it! The changes are up on master