How to read StackTrace of Error

What setup is needed to load a model. If I just initialize an instance of learn and invoke load I am getting following exception. Not sure what minimum initialization step before loading a model.

While copying the parameter named 0.weight, whose dimensions in the model are torch.Size([4096]) and whose dimensions in the checkpoint are torch.Size([64, 3, 7, 7]), …

RuntimeError Traceback (most recent call last)
in ()
----> 1 learn.load(‘299_pre_model’)

~/fastai/courses/dl1/fastai/learner.py in load(self, name)
61 def get_model_path(self, name): return os.path.join(self.models_path,name)+’.h5’
62 def save(self, name): save_model(self.model, self.get_model_path(name))
—> 63 def load(self, name): load_model(self.model, self.get_model_path(name))
64
65 def set_data(self, data): self.data_ = data

~/fastai/courses/dl1/fastai/torch_imports.py in load_model(m, p)
20 def children(m): return m if isinstance(m, (list, tuple)) else list(m.children())
21 def save_model(m, p): torch.save(m.state_dict(), p)
—> 22 def load_model(m, p): m.load_state_dict(torch.load§)
23
24 def load_pre(pre, f, fn):

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict)
358 param = param.data
359 try:
–> 360 own_state[name].copy_(param)
361 except:
362 print(‘While copying the parameter named {}, whose dimensions in the model are’

RuntimeError: invalid argument 2: sizes do not match at /opt/conda/conda-bld/pytorch_1503965122592/work/torch/lib/THC/THCTensorCopy.cu:31

I have seen the same bug but I’m not sure what yet. I created an issue here:

@rsrivastava the reason for that error is that you need to set precompute=False, since that’s how you saved the model.

In general, no setup is required other than to create a learn object with the same parameters as when you saved it.

1 Like

Question: How do we know Jupiter is running. If the heartbeat that is set with saving Jupiter is not there means Jupiter is not working.
I am trying to run

get_data(sz,bs) command. Getting following error. Not sure if one thread throws an exception will it work or it just hangs without any error.

Exception in thread Thread-6:
Traceback (most recent call last):
File “/home/ubuntu/src/anaconda3/envs/fastai/lib/python3.6/threading.py”, line 916, in _bootstrap_inner
self.run()
File “/home/ubuntu/src/anaconda3/envs/fastai/lib/python3.6/site-packages/tqdm/_tqdm.py”, line 144, in run
for instance in self.tqdm_cls._instances:
File “/home/ubuntu/src/anaconda3/envs/fastai/lib/python3.6/_weakrefset.py”, line 60, in iter
for itemref in self.data:
RuntimeError: Set changed size during iteration

That’s just a warning - you can ignore it.

Thanks Jermey. I observed that Jupyter notebook by default save the notebook every 2 minute. I want to know if the notebook has not logged the save message for say 30 minutes. Does that mean it is not processing we need to restart… I am unable to know if the process is taking time or process is hung.

Just ran this fit command:
%time learn.fit(1e-2,1, cycle_len=1)

Running simple 4 line of code got this error. Not sure what argument have i missed.

TypeError Traceback (most recent call last)
in ()

~/fastai/courses/dl1/fastai/learner.py in fit(self, lrs, n_cycle, wds, **kwargs)
95 self.sched = None
96 layer_opt = self.get_layer_opt(lrs, wds)
—> 97 self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs)
98
99 def lr_find(self, start_lr=1e-5, end_lr=10, wds=None):

~/fastai/courses/dl1/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, metrics, callbacks, **kwargs)
85 n_epoch = sum_geom(cycle_len if cycle_len else 1, cycle_mult, n_cycle)
86 fit(model, data, n_epoch, layer_opt.opt, self.crit,
—> 87 metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, **kwargs)
88
89 def get_layer_groups(self): return self.models.get_layer_groups()

~/fastai/courses/dl1/fastai/model.py in fit(model, data, epochs, opt, crit, metrics, callbacks, **kwargs)
90 if stop: return
91
—> 92 vals = validate(stepper, data.val_dl, metrics)
93 print(np.round([epoch, avg_loss] + vals, 6))
94 stop=False

~/fastai/courses/dl1/fastai/model.py in validate(stepper, dl, metrics)
102 preds,l = stepper.evaluate(VV(x), VV(y))
103 loss.append(to_np(l))
–> 104 res.append([f(to_np(preds),to_np(y)) for f in metrics])
105 return [np.mean(loss)] + list(np.mean(np.stack(res),0))
106

~/fastai/courses/dl1/fastai/model.py in (.0)
102 preds,l = stepper.evaluate(VV(x), VV(y))
103 loss.append(to_np(l))
–> 104 res.append([f(to_np(preds),to_np(y)) for f in metrics])
105 return [np.mean(loss)] + list(np.mean(np.stack(res),0))
106

TypeError: accuracy_multi() missing 1 required positional argument: ‘thresh’

That means you set your metrics incorrectly. We can’t help much more without seeing your actual code.

I get this type of error all the time. Not sure what is the root cause. How should we read the stack trace?


RuntimeError Traceback (most recent call last)
in ()

~/fastai/courses/dl1/fastai/learner.py in TTA(self, n_aug, is_test)
148 dl1 = self.data.test_dl if is_test else self.data.val_dl
149 dl2 = self.data.test_aug_dl if is_test else self.data.aug_dl
–> 150 preds1,targs = predict_with_targs(self.model, dl1)
151 preds1 = [preds1]*math.ceil(n_aug/4)
152 preds2 = [predict_with_targs(self.model, dl2)[0] for i in range(n_aug)]

~/fastai/courses/dl1/fastai/model.py in predict_with_targs(m, dl)
115 if hasattr(m, ‘reset’): m.reset()
116 preda,targa = zip(*[(get_prediction(m(*VV(x))),y)
–> 117 for *x,y in iter(dl)])
118 return to_np(torch.cat(preda)), to_np(torch.cat(targa))
119

~/fastai/courses/dl1/fastai/model.py in (.0)
114 m.eval()
115 if hasattr(m, ‘reset’): m.reset()
–> 116 preda,targa = zip(*[(get_prediction(m(*VV(x))),y)
117 for *x,y in iter(dl)])
118 return to_np(torch.cat(preda)), to_np(torch.cat(targa))

~/fastai/courses/dl1/fastai/dataset.py in next(self)
226 if self.i>=len(self.dl): raise StopIteration
227 self.i+=1
–> 228 return next(self.it)
229
230 @property

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py in next(self)
193 while True:
194 assert (not self.shutdown and self.batches_outstanding > 0)
–> 195 idx, batch = self.data_queue.get()
196 self.batches_outstanding -= 1
197 if idx != self.rcvd_idx:

~/src/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py in get(self)
335 res = self._reader.recv_bytes()
336 # unserialize the data after having released the lock
–> 337 return _ForkingPickler.loads(res)
338
339 def put(self, obj):

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/multiprocessing/reductions.py in rebuild_storage_fd(cls, df, size)
68 fd = multiprocessing.reduction.rebuild_handle(df)
69 else:
—> 70 fd = df.detach()
71 try:
72 storage = storage_from_cache(cls, fd_id(fd))

~/src/anaconda3/envs/fastai/lib/python3.6/multiprocessing/resource_sharer.py in detach(self)
56 ‘’‘Get the fd. This should only be called once.’’'
57 with _resource_sharer.get_connection(self._id) as conn:
—> 58 return reduction.recv_handle(conn)
59
60

~/src/anaconda3/envs/fastai/lib/python3.6/multiprocessing/reduction.py in recv_handle(conn)
180 ‘’‘Receive a handle over a local connection.’’'
181 with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
–> 182 return recvfds(s, 1)[0]
183
184 def DupFd(fd):

~/src/anaconda3/envs/fastai/lib/python3.6/multiprocessing/reduction.py in recvfds(sock, size)
159 if len(ancdata) != 1:
160 raise RuntimeError(‘received %d items of ancdata’ %
–> 161 len(ancdata))
162 cmsg_level, cmsg_type, cmsg_data = ancdata[0]
163 if (cmsg_level == socket.SOL_SOCKET and

RuntimeError: received 0 items of ancdata

Search the forum for this one - it’s a known issue and there’s a solution in fastai repo’s github issues.

@KevinB
Have you been able to resolve this issue?

Thanks so much Jeremy, you mentioned rebooting the AWS, increasing the ulimit… I tried those still getting the issue. This is intermittent issue, not sure what is the reason. Same notebook that was once working is not working any more.

Check out this github issue. The work-around you are looking for is there.

import resource
rlimit = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (2048, rlimit[1]))

I guess we should use the formatting …

It’s damn simple…

Just enclose your code related stuff within these

(Located on top left in keyboard)

print('fast.ai')

Thanks…

For further reference…(not tried them but they should also work…)

TypeError: accuracy_multi() missing 1 required positional argument: ‘thresh’

@rsrivastava, were you able to resolve this error? What were you missing? I am also getting this :confused:

@shubham24 please provide the info requested here so we can help, in case @rsrivastava doesn’t have an answer for you: http://wiki.fast.ai/index.php/How_to_ask_for_Help

I am trying to reuse the code from Lesson 1 and 2 notebook to fit the data from Kaggle - Plant Seedlings Classification.

I am using the labels.csv file with file,species header. The rows look like this:

file,species
46fa84dad.png,Fat Hen
52e82d773.png,Fat Hen
61fd68900.png,Fat Hen 
9064640e8.png,Fat Hen
79cec7209.png,Fat Hen
c734bade3.png,Fat Hen

I get the error below when I try to run the following cell.

I am running the code on a local machine using GTX 1080.

As mentioned by Kevin above: I increased ulimit it worked.

The problem is that you have spaces in the name of each label. That means it thinks there are two labels, “Fat”, and “Hen”. Replace the spaces with underscores to fix this.

4 Likes

I tried both workarounds:

  1. Increasing the ulimit for number of files open by process as suggested by @KevinB
  2. Replace spaces in underscores as suggested by @jeremy

and the one that got me past the error was #2 (replacing spaces with underscores). Here is the updated code snippet I’m using to produce the csv file:

df = pd.DataFrame(columns=["file", "species"])

for image in glob("{}/train/**/*.png".format(PATH)):
    dir_ = image.split('/')
    file_, species = dir_[-1], dir_[-2]
    
    # "Fat hen" -> "Fat_hen" per http://forums.fast.ai/t/how-to-read-stacktrace-of-error/7795/21?u=tleyden
    species = species.replace(' ', '_')
    

    df = df.append({
        "file": file_,
        "species": species
        }, ignore_index=True)

df.to_csv('PlantClassificationLabels.csv', index=False)
2 Likes