Reproducing dawnbench results

i must admit that i’m not using the same dataloaders and transforms as you. I run based on this template:

using wideresnet-22 as learner. I started with your notebook, but got some pytorch errors:

TypeError: eq received an invalid combination of arguments - got (torch.LongTensor), but expected one of:

  • (int value)
    didn’t match because some of the arguments have invalid types: (torch.LongTensor)
  • (torch.cuda.LongTensor other)
    didn’t match because some of the arguments have invalid types: (torch.LongTensor)

So I just went with easier path and used Jeremy’s notebook from lesson. I’ll try to figure those errors and see what result I’ll have with your loaders/transforms

This error means that something is not on the GPU. Might be the data or the model. You might need to run .cuda() on something (likely the model).

I am not sure what is causing this error. I am trying to run your notebook as is. As I understand I am doing exactly same things as you. So it should work without modification i guess. Could it be pytorch version is different? my local fast.ai version is the latest git pull.

this cell from your notebook:

learn = get_learner(wrn_22(), 512)
learn.lr_find(wds=1e-4);
learn.sched.plot(n_skip_end=1)

gives this error:


TypeError Traceback (most recent call last)
in ()
1 learn = get_learner(wrn_22(), 512)
----> 2 learn.lr_find(wds=1e-4);
3 learn.sched.plot(n_skip_end=1)

~/fastai/courses/dl2/fastai/learner.py in lr_find(self, start_lr, end_lr, wds, linear, **kwargs)
328 layer_opt = self.get_layer_opt(start_lr, wds)
329 self.sched = LR_Finder(layer_opt, len(self.data.trn_dl), end_lr, linear=linear)
–> 330 self.fit_gen(self.model, self.data, layer_opt, 1, **kwargs)
331 self.load(‘tmp’)
332

~/fastai/courses/dl2/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, best_save_name, use_clr, use_clr_beta, metrics, callbacks, use_wd_sched, norm_wds, wds_sched_mult, use_swa, swa_start, swa_eval_freq, **kwargs)
232 metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, fp16=self.fp16,
233 swa_model=self.swa_model if use_swa else None, swa_start=swa_start,
–> 234 swa_eval_freq=swa_eval_freq, **kwargs)
235
236 def get_layer_groups(self): return self.models.get_layer_groups()

~/fastai/courses/dl2/fastai/model.py in fit(model, data, n_epochs, opt, crit, metrics, callbacks, stepper, swa_model, swa_start, swa_eval_freq, **kwargs)
148
149 if not all_val:
–> 150 vals = validate(model_stepper, cur_data.val_dl, metrics)
151 stop=False
152 for cb in callbacks: stop = stop or cb.on_epoch_end(vals)

~/fastai/courses/dl2/fastai/model.py in validate(stepper, dl, metrics)
209 else: batch_cnts.append(len(x))
210 loss.append(to_np(l))
–> 211 res.append([f(preds.data, y) for f in metrics])
212 return [np.average(loss, 0, weights=batch_cnts)] + list(np.average(np.stack(res), 0, weights=batch_cnts))
213

~/fastai/courses/dl2/fastai/model.py in (.0)
209 else: batch_cnts.append(len(x))
210 loss.append(to_np(l))
–> 211 res.append([f(preds.data, y) for f in metrics])
212 return [np.average(loss, 0, weights=batch_cnts)] + list(np.average(np.stack(res), 0, weights=batch_cnts))
213

~/fastai/courses/dl2/fastai/metrics.py in accuracy(preds, targs)
8 def accuracy(preds, targs):
9 preds = torch.max(preds, dim=1)[1]
—> 10 return (preds==targs).float().mean()
11
12 def accuracy_thresh(thresh):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/tensor.py in eq(self, other)
358
359 def eq(self, other):
–> 360 return self.eq(other)
361
362 def ne(self, other):

TypeError: eq received an invalid combination of arguments - got (torch.LongTensor), but expected one of:

  • (int value)
    didn’t match because some of the arguments have invalid types: (torch.LongTensor)
  • (torch.cuda.LongTensor other)
    didn’t match because some of the arguments have invalid types: (torch.LongTensor)

Something is not on the GPU and something else expects it to be there. If you’d like to experiment with this you could run %debug in a cell of jupyter notebook (after you get the error). Pressing ‘u’ will take you up the stack and you will be able to check the types of variables and see what might be the culprit here. You can read more about this technique of debugging here if you are interested.

I ran this with pytorch 3.1 and my own fork of fastai. I just pushed everything to my github repo but I don’t think this might be the problem here - seems that nothing changed in metrics.py between my fork and fastai master.

4 Likes

@radek, @jeremy, @sgugger

experimenting with cifar-10 dawnbench results:

Guys, please help me to understand and please confirm or reject the following finding. I am pretty sure I’ve missed something, but this is what I see while experimenting on my local computer.

  1. my premise: solution with torchvision dataloaders is approx 25% faster than with fast.ai dataloaders.
    with fast.ai dataloaders: https://github.com/fastai/fastai/blob/master/courses/dl2/cifar10-dawn.ipynb.
    with torchvision loaders:
    - radek’s notebook: https://github.com/radekosmulski/machine_learning_notebooks/blob/master/cifar10_fastai_dawnbench.ipynb).
    - fast.ai dawnbench (https://github.com/fastai/imagenet-fast/blob/master/cifar10/cifar10-super-convergence-tuned.ipynb)

  2. back story:
    I replicated (re-run) Jeremy’s notebook with cifar-10 dawnbench on my local machine. After seeing Radek’s results that were way faster than mine, I started to search where the difference coming from. As I understood, the performance of my computer should be pretty close to Rakek’s.
    I tried Radek’s notebook, and got following error:

    ~/fastai/courses/dl2/fastai/metrics.py in accuracy(preds, targs)
    8 def accuracy(preds, targs):
    9 preds = torch.max(preds, dim=1)[1]
    —> 10 return (preds==targs).float().mean()

    TypeError: eq received an invalid combination of arguments - got (torch.LongTensor), but expected one of: (int value)

    It was obviuos WHAT the error was, my big surprise was WHY this error occured in the first place!?
    I was thinking that it should not matter whether one used fast.ai dataloader or torchvision dataloader.
    From the user’s standpoint this should have worked perfectly because fast.ai library is always smart enough to know what, when and how when optimizing CPU/GPU stuff.
    Ok, fair enough - getting rid of the error message was easy: let fast.ai/metrics.py: accuracy to send data to CPU.

    fast.ai/metrics.py: accuracy:
    def accuracy(preds, targs):
    preds = torch.max(preds, dim=1)[1]
    return (preds==targs).float().mean()

    modified:
    def accuracy(preds, targs):
    preds = torch.max(preds, dim=1)[1]
    preds = preds.cpu().numpy() # <----- added this line
    return (preds==targs).float().mean()

    Now what? Seemed to work perfectly. And it was much faster. Now I got result’s quite close to Radek’s.
    Radek was still 16% faster, but I can live with that for time being :slight_smile:

  3. Where is the speed difference coming from? fast.ai dataloader vs torchvision dataloader
    learner with torchvision dataloader = 25% faster than learner with fast.ai dataloader.
    This is where I’m stuck.

    My thoughts:
    1. The problem arises while validating. And for some reason it seems, that not all data is on GPU where it should be or there is some duplication/redundancy of data transfer around predict(X)/validation step
    .2. torchvision has explicit .ToTensor() in transforms to start with: tfms = [transforms.ToTensor(),…]
    I wanted to make sure that there is similar “.ToTensor()” in fast.ai pipeline.

    I looked through the whole workflow, but couldn’t find the correct place to modify that would make it faster.

1 Like

Yes we’ve been looking into this for the last few days as it happens. When I created fastai, the torchvision/dataloader pipeline was really slow. It’s improved a lot. The underlying methods in fastai and torchvision now seem about the same speed, but there’s some bottleneck that’s holding onto the GIL.

I’m currently leaning towards re-writing the fastai pipeline to leverage the good parts of the torchvision/dataloader pipeline. It shouldn’t change the API much if at all, but may make some things a little easier internally.

5 Likes

thank you. I added link to a notebook to show what I did exactly. Now as I retested everything for a 100th time, I found out that I misunderstood python profilers ‘total time’ and ‘wall time’. Just to be sure of everything I wrote, I sat with physical stopwatch behind computer and measured time. Seems that ‘wall time’ is the real ‘total time’. I have to make some research on that ho to intepret profiler results.