Developer chat

I’ve just started playing with fast.ai v1, and found some minor performance bugs/issues that may be worth fixing in v2:

When using the pets database, loaded via:
win_workers = defaults.cpus
data = ImageDataBunch.from_name_re(path_img, fnames, pat, ds_tfms=get_transforms(),
size=299, bs=bs//2, num_workers=win_workers).normalize(imagenet_stats)
I get this result:
%time data.show_batch(rows=2, figsize=(5,5))
Wall time: 27.8 s

If I then put in the following monkey patch (that changes how num_workers is assigned), I get a 30x speedup:

def fixed_one_batch(self, ds_type:DatasetType=DatasetType.Train, detach:bool=True, denorm:bool=True, cpu:bool=True)->Collection[Tensor]:
“Get one batch from the data loader of ds_type. Optionally detach and denorm.”
dl = self.dl(ds_type)
w = dl.num_workers # CHANGE: read from and assign to dl explicitly
dl.num_workers = 0
try: x,y = next(iter(dl))
finally: dl.num_workers = w # CHANGE: assign to dl explicitly
if detach: x,y = to_detach(x,cpu=cpu),to_detach(y,cpu=cpu)
norm = getattr(self,‘norm’,False)
if denorm and norm:
x = self.denorm(x)
if norm.keywords.get(‘do_y’,False): y = self.denorm(y, do_x=True)
return x,y

ImageDataBunch.one_batch = fixed_one_batch
%time data.show_batch(rows=2, figsize=(5,5))
Wall time: 673 ms

This was on a Windows machine with an i7-8700k CPU (6 cores, 12 threads). The improvement on Linux will probably not be as great (since the overhead of starting processes is lower on that OS), but I would guess it will still be 10x or so.

Also, I would recommend that the default num_workers value be dropped in the data bunch constructors from defaults.cpus to (maybe) half that. Starting as many processes as totally available hardware cores/threads is highly unlikely to be optimal even with large training sets and long epoch times: a lower value will likely lead to higher throughput. (I found about 6 workers is optimal when testing across 7700k, 8700k, and i9-9900k cpus, when using full res imagenet data and Titan class GPUs. The optimal choice will depend on data set size, disk read speed, number of processor cores, GPU etc, but 4-6 seems a more reasonable generic default than maxing this value out.)

I think the magnitude of the change is a caching issue. my first run is always a high number, that’s not to say it doesn’t affect the outcome

Your changes do alter my time from:

CPU times: user 3.11 s, sys: 793 ms, total: 3.9 s
Wall time: 6.37 s
CPU times: user 3.36 s, sys: 235 ms, total: 3.59 s
Wall time: 907 ms

however If I run it a few times first the numbers are:
INITIAL RUN
CPU times: user 3.5 s, sys: 775 ms, total: 4.28 s
Wall time: 6.27 s
Eventually
CPU times: user 196 ms, sys: 221 ms, total: 417 ms
Wall time: 3.42 s
Then preforming your changes drops it to
CPU times: user 3.35 s, sys: 255 ms, total: 3.61 s
Wall time: 909 ms

Alternatively you can run your changes from the start

CPU times: user 5.79 s, sys: 794 ms, total: 6.58 s
Wall time: 3.67 s
With subsequent runs at
CPU times: user 3.53 s, sys: 287 ms, total: 3.82 s
Wall time: 960 ms

So it defiantly makes a difference, I don’t think in the order of magnitude you think but maybe I’m wrong, I am testing on a linux 4770k

The numbers can be taken with a pinch of salt because the variance between runs is very high.

Actually, at least on my systems, caching doesn’t play a big role (though it does contribute some). Rather, I can manually adjust the speed-up multiple quite easily, simply by changing the number of workers. When I look at the process count in task manager, it is apparent that it takes roughly one second for each process to kick in, or perhaps a bit more. So, with the default num_workers value set to defaults.cpus (which is 12 in my case above), it takes a full 12 seconds (at least) before the real work begins (and then the results come relatively quickly, with or without the effects of caching).

If I set num_workers to 6, the overall processing time is about 6 seconds less. Indeed, even if I set num_workers to 1, there is still an extra second involved vs setting num_workers=0, where clearly both cases run on a single core, but the second case doesn’t involve the expense of launching a separate process.

Note: I’ve also observed this behavior (of ~1 sec overhead per core) on an 8 core / 16 thread system (also running Windows)

In your case, with Linux, the overhead of creating new processes should be a lot lower but still noticeable so the speed difference will be less, but still tied to the number of workers. Also of note is that your processor is (I think) only starting 8 processes (vs 12 in the times I originally quoted), so that will also implicitly lower the speed-up multiple.

I think it is safe to say that whenever profiling involves a human counting seconds for a single (pretty trivial) function call, it’s clear there is room for a lot of improvement!

Note: I’m not sure why this wasn’t implemented using threads, but I mainly work in C++ so I’m guessing the choice of process over thread is somehow tied to a limitation in python. Also, for Windows at least, using python 3.7 is noticeably faster in this case than using python 3.6, so it seems someone on the python development team has been looking into this. Perhaps whatever they changed from 3.6 to 3.7 also improved the start up times on Linux.

Hello,
Just wanted to ask if the Pool with progress bar would be appropriate in fastprogress library? I find it useful when dealing with long tasks.

import time
import random
from multiprocessing.pool import Pool
from fastprogress import master_bar, progress_bar

def do_work(a):
    time.sleep(random.random())
    return a ** 2

class BarPool(Pool): 
    def map(self, func, iterable, *args, **kwds):
        result = []
        for value in progress_bar(super().imap(func, iterable, *args, **kwds), len(iterable)):
            result.append(value)
        return result

BarPool(1).map(do_work, range(100))

Sure, feel free to suggest a PR with it!

1 Like

(potentially wrong room). I am trying to use QRNN with half-precision and getting

RuntimeError: "forget_mult_cuda_forward" not implemented for 'Half' "
Is someone working on this? If not where (if at all) would be a good place to start?

Thanks!

For model debugging I’ve found it useful to look at the GradCam on random images in addition to top losses. So I added a plot_random_losses method to the ClassificationInterpretation object. Does this sound like a useful addition to the fastai library? If so I can submit a PR.

I created a new method _cl_int_plot_losses, which is essentially the current _cl_int_plot_top_losses with a random boolean for choosing random images and then set _cl_int_plot_top_losses and _cl_int_plot_random_losses to call it:

def _cl_int_plot_top_losses(self, k, largest=True, figsize=(12,12), heatmap:bool=False, heatmap_thresh:int=16,
                            return_fig:bool=None)->Optional[plt.Figure]:
    "Show images in `top_losses` along with their prediction, actual, loss, and probability of actual class."
    return _cl_int_plot_losses(self, k=k, largest=largest, random=None, figsize=figsize, heatmap=heatmap,
                               heatmap_thresh=heatmap_thresh, return_fig=return_fig, seed=None)

def _cl_int_plot_random_losses(self, k, figsize=(12,12), heatmap:bool=False, heatmap_thresh:int=16,
                               return_fig:bool=None, seed:int=None)->Optional[plt.Figure]:
    "Show random images along with their prediction, actual, loss, and probability of actual class."
    return _cl_int_plot_losses(self, k=k, largest=None, random=True, figsize=figsize, heatmap=heatmap,
                               heatmap_thresh=heatmap_thresh, return_fig=return_fig, seed=seed)

QRNN don’t work in half-precision as the CUDA kernels don’t work in half-precision. Making them work will require custom CUDA kernels.

Adapting the current custom kernels in fastai may just be a matter of changing AT_DISPATCH_FLOATING_TYPES to AT_DISPATCH_FLOATING_TYPES_AND_HALF in forget_mult_cuda_kernel.cu and bwd_forget_mult_cuda_kernel.cu - I think 4 replacements, two in each file from a quick scan.
Based on the quick scan and my fairly rudimentary C++ knowledge nothing jumped out that wouldn’t work. Everything looked to already be parameterized for 32/64-bit floats so might just work with 16-bit floats. Or I might have missed some literals that may introduce some errors with type conversion and require an explicit cast as discussed in this thread. You also may need to add using namespace at; at the top of the file if you get an error about Half not being defined as noted in that thread.

It also may compile and run fine but not really work with the much more limited range of 16-bit floats.

Eh, I didn’t know that flag existed. Just tried and it worked with our v2 development so it should work in v1 too. Will commit that change this afternoon.

Edit: With some additional testing and putting multiples of 8 for every dimension, QRNNs end up twice as fast in FP16 as in FP32. It’s in master now for anyone who wants to use.

3 Likes

Yeah, there’s not a lot of good information on that side of PyTorch outside of the little bits of basic overview like the C++ Extension tutorials. Think I found that macro through completion in VSCode (in a properly configured project, it won’t do it without a bit of setup). Or otherwise browsing the source which is unfortunately about the best way I’ve found so far to figure stuff out.

Hi,

I just had a frustrating experience, debugging a custom ItemList.
My label class inherits from ImageList, with a custom open() method.

It turns out there was a typo in this method, but the exception was swallowed in basic_data.DataBunch.sanity_check(), which just gives a warning with no reference to the cause of the problem.

The perpetrator is the try-except failure counter in sanity_check(), which constructs a list of failed indices, which is then printed. Wouldn’t it make sense to also print a list of exceptions, to make debugging easier? Sometning like

samples,fails,exceptions = [],[],[]
for i in idx:
    try:    samples.append(self.train_dl.dataset[i])
    except e:
        fails.append(i)
        exceptions.append(e)

and then print something like

UserWarning: There seems to be something wrong with your dataset, for example, in the first batch can't access any element of self.train_ds.
Tried: 5288,4778,6757,1324,5592...
Caught the following exceptions:
  5288: name 'n' not defined
  4778: <exception for idx 4778>
  ...
  • Soren

Love this (FWIW), but I would call set on the exceptions before you print :slight_smile:

Hi everyone! I’m suggesting a modification to TabularDataBunch class. Right now if you write
TabularDataBunch.from_df(...)
it’s simpler way than writing something like
data = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs,).split_by_idx(valid_idx).label_from_df(cols=dep_var, label_cls=FloatList, log=True).add_test(TabularList.from_df(test_df, path=path, cat_names=cat_vars, cont_names=cont_vars)).databunch())
(copied from this notebook: https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson6-rossmann.ipynb)

I realized that by using the easier way there is no way to tell that your targets are continuous (label_cls=FloatList). I don’t understand why these aren’t parameters in TabularDataBunch.from_df() function? Then I’m suggesting that there could be some automatic way to recognize if target is continuous. I can make a pull request about that but I’m not sure how the parameters should do (with kwargs or just normal way) so I would appreciate to hear your thoughts about this.

I think the best way to recognize automatically if a target is continuous by following rules below
if text -> categorical
if number
if more than 20 different values -> continuous
else -> categorical
What you think about this? I still believe that there should be a way to do this manually but right now the default approach is to always use categorical which is much worst than trying to sometimes label as continuous because rules above will work all cases if we don’t count those a few edge things that happens rarely.

There are in no factory method of DataBunch. Those are for beginners only and people should use the data block API as soon as they are more familiar with the library, as it’s the API that provides total flexbility.

That rule doesn’t really work since you can have just one or two continuous variables or a lot of labels in a multi-label problem. The current approach which is documented here makes more sense IMO. And again, always use the data block API :wink:

Thanks for clearing. Maybe I then start using data block API :smiley:

I just notice that if I don’t define cont_names in learner and then I run learn.model the first batchnorm1d has 0 inputs although it should have number of cont_names like Jeremy said here https://youtu.be/hkBa9pU-H48?t=2522

I thought that if you put cont_names then it automatically recognize that all of the other are continuous so is this some kind of bug?

I am looking at interpretation for multi-label data and think there might be an error in ClassificationInterpretation.plot_multi_top_losses, It looks like it is expecting self.pred_class to contain raw outputs for each class but self.pred_class = self.preds.argmax(dim=1) in ClassificationInterpretation.__init__. I’ll look to fix it and clean the method up a bit but had a couple of questions.

One thing I am struggling to understand is various code around the number of dimensions of the top losses. Various places check this but I can’t see how this can be anything other than 1d, being generated from torch.topk (in Interpretation.top_losses so it could be overriden). Am I missing something? Perhaps a non-vision application of this function where top_losses is different? Or is it only meant to work with vision and top_losses will always be 1d so that logic can be removed?

I could just make it use the raw predictions from Interpretation.preds but it seems like the better fix might be to reinstate MultiLabelClassificationInterpretation which currently raises NotImplementedError in __init__. I was also going to look at displaying the distribution of the activation for each class for true/false instances to evaluate thresholds and might also work on GradCAM display for multi-class. So there seem to be a few potential things to justify the separate MultiLabelClassificationInterpretation.
The various methods in ClassificationInterpretation all depend on a confusion matrix which gives me an error for multi-label so I don’t think any working features are lost (I might look at fixing this later in which case I guess you might make MultiLabelClassificationInterpretation inherit from ClassificationInterpretation rather than Interpretation as it currently does).
Seem reasonable?

1 Like

multilabelClassificationInterpretation is not working with the multi-label classifier, is it working or have I made any error.?
thank you.

interp=MultiLabelClassificationInterpretation.from_learner(learn2)

NotImplementedError Traceback (most recent call last)
in
----> 1 interp=MultiLabelClassificationInterpretation.from_learner(learn2)

/opt/anaconda3/lib/python3.7/site-packages/fastai/train.py in from_learner(cls, learn, ds_type, activ)
149 “Gets preds, y_true, losses to construct base class from a learner”
150 preds_res = learn.get_preds(ds_type=ds_type, activ=activ, with_loss=True)
–> 151 return cls(learn, *preds_res)
152
153 def top_losses(self, k:int=None, largest=True):

/opt/anaconda3/lib/python3.7/site-packages/fastai/train.py in init(self, learn, preds, y_true, losses, ds_type, sigmoid, thresh)
222 def init(self, learn:Learner, preds:Tensor, y_true:Tensor, losses:Tensor, ds_type:DatasetType=DatasetType.Valid,
223 sigmoid:bool=True, thresh:float=0.3):
–> 224 raise NotImplementedError
225 super(MultiLabelClassificationInterpretation, self).init(learn,preds,y_true,losses,ds_type)
226 self.pred_class = self.preds.sigmoid(dim=1)>thresh if sigmoid else self.preds>thresh

NotImplementedError:

I’ve just removed typing from the pip and conda deps, since it was causing some compatibility issues and doesn’t actually do anything in >=py3.4 .

I’ve also tried setting up conda-forge and miniconda using the conda-forge docs, and found everything worked nicely. This might be a better default install option going forwards. Let me know if you try it out!