Developer chat

Hey folks, I’d like the ability to pass additional callbacks (e.g. Teacher Forcing) in the learner.lr_find() function. Currently there isn’t a good way of going about this unless I’m missing something obvious.

It should be as simple as adding an additional callbacks argument which gets added onto the callbacks list and eventually passed to the fit function.

def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100,
            stop_div:bool=True, wd:float=None, callbacks:Optional[CallbackList]=None):
    ...
    cb = [] if callbacks is None else callbacks
    cb.insert(0,LRFinder(learn, start_lr, end_lr, num_it, stop_div))
    ...
    learn.fit(epochs, start_lr, callbacks=cb, wd=wd)

Is this something which I should PR request or is this a bad idea?
Thanks for a great library and course!!

You can add a PR for this. The use never came up before since usually callbacks are put in the Learner at init.

Ah, yes of course. It makes much more sense to pass callbacks in to the learner initialization once rather than litter several function calls with the same callbacks! Thanks for the wisdom!

I’ll hold off on the PR unless some other valid use case comes up…

@sgugger would it be helpful to have gradcam as a separate function? because right now it is only under plot_top_losses so we cannot get gradcam output for correctly classified images which could be helpful for model interpretation. It might be useful to be an optional output of get_preds or predict.

Thoughts?

I have no objection to it being in a separate function if you want to work on a PR that does that.

Sorry! I am not sure if this is the right place to post this. But there is a bug in the fastai pip and anaconda package:

Yes, it has been fixed in master and will be in the next release.

Are there any plans to add an object detection model/learner (similar to how there are is a UnetLearner) to the library in the near term? Just wondering if I should hold out a little longer, or figure out how to use the object detection models in the new version of torchvision in conjunction with the fastai object detection data loader and transforms.

This is more midterm than short-term, as we are fully focused on v2 for now.

Hi everybody. Question regarding fast.ai and render.com

  1. I trained a model using crestle
  2. I exported the model to .pkl
  3. I put the .pkl file on dropbox
  4. I have a server load the pkl as per the instructions
  5. When I try to build the dockerfile, I get this error

/usr/local/lib/python3.7/site-packages/torch/serialization.py:454: SourceChangeWarning: source code of class ‘torch.nn.modules.loss.CrossEntropyLoss’ has changed. you can retrieve the original source code by accessing the object’s source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/usr/local/lib/python3.7/site-packages/torch/serialization.py:454: SourceChangeWarning: source code of class ‘torch.nn.modules.conv.Conv2d’ has changed. you can retrieve the original source code by accessing the object’s source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/usr/local/lib/python3.7/site-packages/torch/serialization.py:454: SourceChangeWarning: source code of class ‘torch.nn.modules.batchnorm.BatchNorm2d’ has changed. you can retrieve the original source code by accessing the object’s source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/usr/local/lib/python3.7/site-packages/torch/serialization.py:454: SourceChangeWarning: source code of class ‘torch.nn.modules.activation.ReLU’ has changed. you can retrieve the original source code by accessing the object’s source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)

I have tried clicking export twice in crestle, as someone suggested ‘re-exporting.’ Did I misinterpret what ‘re-exporting’ means? Any ideas what is happening? My requirements.txt file looks like this:

aiofiles==0.4.0
aiohttp==3.5.4
asyncio==3.4.3
fastai==1.0.52
https://download.pytorch.org/whl/cpu/torch-1.1.0-cp37-cp37m-linux_x86_64.whl
https://download.pytorch.org/whl/cpu/torchvision-0.3.0-cp37-cp37m-linux_x86_64.whl
numpy==1.16.3
starlette==0.12.0
uvicorn==0.7.1
python-multipart==0.0.5

I’ve just started playing with fast.ai v1, and found some minor performance bugs/issues that may be worth fixing in v2:

When using the pets database, loaded via:
win_workers = defaults.cpus
data = ImageDataBunch.from_name_re(path_img, fnames, pat, ds_tfms=get_transforms(),
size=299, bs=bs//2, num_workers=win_workers).normalize(imagenet_stats)
I get this result:
%time data.show_batch(rows=2, figsize=(5,5))
Wall time: 27.8 s

If I then put in the following monkey patch (that changes how num_workers is assigned), I get a 30x speedup:

def fixed_one_batch(self, ds_type:DatasetType=DatasetType.Train, detach:bool=True, denorm:bool=True, cpu:bool=True)->Collection[Tensor]:
“Get one batch from the data loader of ds_type. Optionally detach and denorm.”
dl = self.dl(ds_type)
w = dl.num_workers # CHANGE: read from and assign to dl explicitly
dl.num_workers = 0
try: x,y = next(iter(dl))
finally: dl.num_workers = w # CHANGE: assign to dl explicitly
if detach: x,y = to_detach(x,cpu=cpu),to_detach(y,cpu=cpu)
norm = getattr(self,‘norm’,False)
if denorm and norm:
x = self.denorm(x)
if norm.keywords.get(‘do_y’,False): y = self.denorm(y, do_x=True)
return x,y

ImageDataBunch.one_batch = fixed_one_batch
%time data.show_batch(rows=2, figsize=(5,5))
Wall time: 673 ms

This was on a Windows machine with an i7-8700k CPU (6 cores, 12 threads). The improvement on Linux will probably not be as great (since the overhead of starting processes is lower on that OS), but I would guess it will still be 10x or so.

Also, I would recommend that the default num_workers value be dropped in the data bunch constructors from defaults.cpus to (maybe) half that. Starting as many processes as totally available hardware cores/threads is highly unlikely to be optimal even with large training sets and long epoch times: a lower value will likely lead to higher throughput. (I found about 6 workers is optimal when testing across 7700k, 8700k, and i9-9900k cpus, when using full res imagenet data and Titan class GPUs. The optimal choice will depend on data set size, disk read speed, number of processor cores, GPU etc, but 4-6 seems a more reasonable generic default than maxing this value out.)

I think the magnitude of the change is a caching issue. my first run is always a high number, that’s not to say it doesn’t affect the outcome

Your changes do alter my time from:

CPU times: user 3.11 s, sys: 793 ms, total: 3.9 s
Wall time: 6.37 s
CPU times: user 3.36 s, sys: 235 ms, total: 3.59 s
Wall time: 907 ms

however If I run it a few times first the numbers are:
INITIAL RUN
CPU times: user 3.5 s, sys: 775 ms, total: 4.28 s
Wall time: 6.27 s
Eventually
CPU times: user 196 ms, sys: 221 ms, total: 417 ms
Wall time: 3.42 s
Then preforming your changes drops it to
CPU times: user 3.35 s, sys: 255 ms, total: 3.61 s
Wall time: 909 ms

Alternatively you can run your changes from the start

CPU times: user 5.79 s, sys: 794 ms, total: 6.58 s
Wall time: 3.67 s
With subsequent runs at
CPU times: user 3.53 s, sys: 287 ms, total: 3.82 s
Wall time: 960 ms

So it defiantly makes a difference, I don’t think in the order of magnitude you think but maybe I’m wrong, I am testing on a linux 4770k

The numbers can be taken with a pinch of salt because the variance between runs is very high.

Actually, at least on my systems, caching doesn’t play a big role (though it does contribute some). Rather, I can manually adjust the speed-up multiple quite easily, simply by changing the number of workers. When I look at the process count in task manager, it is apparent that it takes roughly one second for each process to kick in, or perhaps a bit more. So, with the default num_workers value set to defaults.cpus (which is 12 in my case above), it takes a full 12 seconds (at least) before the real work begins (and then the results come relatively quickly, with or without the effects of caching).

If I set num_workers to 6, the overall processing time is about 6 seconds less. Indeed, even if I set num_workers to 1, there is still an extra second involved vs setting num_workers=0, where clearly both cases run on a single core, but the second case doesn’t involve the expense of launching a separate process.

Note: I’ve also observed this behavior (of ~1 sec overhead per core) on an 8 core / 16 thread system (also running Windows)

In your case, with Linux, the overhead of creating new processes should be a lot lower but still noticeable so the speed difference will be less, but still tied to the number of workers. Also of note is that your processor is (I think) only starting 8 processes (vs 12 in the times I originally quoted), so that will also implicitly lower the speed-up multiple.

I think it is safe to say that whenever profiling involves a human counting seconds for a single (pretty trivial) function call, it’s clear there is room for a lot of improvement!

Note: I’m not sure why this wasn’t implemented using threads, but I mainly work in C++ so I’m guessing the choice of process over thread is somehow tied to a limitation in python. Also, for Windows at least, using python 3.7 is noticeably faster in this case than using python 3.6, so it seems someone on the python development team has been looking into this. Perhaps whatever they changed from 3.6 to 3.7 also improved the start up times on Linux.

Hello,
Just wanted to ask if the Pool with progress bar would be appropriate in fastprogress library? I find it useful when dealing with long tasks.

import time
import random
from multiprocessing.pool import Pool
from fastprogress import master_bar, progress_bar

def do_work(a):
    time.sleep(random.random())
    return a ** 2

class BarPool(Pool): 
    def map(self, func, iterable, *args, **kwds):
        result = []
        for value in progress_bar(super().imap(func, iterable, *args, **kwds), len(iterable)):
            result.append(value)
        return result

BarPool(1).map(do_work, range(100))

Sure, feel free to suggest a PR with it!

1 Like

(potentially wrong room). I am trying to use QRNN with half-precision and getting

RuntimeError: "forget_mult_cuda_forward" not implemented for 'Half' "
Is someone working on this? If not where (if at all) would be a good place to start?

Thanks!

For model debugging I’ve found it useful to look at the GradCam on random images in addition to top losses. So I added a plot_random_losses method to the ClassificationInterpretation object. Does this sound like a useful addition to the fastai library? If so I can submit a PR.

I created a new method _cl_int_plot_losses, which is essentially the current _cl_int_plot_top_losses with a random boolean for choosing random images and then set _cl_int_plot_top_losses and _cl_int_plot_random_losses to call it:

def _cl_int_plot_top_losses(self, k, largest=True, figsize=(12,12), heatmap:bool=False, heatmap_thresh:int=16,
                            return_fig:bool=None)->Optional[plt.Figure]:
    "Show images in `top_losses` along with their prediction, actual, loss, and probability of actual class."
    return _cl_int_plot_losses(self, k=k, largest=largest, random=None, figsize=figsize, heatmap=heatmap,
                               heatmap_thresh=heatmap_thresh, return_fig=return_fig, seed=None)

def _cl_int_plot_random_losses(self, k, figsize=(12,12), heatmap:bool=False, heatmap_thresh:int=16,
                               return_fig:bool=None, seed:int=None)->Optional[plt.Figure]:
    "Show random images along with their prediction, actual, loss, and probability of actual class."
    return _cl_int_plot_losses(self, k=k, largest=None, random=True, figsize=figsize, heatmap=heatmap,
                               heatmap_thresh=heatmap_thresh, return_fig=return_fig, seed=seed)

QRNN don’t work in half-precision as the CUDA kernels don’t work in half-precision. Making them work will require custom CUDA kernels.

Adapting the current custom kernels in fastai may just be a matter of changing AT_DISPATCH_FLOATING_TYPES to AT_DISPATCH_FLOATING_TYPES_AND_HALF in forget_mult_cuda_kernel.cu and bwd_forget_mult_cuda_kernel.cu - I think 4 replacements, two in each file from a quick scan.
Based on the quick scan and my fairly rudimentary C++ knowledge nothing jumped out that wouldn’t work. Everything looked to already be parameterized for 32/64-bit floats so might just work with 16-bit floats. Or I might have missed some literals that may introduce some errors with type conversion and require an explicit cast as discussed in this thread. You also may need to add using namespace at; at the top of the file if you get an error about Half not being defined as noted in that thread.

It also may compile and run fine but not really work with the much more limited range of 16-bit floats.

Eh, I didn’t know that flag existed. Just tried and it worked with our v2 development so it should work in v1 too. Will commit that change this afternoon.

Edit: With some additional testing and putting multiples of 8 for every dimension, QRNNs end up twice as fast in FP16 as in FP32. It’s in master now for anyone who wants to use.

3 Likes