Developer chat

Ah good to know. That one’s an easy fix. Done now. (It was a test that can be deterministic without any problems).

1 Like

I just pushed a new release (1.0.6) to pypi (pip) and anaconda (conda). Thanks to @stas for making it more painless than I’ve ever seen before :slight_smile: Main difference in this release is the use of ImageDataBunch factory methods instead of the functions we had before. (Ditto for each application: text, colab, tabular.) Also NaN loss bug worked around.

2 Likes

From now on, if you fix a significant bug that existed in the last released version, or add a feature, or change the documented behaviour of something, please update CHANGES.md immediately when you push.

When we release a new version, we’ll simply update the version number in CHANGES.md and create a new section above it.

It should be Tuple[Image, Tensor] now, you’re right. Will fix this.

Busy morning here!

  • added ImagePoints class to apply data augmentation to a flow of points
  • changed ImageBBox to be more efficient and use points instead of a mask behind the scenes
  • ImageMask is now ImageSegment because we felt the name mask made people think it was binary only

Docs have been updated accordingly.

4 Likes

I started working on gpu memory utils.

My first need is to make sure I have 8GB free GPU RAM inside fastai_docs/docs_src/run_tests.sh, since it will fail with less than that, and I don’t want to waste time/resources, and want the script to tell me if it can tell from the get going it’s not going to succeed.

Should these go into fastai/utils/mem.py?

These functions I wrote so far on purpose don’t tap into pytorch’s memory maps, because I need those for new processes, so if there is a cached memory somewhere by an idle process it’s not going to work. I need to know the exact available memory not used by pytorch at all.

It’s first draft, so your input on naming, and in/out args are very welcome.

We will probably have a different set of util functions that will measure the memory of the currently running process via pytorch. So those 2 sets should have a distinct naming.

from enum import IntEnum
Memory = IntEnum('Memory', "USED, FREE, TOTAL", start=0)

# returns a list of mem available for each cpu
# [ [used-0, free-0, total-0], [used-1, free-1, total-1] ]
# this function assumes nvidia-smi works and will return [] if this is not the case
def get_gpu_mem():
    "query nvidia-smi for used, free and total memory for each available gpu"
    import subprocess

    mem = []
    try:
        cmd = "nvidia-smi --query-gpu=memory.used,memory.free,memory.total --format=csv,nounits,noheader"
        result = subprocess.run(cmd.split(), shell=False, check=False, stdout=subprocess.PIPE)
    except: pass
    else:
        if result.returncode == 0 and result.stdout:
            output = result.stdout.decode('utf-8')
            mem = [[int(y) for y in x.split(', ')] for x in output.strip().split('\n') ]
            #print(mem)
    return mem

# return the gpu number that has the most memory, and the free memory
# return [] if no gpus were found
def get_gpu_with_max_free_mem():
    mem = np.array(get_gpu_mem())
    if not len(mem): return []
    id = np.argmax(mem[:,Memory.FREE])
    return (id, mem[id,Memory.FREE])

I temporarily put them inside collect_env.py, so a test run on a single gpu box gives:

python -c "import fastai; print(fastai.utils.collect_env.get_gpu_mem(), fastai.utils.collect_env.get_gpu_with_max_free_mem())"
[[495, 7624, 8119]] (0, 7624)
1 Like

and a related question. I remember seeing custom gpu ids hardcoded all over the course notebooks.

How do we avoid that?

For example how do we instrument fastai_docs/docs_src/run_tests.sh to run with non-default device=0 for those who have multiple gpus? The idea of the code in the post above is to do that automatically. Perhaps the same idea can be used for notebooks as well? e.g. the notebook could say how much RAM it needs (a magic function to calculate that perhaps?) and then it’ll just pick the GPU that matches the needs of that notebook dynamically?

Finally, there is GPUtil, which is a wrapper around nvidia-smi https://github.com/anderskm/gputil which we could use too, but I already saw it throwing an exception when there was nvidia-smi available, so I’m not sure if it’s not the easiest to just do it ourselves. Perhaps we could do a wrapper around this wrapper to fail gracefully, or perhaps it’s ok if it fails when there is no way to run nvidia-smi…

But it already has a very rich API, e.g.:

deviceID = GPUtil.getFirstAvailable(order = 'first', maxLoad=0.5, maxMemory=0.5, attempts=1, interval=900, verbose=False)
deviceIDs = GPUtil.getAvailable(order = 'first', limit = 1, maxLoad = 0.5, maxMemory = 0.5, ignoreNan=False, excludeID=[], excludeUUID=[])

Was there changes to datasets recently? The notebook /fastai/examples/text.ipynb brakes at:
learn = RNNLearner.language_model(data_lm, pretrained_fnames=['lstm_wt103', 'itos_wt103'])

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-17-8a8a4cd68441> in <module>()
----> 1 learn = RNNLearner.language_model(data_lm, pretrained_fnames=['lstm_wt103', 'itos_wt103'])

~/Code/fastai/fastai/text/learner.py in language_model(cls, data, bptt, emb_sz, nh, nl, pad_token, drop_mult, tie_weights, bias, qrnn, pretrained_fnames, **kwargs)
     79         learn = cls(data, model, bptt, split_func=lm_split, **kwargs)
     80         if pretrained_fnames is not None:
---> 81             learn.load_pretrained(*pretrained_fnames)
     82             learn.freeze()
     83         return learn

~/Code/fastai/fastai/text/learner.py in load_pretrained(self, wgts_fname, itos_fname)
     64         old_itos = pickle.load(open(self.path/self.model_dir/f'{itos_fname}.pkl', 'rb'))
     65         old_stoi = {v:k for k,v in enumerate(old_itos)}
---> 66         wgts = torch.load(self.path/self.model_dir/f'{wgts_fname}.pth', map_location=lambda storage, loc: storage)
     67         wgts = convert_weights(wgts, old_stoi, self.data.train_ds.vocab.itos)
     68         self.model.load_state_dict(wgts)

~/anaconda3/lib/python3.7/site-packages/torch/serialization.py in load(f, map_location, pickle_module)
    356         f = open(f, 'rb')
    357     try:
--> 358         return _load(f, map_location, pickle_module)
    359     finally:
    360         if new_fd:

~/anaconda3/lib/python3.7/site-packages/torch/serialization.py in _load(f, map_location, pickle_module)
    534     for key in deserialized_storage_keys:
    535         assert key in deserialized_objects
--> 536         deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
    537         offset = None
    538 

RuntimeError: unexpected EOF, expected 12274717 more bytes. The file might be corrupted.

Weird… We didn’t change anything near that and the notebook runs fine on my side. Could you try to redownload the model?

My bad. I deleted and downloaded again. It worked. I am sorry, I didn see I had sent the message.

So, now learner works. But I was running it to show another error.

Running the that notebook, after I get the learner, I run:
learner.lr_find()
and get:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in ()
----> 1 learn.lr_find()

~/Code/fastai/fastai/train.py in lr_find(learn, start_lr, end_lr, num_it, **kwargs)
     24     cb = LRFinder(learn, start_lr, end_lr, num_it)
     25     a = int(np.ceil(num_it/len(learn.data.train_dl)))
---> 26     learn.fit(a, start_lr, callbacks=[cb], **kwargs)
     27 
     28 def to_fp16(learn:Learner, loss_scale:float=512., flat_master:bool=False)->Learner:

~/Code/fastai/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    136         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    137         fit(epochs, self.model, self.loss_fn, opt=self.opt, data=self.data, metrics=self.metrics,
--> 138             callbacks=self.callbacks+callbacks)
    139 
    140     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/Code/fastai/fastai/basic_train.py in fit(epochs, model, loss_fn, opt, data, callbacks, metrics)
     89     except Exception as e:
     90         exception = e
---> 91         raise e
     92     finally: cb_handler.on_train_end(exception)
     93 

~/Code/fastai/fastai/basic_train.py in fit(epochs, model, loss_fn, opt, data, callbacks, metrics)
     79             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     80                 xb, yb = cb_handler.on_batch_begin(xb, yb)
---> 81                 loss = loss_batch(model, xb, yb, loss_fn, opt, cb_handler)[0]
     82                 if cb_handler.on_batch_end(loss): break
     83 

~/Code/fastai/fastai/basic_train.py in loss_batch(model, xb, yb, loss_fn, opt, cb_handler, metrics)
     14                metrics:OptMetrics=None)->Tuple[Union[Tensor,int,float,str]]:
     15     "Calculate loss and metrics for a batch, call out to callbacks as necessary."
---> 16     cb_handler = ifnone(CallbackHandler([]))
     17     if not is_listy(xb): xb = [xb]
     18     if not is_listy(yb): yb = [yb]

TypeError: ifnone() missing 1 required positional argument: 'b'

The same error happens when I try learn.fit(1)

Pull the latest, that was a little bug I left while working this morning. Sorry about that.

No problem. When I git blamed the file it showed Ubuntu, you may have to assign your name in your git.

Weird, on GitHub it shows properly.

¯_(ツ)_/¯

There is an error in train.py#fit_one_cycle, it is overwriting the callbacks, so if I do:
learn.fit(epochs=2, lr=5e-2, callbacks=[TerminateOnNaN()])
It works and calls TerminateOnNaN.
But if I use
learn.fit_one_cycle(cyc_len=20, max_lr=5e-2, div_factor=20, callbacks=[TerminateOnNaN()])
I doesn’t work

1 Like

Should be fixed in my last commit. Thanks for flagging this!

1 Like

I have a style code question:
I see a lot the use of this function ifnone lately (38 usages in 14 files). Is it to trying… brevity? Or does it have another usage. It seems that tired of writing b if a is None else a, we created ifnone. But there was another way of doing the same without creating a new function a or b.

For example:
cb_handler = ifnone(cb_handler, CallbackHandler([]))
could be just:
cb_handler = cb_handler or CallbackHandler([])

Every programmer has her/his way. But for me, the second option seems more clear.

1 Like