Platform: Kaggle Kernels

Sanyam,

It is probably better that Kernels do not jump immediately to the latest fastai version, especially when a lot of changes are made, like in the beginning of each new class.

Is it possible to find somewhere which version Kaggle Kernels and/or Google Colab are using? Is it part of some Docker config file somewhere perhaps?

Thank

Yuri

I am getting this error when I try to run the lesson 3 code


Anyone get something similar?

@piaoya
Hi Sanyam,

I believe something has changed after kaggle updated their p100 or fastai library (just checked, now is under 1.0.46).

Now even if you set model_dir = ‘writable directory’
learn.lr_find() will crash.

I traced to the fastai library, it seems to me that during the training, the purge() method calls torch.save(), and self.path/self.model_dir end up trying to save to …/input/purge-tmp.pkl.

I am still looking for the root cause of this issue, but I guess is the self.path/self.model_dir part, it ends up using self.path, which is …/input in kaggle kernels. (assuming most people loads data at …/input, so you have self.path = ‘…/input’, self.model_dir = ‘/kaggle/model’, but somehow self.path/self.model_dir returns …/input)

But yup, when creating learner and set model_dir to writable directory no longer work…

Currently I use piaoya’s way to move around data…

Here is the trace:

OSError                                   Traceback (most recent call last)
<ipython-input-32-d81c6bd29d71> in <module>()
----> 1 learn.lr_find()

/opt/conda/lib/python3.6/site-packages/fastai/train.py in lr_find(learn, start_lr, end_lr, num_it, stop_div, wd)
     30     cb = LRFinder(learn, start_lr, end_lr, num_it, stop_div)
     31     epochs = int(np.ceil(num_it/len(learn.data.train_dl)))
---> 32     learn.fit(epochs, start_lr, callbacks=[cb], wd=wd)
     33 
     34 def to_fp16(learn:Learner, loss_scale:float=None, max_noskip:int=1000, dynamic:bool=False, clip:float=None,

/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    180         if defaults.extra_callbacks is not None: callbacks += defaults.extra_callbacks
    181         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 182             callbacks=self.callbacks+callbacks)
    183 
    184     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

/opt/conda/lib/python3.6/site-packages/fastai/utils/mem.py in wrapper(*args, **kwargs)
     87 
     88         try:
---> 89             return func(*args, **kwargs)
     90         except Exception as e:
     91             if ("CUDA out of memory" in str(e) or

/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
    101         exception = e
    102         raise
--> 103     finally: cb_handler.on_train_end(exception)
    104 
    105 loss_func_name2activ = {'cross_entropy_loss': F.softmax, 'nll_loss': torch.exp, 'poisson_nll_loss': torch.exp,

/opt/conda/lib/python3.6/site-packages/fastai/callback.py in on_train_end(self, exception)
    289     def on_train_end(self, exception:Union[bool,Exception])->None:
    290         "Handle end of training, `exception` is an `Exception` or False if no exceptions during training."
--> 291         self('train_end', exception=exception)
    292 
    293 class AverageMetric(Callback):

/opt/conda/lib/python3.6/site-packages/fastai/callback.py in __call__(self, cb_name, call_mets, **kwargs)
    212         "Call through to all of the `CallbakHandler` functions."
    213         if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
--> 214         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    215 
    216     def set_dl(self, dl:DataLoader):

/opt/conda/lib/python3.6/site-packages/fastai/callback.py in <listcomp>(.0)
    212         "Call through to all of the `CallbakHandler` functions."
    213         if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
--> 214         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    215 
    216     def set_dl(self, dl:DataLoader):

/opt/conda/lib/python3.6/site-packages/fastai/callbacks/lr_finder.py in on_train_end(self, **kwargs)
     43         # restore the valid_dl we turned off on `__init__`
     44         self.data.valid_dl = self.valid_dl
---> 45         self.learn.load('tmp')
     46         if hasattr(self.learn.model, 'reset'): self.learn.model.reset()
     47         print('LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.')

/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py in load(self, name, device, strict, with_opt, purge)
    241     def load(self, name:PathOrStr, device:torch.device=None, strict:bool=True, with_opt:bool=None, purge:bool=True):
    242         "Load model and optimizer state (if `with_opt`) `name` from `self.model_dir` using `device`."
--> 243         if purge: self.purge(clear_opt=ifnone(with_opt, False))
    244         if device is None: device = self.data.device
    245         state = torch.load(self.path/self.model_dir/f'{name}.pth', map_location=device)

/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py in purge(self, clear_opt)
    287         state['cb_state'] = {cb.__class__:cb.get_state() for cb in self.callbacks}
    288         if hasattr(self, 'opt'): state['opt'] = self.opt.get_state()
--> 289         torch.save(state, open(tmp_file, 'wb'))
    290         for a in attrs_del: delattr(self, a)
    291         gc.collect()

OSError: [Errno 30] Read-only file system: '../input/purge-tmp.pkl'

So here is what happened (still not 100% sure, but we have a fix)

Cause: 1.0.46 version kaggle kernel

File: lr_find.py
Function: on_train_end()
self.learn.load(‘tmp’) will trigger the purge function which somehow will cause the purge open the tmp file on path(’…/input/tmp.pkl’), instead of model_dir path

Resolve:
set self.learn.load(‘tmp’, purge=False) will resolve the issue.

Which fastai 1.0.48 has the fix… :joy:

My advise is if you want to use kaggle kernel (P100 GPU)
On the first cell, run

!conda install -c fastai fastai --yes

after done,
import fastai
fastai.__version__

check it is 1.0.48

Now you can go back to set the learn = cnn_learner(..., model_dir = '/kaggle/model')

@shlyakh: You can find out which version of fastai you are using by putting in this code:
import fastai; fastai.__version__

@heye0507: It’s weird that you get an error with my code, because I used (and am still using) version 1.0.46 as well. In your error message it seems like your code still points at the input? It might also have something to do with the learner. I create it like that:
learn = create_cnn(data, models.resnet34, model_dir = '/tmp/models', metrics=error_rate)

But anyway: now we have two options and with an update on 1.0.48 this problem seems to be gone, just like you wrote :slight_smile:

2 Likes

is there a way to download a model after some epochs in kaggle without commiting, like if I train a gan model, the commit will go on forever as I stop the cell running manually in colab I usually go to the folder structure and download the model file or save in in google drive.

There is a way to download csv file without commit… but I don’t know anything about download the model… assume is .pth file?

yes I know about the csv method but it only works for files around 2mb and csv files https://www.kaggle.com/rtatman/download-a-csv-file-from-a-kernel
it is a pkl file, being able to download any file would be great though as even after commiting I cancel the commit and I get a error that folder structure is too large, more than 6 and no outputs files. Im looking to a workaround now.

I would try this:
https://www.kaggle.com/rtatman/download-a-csv-file-from-a-kernel#467667

from IPython.display import FileLinks
FileLinks('.') # input argument is specified folder

I do not know if there is a download limit though…

1 Like

the solution seems to be around zipping the file and then deleted the folder contents to not get the too many outputs error
!zip -r output.zip /kaggle/working/
!rm -rf /kaggle/working/*

currently testing this

So I was able to get the commit to pass but there is not output zip file, is there a specific directory I need to move the output.zip file to get it to show up as output?

Thanks for sharing it!
I tried it just now, a model file with size 241MB can be downloaded successfully!

@Daniel glad to hear this worked!

You probably haven’t saved the model in the correct directory… Otherwise, the model should be downloadable from the output section of the kaggle kernel

I tried this method i get this


it is saved under /kaggle/working/stylegan/results/, I was able to upload the model to dropbox from kaggle using a script

This doesn’t look like from the kaggle kernel… you should be able to just press the link for the file you want and download it…

when I click on the link that is the page I get

Hey all, I’m getting Bus Errors when trying to look at a batch from my DataBunch. I’ve successfully downloaded the images into the kernel (I don’t know if this is the correct way to say that) and have created the data bunch but the following line is throwing errors:

data.show_batch(rows=3, figsize=(7,8), num_workers=0)

I’m also getting bus errors when trying to commit the kernel. Is this a problem on Kaggle’s end?


I was facing same problem as yours, this is the solution i came up with