Platform: Kaggle Kernels

shlyakh · March 11, 2019, 2:49pm

Sanyam,

It is probably better that Kernels do not jump immediately to the latest fastai version, especially when a lot of changes are made, like in the beginning of each new class.

Is it possible to find somewhere which version Kaggle Kernels and/or Google Colab are using? Is it part of some Docker config file somewhere perhaps?

Thank

Yuri

DeepLearning · March 11, 2019, 6:32pm

I am getting this error when I try to run the lesson 3 code

Anyone get something similar?

heye0507 · March 12, 2019, 5:41am

@piaoya
Hi Sanyam,

I believe something has changed after kaggle updated their p100 or fastai library (just checked, now is under 1.0.46).

Now even if you set model_dir = ‘writable directory’
learn.lr_find() will crash.

I traced to the fastai library, it seems to me that during the training, the purge() method calls torch.save(), and self.path/self.model_dir end up trying to save to …/input/purge-tmp.pkl.

I am still looking for the root cause of this issue, but I guess is the self.path/self.model_dir part, it ends up using self.path, which is …/input in kaggle kernels. (assuming most people loads data at …/input, so you have self.path = ‘…/input’, self.model_dir = ‘/kaggle/model’, but somehow self.path/self.model_dir returns …/input)

But yup, when creating learner and set model_dir to writable directory no longer work…

Currently I use piaoya’s way to move around data…

Here is the trace:

OSError                                   Traceback (most recent call last)
<ipython-input-32-d81c6bd29d71> in <module>()
----> 1 learn.lr_find()

/opt/conda/lib/python3.6/site-packages/fastai/train.py in lr_find(learn, start_lr, end_lr, num_it, stop_div, wd)
     30     cb = LRFinder(learn, start_lr, end_lr, num_it, stop_div)
     31     epochs = int(np.ceil(num_it/len(learn.data.train_dl)))
---> 32     learn.fit(epochs, start_lr, callbacks=[cb], wd=wd)
     33 
     34 def to_fp16(learn:Learner, loss_scale:float=None, max_noskip:int=1000, dynamic:bool=False, clip:float=None,

/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    180         if defaults.extra_callbacks is not None: callbacks += defaults.extra_callbacks
    181         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 182             callbacks=self.callbacks+callbacks)
    183 
    184     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

/opt/conda/lib/python3.6/site-packages/fastai/utils/mem.py in wrapper(*args, **kwargs)
     87 
     88         try:
---> 89             return func(*args, **kwargs)
     90         except Exception as e:
     91             if ("CUDA out of memory" in str(e) or

/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
    101         exception = e
    102         raise
--> 103     finally: cb_handler.on_train_end(exception)
    104 
    105 loss_func_name2activ = {'cross_entropy_loss': F.softmax, 'nll_loss': torch.exp, 'poisson_nll_loss': torch.exp,

/opt/conda/lib/python3.6/site-packages/fastai/callback.py in on_train_end(self, exception)
    289     def on_train_end(self, exception:Union[bool,Exception])->None:
    290         "Handle end of training, `exception` is an `Exception` or False if no exceptions during training."
--> 291         self('train_end', exception=exception)
    292 
    293 class AverageMetric(Callback):

/opt/conda/lib/python3.6/site-packages/fastai/callback.py in __call__(self, cb_name, call_mets, **kwargs)
    212         "Call through to all of the `CallbakHandler` functions."
    213         if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
--> 214         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    215 
    216     def set_dl(self, dl:DataLoader):

/opt/conda/lib/python3.6/site-packages/fastai/callback.py in <listcomp>(.0)
    212         "Call through to all of the `CallbakHandler` functions."
    213         if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
--> 214         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    215 
    216     def set_dl(self, dl:DataLoader):

/opt/conda/lib/python3.6/site-packages/fastai/callbacks/lr_finder.py in on_train_end(self, **kwargs)
     43         # restore the valid_dl we turned off on `__init__`
     44         self.data.valid_dl = self.valid_dl
---> 45         self.learn.load('tmp')
     46         if hasattr(self.learn.model, 'reset'): self.learn.model.reset()
     47         print('LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.')

/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py in load(self, name, device, strict, with_opt, purge)
    241     def load(self, name:PathOrStr, device:torch.device=None, strict:bool=True, with_opt:bool=None, purge:bool=True):
    242         "Load model and optimizer state (if `with_opt`) `name` from `self.model_dir` using `device`."
--> 243         if purge: self.purge(clear_opt=ifnone(with_opt, False))
    244         if device is None: device = self.data.device
    245         state = torch.load(self.path/self.model_dir/f'{name}.pth', map_location=device)

/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py in purge(self, clear_opt)
    287         state['cb_state'] = {cb.__class__:cb.get_state() for cb in self.callbacks}
    288         if hasattr(self, 'opt'): state['opt'] = self.opt.get_state()
--> 289         torch.save(state, open(tmp_file, 'wb'))
    290         for a in attrs_del: delattr(self, a)
    291         gc.collect()

OSError: [Errno 30] Read-only file system: '../input/purge-tmp.pkl'

heye0507 · March 12, 2019, 7:16am

So here is what happened (still not 100% sure, but we have a fix)

Cause: 1.0.46 version kaggle kernel

File: lr_find.py
Function: on_train_end()
self.learn.load(‘tmp’) will trigger the purge function which somehow will cause the purge open the tmp file on path(’…/input/tmp.pkl’), instead of model_dir path

Resolve:
set self.learn.load(‘tmp’, purge=False) will resolve the issue.

Which fastai 1.0.48 has the fix…

My advise is if you want to use kaggle kernel (P100 GPU)
On the first cell, run

!conda install -c fastai fastai --yes

after done,
import fastai
fastai.__version__

check it is 1.0.48

Now you can go back to set the learn = cnn_learner(..., model_dir = '/kaggle/model')

piaoya · March 12, 2019, 8:47am

@shlyakh: You can find out which version of fastai you are using by putting in this code:
import fastai; fastai.__version__

@heye0507: It’s weird that you get an error with my code, because I used (and am still using) version 1.0.46 as well. In your error message it seems like your code still points at the input? It might also have something to do with the learner. I create it like that:
learn = create_cnn(data, models.resnet34, model_dir = '/tmp/models', metrics=error_rate)

But anyway: now we have two options and with an update on 1.0.48 this problem seems to be gone, just like you wrote

gan · March 13, 2019, 1:15am

is there a way to download a model after some epochs in kaggle without commiting, like if I train a gan model, the commit will go on forever as I stop the cell running manually in colab I usually go to the folder structure and download the model file or save in in google drive.

heye0507 · March 13, 2019, 5:08am

There is a way to download csv file without commit… but I don’t know anything about download the model… assume is .pth file?

gan · March 13, 2019, 2:45pm

yes I know about the csv method but it only works for files around 2mb and csv files https://www.kaggle.com/rtatman/download-a-csv-file-from-a-kernel
it is a pkl file, being able to download any file would be great though as even after commiting I cancel the commit and I get a error that folder structure is too large, more than 6 and no outputs files. Im looking to a workaround now.

ilovescience · March 14, 2019, 12:28am

I would try this:
https://www.kaggle.com/rtatman/download-a-csv-file-from-a-kernel#467667

from IPython.display import FileLinks
FileLinks('.') # input argument is specified folder

I do not know if there is a download limit though…

gan · March 14, 2019, 4:00pm

the solution seems to be around zipping the file and then deleted the folder contents to not get the too many outputs error
!zip -r output.zip /kaggle/working/
!rm -rf /kaggle/working/*

currently testing this

gan · March 14, 2019, 4:58pm

So I was able to get the commit to pass but there is not output zip file, is there a specific directory I need to move the output.zip file to get it to show up as output?

Daniel · March 17, 2019, 1:48am

Thanks for sharing it!
I tried it just now, a model file with size 241MB can be downloaded successfully!

ilovescience · March 17, 2019, 4:47am

@Daniel glad to hear this worked!

ilovescience · March 17, 2019, 4:49am

You probably haven’t saved the model in the correct directory… Otherwise, the model should be downloadable from the output section of the kaggle kernel

gan · March 17, 2019, 3:15pm

I tried this method i get this

gan · March 17, 2019, 3:17pm

it is saved under /kaggle/working/stylegan/results/, I was able to upload the model to dropbox from kaggle using a script

ilovescience · March 17, 2019, 10:39pm

This doesn’t look like from the kaggle kernel… you should be able to just press the link for the file you want and download it…

gan · March 19, 2019, 3:09pm

when I click on the link that is the page I get

sammbeller · March 22, 2019, 3:48pm

Hey all, I’m getting Bus Errors when trying to look at a batch from my DataBunch. I’ve successfully downloaded the images into the kernel (I don’t know if this is the correct way to say that) and have created the data bunch but the following line is throwing errors:

data.show_batch(rows=3, figsize=(7,8), num_workers=0)

I’m also getting bus errors when trying to commit the kernel. Is this a problem on Kaggle’s end?

A4KA5H · March 22, 2019, 6:41pm

I was facing same problem as yours, this is the solution i came up with