Lesson 3 In-Class Discussion ✅

In the head-pose notebook I tried to track the MSE manually with this metric:

def mse(pred:Tensor, targ:Tensor)->Rank0Tensor:
    return ((targ - pred)**2).mean()

learn = create_cnn(data, models.resnet34, metrics=mse)
learn.loss_func = MSELossFlat()

I would expect the validation loss to be the same as the error rate, but it’s much smaller.

epoch  train_loss  valid_loss  mse     
1      0.036152    0.008999    0.011818  (00:53)
2      0.012023    0.010004    0.015681  (00:52)
3      0.004507    0.007254    0.014140  (00:52)
4      0.002694    0.001148    0.007541  (00:51)
5      0.001897    0.001305    0.008124  (00:52)

Why is the validation loss so much smaller?

The problem seems to be that targ’s size is [64, 1, 2] and pred is [64, 2], so your custom mse is causing an unwanted broadcasting. Try return ((targ.squeeze() - pred.squeeze())**2).mean().


while running the lesson3-imdb notebook I am getting RuntimeError: CUDA error: out of memory at this line:

learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))

I am stuck at this point. I have 2 1080Ti. As this uses 1 gpu, can anyone suggest how to overcome this either by using 2 gpus or any other way?

1 Like

Smaller batch size?

In case you haven’t tried: Once you get an out of memory error you have to restart your notebook before trying again.


Thanks I am also having a CUDA issue and reduced batchsize but it didn’t solve the issue. Am also working with tweaking bptt per thread 1 and thread 2 from prior courses but haven;t cracked it yet in event these threads help @karan

Question re: the Data Block API (which btw is super cool and flexible)

In the docs, we have the following example:

data = (ImageFileList.from_folder(planet)            
        #Where to find the data? -> in planet and its subfolders
        .label_from_csv('labels.csv', sep=' ', folder='train', suffix='.jpg')  
        #How to label? -> use the csv file labels.csv in path, 
        #add .jpg to the names and take them in the folder train
        #How to split in train/valid? -> randomly with the default 20% in valid
        #How to convert to datasets? -> use ImageMultiDataset
        .transform(planet_tfms, size=128)             
        #Data augmentation? -> use tfms with a size of 128
        #Finally? -> use the defaults for conversion to databunch

My understanding is that the first line recursively goes through all the subfolders of path, so what happens in the case where the all of those files (which I’d imagine could include files under valid or test folders) != files specified in labels.csv != files in train folder? I guess this is for the scenario where there’s either a.) multiple csv label files for train and valid images, or a single label file for train and valid subfolders…

Thanks :slight_smile:

from where can I change batch size parameter? I think its in this line?


Yes - I believe bs is changed in data & bptt in learning function so I am trying (slowly!):

data_lm = TextLMDataBunch.load(path, 'tmp_lm', bs=32)
learn = language_model_learner(data_lm, drop_mult=0.3, pretrained_model=URLs.WT103, bptt=40)

Default bs~64 and default bptt~70 so messing around w the figures at the moment


“mv: cannot stat ‘kaggle.json’: No such file or directory”

I believe that all other steps were executed properly. Any hints out there in non-newb land?

I think your kaggle.json file is in a different directory from where you are running your noteboot. Try ! mv ~/kaggle.json ~/.kaggle/

1 Like

Thanks for the reply joshfp! My kaggle.json is in, I believe it is the root directory, as pictured above. Did you mean just run this?

My apologies because I am a huge newb, so this could be entirely in relation to that. :slight_smile:

Also, I’m using salamander and, in Jeremy’s words, “Depending on your platform you may need sudo you may need slash something else slash pip , you may need source activate.”

I mean ~/kaggle.json instead of kaggle.json.

1 Like

@joshfp is probably right - your .json file isn’t in your active directory. Importantly, the “root” shown by jupyter is jupyter’s root, not the root of the operating system. So, the fully-qualified path surely includes a couple of directory levels above it before reaching jupyter’s root directory.

It might help if you examine your active directory with the following (and might be useful in general for others who stumble across this post):

Use !pwd to display the active directory;
Use !ls -a to list all of the files and directories in the active directory, including the hidden files or folders;
Use something like find . -name "*.json" or find / -name "*.json" to search the active directory and its subdirectories or the root directory and subs, respectively. The second one may take a long time and throw a slew of ‘permission denied’ responses.
Use !echo $HOME to display the home path of the OS, which is usually something like /home/<username>;

Essentially, joshfp is saying that your active directory is something other than your home directory (where your home directory equals /home/<username> aka ~/). As a result you need to use the fully-qualified path to the file, kaggle.json file to move it to the hidden folder ~/.kaggle.

Or, you could change to your home directory using os.chdir('/home/<username>') or maybe even os.chdir('~/'). !cd ~/ or !cd /home/<username>/ might work, but sometimes cd flakes out in a jupyter notebook, in my experience.


I’m trying to save model as pth from the forst phrase and then load it when we’re suppoused to train on bigger images. I have loaded pth by doing

model = torch.load("/home/jupyter/tutorials/fastai/course-v3/nbs/dl1/model.pth")

but I have no idea how to plug it into unet for further training.

Whats is the folder format to load the dataset for classifier in imdb notebook.
while loading getting this error:

TypeError                                 Traceback (most recent call last)
<ipython-input-53-97dd9c4a5ffe> in <module>
      3             .label_from_folder(classes=['hotel','train'])
      4              #label them all with their folder, only keep 'neg' and 'pos'
----> 5             .split_by_folder(valid='test')
      6              #split by folder between train and validation set
      7             .datasets()

~/anaconda2/envs/hindinlu/lib/python3.6/site-packages/fastai/data_block.py in datasets(self, dataset_cls, **kwargs)
    234         train = dataset_cls(*self.train.items.T, **kwargs)
    235         dss = [train]
--> 236         dss += [train.new(*o.items.T, **kwargs) for o in self.lists[1:]]
    237         cls = getattr(train, '__splits_class__', self._pipe)
    238         return cls(self.path, *dss)

~/anaconda2/envs/hindinlu/lib/python3.6/site-packages/fastai/data_block.py in <listcomp>(.0)
    234         train = dataset_cls(*self.train.items.T, **kwargs)
    235         dss = [train]
--> 236         dss += [train.new(*o.items.T, **kwargs) for o in self.lists[1:]]
    237         cls = getattr(train, '__splits_class__', self._pipe)
    238         return cls(self.path, *dss)

~/anaconda2/envs/hindinlu/lib/python3.6/site-packages/fastai/basic_data.py in new(self, *args, **kwargs)
     40     def new(self, *args, **kwargs):
     41         "Create a new dataset using `self` as a template"
---> 42         return self.__class__(*args, **kwargs)
     44     def _get_x(self,i):   return self.x[i]

TypeError: __init__() missing 1 required positional argument: 'fns'

I am getting division by zero Error with following learning rate setting, any solution to fix this ?

lr = 1e-03
learn.fit_one_cycle(5, slice(1e-04, lr/5))

Error messages:

ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-29-2392d8fd8181> in <module>
----> 1 learn.fit_one_cycle(5, slice(1e-04, lr/5))

~/fastai/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
     16                   wd:float=None, callbacks:Optional[CallbackList]=None, **kwargs)->None:
     17     "Fit a model following the 1cycle policy."
---> 18     max_lr = learn.lr_range(max_lr)
     19     callbacks = ifnone(callbacks, [])
     20     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,

~/fastai/fastai/basic_train.py in lr_range(self, lr)
    148         "Build differential learning rates."
    149         if not isinstance(lr,slice): return lr
--> 150         if lr.start: res = even_mults(lr.start, lr.stop, len(self.layer_groups))
    151         else: res = [lr.stop/3]*(len(self.layer_groups)-1) + [lr.stop]
    152         return np.array(res)

~/fastai/fastai/core.py in even_mults(start, stop, n)
    102     "Build evenly stepped schedule from `start` to `stop` in `n` steps."
    103     mult = stop/start
--> 104     step = mult**(1/(n-1))
    105     return np.array([start*(step**i) for i in range(n)])

ZeroDivisionError: division by zero

How n = 1 in this case ? Any help is much appreciated.

Apparently, n==1 because self.layer_groups==1. If your model has only one layer_group, instead of passing an slice, try passing a single number (float) as learning rate.


Any one got this issue when running imdb? (GCP)

data_lm = (TextFileList.from_folder(path)


UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 1179: ordinal not in range(128)

Hey @joshfp, was wondering if you might be able to provide some intuition behind the reasoning the squeeze is needed? In the camvid notebook, we have the following code:

name2id = {v:k for k,v in enumerate(codes)}
void_code = name2id['Void']

def acc_camvid(input, target):
    target = target.squeeze(1)
    mask = target != void_code
    return (input.argmax(dim=1)[mask]==target[mask]).float().mean()

My understanding is a torch_tensor.squeeze() removes all dimensions that are equal to 1, and torch_tensor.squeeze(dim) removes that dimension if it’s of size 1? It makes sense that the pred and targ dimensions should be the same, but any tips on how to think about what the dimensions of pred and targ are during the learn process in order to know how to squeeze them to the same shape? Thanks!!