Lesson 2: Saving a Cats v Dogs Model: NaNs at CL but not in Jupyter Notebook

SpaceCowboy850 · January 18, 2023, 8:33pm

I’m really confused here…

If I run the notebook from github, cats v dogs trains fine.

If I take the same code, put it into a python script and run it, my training results in NaNs. I’ve verified at inference that this is not a reporting anomaly, but truly a failure to train (as the notebook model.ckpt is highly accurate, and my command line trained version is highly inaccurate)

Script Code:

import sys
sys.path.append('../../../fastai')
sys.path.append('../../../fastcore')
sys.path.append('../../../fastprogress')
sys.path.append('../../../fastdownload')

from fastai.vision.all import *

def is_cat(x): 
	return x[0].isupper() 

def main():
	path = untar_data(URLs.PETS)/'images'
	dls = ImageDataLoaders.from_name_func('.', get_image_files(path), valid_pct=0.2, seed=42, label_func=is_cat, item_tfms=Resize(192))
	learn = vision_learner(dls, models.resnet18, metrics=error_rate)
	learn.fine_tune(3)
	learn.path = Path('.')
	learn.export("model.pkl")

if __name__ == '__main__':
	main()

I’m on Windows 11
Pytorch 1.13
Cuda 11.6

If any of that matters…

Note, I’m not installing fastai, fastcore, fastprogress, or fastdownload, so that I can more easily search and modify it in the same location. Furthermore, making these same modifications to the notebook and it trains fine. I’m using the same conda environment in both places (fastai2022)

I’ve tried training in powershell and the normal command prompt, both result in NaN training loss and validation loss. Also the progress bar does not update until nearly the end of the epoch, then it rapidly fills in.

Any ideas what the difference could be?

tameyerster · June 28, 2023, 3:36pm

Hey

Maybe you figured this out, maybe not but posting this in case others run into the same issue.
Working on windows 11 with cuda and the fastai library on the chapter 6 multicat example.
I’m using conda and visual code to run the examples.

I too kept getting nan for train_loss and valid_loss.
On a whim I add the num_workers=0 parm to my dataloaders call like
dls = dblock.dataloaders(df,num_workers=0)

I think this keeps the data pipeline single threaded. I have had to use this in almost all of the examples because of some type of bug with using cuda on windows.

My code that worked:

from fastai.vision.all import *

path = untar_data(URLs.PASCAL_2007)
df = pd.read_csv(path/'train.csv')
def get_x(r): return path/'train'/r['fname']
def get_y(r): return r['labels'].split(' ')

#print(df.head())
def splitter(df):
        train = df.index[~df['is_valid']].tolist()
        valid = df.index[df['is_valid']].tolist()
        return train,valid


if __name__ == "__main__":
    dblock = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
                    splitter=splitter,
                    get_x=get_x, 
                    get_y=get_y,
                    item_tfms = RandomResizedCrop(128, min_scale=0.35))
    
    dls = dblock.dataloaders(df,num_workers=0)
    learn = vision_learner(dls, resnet50, metrics=partial(accuracy_multi, thresh=0.2))
    print(learn.summary())
    learn.fine_tune(3, base_lr=3e-3, freeze_epochs=4)

qdk0901 · October 26, 2023, 2:34pm

setting num_workers=0 also works for me, and my platform is also windows 11

so the problem occur when using multi-process data loading?

AllenK · October 27, 2023, 8:26am

Windows isn’t worth pursuing if you can avoid it, IMHO.

(nearly everyone/everything uses Linux in one flavour or another for DL).

Luckily it is much easier to use Linux on Windows machines these days via WSL2 (Windows Subsystem for Linux).

Linux skills are good to acquire for working across the many cloud providers and/or for ML industry work prospects etc … in the live coding sessions Jeremy walks through many tips on using linux / installing wsl etc

It is a steeper learning curve to learn a different OS and DL at the same time though.

Or as the course suggests use Colab and Kaggle whilst learning, to reduce wasting cycles on system setup & config issues.