I’m really confused here…
If I run the notebook from github, cats v dogs trains fine.
If I take the same code, put it into a python script and run it, my training results in NaNs. I’ve verified at inference that this is not a reporting anomaly, but truly a failure to train (as the notebook model.ckpt is highly accurate, and my command line trained version is highly inaccurate)
Script Code:
import sys
sys.path.append('../../../fastai')
sys.path.append('../../../fastcore')
sys.path.append('../../../fastprogress')
sys.path.append('../../../fastdownload')
from fastai.vision.all import *
def is_cat(x):
return x[0].isupper()
def main():
path = untar_data(URLs.PETS)/'images'
dls = ImageDataLoaders.from_name_func('.', get_image_files(path), valid_pct=0.2, seed=42, label_func=is_cat, item_tfms=Resize(192))
learn = vision_learner(dls, models.resnet18, metrics=error_rate)
learn.fine_tune(3)
learn.path = Path('.')
learn.export("model.pkl")
if __name__ == '__main__':
main()
I’m on Windows 11
Pytorch 1.13
Cuda 11.6
If any of that matters…
Note, I’m not installing fastai, fastcore, fastprogress, or fastdownload, so that I can more easily search and modify it in the same location. Furthermore, making these same modifications to the notebook and it trains fine. I’m using the same conda environment in both places (fastai2022)
I’ve tried training in powershell and the normal command prompt, both result in NaN training loss and validation loss. Also the progress bar does not update until nearly the end of the epoch, then it rapidly fills in.
Any ideas what the difference could be?