Lesson 1 - Running fastai locally in python

andtif · February 16, 2024, 8:15am

Hi there!

I was trying to setup a local environment after doing the exercises through Kaggle but i’ve run into some issues that im finding hard to debug and wrap my head around so I am coming to the forum to see if someone has any ideas and can help.

When im running the local enivronment, it throws the following error:

epoch     train_loss  valid_loss  error_rate  time
0         nan         nan         0.000000    00:41
epoch     train_loss  valid_loss  error_rate  time
0         nan         nan         0.000000    00:40
[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

I’ve setup an environment with miniconda and installed all necessary packages for fastai and im running it with python 3.12.1 and fastai 2.7.14.

Here is the code:

from fastai.vision.all import * # learners etc
from fastai.data.all import * # datablocks
from PIL import Image

    
def validate_images(path):
    for img_path in get_image_files(path):
        try:
            with Image.open(img_path) as img:
                img.verify()
        except Exception as e:
            print(f"Invalid image file: {img_path} - {e}")


def main():

    searches = 'hamster','guinea_pig'
    path = Path('hamster_or_not')

    for name in searches: 
        resize_images(path/name, max_size=400, dest=path/name) 
    print("resizes images done...")

    print(path)
    searches = 'hamster','guinea_pig'

    for imageType in searches:
        print(imageType)
        content = os.listdir(path/imageType)
        for element in content:
            print(element)

    
    # Verify path and list image files
    print(f"Verifying path: {path}")
    image_files = get_image_files(path)
    print(f"Found {len(image_files)} image files.")
    # Ensure image files are being found
    if len(image_files) == 0:
        print("No image files found. Please check the directory structure and file permissions.")
        return

    # Validate images
    validate_images(path)
    
    # Setting up our datablock
    print("Setting up datablock...")
    datablock = DataBlock(
    blocks=(ImageBlock, CategoryBlock),  
    get_items=get_image_files, 
    splitter=RandomSplitter(valid_pct=0.1, seed=42), 
    get_y=parent_label,  
    item_tfms=[Resize(192, method='squish')], 
    )
    
    dls = datablock.dataloaders(path, bs=2, verbose=True) 

# check what our dataloader has picked up as categories, should only contain 2 categories
    print(dls.vocab) # prints ['guinea_pig', 'hamster'], all ok

    learn = vision_learner(dls, resnet18, metrics=error_rate) 
    
    learn.fine_tune(1)

The code outputs and works fine up until the learn.fine_tune(), where it throws an error. It finds all the images in their respective folder and prints them to the console.

However, i can pass in a single batch to the model by doing the following:

    if torch.cuda.is_available():
        learn.model.cuda()

    xb, yb = dls.one_batch()
    print(f"Batch X shape: {xb.shape}, Batch Y shape: {yb.shape}")


    try:
        learn.model.eval()  # Set the model to evaluation mode
        with torch.no_grad():  # Disable gradient calculation
            preds = learn.model(xb)  # Perform a forward pass with a batch of data
        print(f"Predictions shape: {preds.shape}")  # Print the shape of the predictions
    except Exception as e:
        print(f"Error during forward pass: {e}")  # Print any errors that occur

Note that i am forcing it to use the gpu in this case, and when i do, the one forward pass works.

Batch X shape: torch.Size([2, 3, 192, 192]), Batch Y shape: torch.Size([2])
Predictions shape: torch.Size([2, 2])

If i remove the learn.model.cuda(), i get the following error:

Batch X shape: torch.Size([2, 3, 192, 192]), Batch Y shape: torch.Size([2])
Error during forward pass: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

I cannot force it to use the gpu with learn.model.cuda() when i am using learn.fine_tune(), then it just throws the first error.

The datablock in any case seems to be fine, here is the output from the verbose settings

Collecting items from hamster_or_not
Found 22 items
2 datasets of sizes 20,2
Setting up Pipeline: PILBase.create
Setting up Pipeline: parent_label -> Categorize -- {'vocab': None, 'sort': True, 'add_na': False}
Setting up after_item: Pipeline: Resize -- {'size': (192, 192), 'method': 'squish', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0} -> ToTensor
Setting up before_batch: Pipeline:
Setting up after_batch: Pipeline: IntToFloatTensor -- {'div': 255.0, 'div_mask': 1}

So, as for debugging I’ve;

reduced the batch size -
reduced the sample size -
checked my folder permissions, -
checked the labeling, -
checked that it finds and picks up the images -
checked that it gets loaded into the datablock -
done one forward pass which seems to work ONLY if i force it to use the GPU with learn.model.cuda(),
throws an error if i remove the learn.model.cuda() of code -
I cannot run the learn.fine_tune() regardless of forcing it to use the GPU with learn.model.cuda()

My thoughts:
Does my computer switch between the cpu and the gpu between epochs/batches?

From what i can gather, the first error seems to stem from multi-gpu processing, but i do not have multiple gpus that are running concurrently.

The second error seems to stem from switching between the hardware for learning, maybe?

Any help is greatly appreciated!