Hi there!
I was trying to setup a local environment after doing the exercises through Kaggle but i’ve run into some issues that im finding hard to debug and wrap my head around so I am coming to the forum to see if someone has any ideas and can help.
When im running the local enivronment, it throws the following error:
epoch train_loss valid_loss error_rate time
0 nan nan 0.000000 00:41
epoch train_loss valid_loss error_rate time
0 nan nan 0.000000 00:40
[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
I’ve setup an environment with miniconda and installed all necessary packages for fastai and im running it with python 3.12.1 and fastai 2.7.14.
Here is the code:
from fastai.vision.all import * # learners etc
from fastai.data.all import * # datablocks
from PIL import Image
def validate_images(path):
for img_path in get_image_files(path):
try:
with Image.open(img_path) as img:
img.verify()
except Exception as e:
print(f"Invalid image file: {img_path} - {e}")
def main():
searches = 'hamster','guinea_pig'
path = Path('hamster_or_not')
for name in searches:
resize_images(path/name, max_size=400, dest=path/name)
print("resizes images done...")
print(path)
searches = 'hamster','guinea_pig'
for imageType in searches:
print(imageType)
content = os.listdir(path/imageType)
for element in content:
print(element)
# Verify path and list image files
print(f"Verifying path: {path}")
image_files = get_image_files(path)
print(f"Found {len(image_files)} image files.")
# Ensure image files are being found
if len(image_files) == 0:
print("No image files found. Please check the directory structure and file permissions.")
return
# Validate images
validate_images(path)
# Setting up our datablock
print("Setting up datablock...")
datablock = DataBlock(
blocks=(ImageBlock, CategoryBlock),
get_items=get_image_files,
splitter=RandomSplitter(valid_pct=0.1, seed=42),
get_y=parent_label,
item_tfms=[Resize(192, method='squish')],
)
dls = datablock.dataloaders(path, bs=2, verbose=True)
# check what our dataloader has picked up as categories, should only contain 2 categories
print(dls.vocab) # prints ['guinea_pig', 'hamster'], all ok
learn = vision_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(1)
The code outputs and works fine up until the learn.fine_tune(), where it throws an error. It finds all the images in their respective folder and prints them to the console.
However, i can pass in a single batch to the model by doing the following:
if torch.cuda.is_available():
learn.model.cuda()
xb, yb = dls.one_batch()
print(f"Batch X shape: {xb.shape}, Batch Y shape: {yb.shape}")
try:
learn.model.eval() # Set the model to evaluation mode
with torch.no_grad(): # Disable gradient calculation
preds = learn.model(xb) # Perform a forward pass with a batch of data
print(f"Predictions shape: {preds.shape}") # Print the shape of the predictions
except Exception as e:
print(f"Error during forward pass: {e}") # Print any errors that occur
Note that i am forcing it to use the gpu in this case, and when i do, the one forward pass works.
Batch X shape: torch.Size([2, 3, 192, 192]), Batch Y shape: torch.Size([2])
Predictions shape: torch.Size([2, 2])
If i remove the learn.model.cuda(), i get the following error:
Batch X shape: torch.Size([2, 3, 192, 192]), Batch Y shape: torch.Size([2])
Error during forward pass: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same
I cannot force it to use the gpu with learn.model.cuda() when i am using learn.fine_tune(), then it just throws the first error.
The datablock in any case seems to be fine, here is the output from the verbose settings
Collecting items from hamster_or_not
Found 22 items
2 datasets of sizes 20,2
Setting up Pipeline: PILBase.create
Setting up Pipeline: parent_label -> Categorize -- {'vocab': None, 'sort': True, 'add_na': False}
Setting up after_item: Pipeline: Resize -- {'size': (192, 192), 'method': 'squish', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0} -> ToTensor
Setting up before_batch: Pipeline:
Setting up after_batch: Pipeline: IntToFloatTensor -- {'div': 255.0, 'div_mask': 1}
So, as for debugging I’ve;
- reduced the batch size -
- reduced the sample size -
- checked my folder permissions, -
- checked the labeling, -
- checked that it finds and picks up the images -
- checked that it gets loaded into the datablock -
- done one forward pass which seems to work ONLY if i force it to use the GPU with learn.model.cuda(),
- throws an error if i remove the learn.model.cuda() of code -
- I cannot run the learn.fine_tune() regardless of forcing it to use the GPU with learn.model.cuda()
My thoughts:
Does my computer switch between the cpu and the gpu between epochs/batches?
From what i can gather, the first error seems to stem from multi-gpu processing, but i do not have multiple gpus that are running concurrently.
The second error seems to stem from switching between the hardware for learning, maybe?
Any help is greatly appreciated!