Unet learner failing with a CUDA device assert

rahulrav · March 26, 2020, 9:10pm

I have a custom datastore which essentially loads a list of rows of type Row. Each Row instance has corresponding annotations of type AnnotationSpecs.

class Row(object):
    def __init__(self, patient, image_path, left_eye, annotation_specs=None):
        self.patient = patient
        self.image_path = image_path
        self.left_eye = left_eye
        self.annotation_specs = annotation_specs

# Defines a list of ver
class AnnotationSpec(object):
    def __init__(self, annotation_path, class_id, class_name, vertices):
        self.annotation_path = annotation_path
        self.class_id = class_id
        self.class_name = class_name
        self.vertices = vertices

I defined a DataBlock which looks like this:

# Loads a list of rows, which have an `image_path` and  a list of `annotations` associated.
rows = load_rows()
print('Loaded records %s' % (len(rows)))

def get_items(path):
  return rows

def load_image(row):
  image = cv2.imread(row.image_path, cv2.COLOR_BGR2RGB)
  return image

def _fill_vertices(mask_data, vertices, value):
  if vertices is not None:
    normalized = np.array(vertices)
    coordinates = np.multiply(normalized, np.array([SRC_IMAGE_WIDTH, SRC_IMAGE_HEIGHT]))
    coordinates = np.asarray(coordinates, dtype='int32')
    cv2.fillConvexPoly(mask_data, coordinates, [value, value])

def load_masks(row):
  mask_data = np.zeros((SRC_IMAGE_HEIGHT, SRC_IMAGE_WIDTH), dtype='int32')
  eyelids = list(filter(lambda spec: spec.class_name == 'eyelid', row.annotation_specs))
  pupils = list(filter(lambda spec: spec.class_name == 'pupil', row.annotation_specs))

  # Draw eyelids first
  if len(eyelids) > 0:
    eyelid = eyelids[0]
    _fill_vertices(mask_data, eyelid.vertices, 2)
  
  # Overlay Pupils on Eyelids
  if len(pupils) > 0:
    pupil = pupils[0]
    _fill_vertices(mask_data, pupil.vertices, 3)

  return numpy.asarray(mask_data, dtype='uint8')

block = DataBlock(
  # I have 3 codes 0 = background, 2 = eyelid, 3 = pupil
  blocks=(ImageBlock(), MaskBlock((0, 2, 3))),
  get_items=get_items,
  splitter=RandomSplitter(),
  getters=[
    load_image, load_masks
  ],
)

loaders = block.dataloaders(Path('.'), bs=4)
loaders.show_batch(max_n=4)

This works as expected, and my loaders instance loads the images.

Now when I try and train a model like so:

learner = unet_learner(loaders, resnet34)
learner.fine_tune(1)

I run into an error:

THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=26 error=710 : device-side assert triggered
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [21,0,0] Assertion `t >= 0 && t < n_classes` failed.

Any ideas on what might be going on?

Thanks,
Rahul

muellerzr · March 26, 2020, 9:12pm

Right away you have 3 codes that aren’t in order. IE 0,2,3. This will break training the model. Try to make them continuous from zero, IE 0,1,2 but that’s my first impression

rahulrav · March 26, 2020, 9:13pm

Thanks for your quick answer. OOC why do they have to be in order ?

FWIW, I also tried codes = (1, 2, 3). That does not help either.

muellerzr · March 26, 2020, 9:16pm

It’s always been a thing with fastai. I don’t know why exactly just that it is so. And you need to start from zero. See the advice in this thread

rahulrav · March 26, 2020, 9:24pm

That worked !!

Thank you @muellerzr. I am still very curious to know why.

@sgugger Thoughts on making the loader fail early when codes are not continuous & start from 0 ?
I can try submitting a change.

muellerzr · March 26, 2020, 9:27pm

I’ve been trying to figure out a good way to do so. The simple answer is grab a batch of data and make sure it all aligns but this could be bad if say we have classes that don’t show up in the first batch. (Would also love to know your thoughts on a best approach here Sylvain)

sgugger · March 26, 2020, 10:46pm

I don’t see a way of doing this that would work with the rest of the API: DataLoaders is a general API that is used across applications.

muellerzr · March 26, 2020, 10:48pm

That was my worry. Perhaps this could be included in the segmentation documentation/example as a warning instead for best practice/make notice to the behavior? (I’ll make the PR if so )

sgugger · March 26, 2020, 10:52pm

There in the segmentation part of the tutorial.vision maybe?

muellerzr · March 26, 2020, 10:54pm

Sounds like a good spot for it. I’ll make a PR by EOD.

rahulrav · March 26, 2020, 11:12pm

Could you please elaborate on why is this required though ?

muellerzr · March 26, 2020, 11:17pm

The Dataloaders is general. What this means is it’s not geared specifically for a specific application, and this modification would need to be made for this very specific case when it simply wouldn’t be used by the rest of the applications. We can’t wrap it into say a type transform etc because it needs to only be done once at the very beginning and not per say every single time (which also adds a time overhead).

I mean there could be a type transform that runs silently in the background checking the codes but I don’t think that’s the best way to deal with a user error (what we have here) rather than a library error.

rahulrav · March 26, 2020, 11:18pm

Sorry for being unclear. I was asking about the continuous values for the MaskBlock.

I completely understand that DataLoaders is a general API & should not enforce this.

muellerzr · March 26, 2020, 11:20pm

If I had to guess it’s the behavior of the loss function specifically. If we consider our regular CategoryBlock we encode our classes from 0-n, the same needs to be done here. We need to tell the model in a continous stretch how many values to guess for. I could explicitly say you have 0-255 values that can show up (say in an instance where I have a particular class that is pixel 255), but if I don’t explicitly say this it’ll get classes on it’s ground truth that weren’t explicitly stated to be there, thus the CUDA error (you can recreate this with image classification too by not including certain classes in your training set that show in your validation set and trying to train on them)

This is why the issue isn’t thrown during the databunch creation and initially during training I believe

rahulrav · March 26, 2020, 11:23pm

That makes sense. Thanks !

muellerzr · March 26, 2020, 11:25pm

No problem It took me awhile to figure out this behavior and what was really happening in the background as I did find it very strange too, but when you actually look at it boiled down it does make sense in the end!

barnacl · March 27, 2020, 12:29am

Adding to what zach said as we use a version of the CrossEntropyLoss - This criterion expects a class index in the range [0, C-1] docs.
Why can’t we call something like the setup for CategoryBlock ie CategoryMap formation?
Like we spoke yesterday a warning for image data type would be great too

muellerzr · March 27, 2020, 10:12pm

I’ll try to see if I can creatively think of something. Taking me a bit longer over here so expect to hear something by Wednesday at the latest (many moving parts currently on my end)