Questions on semantic segmentation


#1

I found a paper called LinkNet, it trained on CamVid dataset which contains 11 classes, 367 images, 101 validation and 233 testing images. This paper do not mentioned they apply data augmentation.

What if I only need to segment 2 classes?Do it mean the dataset could become even smaller?Some use cases I could imagine are

a : Pedestrian detection
b : License plate detection
c : Smoke detection
d : etc

Semantic segmentation works so well and required much smaller training net, it sounds like an ideal candidate for something like license plate and smoke detection. We only need two classes, license plate vs not license plate, smoke vs non smoke but I am wonder why I can’t find a paper about this topic(pedestrian detection do exist). Is it because

1 : Too easy compare with the dataset they are trying to challenge
2 : Infrared camera works better than any existing computer vision “hacks” on license plate detection
3 : Semantic segmentation works very poor for smoke or license plate detection

ps : I haven’t studied lesson 14 yet, still on lesson 10


(Constantin) #2

@tham: If I understand the question correctly, you’d like to fine-tune the net to detect only two classes. In this case I would use the ground truth segmentation of, e.g. CamVid and train a network to detect all classes in CamVid. Once that is done you could pop off the top and re-train the classifier to detect your new classes. Pretty much as described for image classification in lesson7, just with a segmentation net.


(alex) #3

Lesson 14 shows a segmentation example on this dataset using the Tiramisu architecture.

I refactored it to work on a 2-class pathology segmentation problem and it worked no problem.

If you’re talking about just subsetting 2 classes like pedestrian vs no pedestrian on the CamVid dataset, why don’t you just group all the other classes as ‘not pedestrain’ and use it as is?


#4

Thanks for your suggestion, sound interesting, never think of fine-tune on segmentation, because the pre-train dataset like CamVid is quite small.

Replace the top and train the network as classification network?

I am amazed that segmentation network train on small data set work so well even without leverage pre-trained network.


(alex) #5

No, train from scratch with 2 clases rather than the full set from CamVid. Training set was around 500 images i think. Plus data augmentation (using a special generator covered in lecture 14 that produces augmented input and mask images).

I’d imagine transfer learning would help but couldn’t figure out how to make it work in the Tiramisu case - training from scratch even on that size dataset worked decently well…


#6

Implement LinkNet(at here).

trainer.py, test.py and process_camvid.py still quite messy.

train with 800 epoch, 368 images, rmsprob, start from learning rate 0.0005, training size is 480 x 320, average accuracy on test set(333 images) reach 0.891675326947, test with the same way as lesson 14((pred == label).mean()). Before I test, I convert all of the color do not exist in the 11 labels as (0, 0, 0).

Here are examples segmented by LinkNet, I forgot to include color of sidewalk, so following results I show only train with 11 labels(Void, Sky, Building, Pole, Road, Tree, SignSymbol, Fence, Car, Pedestrian, Bicyclist).

real labels

predict labels

I guess my implementation is correct or very close to correct answer.

Loss graph

I will do more experiences on this network, like Adam vs RMSprob vs SGD, effect of training size(the bigger the better?) and write it down on my blog.

After that I intent to follow the advice of @machinethink. Do anyone interesting on how to develop a mobile app with the helps of opencv dnn module and Qt5?


(Yash Katariya) #7

These results look awesome, what was the accuracy that you got?


#8

If you are asking accuracy of my old project, this is my first project about segmentation, so I cannot give you accuracy of them, because they never exist.

if you are asking accuracy of the LinkNet model, average accuracy on test set(333 images) is 0.891675326947.


#9

Is this a correct way to find the iou of semantic segmentation?

def test_images_accuracy(model, raw_img_folder, label_folder, label_colors):
    #get all of the camvid images from test set
    raw_imgs_loc = list(glob.glob(raw_img_folder + "/*.png"))
    label_len = len(label_colors)    
    iou = np.zeros((label_len))
    for i, img_loc in enumerate(raw_imgs_loc):
        print(i, ":caculate accuracy of image:", img_loc)
        #convert the image to colorful segmentation map
        segmap = to_segmap(model, img_loc, label_colors)
        #this is the true label
        label = pil.Image.open(label_folder + "/" + img_loc.split("/")[-1][0:-4] + "_L.png")
        label = np.array(label)
        for i, color in enumerate(label_colors):
            real_mask = label[:,:,] == color
            predict_mask = segmap[:,:,] == color
            true_intersect_mask = (real_mask & predict_mask)
            
            TP = true_intersect_mask.sum() #true positive
            FP = predict_mask.sum() - TP #false positive
            FN = real_mask.sum() - TP  #false negative            
            #print("TP, FP, FN", TP, FP, FN)
            iou[i] += TP/float(TP + FP + FN)
            
    return iou / len(raw_imgs_loc) * 100, mean_acc / len(raw_imgs_loc)

The results are very weird, it perform too well compare with the LinkNet paper.Left side is the results of my model, right side is the result of paper

iou of Sky is 84.4839562309 vs 92.8
iou of Building is 81.0601742644 vs 88.8
iou of Sidewalk is 60.76727697 vs 88.4
iou of Column_Pole is 75.3379892824 vs 37.8
iou of Road is 85.3882216306 vs 96.8
iou of Tree is 82.8543640992 vs 85.3
iou of SignSymbol is 78.8773600477 vs 41.7
iou of Fence is 79.7187205135 vs 57.8
iou of Car is 76.1308416568 vs 77.6
iou of Pedestrian is 76.3787193096 vs 57.0
iou of Bicyclist is 61.4157209066 vs 27.2
average iou: 76.579 vs 68.3

Probability

1 : overfit, since camvid dataset are very similar and small + I train with 800 epoch, this is not a surprise. I test it on video of youtube, it do not perform as good as camvid
2 : my network architectures is wrong
3 : the paper do not utilize(?) data augmentation(I use random_crop and horizontal flip)
4 : My hyper parameters are different with the paper, the paper train with 768 * 512, I train with 128*128

By the way, how to find iIoU?

Edit : Do anyone have the dataset of cityscapes?I do not have permission to download the dataset.


Let us create a semantic segmentation model(LinkNet) by PyTorch
#10

I find out the problem, my weighting is wrong, now the training results are closer to the paper.


(Pietro La Torre) #11

Hi everyone,
Can you explain how you can map camvid example dataset (shown by Jeremy) to a subset of classes?
There should be a place where all pixel codes are mapped to classes, but I don’t find it.
I tried to keep in the “codes” array, loaded from file, all the positions (since I understood that they correspond to pixel numbers in the mask) and replace them with super-classes, e.g. people, objects, vehicles…
Can you help me?
Thanks