CNN in combination with KNN

GertjanBrouwer · July 24, 2017, 11:24am

Hi,

I am currently exploring the idea of using a CNN and a K-NN(K-Nearest Neighbour) in conjunction with each other, my problem formulation is: I want to know which type of furniture is in a picture. The way I want to approach this problem is by having multiple CNN which will be trained on a large dataset so they can globally predict. What I mean by that is - they can recognize if something is a chair, a couch or a table. Then I want to have another CNN based on that output predict the pattern that is used(eg. Stripes, Blocks, Dots) after that I want to see what is the most used color. But then there is a problem, because if you have predicted these thing and your possible options that come out of the CNN can still be big enough, eg. a couch with a stripe pattern, which is primarly black & white. There might be thousands of these thats where I got the idea for a KNN from. I think that if I use a K-nn it can based on the prediction(couch, stripe, black & white) give a better results, for example my prediction gives back 30,000 images then I can let the KNN give back the most similair ones.

My question is, is this a good way to create a furniture recognizer and also will my prediction be fast enough If i use: CNN(Type: Couch, Chair, Table, …) -> CNN (Pattern: Stripe, Dots, …) -> CNN(Color: Brown, Blue, Red … ) -> KNN(Get most similair image).

Thanks for reading all this.

Kind Regards,

Gertjan

markovbling · July 24, 2017, 12:53pm

I’d suggest getting the dense activations from your images and then training separate models to detect furniture type, pattern, colour.

You could then do KNN on the images that match the type, pattern and colour of your input image.

Alternatively, you can just do KNN on the dense activations themselves but you might end up with one feature dominating e.g. it’ll find furniture of the same type and pattern but not the same colour

Hope that helps?

GertjanBrouwer · July 24, 2017, 2:14pm

So that means I would have a CNN that gets the dense activations which you feed into: CNN(Type: Couch, Chair, Table, …) -> CNN (Pattern: Stripe, Dots, …) -> CNN(Color: Brown, Blue, Red …) so you won’t have to input the image itself into the CNN’s?

markovbling · July 24, 2017, 2:17pm

I’m not sure what you mean. Have a single CNN to get activations from the images. Then fit a separate model for each feature that learns to map activations to each kind of class type.

e.g. CNN -> Activations
then fit models:

Activations -> Furniture Type
Activations -> Pattern
Activations -> Colour

GertjanBrouwer · July 24, 2017, 2:19pm

Ah, i see. Is there a reason to get the activations first? To prevent you have to input an image into every CNN?

markovbling · July 24, 2017, 2:52pm

you don’t need to train multiple CNNs - just use the activations from a single CNN

GertjanBrouwer · July 24, 2017, 2:55pm

But I still need CNN’s to give me furniture type, pattern and color ?

markovbling · July 24, 2017, 3:59pm

You can train a CNN to detect furniture type and train a separate CNN to detect pattern and yet another separate CNN to detect colour. And if you had enough training data, you could train those CNNs from scratch.

Instead, you can use a pre-trained CNN like VGG trained on imagenet to get some intermediate representation of your images - for example one of the later dense layers’ activations. Take your images and feed them through the CNN to get those activations and then train your own models on top of those activations to detect each of the things you want to detect.

GertjanBrouwer · July 24, 2017, 4:32pm

Yeh, exactly that was my interpretation of what you said, I just understood it wrong at first. But that is indeed what I am gonna do. Cut the VGG16 in half, get that high-dimensional vector and use it as inputs into my other CNN’s.

markovbling · July 24, 2017, 4:58pm

Don’t see why the other models should be CNNs. CNN is useful for image data (and other data types where there is a kind of spatial correlation). Once you’ve got the activations, you’ve got numerical data and can use a simple algorithm like logistic regression or a tree method or a traditional neural net.

GertjanBrouwer · July 24, 2017, 5:00pm

Ah i see, That’s the part I did not understand yet. So it would be possible to use my KNN there too. I will go and look into it a bit more. Thanks a lot anyway @markovbling

markovbling · July 24, 2017, 10:31pm

Sure - can definitely use your KNN there - just run your image through a CNN to get activations then do KNN with the activations

The lesson on fine-tuning shows you how to chop off layers of the network to add your own new sigmoid layer. Just pop off layers (literally, it’s model.pop()) until you only have a dense layer then calling predict on the truncated CNN gives you the dense activations.

Keen to see / hear about the finished product!

rforgione · July 26, 2017, 12:32am

Hey @markovbling – do you have any good papers / books you can point me to digging into performance coming directly from a CNN versus using CNN activations as inputs to a more traditional classifier? I’ve heard anecdotally that passing activation layer outputs to another model has been shown to outperform predictions that come directly from the CNN (at least for certain applications), but I’ve tried this a few times and haven’t been successful. I would love to learn more about when this technique is most impactful.

Also, I may be jumping the gun here – I’m only on lesson 4 of part I, so if this gets discussed later in the lectures I’m happy to circle back once I’ve viewed them!

Surya501 · July 26, 2017, 1:53am

Check out Lecture 8 & 9 where you use activations from to selectively transfer style or content to implement artistic style transfer.

The Densenet algorithm implements this idea where you concatenate activations from all layers and then uses that to classify images and has shown state of the art performance when the dataset is small. This is covered in lecture 14 (or 13, I forget the exact one).

rforgione · July 26, 2017, 3:09am

Awesome – thanks @Surya501!

markovbling · July 26, 2017, 10:55am

Hey @rforgione, my understanding is that you have 2 choices in using a CNN to classify an image:

Choice 1
Take an existing pre-trained CNN such as VGG and chop off the softmax layer to get at one of the dense layers that follow the convolutional layers. Now when you call predict on a new image, you get the activations.

You could (not necessary but as a thought experiment) save these to a CSV and view them as a replacement for your images. Instead of images, you now have activations. You still need to do your final task which is to classify your image and you can build a classifier however you like e.g. feed the activations into a logistic regression or a random forest.

Alternatively, you could use Keras to fit a vanilla neural network (MLP, not convolutional) to fit a neural network on top of your activations that ends in a softmax layer and use that as your final classifier.

Choice 2
Use a pre-trained CNN like in choice 1 but don’t chop-off any dense layers except the final softmax layer which you’ll replace with a layer for your problem so that it has the correct number of output classes (like cats & dogs where you replace the final layer that predicts the 1000 classes with a layer that predicts just 2 classes).

Instead, set trainable=false for the convolutional layers and just re-train the dense layers on your image data.

Important note on compositionality
Notice that if you use Keras to add the same number of dense layers and softmax layers as you chopped off then that’s the same as if you never chopped them off in the first place.

Chopping off the dense layers and then adding them back doesn’t change anything (assuming the convolutional layers are not trainable) - the convolutional part that gives you the activations and the dense layers you add on top of those activations are compositional.

Multiple Classifiers
Say, for example, your images are of furniture and you know for each image the following 3 different attributes:

furniture type (chair / sofa / table)
furniture material (wood / leather / fabric)
furniture colour (black / brown / red)

Now if you go with choice 2 to classify a new image for each of the 3 attributes, you will have to recompute the convolutional layers for each classifier.

If instead, you go with choice 1, you only have to compute the convolutional outputs once and then fit 3 different classifiers on the same activations for each of your 3 attributes you’re trying to predict.

For more detail, see the section under the heading “Aside: Pre-calculating Convolutional Layer Output” from the lesson 3 notes here:
http://wiki.fast.ai/index.php/Lesson_3_Notes

Side note
In both of the cases above, I’m assuming you don’t have enough data to train the convolutional layers from scratch but if you do, the differences between choice 1 and choice 2 remain the same except you’d want to train the convolutional layer on a single attribute e.g. furniture type (or possibly train on all 3 attributes in sequence, chopping off the final layer and training the convolutions some more with different final layers being predicted)

Hope that helps!

rforgione · July 26, 2017, 2:03pm

Great summary @markovbling! I guess I’m wondering if in practice the additional predictive power that you get from passing your second-to-last-layer activations to a different algorithm (logistic regression, random forest, SVM) tends to make the overall classification performance better than simply maintaining the original last-layer and outputting a classification guess directly from the network. I would guess that since the network’s last layer is outputting a linear combination of activations while other algorithms can detect much more nuanced patterns, passing your activations to a more flexible model could make for a more effective model – but I don’t know for sure. This might be a fun experiment to try if it hasn’t already been researched / published!

markovbling · July 26, 2017, 9:32pm

Happy to help!

While it’s definitely possible that the particular layers following your ‘activations layer’ may not be the very best for your particular task, I think NNs in general will be at least as good as other algos (becasue of the universal approximation theorem).

The nice thing about getting the activations is it makes it much easier to test many different architectures (or other algos entirely) without needing to run your images through the time-consuming conv layers for each downstream experiment…

GertjanBrouwer · July 28, 2017, 12:28pm

@markovbling I have 1 more question(I know, I am a pain in the butt) but this questions keeps me awake at night . So if you cut of the last layer of the VGG16 CNN and use that for input into a MLP/Logistic regression algo or any other algo. Should I not train the CNN(VGG16) first?

They way I see it now is that VGG16 has 1000 classes, lets say 15 of those classes are related to furniture(what I am trying to detect). But if i use the activiations of VGG16 directly does that not mean I get activiation which all fit in those categories? I dont know if this makes sense to you?

Another example would be that VGG16 has those classes and I use the last layers as input into another algo, that means I will only ever get 1 of the 15 types and not 1 of the 1000 because the VGG16 is trained on 1000 classes and a dog looks nothing like a chair so the last layers would always output 1 of the 15 classes, the same thought I have if you cut of the last layer it will always be activiations which are near each other because of the 1000 classes but If i retrain the VGG16 first with my images and then cut of the last layer I feel like I would get better results.

I hope this makes sense.

Thanks

markovbling · July 28, 2017, 4:29pm

@GertjanBrouwer I don’t think you understand how CNNs work - I’d suggest going back and re-watching the first 3 lessons and poking around at the code (e.g. calling model.summary() and calling .shape on outputs after popping off layers).

So if you cut of the last layer of the VGG16 CNN and use that for input into a MLP/Logistic regression algo or any other algo. Should I not train the CNN(VGG16) first?

VGG is trained on ImageNet - you could train it some more on your particular data by removing the final layer with 1000 classes and replacing a layer with 15 classes, without setting trainable=False on the conv layers. So you can train the conv layers some more on your particular task and that might actually improve performance but since ImageNet contains images of furniture , the VGG conv layers are likely already general enough to be used without any further training.

I say I don’t think you understand CNNs because of your next question:

They way I see it now is that VGG16 has 1000 classes, lets say 15 of those classes are related to furniture(what I am trying to detect). But if i use the activiations of VGG16 directly does that not mean I get activiation which all fit in those categories? I dont know if this makes sense to you?

The activations I’ve been saying you should get (and then fit models on top of) aren’t just those that apply to the ImageNet furniture classes - they’re just the outputs of that image fed through the CNN that detects 1000 different things including the furniture classes. Sure, there may be some redundancy if some activations are consistent across all your images since they tend to be stationary for furniture images… but the point is that the activations are just numbers that represent the content of your images - around 4000 numbers that you can use IN PLACE OF YOUR IMAGES and fit models to

Another example would be that VGG16 has those classes and I use the last layers as input into another algo, that means I will only ever get 1 of the 15 types and not 1 of the 1000 because the VGG16 is trained on 1000 classes and a dog looks nothing like a chair so the last layers would always output 1 of the 15 classes, the same thought I have if you cut of the last layer it will always be activiations which are near each other because of the 1000 classes but If i retrain the VGG16 first with my images and then cut of the last layer I feel like I would get better results.

Why don’t you try it and see

If you don’t have hundreds of thousands of images, I’d suggest against training the conv layers…