Custom Head for resnet34 model

cinbez · April 9, 2019, 9:53am

Hi there

Newbie here with computer vision and neural networks so hoping someone can assist me. I’m wanting to apply lesson 1 to a set of data images that are satellite images over a property at a specified zoom level (from Google static maps API). I have my images and I have residential property sale data in South Africa that allows me to create 2 classes of images by taking the top and bottom 10% of sales and classifying as 1 (luxuy) and 0 (affordable) to see whether a CNN can identify the difference between expensive and affordable property areas from satellite imagery. These images are in labeled folders ( “0” and “1”).

I’m trying to replicate a white paper that instead of just fine-tuning a pretrained network to distinguish between my 2 classes, rather output 256 features before trying to make the softmax prediction of luxury vs affordable. This is since the 256 features are to be used in a neural net model concatenated with other data features such as area under roof, number of bedrooms, etc. Thus, I need to figure out how to alter lesson 1 to add a custom head that takes the original model’s output, and then adds a fully connected linear layer to output 256 features, and then a softmax on top of that to classify as 1 or 0. Or at least that’s what I think I need to do

The plan is to determine whether a luxury/affordable classifier is accurate using only 256 features from an image, and if so, use these features as input features in another neural network. I hope I’ve explained this well enough.

Appreciate any help you guys can offer me!

Pomo · April 9, 2019, 6:38pm

Hi Cindy. Welcome to the forums, and to starting to design your own models!

I would like to help with your question, but do not understand from your description exactly what you want to do. It sounds like combining image and tabular training data? It would help if you would specify exactly what the model’s inputs and outputs are, even sketch a diagram for what you intend to do. How about a link to that paper, too? Then I or someone else could suggest ways to design and implement.

One thing I have learned (after sometimes getting no response) is that the way most likely to get useful responses is to be as specific, complete, and contextual as possible.

HTH, Malcolm

cinbez · April 9, 2019, 6:59pm

Thanks Malcolm. Sorry for not being more descriptive. The paper is found [here] around section 4.2 (https://vision.ece.ucsb.edu/sites/vision.ece.ucsb.edu/files/publications/bency_wacv_17.pdf) and they used inception v3 - I plan to try both resnet34/50 and inception. Lesson 1’s note basically takes a pre trained model and adjusts the final layer to be as many classes as needed - eg 100 pet breeds. I need to add a layer first that summarizes the model’s output to 256 outputs, then put a 2 class softmax layer on that and see how accurately the 256 outputs can distinguish affordable vs luxury areas.

Yes, in the end I will be using the 256 outputs (and not the 0/1 classifier) from the CNN model after fine tuning as input into another model concatenated with other features.

I hope this gives you a better sense of what I’m trying to do.

bfarzin · April 9, 2019, 8:29pm

Welcome Cindy!

Let me know if this is not 100% clear to you, I can spell out the steps and how it is all working.

First, I looked into cnn_learner to figure out how it was building the head output, then replicated to what I think you are looking for. You could take this pretty far and then change around the head into whatever else you want.

I am using the MNIST_SAMPLE data so that you can have a version that you can see how this part works. You would need to use your data and create a databunch that would work correctly with this.

This example should run if you copy-paste into a notebook or a file and if in a notebook, you could inspect each part and see how it all fits together.

First, I did this to get a handle on what the head layers look like:

from fastai.vision import *
path = untar_data(URLs.MNIST_SAMPLE)
data = ImageDataBunch.from_folder(path)
learner = cnn_learner(data, models.resnet18, metrics=[accuracy])
print(learner.model)

You should get output like this, which shows the fully-connected layers after your ResNet model (which funnel down to the two classes of output)

...

  (1): Sequential(
    (0): AdaptiveConcatPool2d(
      (ap): AdaptiveAvgPool2d(output_size=1)
      (mp): AdaptiveMaxPool2d(output_size=1)
    )
    (1): Flatten()
    (2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.25)
    (4): Linear(in_features=1024, out_features=512, bias=True)
    (5): ReLU(inplace)
    (6): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.5)
    (8): Linear(in_features=512, out_features=2, bias=True)
  )
)

You can override this head with any PyTorch model that will accept the input dimensions and then generate the 2 output dimensions you are looking for. Again, I went to the code and stole these lines to figure out how to generate a model just like this but with the 256 inner dimension you are looking for:

from fastai.vision.learner import num_features_model,create_body
base_model = models.resnet18
concat_pool = True

body = create_body(models.resnet18, True)
nf = num_features_model(nn.Sequential(*body.children())) * (2 if concat_pool else 1)
nc = 2 #num output classes
custom_head = create_head(nf,nc,lin_ftrs=[256])
print(custom_head)

This is the output:

Sequential(
  (0): AdaptiveConcatPool2d(
    (ap): AdaptiveAvgPool2d(output_size=1)
    (mp): AdaptiveMaxPool2d(output_size=1)
  )
  (1): Flatten()
  (2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (3): Dropout(p=0.25)
  (4): Linear(in_features=1024, out_features=256, bias=True)
  (5): ReLU(inplace)
  (6): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (7): Dropout(p=0.5)
  (8): Linear(in_features=256, out_features=2, bias=True)
)

Now you have the 256 dimension just before the output.

There are two ways to inject this into your model.

overwrite the learner.model[1] with this sub-model
build your cnn_learner with custom_head=custom_head

Once you have it the way you want, you can check again with learner.model and be sure you have the head you want. From there, lr_find, fit_one_cycle and you should have it sorted out!

Pomo · April 9, 2019, 10:35pm

Hi Cindy. Thanks for the fuller picture. What I’m going to write here is part opinion and part technical.

In the paper, as I understand it after skimming, first they train six separate satellite vision models at different scales to capture house, neighborhood, and geographic features. Then they concatenate these features with nearby points-of-interest and specific house characteristics by address. These features all go into a linear regressor, a random forest regressor, and a “multi-level perceptron”, which I think means the kinds of models we are studying. Finally, the scores are combined (somehow) to yield a house price.

Whew! This is very advanced stuff, far beyond what anyone as a beginner just finishing Lesson 1 can accomplish. IMHO, it’s even too advanced to learn much from as a personal curriculum. At the least, you’d need to master Lesson 3 on tabular data, understand how to build models (PyTorch), and take a course on Machine Learning/Scikit to understand regressors.

Instead I suggest implementing what you already do understand after Lesson 1 and, if you remain fascinated by this particular problem, adding in further complexity as you come to understand each topic. Step-by-step builds the foundation.

As for Lesson 1, you could implement a single-house vision model. You have three classes: lower 10%, upper 10%, and in-between. You have several ways to label satellite images with these classes. The benefit of fastai is that you don’t need to be concerned at this point with softmax or feature sets. You can just load a pre-trained model, train, predict, and see how it works! It’s the only way to get a feeling for training and for the capabilities of different CNNs.

The 256 features mentioned in the paper is AFAICT just arbitrary - it does not matter to their model or to yours. Maybe they tested for the ideal number, maybe not, maybe it was what fit in their GPU. With resnet34, fastai resolves to 512 features in the head. Once you are more comfortable with actually using fastai, you might experiment with fewer features as @bfarzin shows you how to do. You can discover for yourself how much feature count really matters.

Another interesting experiment would be to train as a regression model, predicting the actual sale price instead of the category. Not too hard. Change the data input to include price, and use FloatList. fastai will provide the appropriate loss function.

Later, if you want to use this model as a piece of a bigger one, you can load the weights you trained, and remove one or two linear layers from the head to access the features. These features can be combined with, say, tabular data or other CNNs. To understand how, you’ll need to understand PyTorch Modules (Part 2).

Well, that’s my best advice. Please excuse any false assumptions I may have made. Good luck!

cinbez · April 10, 2019, 6:53am

Thanks very much…I’m sure this is going to help me greatly. I will let you know when I get it working.

tcapelle · April 10, 2019, 7:13am

This is a task that is not hard to implement, and thanks to @sgugger now we have MixedItemList, specially built to do this.
But as @Pomo said, you need to master tabular model, because you will assemble an Image model + a Tabular.

cinbez · April 10, 2019, 7:45am

@Pomo thanks very much for your thoughts. I have already run lesson 1 and applied it to simply differentiate between affordable and luxury areas using resnet50 (no fiddling with reducing output features before making the classification of course) and fine-tuning that and getting 93% accuracy. However, I realised that to get 100% fair results, I should ensure that my validation images have never been seen by the model before (even in part as images at zoomed out levels may easily overlap in range). So I’m working now to ensure that my validation set images are in towns that have been unseen in the training set. I hope my accuracy won’t drop by too much.

I know this is an ambitious project…and I’m nowhere near the second part yet, but I would like to get this first part completed fairly soon. I realise the 256 is probably arbitrary and a model will probably perform better if I don’t do this and use the standard 512 outputs.

Then I would like to try the paper’s suggestion and yours to predict the house price using the image attributes only although I may need some more direction in how to do this. It would be very interesting to see how well the model performs using only an aerial image!

I also plan to repeat a similar process flow to the Rossmann store lesson which uses tabular data to predict store sales and get a NN to predict house prices using structured data attributes only.

The final part of this project, would be to use the imagery features + structured attribute data on a house to predict the house value.

Thanks so much for your thoughts! I really appreciate the help you have so willingly offered.