Why resnet34 in Lesson 1?


#1

Hi,

In Lesson-1 we are using resnet34. But I noticed that there were few more options available.

image

I did try the complete steps on resnet18, I found that results from resnet34 and resnet18 were almost similar. Just curios to know why we are using resnet34? I’ve not tride out the other options yet. But before doing that want to understand the rationale behind showing the demo with resnet34.


#2

I wouldn’t read too much into the arch choice. In general it is a good idea to try out a couple and see how they perform.

For me, resnet34 would be one of the first architectures if not the first I would try out - it seems to offer very good performance vs it size (which impacts training time) and allows for bigger batch size.

As a rule of thumb the more complex the problem, the bigger an arch you might need. All of the archs from your screenshot have been pretrained on imagenet and in general telling a cat from a dog is probably not the hardest of tasks for a CNN hence going for something relatively small seems to make a lot of sense.


#3

@radek,

Thanks for your response. I’m still not completely convinced. I think just batch size or size is any constraints, I was able to go to batch size beyound 64K in both resnet18 and resnet34. Here is the sample data for resnet34, which I gathered to benchmark my system’s performance.

Batch Size trn_loss val_loss Accuracy Wall Time (seconds)
64 0.031134 0.028481 0.989 15.7
128 0.028619 0.029348 0.989 14.1
256 0.032689 0.022995 0.991 13.2
512 0.038162 0.025427 0.9895 12.7
1024 0.055639 0.02597 0.988 12.2
2048 0.08693 0.034631 0.987 11.4
4096 0.165338 0.048062 0.983 11.5
8192 0.303578 0.060767 0.9795 10.1
16384 0.346356 0.091748 0.98 6.15
32768 0.651255 0.262653 0.927 4.66
65536 0.676977 0.250999 0.9475 4.74
131072 0.56841 0.24005 0.9415 4.73

Any inputs would certainly help. What does 34/18/101/xyz numbers signify?


(Bryan Heffernan) #4

The 34 is how many layers are in the network. Resnet34 shown here https://i.imgur.com/nyYh5xH.jpg. Larger networks can model more complex problems, but at the risk of over fitting. You would need more regularization for larger networks. The reason why resnet34 is used is because its performance to accuracy trade off is fine for the problem. Larger networks would take longer to train, and use more memory than smaller networks.


(Jeremy Howard) #5

Can you be specific about what training you did and what results you got? IIRC rn34 gave me much better results on dogs v cats.


#6

Hi @jeremy ,

I’m running the following code to compare.

#arch=resnet34
arch=resnet18
data = ImageClassifierData.from_paths(PATH, bs=131072, tfms=tfms_from_model(arch, sz))
learn = ConvLearner.pretrained(arch, data, precompute=True)
%time learn.fit(0.01, 5)

Here is my comparision results

Resnet34: Results

Batch Size trn_loss val_loss Accuracy Wall Time (seconds)
64 0.031134 0.028481 0.989 15.7
128 0.028619 0.029348 0.989 14.1
256 0.032689 0.022995 0.991 13.2
512 0.038162 0.025427 0.9895 12.7
1024 0.055639 0.02597 0.988 12.2
2048 0.08693 0.034631 0.987 11.4
4096 0.165338 0.048062 0.983 11.5
8192 0.303578 0.060767 0.9795 10.1
16384 0.346356 0.091748 0.98 6.15
32768 0.651255 0.262653 0.927 4.66
65536 0.676977 0.250999 0.9475 4.74
131072 0.56841 0.24005 0.9415 4.73

Resnet18 Results

Batch Size trn_loss val_loss Accuracy Wall Time (seconds)
64 0.047726 0.03651 0.987 15.7
128 0.043944 0.036581 0.9885 14
256 0.042771 0.034083 0.9875 13.2
512 0.052014 0.035504 0.9865 12.6
1024 0.067253 0.042399 0.984 12.1
2048 0.105353 0.043386 0.9855 11.4
4096 0.186447 0.060333 0.9765 11.4
8192 0.310596 0.066564 0.976 10.2
16384 0.380862 0.118406 0.9625 6.32
32768 0.683845 0.264256 0.918 4.74
65536 0.669664 0.224125 0.95 4.74
131072 0.680029 0.280925 0.9175 4.69

Appreciate your help in identifying the mistake I’m doing compared to your results.

I’m looking at the accuracy column and the wall time columns as criteria for my observations.


#7

@Pendar, Thanks for the image and the explaination. I’m getting better understanding now.


(Jeremy Howard) #8

There’s no point using those enormous batch sizes. Stick with 128 to keep it simple. You should be running through all the steps - only do at most 2 epochs with precompute, then a few epochs unfreezing the different layer groups, like in the lesson 1 notebook. And try TTA too. See what the best accuracy you can get is with each architecture.


#9

Thanks @jeremy, I will take your advice and continue. Will post my results about the observations I’ll be making with TTA on different resnet shortly.

I built my own DL setup and hence was doing the testing on how well my system coupes up with different batch sizes. In some threads in this fourm people were discussing about batch size to determine the capacity or performance of their system. So I was experimenting with batch sizes, to get the max load limit of my system.


(Jeremy Howard) #10

That will only be useful for you with precompute=False and all layers unfrozen.