Why resnet34 in Lesson 1?

santman · October 7, 2018, 5:12pm

Hi,

In Lesson-1 we are using resnet34. But I noticed that there were few more options available.

I did try the complete steps on resnet18, I found that results from resnet34 and resnet18 were almost similar. Just curios to know why we are using resnet34? I’ve not tride out the other options yet. But before doing that want to understand the rationale behind showing the demo with resnet34.

radek · October 7, 2018, 6:45pm

I wouldn’t read too much into the arch choice. In general it is a good idea to try out a couple and see how they perform.

For me, resnet34 would be one of the first architectures if not the first I would try out - it seems to offer very good performance vs it size (which impacts training time) and allows for bigger batch size.

As a rule of thumb the more complex the problem, the bigger an arch you might need. All of the archs from your screenshot have been pretrained on imagenet and in general telling a cat from a dog is probably not the hardest of tasks for a CNN hence going for something relatively small seems to make a lot of sense.

santman · October 8, 2018, 12:55am

@radek,

Thanks for your response. I’m still not completely convinced. I think just batch size or size is any constraints, I was able to go to batch size beyound 64K in both resnet18 and resnet34. Here is the sample data for resnet34, which I gathered to benchmark my system’s performance.

Batch Size	trn_loss	val_loss	Accuracy	Wall Time (seconds)
64	0.031134	0.028481	0.989	15.7
128	0.028619	0.029348	0.989	14.1
256	0.032689	0.022995	0.991	13.2
512	0.038162	0.025427	0.9895	12.7
1024	0.055639	0.02597	0.988	12.2
2048	0.08693	0.034631	0.987	11.4
4096	0.165338	0.048062	0.983	11.5
8192	0.303578	0.060767	0.9795	10.1
16384	0.346356	0.091748	0.98	6.15
32768	0.651255	0.262653	0.927	4.66
65536	0.676977	0.250999	0.9475	4.74
131072	0.56841	0.24005	0.9415	4.73

Any inputs would certainly help. What does 34/18/101/xyz numbers signify?

Pendar · October 8, 2018, 1:25am

The 34 is how many layers are in the network. Resnet34 shown here https://i.imgur.com/nyYh5xH.jpg. Larger networks can model more complex problems, but at the risk of over fitting. You would need more regularization for larger networks. The reason why resnet34 is used is because its performance to accuracy trade off is fine for the problem. Larger networks would take longer to train, and use more memory than smaller networks.

jeremy · October 8, 2018, 2:17am

Can you be specific about what training you did and what results you got? IIRC rn34 gave me much better results on dogs v cats.

santman · October 8, 2018, 3:35am

Hi @jeremy ,

I’m running the following code to compare.

#arch=resnet34
arch=resnet18
data = ImageClassifierData.from_paths(PATH, bs=131072, tfms=tfms_from_model(arch, sz))
learn = ConvLearner.pretrained(arch, data, precompute=True)
%time learn.fit(0.01, 5)

Here is my comparision results

Resnet34: Results

Batch Size	trn_loss	val_loss	Accuracy	Wall Time (seconds)
64	0.031134	0.028481	0.989	15.7
128	0.028619	0.029348	0.989	14.1
256	0.032689	0.022995	0.991	13.2
512	0.038162	0.025427	0.9895	12.7
1024	0.055639	0.02597	0.988	12.2
2048	0.08693	0.034631	0.987	11.4
4096	0.165338	0.048062	0.983	11.5
8192	0.303578	0.060767	0.9795	10.1
16384	0.346356	0.091748	0.98	6.15
32768	0.651255	0.262653	0.927	4.66
65536	0.676977	0.250999	0.9475	4.74
131072	0.56841	0.24005	0.9415	4.73

Resnet18 Results

Batch Size	trn_loss	val_loss	Accuracy	Wall Time (seconds)
64	0.047726	0.03651	0.987	15.7
128	0.043944	0.036581	0.9885	14
256	0.042771	0.034083	0.9875	13.2
512	0.052014	0.035504	0.9865	12.6
1024	0.067253	0.042399	0.984	12.1
2048	0.105353	0.043386	0.9855	11.4
4096	0.186447	0.060333	0.9765	11.4
8192	0.310596	0.066564	0.976	10.2
16384	0.380862	0.118406	0.9625	6.32
32768	0.683845	0.264256	0.918	4.74
65536	0.669664	0.224125	0.95	4.74
131072	0.680029	0.280925	0.9175	4.69

Appreciate your help in identifying the mistake I’m doing compared to your results.

I’m looking at the accuracy column and the wall time columns as criteria for my observations.

santman · October 8, 2018, 3:37am

@Pendar, Thanks for the image and the explaination. I’m getting better understanding now.

jeremy · October 8, 2018, 3:45am

There’s no point using those enormous batch sizes. Stick with 128 to keep it simple. You should be running through all the steps - only do at most 2 epochs with precompute, then a few epochs unfreezing the different layer groups, like in the lesson 1 notebook. And try TTA too. See what the best accuracy you can get is with each architecture.

santman · October 8, 2018, 4:14am

Thanks @jeremy, I will take your advice and continue. Will post my results about the observations I’ll be making with TTA on different resnet shortly.

I built my own DL setup and hence was doing the testing on how well my system coupes up with different batch sizes. In some threads in this fourm people were discussing about batch size to determine the capacity or performance of their system. So I was experimenting with batch sizes, to get the max load limit of my system.

jeremy · October 8, 2018, 5:36am

That will only be useful for you with precompute=False and all layers unfrozen.