Non-Beginner Discussion

Danrohn · May 2, 2022, 11:41am

Thanks!
Will this be sufficient for retrieving what I need?

I also found this:

Which I think can be helpful.

If I need some callback that saves a decoded prediction of the image in a specific folder, say every epoch, then would it be it?

jpc · May 2, 2022, 3:48pm

Hi,

I was wondering about the Avg and MaxPool2d layers that are the last step of CNN networks just before the head. In classification models they “compress” 49 (7x7) features from the last conv layer into a single feature vector using element-wise mean and max.

I tried to take some of the convolutional features from the 7x7 layer and build an image-search model (using BarlowTwins for unsupervised training) with them. If I take a single feature and calculate cosine similarity then it works really well. But when I try to calculate the element-wise mean or max of several features the cosine similarity is no longer working well – the more features are mix into the “search query” the less relevant the matches (ordered by cosine similarity).

Has anybody encountered this before?

I am also wondering how this affects transfer-learning:

The features before the max/avg pool seem to carry a lot more information (“one-shot” image search works on them but not on the pooled versions)
ResNets work in practice so it has to mean that the features can be trained to behave sensibly under Avg/Max Pool.

My hypothesis:

Maybe the conv features have to be fine-tuned to each task to optimaly “survive” the pooling and give good accuracy? Would it be beneficial to add a randomly-initilized ResBlock between the last conv layer and the pooling later and train it with the same learning rate as the head?

What would be a good fine-tuning task & dataset (not object-detection since they don’t use max/avg pool layers) on which I could test this hypothesis? It has to be sufficiently different from ImageNet for the

mindtrinket · May 2, 2022, 4:39pm

I know it is still a little early for NLP. But has anyone tried codeBert? A quick search does not show up in the forums.

“We probe four models that vary in their expected knowledge of
code properties: BERT (pre-trained on English), CodeBERT and
CodeBERTa (pre-trained on source code, and natural language
documentation), and GraphCodeBERT (pre-trained on source
code with dataflow). While GraphCodeBERT performs more
consistently overall, we find that BERT performs surprisingly
well on some code tasks, which calls for further investigation.”

https://arxiv.org/pdf/2108.11308.pdf

jeremy · May 2, 2022, 6:39pm

I’d suggest reading the source for this callback, since it’s close to what you want:

https://lessw.medium.com/fastais-callbacks-for-better-cnn-training-meet-savemodelcallback-e55f254f1af5

And of course read the docs for fastai callbacks in general:

PS: I found both of these by just googling for “fastai callback” – there’s lots of tutorials and stuff around about these to help you get started, so I’d suggest taking a look thru the links you’ll find there too.

Danrohn · May 2, 2022, 6:44pm

Thank you for your guidance! @jeremy
The first link of yours is something that I’ve not seen yet. Great!
I’ve actually googled too, and have seen that second webpage that you posted. After sweeping through that, I kept searching and found out this: (As in my previous message)

Because get_preds also seems to be like something I’m looking for

jeremy · May 2, 2022, 7:07pm

You can just use predict() like we did in the last lesson - also shown in ch1 of the book. We’ll be using that more in the next lesson too FYI.

Danrohn · May 4, 2022, 8:58pm

Hey there,
I’m brushing up my knowledge a little bit, and just got surprised with a new function vision_learner that replaces cnn_learner.
How is vision_learner different than unet_learner? When and which is better to use which?
(Answer that I think that I found: vision_learner fits more of classification tasks, while unet_learner fits more of regression/segmentation tasks)

Another few questions:

learn = vision_learner(dls, resnet34, metrics=error_rate)
learn.fit_one_cycle(3, 3e-3)
learn.unfreeze()
learn.fit_one_cycle(12, lr_max=slice(1e-6,1e-4))

Why is it important to unfreeze() between training? I only understood that it helps us not update the weights of the last layer. But I kinda got lost in beforehand. What happens if I don’t unfreeze()?
When using slice(), does the 1e-6 apply to the later layers? (=“lowest”)

Info came from here:

Thanks

Danrohn · May 4, 2022, 10:50pm

Thanks. predict() is indeed useful after the training is over, while I want to use callback that outputs a sequence of [input,target,preds] of images and save them in a folder every epoch (while training)

VishnuSubramanian · May 5, 2022, 1:38pm

I think vision_learner replaces cnn_learner.

cnn_learner was built at the time when CNN base algorithms like

resnet
densenet
efficientnet

were used for image-based tasks. As the use of transformer based architectures for image-based data has increased cnn_learner became vision_learner.

Its more than a name change, now we can use all the amazing models from timm

We should be able to use vision_learner for regression too, and you guessed it right. unet_learner is used for image segmentation.

After training the head, we unfreeze which trains the full model. I guess slice lets us use something called discriminative learning where we train groups of layers with different learning rates.

Danrohn · May 5, 2022, 2:04pm

Thanks for the info!

Just questions to clarify:

What makes timm's models so much more incredible?
How do I know how many epochs are needed for training only the head?
When my output is not a label, but an image (a float number in every pixel), is it called regression? Or segmentation? Meaning, could I use vision_learner for training where the output is an image as well?
(I tried to replace unet_learner with vision_learner, but got an error that the input tensor only got two axis [bs,ch] rather than the four axis [bs,ch,h,w])

VishnuSubramanian · May 5, 2022, 3:32pm

Timm models offer a consistent API and pretrained weights for 100’s of models. You can check the kaggle kernel by Jeremy to explore more.
fastai fine_tune method just trains the head for 1 epoch. If the dataset is similar to imagenet I would finetune longer and if it is different I usually train by unfreezing from start. It works for me, you can play with different choices to get a feel of it.
When you want to predict an image I think you should be using unet_learner. It is called segmentation.

Danrohn · May 5, 2022, 4:59pm

Thanks!
It’s funny to say, but when I use freeze() without unfreeze() afterward, I get my training metric goes better at a linear rate. That’s even more impressive. Could I say that it’s just that my dataset very resembles the ImageNet?

With timm models, could it mean that I may use a better model of Resnet50? That is, being pretrained on a much more diverse dataset than ImageNet?

florianl · May 5, 2022, 7:58pm

I did some similarity projects (vision and text) and if you want really good results you’ll have to find a training objective to fine tune your model on the task. The features after pooling work fine - e.g. in fastai I just take the concat-pool outputs. There are differences in avg pool, max pool and concat pool but all are ok.

Why fine-tune? Self supervised trained models (nlp / vision) or models pretrained on ImageNet don’t work that well - the results often aren’t similar in a human sense.

How to fine-tune (from easy 1 to hard 3)?
1a) vision: self supervised pretraining on your dataset instead of using a pretrained model e.g. Self Supervised Learning with Fastai | self_supervised
1b) text: fine-tune your language model on your dataset
2) vision + nlp: do you have a label dataset? train a classifier. better: train a multi label model - it doesn’t matter how good the models work but classes / labels will produce similar feature vectors.
3) vision + nlp: contrastive learning will give the best results but it’s often hard / impossible to create the required dataset (quality and size) e.g. sentence transformers for nlp https://www.sbert.net

If you want something that works out of the box and it’s a real world dataset (photos) I’d just use CLIP: GitHub - rom1504/clip-retrieval: Easily compute clip embeddings and build a clip retrieval system with them

Danrohn · May 5, 2022, 11:01pm

I just have a weird question:
Does training a dataset have momentum?
Meaning, will I reach the same maximum potential in both ways:

using a learning rate of 1e-3 for 50 epochs, then 1e-6 for more 50 epochs,
using a learning rate of 1e-6 for 50 epochs, then 1e-3 for more 50 epochs

Imagine this as if I squeeze an orange juice. Will I be able to gain the same juice by squeezing it fast and forcibly, or slowly with low efforts? Or would it yield different results?

jeremy · May 6, 2022, 4:30am

Very different results!

BTW @Danrohn most of your questions are actually answered at length in the book - because they’re very good and important questions, and they’re the questions we try to ensure are answered. Full answers to them require a book length treatment - which is what we have!

So I’d strongly recommend making your way through the book, since it’s going to tell you exactly what you’re trying to understand

(Which is not a way of discouraging you from asking - but I don’t have a better answer than what’s in the book, so I’m not going to be able to add much here…)

VishnuSubramanian · May 6, 2022, 5:05am

resnet50 implementation across torchvision and timm are almost identical.

So we will find similar results. The advantage what timm offers is access to 100’s of models, and Ross Wightman continuously updates his repo with the latest models and weights for it.

Screenshot 2022-05-06 at 10.34.17 AM

ilovescience · May 6, 2022, 9:25pm

Just to clarify what learn.fine_tune does is:

with the given learning rate, just train the head for one epoch
Unfreeze the entire learning rate, and train the entire model with a slightly lower learning rate (the original learning rate divided by some factor) for whatever number of epochs was passed into the function. Note that the learning rates used are discriminative learning rates, meaning the beginning of the model is trained with a much lower learning rate than closer to the end (the book has more details on this).

You can technically pass in the number of epochs that only the head is trained for (the default being 1). I recommend checking out the source code for Learner.fine_tune.

The idea is that the first two layers are untrained (we add a custom head after all) so instead we train just the head to “calibrate” it with the rest of the model. Then we can unfreeze and train the whole model much more easily.

This is definitely an interesting observation and it is likely that indeed the datasets closely resembles ImageNet so you don’t really need to adjust the body of the model. As a side point, make sure your validation metrics are improving, not just your training metrics.

Danrohn · May 6, 2022, 9:43pm

Hey man! Your messages, like once, are clarified as cutting butter with a warm knife. Thanks!

Your explanation is wonderful, but I already figured out this by looking at the source code, and by reading this:

I have kept reading those explanations so many often, but it ain’t feeling so clear to me. If I don’t unfreeze(), would it mean that the “body” stays untrained?

This is definitely an interesting observation and it is likely that indeed the datasets closely resembles ImageNet so you don’t really need to adjust the body of the model. As a side point, make sure your validation metrics are improving, not just your training metrics.

Yes, both the valid_loss and the metrics go better, without overfitting.

epoch	train_loss	valid_loss	ssim	psnr	time
0	0.236554	0.208531	0.183433	11.753738	02:09
1	0.179139	0.148908	0.236383	13.805398	02:05
2	0.164469	0.179590	0.292767	12.279366	02:06
3	0.144973	0.139101	0.338365	14.361961	02:05
4	0.132566	0.115691	0.383597	15.746257	02:05
5	0.123076	0.140860	0.408780	14.752694	02:05
6	0.124454	0.101365	0.442580	16.926208	02:05
7	0.112748	0.102430	0.499645	16.352867	02:04
8	0.108559	0.099927	0.507152	17.267815	02:04
9	0.108745	0.096709	0.531147	17.573277	02:02
10	0.131972	0.105292	0.546981	16.911779	02:03
11	0.122663	0.106509	0.569845	17.131609	02:06
12	0.117344	0.116483	0.586394	16.417915	02:06
13	0.118717	0.096780	0.601940	17.860203	02:02
14	0.109777	0.108330	0.622874	16.764353	02:02
15	0.107149	0.086118	0.636838	18.321262	02:02
16	0.099475	0.098193	0.657461	17.738472	02:04
17	0.100719	0.132301	0.639843	15.730131	02:04
18	0.111611	0.107994	0.666687	16.818670	02:03
19	0.111105	0.120799	0.679617	16.494234	02:03
20	0.114704	0.207559	0.639382	12.127844	02:04
21	0.099970	0.098086	0.707660	17.849594	02:03
22	0.099835	0.089594	0.700292	18.560604	02:01
23	0.106707	0.138295	0.674177	15.729656	02:04
24	0.099064	0.073961	0.725095	20.033642	02:03
25	0.102635	0.084740	0.737844	18.210724	02:06
26	0.108957	0.155416	0.677125	14.684962	02:01
27	0.100712	0.096362	0.740240	18.032732	02:02
28	0.091303	0.098734	0.757668	17.141289	02:03
29	0.095276	0.082612	0.747881	18.612074	02:01
30	0.090979	0.079308	0.766427	19.219879	02:02
31	0.088959	0.062391	0.781548	21.137724	02:05
32	0.092504	0.070026	0.772093	19.750147	02:01
33	0.087208	0.074354	0.783176	19.588976	02:05
34	0.083044	0.066101	0.794248	19.470947	02:01
35	0.080781	0.075523	0.785750	18.851494	02:04
36	0.076060	0.088846	0.795860	16.493053	02:01
37	0.074469	0.068146	0.805137	19.896648	02:04
38	0.071076	0.070530	0.806764	20.062801	02:03
39	0.075082	0.070611	0.802137	20.234707	02:01
40	0.076525	0.073114	0.810220	18.325096	02:05
41	0.070763	0.070393	0.813122	18.819860	02:05
42	0.068504	0.073804	0.805940	18.706667	02:05
43	0.065610	0.068306	0.816073	19.237278	02:02
44	0.060239	0.064983	0.826524	19.970503	02:04
45	0.064982	0.084969	0.812253	18.099051	02:03
46	0.059241	0.054122	0.831039	21.738705	02:05
47	0.056639	0.056747	0.836223	22.513317	02:02
48	0.057209	0.046792	0.846990	24.197918	02:05
49	0.056328	0.046437	0.846901	23.958132	02:03
50	0.054749	0.047320	0.850570	24.019894	02:03
51	0.051009	0.047988	0.852203	23.820242	02:05
52	0.050929	0.047600	0.850500	24.071835	02:06
53	0.051956	0.043077	0.862234	24.669596	02:03
54	0.050984	0.043307	0.857978	24.458191	02:06
55	0.048370	0.046028	0.858897	23.934223	02:04
56	0.048205	0.044067	0.861460	24.493036	02:06

I reached up to ssim=0.90.

I used fit_one_cycle() but I tried to squeeze it even more by using a callback ReduceLROnPlateau(), but it seems to not work on the case of fit_one_cycle (the lr doesn’t get updated). So then I tried fit(), on which that callback does work, but the performance is slightly lesser than with fit_one_cycle(). What am I doing wrong, then?

ilovescience · May 6, 2022, 10:44pm

Yeah, it means that the “body” remains as the ImageNet-pretrained weights and is not adjusted any further through training.

This is expected ,ReduceLROnPlateau() is meant to be used just with fit and is kind of a (dynamic) LR schedule itself so it doesn’t work with other LR schedules. Sounds to me like fit_one_cycle is better for your use-case then.

What problem are you working on? Given that you are going SSIM and PSNR metrics, sounds like maybe an image super-resolution or restoration project?

Danrohn · May 7, 2022, 2:58pm

Thanks for the snippet here. These are sometimes like grain salt on a medium-well steak.

I’ll probably be glad to start using timm models as well as torchvision's