Research collaboration opportunity with Leslie Smith

Fascinating dynamic here!

I was wondering, what is the rationale for these two hypotheses? Are they just hunches, or do you have particular reasons for believing them to be worthwhile explorations?

Please count me in, if possible!

What I am missing here is ‘WHY’.
It would be great if @jeremy @Leslie can explain a bit why they think that their approach works and elaborate more on the thought process.

I want to learn how to catch a fish, not just fry the already caught fish :slight_smile:

I think they’re different approaches to the same overall idea: speed up learning by learning the ‘big-picture’ first, and increase detail as you go along. Sounds very Fast AI.

Put another way: you want to learn what a ‘fish’ is.

Leslie is saying "let’s give you one picture of what a fish is so you get the idea (fins, shape, etc). Once you learn that, I’ll give you more examples of fish so you can learn the details: how they vary, what features they share, etc. Do that for cars, plants, etc.

Jeremy is saying "let’s start by learning about very different things: fish, car, plant – in the usual way. You get an idea of what each is, and how they’re different. Once you learn that, we’ll start learning about different kinds of fish, cars (Toyotas, Mercedes, etc), plants, etc.


Anyone correct me if I’m wrong – that’s the premise I’m going on.

4 Likes

I really appreciate you and Jeremy effort for opening research collaboration opportunity to everyone interested. Thanks for doing that.

I have been thinking about how to design an experiment to test if the ideas given by both Leslie and Jeremy work. Following are my thoughts on the same:

  • How do we know if the experiment is successful (that is, we obtain an accuracy above a threshold and also make the training faster)?
    • We can use a framework similar to the asymptotic model of computational complexity, used to determine the efficiency of an algorithm. That is, use instruction count instead of measuring time in seconds/milliseconds. This would make the comparison independent of machine specifications (faster hardware, more optimized version of software etc).
    • If we go with the framework above, then we carry out our experiments by first choosing a model and train it up to x% accuracy using full data and count the number of computations required. One computation is defined as one forward and one backward pass over an image. This would be the benchmark for that model. Then, we train the model using Leslie’s/Jeremy’s trick. For the experiment to be successful, the number of computations required should be less than the benchmark above.

Now the question is, what are the drawbacks of using computation count over measuring time in seconds? What am I missing?

2 Likes

To determine the success of the experiment, I have the following idea: train the model as usual on the full dataset as good as you can (this will be the baseline), then try the two methods and it’s a success if you can obtain the same accuracy of the baseline.
If you get something that is not as good as the baseline but close, an interesting measure could be accuracy over time.

As for measuring speed… Benchmarking is hard. Ideally all the experiments will need to be done on the same machine at a certain point, so we’re sure there are no differences in hardware, CUDA/cuDNN versions, … As for instruction count, it can be higher/lower depending on the instruction set of the processor, so it’s still machine dependent (in a more obscure way, in my opinion).

Another measure that I’ve seen used in papers is number of updates, that is, how many gradients are computed and apply to the parameters of your model.

I think all the approaches give slightly different information and there isn’t a universal good choice

1 Like

Thanks for your comment, @iacolippo. You do have a point about the instruction set. We can certainly go by the number of updates metric.

As for:

try the two methods and it’s a success if you can obtain the same accuracy of the baseline.

I just have one doubt. It is just a speculation, however. When we use the entire dataset, won’t we eventually be able to train it as well as our baseline (since we’re using the entire dataset now) if we let go of the constraints on time? My point here being, just reaching the baseline accuracy won’t suffice. For the experiments to make sense, time would always have to be factored in.

yes, the point would be to say we reach accuracy A in time X instead of Y with X < Y :slight_smile: although I think it can be a success also if you reach e.g. accuracy of 90% instead of 91% in 3 minutes instead of 30.

1 Like

Thanks for including me!

Would it be best first to define a simple architecture to use, such as the simple resnet created in lesson 7, or use the approaches that were learnt from the dawnbench (wideresnet and darknet) competitions too?

I think both methods would be nice to have:
1.) a more technical benchmark (the number of instructions/gradient updates/etc.) and
2.) a applied benchmark for comparing the computation time to get a certain accuracy with the same machine and the two approaches.

Like @nubbinsonfire suggested: Should we pick a basic architecture (resnet34)?

Where do we get the data?
Is there a way to get the ImageNet data in nested groups or would this be a manual task (i.e. animals with four legs>dogs, cats, horse, etc.)?

I’m looking forward to explore this topic! :slight_smile:

Can you please add me id : deepnarainsingh

thanks
deep

1 Like

For Jeremy’s Idea, We need to have hierarchy among the classes and CIFAR 100 would be a ideal dataset to benchmark on.

CIFAR-100 are grouped into 20 superclasses. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs). Link

Superclass Classes
aquatic mammals beaver, dolphin, otter, seal, whale
fish aquarium fish, flatfish, ray, shark, trout
flowers orchids, poppies, roses, sunflowers, tulips
food containers bottles, bowls, cans, cups, plates
fruit and vegetables apples, mushrooms, oranges, pears, sweet peppers
household electrical devices clock, computer keyboard, lamp, telephone, television
household furniture bed, chair, couch, table, wardrobe
insects bee, beetle, butterfly, caterpillar, cockroach
large carnivores bear, leopard, lion, tiger, wolf
large man-made outdoor things bridge, castle, house, road, skyscraper
large natural outdoor scenes cloud, forest, mountain, plain, sea
large omnivores and herbivores camel, cattle, chimpanzee, elephant, kangaroo
medium-sized mammals fox, porcupine, possum, raccoon, skunk
non-insect invertebrates crab, lobster, snail, spider, worm
people baby, boy, girl, man, woman
reptiles crocodile, dinosaur, lizard, snake, turtle
small mammals hamster, mouse, rabbit, shrew, squirrel
trees maple, oak, palm, pine, willow
vehicles 1 bicycle, bus, motorcycle, pickup truck, train
vehicles 2 lawn-mower, rocket, streetcar, tank, tractor

~Gokkul

7 Likes

Hope I am not too late, count me in.
My github id is https://github.com/hongcheng79
Thanks !

1 Like

I don’t have time right now to commit to involvement in this group project, but I wanted to put my theoretical hat in the ring and suggest a direction that should work well with @jeremy and @Leslie’s ideas. When I first heard about both of these my instinct was that this should combine extremely well with stochastic weight averaging, which is already implemented and a part of the fast.ai library.

If training on one small batch of images or one class and then adding another is viable, then SWA should be a good strategy for regularizing in the case of small batch, and of combining the classes in the case of adding classes. It also provides a mechanism for increasing the parallelism which could drastically speed up multi-gpu.

I think it’ll likely work better with Leslie’s small batches idea, and that’s what I was hoping to try it out on but I don’t have the time myself. I’ll definitely be following this thread though and if anyone has questions about this I’m happy to try to answer them.

Pseudo code of the inner loop is:

batch = random sample of one image per class (or more if bs << capacity)
backpropagate to find the weights that optimize for that set of images (same as leslie’s, high lr)
weights = old_weights*(n-1/n) + new_weights*(1/n)

Looking forward to seeing the direction people take this.

I could work on the docker part as well, do let me know how to contribute on this area ? I was looking at nvidia-docker @radek

I am not planning on making any changes to the docker setup at this point but please feel free to fork it if you believe it can be helpful.

Sorry for the delay in responding - I was offline for the past 3 days while observing a Jewish holiday.

It is very appropriate to ask me why I think my idea might work so I will attempt to convey my thinking.

Jeremy and I were discussing transfer learning (TL). If you know about TL, you should know that it works better the closer the match in classes for the source dataset to the target dataset. Although typically TL is used with a large labeled source dataset and a target with limited labeled samples, I wondered about reversing this a bit - what if we used exactly the same labels but started with only part of the training data samples and did transfer learning to warm start the weights for ‘fine-tuning’ with the full training set. Of course by induction one should do this recursively. It seemed likely to speed up training but I wasn’t sure about performance. I hope the first stages would direct the weights towards a good solution, like @Borz explained (nice explanation). Also, I’ve seen in the literature where TL can improve performance over training from scratch.

Let me know if this explanation is clear and sufficient. If need be, I can try again.

Best,
Leslie

3 Likes

I am glad to see that you did a literature search @narvind2003. All research project need to start with a literature search to find out if the idea has been done or even to find related work.

Incremental learning is a bit different than Jeremy’s idea. There’s a research area called “lifelong learning” with several subtopics, such as incremental learning, continuous learning, etc. The goal there is to be able to learning as you go - online learning of new/novel classes. Some of my own research is in this area.

On the other hand, Jeremy proposed his idea to speed up learning (correct me if I am wrong). One has all of the classes at the start but one introduces them to the network in stages. I believe Jeremy’s reasoning is related to curriculum learning.

5 Likes

IMO, the commonality between Jeremy’s increasing the number of classes as training progresses and my idea for increasing the number of samples/class is that an aspect of the training grows in steps/stages. That is, the first stage is a quick training that (hopefully) brings the weights in the vicinity of a good solution. The next step is transfer learning (TL) of the weights to find a more accurate solution (repeat as necessary). From this perspective, progressive resizing might also be considered as part of this family of ideas (is this clear? If not, ask.).

I asked myself “what else can grow?” that we haven’t yet considered. The answer that came to me is the architecture. If the first stage was to train a thin, shallow network, it would be pretty quick to train. If those weight could be used to “warm start” a wider, deeper network, it might just work.

Wait a minute - how can one warm start a larger network? Hmmmm… Actually I remember a nice paper from a couple of years ago form the good people at Google:
Chen, Tianqi, Ian Goodfellow, and Jonathon Shlens. “Net2net: Accelerating learning via knowledge transfer.” arXiv preprint arXiv:1511.05641 (2015).
In this paper the authors show how to do TL of the weights to a wider and/or deeper network.

Therefore, I’d like to expand this research investigation to also include transfer learning in stages from a thin, shallow network to a final wider, deeper network. Will anyone accept this challenge?

9 Likes