Lesson 11 discussion and wiki

Wasn’t the first layer’s kernel size supposed to be bigger than 3x3?

3 Likes

It depends on where you are in your model. At the beginning, I think they still use a MaxPool. Later on though, it’s average pooling.

1 Like

Average pooling retain more information than max pooling

2 Likes

can Jeremy give a quick peek into , how he looks at a tensor ,breaks it down , analyze

1 Like

Jeremy mentioned about recording videos for topics that we run out of time for…will those recording be available only along with MOOC or will they be released before that?

2 Likes

So what does one do when having high res images (typically the case)? Resize to 224 by 224 (information loss)? Or to any appropriate size that can fit into GPU?

This was more about being explicit instead of implicit to make it easier for others to reason about the code, I mean if it doesn’t cost you readability:

def label_by_func(sd, f):
    proc = CategoryProcessor.from_dataset(sd.train) # <- that makes it explicit dependency on ds.train, so it is clear you can't reuse CategoryProcessor
    train = LabeledData.label_by_func(sd.train, f, proc)
    valid = LabeledData.label_by_func(sd.valid, f, proc)
    return SplitData(train,valid)

That way you don’t need the assert in CategoryProcessor. deprocess, and you let your user know that they can’t reuse CategoryProcessor between different datasets.

1 Like

I have a general question. Not sure what is the right forum, but posting here anyways. I found when working on my initial kaggle competitions that most of the times, good feature engineered kernels tend to perform better than using the deep learning techniques and changing the associated hyperparameters. Is feature engineering a must for every problem we look at or am i doing something wrong with deep learning.

2 Likes

Example competition? I know this happens with Tabular data a bit.

What was the link to that BERT paper Jeremy just showed? I don’t think it’s in the first post

I guess it depends on the type of competition. For tabular data, it is very often true. (At least, from my experience). But for images or texts, automatic feature extraction you have with Deep Learning show good results.

8 Likes

It has been my experience mostly with tabular data.

I just added it to the first post.

3 Likes

When Jeremy covers the Rossman Competition (tabular data) that uses a mix of necessary feature engineering together with deep learning.

Edited to add:
Here are the links to the relevant lessons:
https://course18.fast.ai/lessons/lesson3.html
https://course18.fast.ai/lessons/lesson4.html
https://course18.fast.ai/lessonsml1/lesson10.html
https://course18.fast.ai/lessonsml1/lesson11.html
https://course18.fast.ai/lessonsml1/lesson12.html

6 Likes

Thanks. Can someone give a high level intuition for what “pre-training” is? How is it different from regular training? And is this something we’ve done before?

1 Like

When you use an imagenet model, you use a pretrained model. Same for transfer learning in NLP.

From the paper:

Solver | Batch Size| Iterations | F1 score on dev set | Hardware | Time
Our Method | 64k/32k | 8599 | 90.584 | 1024 TPUs | 76.19m

1024 TPUs

6 Likes

In Part 1 we used a pre-trained LSTM for classifying movie reviews. Pretraining involved going through WikiText and training the model to predict the next word.

1 Like

Pre-training is the 1st stage in transfer learning. It is when you train a model on a dataset that is not the dataset you are ultimately interested in (perhaps because your dataset is too small). With transfer learning, you next “fine-tune” the model to your particular dataset.

3 Likes