Lesson 3 In-Class Discussion ✅

You’d use their subclass, ImageBbox

1 Like

Use 1cycle policy. learn.fit_one_cycle and pass a higher number

is the head pose dataset contains video data? if yes then how to parse frame ?

The IMDB set has already been split into a validation set - “25,000 of them are labelled between positive and negative for training, another 25,000 are labelled for testing”

Is there any reason we should stick to using their is_valid splitting, or could we just as well combine those into 50,000 positive/negative labelled examples and then randomly split our own training set out of it.

2 Likes

We will post on twitter (jeremyphoward and math_rachel) and on the blog (fast.ai) about part 2.

13 Likes
5 Likes

For regression problems where we know the output should be bounded (in this case the pixel boundaries) would it help to guide the model with this information, for example, adding clipping to the final node in the model? Or should we just rely on the model learning these boundaries?

3 Likes

You don’t have to, but if you want to report your amazing results, you have to use their validation set.

1 Like

There is going to be support for that soon, but yes, that is a useful information for the model.

2 Likes

fastai.text does not use torchtext, I know Jeremey said in the past it was too slow, has this changed at all?

4 Likes

what if a rare word is important key word for classification result? Could we not put it in xxunk?

can the fastai text tokenization be applied on multilingual dataset ?

1 Like

I’m curious how tokenizing works with words that rely on each other. For example, we shouldn’t tokenize San Francisco into two separate words. Can I take this into account when tokenizing?

9 Likes

I think it uses spacy tokenizer behind the scenes https://spacy.io/api/tokenizer

1 Like

How can we handle noisy images - please let us know are there any standard de noising techniques ?

2 Likes

Not sure, but we’re still not using torchtext.

1 Like

I get a bunch of consecutive xxfld tokens at the beginning of my text when I create a TextLMDataBunch. What does it stand for?

xxfld 1 0 xxfld 2 0 xxfld 3 0 xxfld 4 0 xxfld 5 0 xxfld 6 1 xxfld 7 0 xxfld 8 0 xxfld 9 0 xxfld 10 0 xxfld 11 0 xxfld 12 0 xxfld 13 0 xxfld 14 0 xxfld 15 0 xxfld 16 0 xxfld 17 0 xxfld 18 0 xxfld 19 0 xxfld 20 0 xxfld 21 0 xxfld 22 0 xxfld 23 0 xxfld 24 0 xxfld 25 0 xxfld 26

1 Like

I believe rare words are considered to be words which only appear once in the entire dataset - if a word appears twice or more I don’t think it gets put in xxunk. If a word is so rare it only appears once it can’t be used for training (as it is guaranteed to over-fit to just that single record).

1 Like

xxfld means Field 1, Field 2, etc, somehow your feeding the data so it thinks each is a field. When I feed data to text it’s just label, text (not tokenize)

4 posts were merged into an existing topic: Lesson 3 Advanced Discussion :white_check_mark: