Lesson 3 In-Class Discussion ✅

sgugger · November 9, 2018, 4:33am

You’d use their subclass, ImageBbox

itsmuriuki · November 9, 2018, 4:34am

Use 1cycle policy. learn.fit_one_cycle and pass a higher number

mmiakashs · November 9, 2018, 4:35am

is the head pose dataset contains video data? if yes then how to parse frame ?

simonw · November 9, 2018, 4:35am

The IMDB set has already been split into a validation set - “25,000 of them are labelled between positive and negative for training, another 25,000 are labelled for testing”

Is there any reason we should stick to using their is_valid splitting, or could we just as well combine those into 50,000 positive/negative labelled examples and then randomly split our own training set out of it.

rachel · November 9, 2018, 4:36am

We will post on twitter (jeremyphoward and math_rachel) and on the blog (fast.ai) about part 2.

cjwinslow · November 9, 2018, 4:36am

dennisobrien · November 9, 2018, 4:36am

For regression problems where we know the output should be bounded (in this case the pixel boundaries) would it help to guide the model with this information, for example, adding clipping to the final node in the model? Or should we just rely on the model learning these boundaries?

sgugger · November 9, 2018, 4:37am

You don’t have to, but if you want to report your amazing results, you have to use their validation set.

sgugger · November 9, 2018, 4:37am

There is going to be support for that soon, but yes, that is a useful information for the model.

gamino · November 9, 2018, 4:38am

fastai.text does not use torchtext, I know Jeremey said in the past it was too slow, has this changed at all?

angelinayy · November 9, 2018, 4:40am

what if a rare word is important key word for classification result? Could we not put it in xxunk?

mmiakashs · November 9, 2018, 4:40am

can the fastai text tokenization be applied on multilingual dataset ?

vasuarez · November 9, 2018, 4:40am

I’m curious how tokenizing works with words that rely on each other. For example, we shouldn’t tokenize San Francisco into two separate words. Can I take this into account when tokenizing?

ertan · November 9, 2018, 4:41am

I think it uses spacy tokenizer behind the scenes https://spacy.io/api/tokenizer

raviraj_mg · November 9, 2018, 4:41am

How can we handle noisy images - please let us know are there any standard de noising techniques ?

sgugger · November 9, 2018, 4:41am

Not sure, but we’re still not using torchtext.

howkhang · November 9, 2018, 4:41am

I get a bunch of consecutive xxfld tokens at the beginning of my text when I create a TextLMDataBunch. What does it stand for?

xxfld 1 0 xxfld 2 0 xxfld 3 0 xxfld 4 0 xxfld 5 0 xxfld 6 1 xxfld 7 0 xxfld 8 0 xxfld 9 0 xxfld 10 0 xxfld 11 0 xxfld 12 0 xxfld 13 0 xxfld 14 0 xxfld 15 0 xxfld 16 0 xxfld 17 0 xxfld 18 0 xxfld 19 0 xxfld 20 0 xxfld 21 0 xxfld 22 0 xxfld 23 0 xxfld 24 0 xxfld 25 0 xxfld 26

simonw · November 9, 2018, 4:41am

I believe rare words are considered to be words which only appear once in the entire dataset - if a word appears twice or more I don’t think it gets put in xxunk. If a word is so rare it only appears once it can’t be used for training (as it is guaranteed to over-fit to just that single record).

gamino · November 9, 2018, 4:42am

xxfld means Field 1, Field 2, etc, somehow your feeding the data so it thinks each is a field. When I feed data to text it’s just label, text (not tokenize)

jeremy · November 10, 2018, 12:36am

4 posts were merged into an existing topic: Lesson 3 Advanced Discussion