Lesson 3 In-Class Discussion ✅

i actually have cases that these single appearance keywords are important and cannot be ignored.

The exact rule is 60k words max, and then an appearance of at least twice (otherwise there’s nothing to learn), but your general statement is true.

Thanks! Now this all makes sense. I have a multilabel csv with 26 fields of labels which must be the reason why it’s picking those up.

Correct! It basically is a way to signal the neural network that this field is different than the other.

Since there is convolution operation behind the scene there is no new weight add up.

1 Like

Do you think that universal approximation theorem is something similar to Fourier series for functions? Like, you can decompose any function into a bunch of sin/cos and same here you can build any N-dimensional surface from these “rectangular platos”?

4 Likes

I think the other key is figuring out how many (and which) layers to include or not

Jeremy will cover NLP and ULMFit in much more detail in a future lesson. This was just a brief example.

7 Likes

Very similar yes. I haven’t read the details of the proof, but I’m pretty sure both of them use the same mathematical theorem behind the scenes.

2 Likes

Yes, I think it’s the same concept.

1 Like

I understand how fully connected layers relate to linear models, but don’t convolution layers do something sort of different?

1 Like

Use SentencePiece Byte-Pair-Encoding (BPE): https://github.com/google/sentencepiece ?

4 Likes

Some satellite images has 4 channels, how can we deal with 4 channels or 2 channels datasets using pretrained models?

17 Likes

Character level language models are interesting too http://karpathy.github.io/2015/05/21/rnn-effectiveness/

1 Like

For 2 channels you could create a dummy 3rd channel that is the average of the 2 channels

2 Likes

Like if another channel is from a depth sensor, like in the head pose data from the Kinect?

1 Like

are certain shapes, or classification problems, more suited to say relu. for example with fourier (sp) transforms takes a ton of terms to build a step change, but one term to build a sine wave. is there something analogous four relu and the size of architecture, or size of middle layers?

For 4 channels, you could try to do some kind of dimensionality reduction (linear combinations of channels?) to transform to 3 channels.

1 Like

Does predict function return multiple labels for multi-label classification?

2 Likes

Is it possible to create a model that can predict a class and a regression number by just creating the correct databunch?

1 Like