Lesson 2 discussion

I am trying to run the linear model code from Lesson 2, but I get 87% accuracy rather than the 97% obtained by Jeremy – any hints on what may be going wrong?

lm.fit(trn_features, trn_labels, nb_epoch=15, batch_size=batch_size,
validation_data=(val_features, val_labels))


Train on 23000 samples, validate on 2000 samples
Epoch 1/15
23000/23000 [==============================] - 0s - loss: 0.3425 - acc: 0.8536 - val_loss: 0.3388 - val_acc: 0.8625
Epoch 2/15
23000/23000 [==============================] - 0s - loss: 0.3288 - acc: 0.8685 - val_loss: 0.3543 - val_acc: 0.8685
Epoch 3/15
23000/23000 [==============================] - 0s - loss: 0.3326 - acc: 0.8735 - val_loss: 0.3628 - val_acc: 0.8710
Epoch 4/15
23000/23000 [==============================] - 0s - loss: 0.3345 - acc: 0.8737 - val_loss: 0.3771 - val_acc: 0.8685
Epoch 5/15
23000/23000 [==============================] - 0s - loss: 0.3356 - acc: 0.8779 - val_loss: 0.3820 - val_acc: 0.8735
Epoch 6/15
23000/23000 [==============================] - 0s - loss: 0.3405 - acc: 0.8770 - val_loss: 0.3928 - val_acc: 0.8730
Epoch 7/15
23000/23000 [==============================] - 0s - loss: 0.3425 - acc: 0.8779 - val_loss: 0.3972 - val_acc: 0.8715
Epoch 8/15
23000/23000 [==============================] - 0s - loss: 0.3431 - acc: 0.8781 - val_loss: 0.4090 - val_acc: 0.8685
Epoch 9/15
23000/23000 [==============================] - 0s - loss: 0.3456 - acc: 0.8782 - val_loss: 0.4156 - val_acc: 0.8720
Epoch 10/15
23000/23000 [==============================] - 0s - loss: 0.3471 - acc: 0.8798 - val_loss: 0.4123 - val_acc: 0.8695
Epoch 11/15
23000/23000 [==============================] - 0s - loss: 0.3490 - acc: 0.8800 - val_loss: 0.4238 - val_acc: 0.8735
Epoch 12/15
23000/23000 [==============================] - 0s - loss: 0.3499 - acc: 0.8795 - val_loss: 0.4216 - val_acc: 0.8725
Epoch 13/15
23000/23000 [==============================] - 0s - loss: 0.3514 - acc: 0.8809 - val_loss: 0.4211 - val_acc: 0.8730
Epoch 14/15
23000/23000 [==============================] - 0s - loss: 0.3521 - acc: 0.8802 - val_loss: 0.4354 - val_acc: 0.8680
Epoch 15/15
23000/23000 [==============================] - 0s - loss: 0.3520 - acc: 0.8801 - val_loss: 0.4226 - val_acc: 0.8750

Interesting @axelstram, I am getting 0.87 with both loss functions. Out of curiosity, are you using the TensorFlow backend?

No, I’m using Theano.

Why did model.fit() run through its epochs so much faster in Lesson 2 @ 1h40m54s than in Lesson 1?

EDIT: Nevermind, I think I see now. This model’s architecture doesn’t need to spend time computing the output for all the layers from the model in lesson 1 because it only has one layer. Duh.

I think I was confused before because in lesson 1, we set previous layers to be untrainable, and in my head, that meant there was no computation going on there. But I guess that setting a layer to be untrainable just means we’re locking the weights. We still need to compute the output of those layers when they are part of the model architecture.

Looking back, this was a simple question. Though confusing at the time, going forward, this gives me some confidence that what seems confusing could be simple, and probably is if I go back over the material.

Do you know what the difference is between VGG.test and VGG.predict, and which one we should use?
I used predict, but I see Jeremy used VGG.test is its cats vs dogs redux notebook. Any ideas?

I installed ffmpeg via conda:
conda install -c conda-forge ffmpeg

The, restart Jupyter Notebook

I installed ffmpeg via

conda install -c menpo ffmpeg

It worked right away without restarting notebook (like magic)

Hi Jeremy,
I am a bit confused here (obviously missing something basic).
When using lm.fit() you used the output of predict step as one of the arguments " (trn_features = model.predict(trn_data, batch_size=100)", however, your suggestion to to use lm.fit_generator hints at using batches (which are generators created to feed raw image data to be concatenated using get_data function). Is this correct?

Or, are you suggesting creating a generator from “trn_features” that could be used as an argument to fit_generator.

I would like to make my Kaggle submission from my PC, but I’m having problems copying the CSV file from the server to my machine.

My unix exposure is about 30-plus years old, and I’m still getting used to AWS and learning bash. But I’ve been through the Andrew Ng ML course on Coursera, as well as the Learning From Data course on edX by Caltech, so lots of the concepts make sense. I have a Kaggle account and made it to the top 26% on a competition last year about voter intentions as part of the Data Analytics MOOC.

(Frustrating, because I stepped through the whole modeling process in a jupyter notebook over the course of several hours, and I think I may have a reasonably good submission, even though it is only my first time through the process.)

But syntax is everything, eh?

Enough background and whining…

I tried FileLink, which quickly gave me a URL in the jupyter cell, but clicking on it gave me a 404 not found error. I tried to make sense of Stack Overflow discussion about FileLink and the trees it recognizes, but I never understood how to overcome my problem.

So then I turned to scp, tried basic syntax,

scp username@remote:/file/to/send /where/to/put

in my case, something like this in a cygwin bash session on my PC:

scp @:/home/courses/README.md C:/BigData

where “C:/BigData” is a folder on my machine. After failing to get the actual submission file, I thought I would simplify the problem and just copy a text file higher up the directory tree, namely /home/courses/README.md

I also tried using this for the destination: /cygwin/c/BigData

Same error in all cases for using scp: Permission Denied (public key)

Based on Stack Overflow suggestions, I’ve tried these things with the same result:

scp admin@xx.xx.xxx.xx:~/scraper/summary.csv /home/barns/Desktop (add a tilde)

scp -vvv admin@xx.xx.xxx.xx:~/scraper/summary.csv /home/barns/Desktop (add a -vvv)

I got a bunch of lines from -vvv. One told me scp was trying the right ip address, but using port 22. Can that be right? Is that a problem? Instead of port 8888 where the notebooks are?

I can see the right ip address in .ssh/known_hosts on my local machine

The only file with a “pem” in it is aws-key-fast-ai.pem

Do I need a pem file for AWS?

I’ve spent over an hour searching the Forum on Lesson 2 and reading Stack Overflow without reaching a solution. If you’ve followed me this far, you can tell I’m very confused. If anyone could give me a few guidelines, preferably with explicit syntax for an imaginary system, I would appreciate it. Thank you.

Hey Charlie,

If I understand correctly, the error you get is: “scp: Permission Denied (public key)”

If that’s correct, I don’t believe there is any amount of directly path fiddling that will make the situation better. I believe your scp request from your Windows/cygwin machine is being rejected by your AWS EC2 instance due to an issue (yet TBD) with security. This is not an issue of not finding the file(s), it’s an issue of not letting you in the door.

Security and private/public keys were discussed more fully in the “Setup” lesson. I think in lessons 1 & 2 it’s assumed to be properly setup and work.

The scp command indeed uses a protocol that uses port 22. TCP port 8888 is for web content hosted by Jupyter. Two different apps (which is why its two different ports).

My advice is to revisit how your key pairs are setup. I don’t use Cygwin, so I don’t know how that might affect default configuration parameters (path, default file names, etc.)

Good luck,
JP

Hello JP,

Thanks so much for your reply. Thanks also for the email. I thought the course might have been suspended or abandoned, and I moved on to an ML project outside the course.

Yes, you understand correctly.

The AWS console tells me I have only one key pair, for fast-ai. So, I will review Lesson 1 and try to find the discussion of key pairs.

Thanks again for reading and responding.

Charlie

hi jermey - great course.
i am at class 2 trying to understand the tuning

when i ran all the code everything go great.
but when i try yo pass the lm.fit and got strait to the tuning it doesnt work

i mean - i get acc of 0.5 when i dont m.fit and just pop and add the layer

i dont understand y the lm.fit that doesnt connect to the model and is a liniar optimizer that stand alone inflact all the model tuning.

y if i dont run it - i cant finetune?

lm = Sequential([ Dense(2, activation='softmax', input_shape=(1000,)) ])
lm.compile(optimizer=RMSprop(lr=0.1), loss='categorical_crossentropy', metrics=['accuracy'])

lm.fit(trn_features, trn_labels, nb_epoch=3, batch_size=batch_size, 
       validation_data=(val_features, val_labels))

model.pop()
for layer in model.layers: layer.trainable=False
model.add(Dense(2, activation='softmax'))

as it’s seems to be. the layer we added is not the layer we fited. its a diff instance.

but when i skipped the fit of lm layer. the tuning didnt worked. and i dont understand y

i think i understood my problem
its when i didnt concatenate to np arrays and used
train generator that it didnt worked…

y is the np arrays so important?

sry - my mistake was that my generator rescale to 1/255

Hi All,

From the Keras documentation How can I “freeze” Keras layers? it appears that the sequential models need to be compiled if the trainable parameter is changed.

If this is correct, should the third paragraph under the heading “Training multiple layers in Keras”

Since we haven’t changed our architecture, there’s no need to re-compile the model - instead, we just set the learning rate

be changed?

1 Like

I’m also having the same problem. I’m at the same point in the Lesson 2 notebook as @rashudo, calling: fit_model(model, batches, val_batches, nb_epoch=2). I’m just using my sample set of 200 training images 50 val images, but my results looks nearly identical, only I’m running a hair above or below 0.50 accuracy. I think @luca must be at a different point in the exercise, but the results look similar also.

Epoch 1/2
200/200 [==============================] - 317s - loss: 7.5536 - acc: 0.5200 - val_loss: 7.4143 - val_acc: 0.5400
Epoch 2/2
200/200 [==============================] - 264s - loss: 7.4949 - acc: 0.5350 - val_loss: 7.4143 - val_acc: 0.5400

I hadn’t noticed this at first, until I ran my confusion matrix, which resulted in predictions of all cats, no dogs, even tho the labeled data is evenly split:

I ran through this twice trying to debug the problem. On my second run, the results were similar, but accuracy was slightly below 0.50, and the confusion matrix was exactly reversed!

Epoch 1/2
200/200 [==============================] - 252s - loss: 8.6517 - acc: 0.4500 - val_loss: 8.7038 - val_acc: 0.4600
Epoch 2/2
200/200 [==============================] - 256s - loss: 8.6232 - acc: 0.4650 - val_loss: 8.7038 - val_acc: 0.4600

I was starting to think it was related to my small sample data set, but after seeing similar results from @luca with the full data set, I’m pretty sure it’s not that.

Still spinning my wheels on this, and it looks like others may be too. I’m sure we’d love to hear ideas from anyone who’s been there, done that, and hopefully got a nice T-shirt from it. :slight_smile:

Thanks in advance!

In my case it was caused by the layers not being trainable and the batch size being too low. You should also check if your labels are OK, maybe they don’t line up with the images?

It might also be that nothing is wrong but that your network is simply learning slowly - just 200 samples and 2 epochs. Try setting the learning rate higher and learn more epochs. And if possible get your hands on a fast GPU or lowering the image resolution to speed things up.

Thanks so much for the suggestions! I got it working right, and learned a thing or two along the way. This was the key:

… however, I think you actually meant to say to set the learning rate lower, because that’s how to reduce the chance of diverging (with gradient descent anyway - thanks to Andrew Ng’s course for that). So, I reduced the LR from 0.1 to 0.01, and ran 4 epochs instead of 2. This was my result:

Epoch 1/4
200/200 [==============================] - 251s - loss: 7.5377 - acc: 0.5200 - val_loss: 7.4143 - val_acc: 0.5400
Epoch 2/4
200/200 [==============================] - 248s - loss: 7.4949 - acc: 0.5350 - val_loss: 7.4143 - val_acc: 0.5400
Epoch 3/4
200/200 [==============================] - 247s - loss: 7.4949 - acc: 0.5350 - val_loss: 7.4143 - val_acc: 0.5400
Epoch 4/4
200/200 [==============================] - 243s - loss: 7.4949 - acc: 0.5350 - val_loss: 7.4143 - val_acc: 0.5400

Barely any improvement, sadly, and this time my confusion matrix showed everything predicting as dogs, no cats. However, I noticed that the slight improvement in this run was between the first and second epochs. Epochs 3 and 4 showed no improvment.

From that, I figured that the minuscule improvement that I did get had to come solely from reducing the learning rate. So, I ran another pass, this time reducing the learning rate from 0.01 to 0.001, and dropped back to 2 epochs. This time I got:

Epoch 1/2
200/200 [==============================] - 260s - loss: 0.4592 - acc: 0.8500 - val_loss: 0.0347 - val_acc: 0.9800
Epoch 2/2
200/200 [==============================] - 255s - loss: 0.1485 - acc: 0.9400 - val_loss: 0.0106 - val_acc: 1.0000

Score! :thumbsup: :sunglasses:

TIL: I guess dropping the learning rate gives a more precise descent when optimizing, which is probably necessary with such a small number of training examples.

As for the GPU suggestion, thanks. I was planning to run my full data set on my FloydHub GPU instance, but I wanted to make sure I got the model and logic right locally first, so the small sample data set allowed me to do that. However, I’m still thinking of getting my own low-end GPU to support faster iteration locally before pushing the model up to the cloud.

Overall, very profitable experience! Thanks again for your help and suggestions @rashudo!

1 Like

Hi everyone, for lesson 2 when examining model summary I receive a Cannot allocate memory error and I’m wondering if anyone else has experienced this

model.summary()

Problem occurred during compilation with the command line below:
/usr/bin/g++ -shared -g -O3 -fno-math-errno -Wno-unused-label -Wno-unused-variable -Wno-write-strings -march=broadwell -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mbmi2 -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mrtm -mhle -mrdrnd -mf16c -mfsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mno-clflushopt -mno-xsavec -mno-xsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl -mno-avx512ifma -mno-avx512vbmi -mno-clwb -mno-pcommit -mno-mwaitx --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=46080 -mtune=broadwell -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -m64 -fPIC -I/home/ubuntu/anaconda2/lib/python2.7/site-packages/numpy/core/include -I/home/ubuntu/anaconda2/include/python2.7 -I/home/ubuntu/anaconda2/lib/python2.7/site-packages/theano/gof -fvisibility=hidden -o /home/ubuntu/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-2.7.12-64/tmptoGcI4/a48f4668d98d8c16519ad508b2f7c269.so /home/ubuntu/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-2.7.12-64/tmptoGcI4/mod.cpp -L/home/ubuntu/anaconda2/lib -lpython2.7
ERROR (theano.gof.cmodule): [Errno 12] Cannot allocate memory

A brief search seems to suggest that it’s an issue with my GPU running out of memory while loading the model but I’m wondering why it would show up when I’m trying to print the summary instead of while using the model to predict. And perhaps there’s a way I can free my GPU memory to prevent it from happening in the future?