Lesson 8: Recreate style has abnormally high loss

xenoproboscizoid · October 1, 2017, 7:32pm

Hi everyone,
I’m working through part 2 lesson 8 and I’m currently stuck on the Recreate Style section. I’m not getting any errors, but solve_image is giving me extremely high errors, as seen below:

current loss value: 24357.9316406
current loss value: 21018.1699219
current loss value: 18276.6582031
current loss value: 16108.6474609
current loss value: 14343.5166016
current loss value: 12834.8105469
current loss value: 8366.41308594
current loss value: 4700.38867188
current loss value: 4118.77929688
current loss value: 3665.37548828

My errors appear to start much higher than Jeremy’s and the converge much slower. I tested pushing this out to 100 + iterations and finally got the loss down to ~80, but the results still weren’t pretty.
After 10 iterations I get the image below (using starry night as the style image):

Note that there is an odd bar across the top of the image with different texture, which I’m not sure what to make of.

I’m using python 3 with TF 1.3 and Keras 2.0.8. I strongly suspect this is a version issue since running Jeremy’s code directly produces the same result as my code (indicating that the source of my issue isn’t related to any code changes I’ve made on my end).
I’ve read the Keras release notes and I didn’t see anything that should effect me, but just in case I’m going to try running this in another environment with older packages. If it works I’ll procedurally update until it breaks again in order to isolate the problem.

Please let me know if you have any ideas. Thanks!

xenoproboscizoid · October 5, 2017, 12:47am

UPDATE: I can’t say I fully understand it, but I’ve got it working now with the appropriate loss. My final bout of testing involved installing tensorflow-gpu from pip. Installing using pip was an accident since, in a working test environment, I’d installed using conda and at this point in my testing I had come to point where questioning something so seemingly trivial is the source of the installation seemed entirely reasonable. I tested the code none-the-less and found that the fits failed, producing the same results I have described above. I then uninstalled my pip tensorflow-gpu and reinstalled it using conda. At this point I found that the issue was resolved and my fits were appropriately low.
I should note two things in case someone is unfortunate enough to encounter a similar issue:

I had successfully tested using tensorflow 1.1.0 via pip installed, it was only the tensorflow-gpu 1.3 pip install which failed me.
The conda install installed a different set of tools and updated a different set of packages; in addition to tensorflow-gpu 1.3.0, it installed libgcc 7.2 and tensorflow-gpu-base 1.3, and it updated cudatoolkit to 8.0.1 and cudnn to 6.0.2.

If I were to name a cause of my blind success, it would be the change described in (2). Why pip and conda install different accompanying tools is unknown to me, but if possible, I will alert the google devs.

dredwilliams · October 6, 2017, 9:26pm

Unfortunately, I’m not so lucky. I installed a package through anaconda that caused a few packages to update … and all of a sudden I cannot get a reasonable error on any of the activities in Lesson 8. I tried blowing away my entire anaconda installation, reinstalled from the latest distribution on the continuum.io website (5.0.0) and no change.

My guess is that a default parameter changed on one of the key functions, but I don’t know the packages well enough yet to know where even to look. In the first activity – recreating the image – I start with an error around 100 and work down to mid 30s after 10 iterations … but even if I re-run that fit command a few more times it only gets down to 15 or so, which still doesn’t give a good result. The later activities (as Will showed above) don’t even get low enough to show any useful information, even when I try to iterate it a few more (hundred) times. For some reason, it looks like the learning rate (for lack of a better term here) has been reset to a massively small level, and we can’t make enough progress to be useful.

I’m open to suggestions … please!

dredwilliams · October 6, 2017, 10:00pm

So, I installed the exact version in Jeremy’s instructions – 4.3.0 – and I get the results that are in line with what Jeremy got in his runs … so something was upgraded/changed recently that has caused this discrepancy.

At some point, I will try to figure out what was the change that caused the problem … but I don’t have the time at the moment. I will lock down this install and not do anything that could mess it up.

xenoproboscizoid · October 6, 2017, 10:21pm

Hey Ed,

I’d be hesitant to assume this is a learning rate or parameter issue unless your style transfer loss started of on the right scale (i.e. not << 20k).
I did a lot of testing in “test” environments just installing, uninstalling, and reinstalling packages at different versions and the like and that’s what eventually fixed it for me. Assuming you have the same issue, you could use my env yml to recreate my environment and test. I’ve attached it just in case.

working_root_env.pdf (30.6 KB)

xenoproboscizoid · October 6, 2017, 11:00pm

Oh missed this, nice. Glad you got it working!