# Lesson 9 discussion

(David Gutman) #21

Another interesting torrent for people trying to recreate the Devise paper:

1000 dimension Word2Vec embedding trained on Wikipedia (English).

(bckenstler) #22

Iâm having a very difficult time using style loss in the feed forward network. As a test,
if I replace this line from the notebook

``````loss = Lambda(lambda x: K.sqrt(K.mean(x[0]-x[1])**2, (1,2)))([vgg1, vgg2])
``````

with

``````loss = Lambda(lambda x: style_loss(x[0],x[1]), (1,2))([vgg1, vgg2])
``````

I get

``````ValueError: Dimension must be 4 but is 3 for 'transpose_5' (op: 'Transpose') with input shapes: [?,144,144,128], [3].
``````

Which comes from this line in the gram matrix function:

``````features = K.batch_flatten(K.permute_dimensions(x, (2, 0, 1)))
``````

The tensorflow tensor has four dimension, the first I assumed was the # of images. So I altered that line to

``````features = K.batch_flatten(K.permute_dimensions(x, (0, 3, 1, 2)))
``````

Which keeps puts channels in rows but doesnât touch that âitemâ index. I have no idea if this is correct. But when I do this, I get

``````ValueError: None values not supported.
``````

from this line:

``````return K.dot(features, K.transpose(features)) / x.get_shape().num_elements()
``````

Any guidance on this would be helpful!

(Jeremy Howard) #23

You can certainly remove the `/ x.get_shape().num_elements()` bit since itâs just a constant (so then youâll be using the sum, not mean, so will need to change your weightings). But I donât think your `permute_dimensions()` is doing what you want, since you now have the batches in the rows, not the channels!

(sravya8) #24

Where can we download the files named âtrn_resized_72.bcâ used in the super resolution example? I dont see them at platform.ai

Also curious why are we using a .bc extension here and using bcolz unlike how we do it traditionally just reading image files?

(Matthew Kleinsmith) #25

I think we use bcolz for compression and to easily store numpy arrays.

(Jeremy Howard) #26

Right - and I donât like to store processed images as jpegs, since each processing step is going to introduce more lossy compression artifacts.

(sravya8) #28

I do not seem to understand what is the (1,2) in this line of the code and why we need it? Any pointers appreciated
`content_loss = Lambda(lambda x: K.sqrt(K.mean((x[0]-x[1])**2, (1,2))))([vgg_ip_image, vgg_it_op])`

(sravya8) #29

Thanks @Matthew!

(sravya8) #30

Also, what is the purpose of .astype(âuint8â) here?
`plt.imshow(p[0].astype('uint8'));`

Super resolution seems to be working fine for the training images but not working so well for my test images.

(Jeremy Howard) #31

Thatâs the 2nd parameter to `K.mean`: https://keras.io/backend/#mean . Weâre taking the mean over the x & y.

(Jeremy Howard) #32

matplotlib uses the data type to decide how to display your data. Our data is 0->255, so we use uint8

@jeremy Why are we not reducing across the channel as well to get a scalar value for the loss? Something like:

``````feature_loss = Lambda(lambda x: K.sqrt(K.mean((x[0] - x[1]) ** 2, axis=(1, 2, 3))))([true_output, expected_output])
``````

I did the same for fast style network (so that I can combine the style loss and content loss), and itâs not working (itâs generating bad results).

(nazanin) #34

Can I get the link for lesson 9 in-class video? I like to review lesson 9 again, and the video will be very helpful. I searched the forum, but was not able to find it.

(Jeremy Howard) #35

(Suresh ) #36

I came across this paper today. http://arxiv.org/abs/1610.07629

I basically builds on the perceptual losses paper where we train a CNN to generate a stylized image in one pass and extends it to multiple styles of painting. Basically, each kind of painting style can be represented as an embedding. Below are some of my verbatim notes from the paper itself.

âsingle, scalable deep network that can parsimoniously capture the artistic style of a diversity of paintings. We demonstrate that such a network generalizes across a diversity of artistic styles by reducing a painting to a point in an embedding spaceâ

âIn this work, we show that a simple modification of the style transfer network, namely the in- troduction of conditional instance normalization, allows it to learn multiple stylesâ

âwe found a very surprising fact about the role of normalization in style transfer networks: to model a style, it is sufficient to specialize scaling and shifting parameters after normalization to each specific style. In other words, all convolutional weights of a style transfer network can be shared across many styles, and it is sufficient to tune parameters for an affine transformation after normalization for each styleâ (emphasis mine)

âOne added benefit of this approach is that one can stylize a single image into N painting styles with a single feed forward pass of the network with a batch size of N.â

My takeaway: Intuitively, this makes sense. Lower level features are similar to DCT in jpeg compression and we can represent any signal using a scaled combination of these basis vectors. In neural networks, we lose the linearity, but the idea is the same. Textures are just scaled and rotated versions of these lower level features and hence a affine transformation makes sense.

thoughts anyone?

(David Woo) #37

I was exploring the imagenet notebook and was looking to get a primer and intuition on how word2vec works. I wrote a fun blog article with some examples on using word2vec to answer some fan theory on games of thrones. thought would share in case benefits folks.

(Runqi Yang) #38

In the âneural-sytleâ code, why set `mode=2` in BatchNormalization in `conv_block` and `deconv_block`?

``````x = BatchNormalization(mode=2)(x)
``````

I read the document of BatchNormalization but can not understand why mode 2 is chosen instead of mode 0. Can somebody explain this?

(Constantin) #39

@rqyang, this is because we feed the data in batches to the GPU. Mode 2 does the same as Mode 0, but for each batch.
So, letâs say your data set has got 1000 samples (i.e. observations, first dimension) and your batch size is 10.
Then Mode 0 would compute mean and std per 1000 samples, whereas Mode 2 would compute mean and std per 10 samples in the batch. To make it a little confusing, the axis it computes this along is often the last axis (i.e. number of filters or feature maps) - still Mode 0 would do it for the entire data set and Mode 2 for one batch only.
You can convince yourself that the two are different by comparing e.g.

``````np.mean(np.random.randint(0, 10, 1000))
``````

to

``````np.mean(np.random.randint(0, 10, 10))
``````

The results are likely to be different.
In fact, if the sample size is very small you might even squew the distribution too much and BatchNormalization doesnât work well any more. This can cause problems with BatchNormalization as pointed out by @mariya.

(Runqi Yang) #40

Use `mode=2` is faster but less accurate. Here we sacrifice a little bit accuracy for efficiency.