Another interesting torrent for people trying to recreate the Devise paper:

1000 dimension Word2Vec embedding trained on Wikipedia (English).

Another interesting torrent for people trying to recreate the Devise paper:

1000 dimension Word2Vec embedding trained on Wikipedia (English).

Iâm having a very difficult time using style loss in the feed forward network. As a test,

if I replace this line from the notebook

```
loss = Lambda(lambda x: K.sqrt(K.mean(x[0]-x[1])**2, (1,2)))([vgg1, vgg2])
```

with

```
loss = Lambda(lambda x: style_loss(x[0],x[1]), (1,2))([vgg1, vgg2])
```

I get

```
ValueError: Dimension must be 4 but is 3 for 'transpose_5' (op: 'Transpose') with input shapes: [?,144,144,128], [3].
```

Which comes from this line in the gram matrix function:

```
features = K.batch_flatten(K.permute_dimensions(x, (2, 0, 1)))
```

The tensorflow tensor has four dimension, the first I assumed was the # of images. So I altered that line to

```
features = K.batch_flatten(K.permute_dimensions(x, (0, 3, 1, 2)))
```

Which keeps puts channels in rows but doesnât touch that âitemâ index. I have no idea if this is correct. But when I do this, I get

```
ValueError: None values not supported.
```

from this line:

```
return K.dot(features, K.transpose(features)) / x.get_shape().num_elements()
```

Any guidance on this would be helpful!

You can certainly remove the `/ x.get_shape().num_elements()`

bit since itâs just a constant (so then youâll be using the sum, not mean, so will need to change your weightings). But I donât think your `permute_dimensions()`

is doing what you want, since you now have the batches in the rows, not the channels!

Where can we download the files named âtrn_resized_72.bcâ used in the super resolution example? I dont see them at platform.ai

Also curious why are we using a .bc extension here and using bcolz unlike how we do it traditionally just reading image files?

http://www.platform.ai/data/trn_resized_72.tar

http://www.platform.ai/data/trn_resized_288.tar

I think we use bcolz for compression and to easily store numpy arrays.

Right - and I donât like to store processed images as jpegs, since each processing step is going to introduce more lossy compression artifacts.

I do not seem to understand what is the (1,2) in this line of the code and why we need it? Any pointers appreciated

`content_loss = Lambda(lambda x: K.sqrt(K.mean((x[0]-x[1])**2, (1,2))))([vgg_ip_image, vgg_it_op])`

Also, what is the purpose of .astype(âuint8â) here?

`plt.imshow(p[0].astype('uint8'));`

Super resolution seems to be working fine for the training images but not working so well for my test images.

Thatâs the 2nd parameter to `K.mean`

: https://keras.io/backend/#mean . Weâre taking the mean over the x & y.

matplotlib uses the data type to decide how to display your data. Our data is 0->255, so we use uint8

@jeremy Why are we not reducing across the channel as well to get a scalar value for the loss? Something like:

```
feature_loss = Lambda(lambda x: K.sqrt(K.mean((x[0] - x[1]) ** 2, axis=(1, 2, 3))))([true_output, expected_output])
```

I did the same for fast style network (so that I can combine the style loss and content loss), and itâs not working (itâs generating bad results).

Can I get the link for lesson 9 in-class video? I like to review lesson 9 again, and the video will be very helpful. I searched the forum, but was not able to find it.

I came across this paper today. http://arxiv.org/abs/1610.07629

I basically builds on the perceptual losses paper where we train a CNN to generate a stylized image in one pass and extends it to multiple styles of painting. Basically, each kind of painting style can be represented as an embedding. Below are some of my verbatim notes from the paper itself.

âsingle, scalable deep network that can parsimoniously capture the artistic style of a diversity of paintings. We demonstrate that such a network generalizes across a diversity of artistic styles by reducing a painting to a point in an embedding spaceâ

âIn this work, we show that a simple modification of the style transfer network, namely the in- troduction of conditional instance normalization, allows it to learn multiple stylesâ

âwe found a very surprising fact about the role of normalization in style transfer networks: to model a style, it is sufficient to specialize scaling and shifting parameters after normalization to each specific style. **In other words, all convolutional weights of a style transfer network can be shared across many styles, and it is sufficient to tune parameters for an affine transformation after normalization for each style**â (emphasis mine)

âOne added benefit of this approach is that one can stylize a single image into N painting styles with a single feed forward pass of the network with a batch size of N.â

My takeaway: Intuitively, this makes sense. Lower level features are similar to DCT in jpeg compression and we can represent any signal using a scaled combination of these basis vectors. In neural networks, we lose the linearity, but the idea is the same. Textures are just scaled and rotated versions of these lower level features and hence a affine transformation makes sense.

thoughts anyone?

I was exploring the imagenet notebook and was looking to get a primer and intuition on how word2vec works. I wrote a fun blog article with some examples on using word2vec to answer some fan theory on games of thrones. thought would share in case benefits folks.

In the âneural-sytleâ code, why set `mode=2`

in BatchNormalization in `conv_block`

and `deconv_block`

?

```
x = BatchNormalization(mode=2)(x)
```

I read the document of BatchNormalization but can not understand why mode 2 is chosen instead of mode 0. Can somebody explain this?

@rqyang, this is because we feed the data in batches to the GPU. Mode 2 does the same as Mode 0, but for each batch.

So, letâs say your data set has got 1000 samples (i.e. observations, first dimension) and your batch size is 10.

Then Mode 0 would compute mean and std per 1000 samples, whereas Mode 2 would compute mean and std per 10 samples in the batch. To make it a little confusing, the axis it computes this along is often the last axis (i.e. number of filters or feature maps) - still Mode 0 would do it for the entire data set and Mode 2 for one batch only.

You can convince yourself that the two are different by comparing e.g.

```
np.mean(np.random.randint(0, 10, 1000))
```

to

```
np.mean(np.random.randint(0, 10, 10))
```

The results are likely to be different.

In fact, if the sample size is very small you might even squew the distribution too much and BatchNormalization doesnât work well any more. This can cause problems with BatchNormalization as pointed out by @mariya.

Thank you for your answer!

So is this correct?

Use

`mode=2`

is faster but less accurate. Here we sacrifice a little bit accuracy for efficiency.

Yes. Notice, that the syntax has changed for Keras 2.0. Afaik the mode parameter does not exist there any more.