CycleGAN Discussion

@jeremy @rachel I’m reading the CycleGAN paper and trying understand how exactly how the “cyclical consistency loss” is actually able to reconstruct (almost) the original image from the generated image.

In the screenshot below, the images are original on left, generated in middle, then reconstructed on the right. If you look at the little park on the bottom left corner (the box of trees with a lawn in the middle), it turns into just a general gray area in the generated image. Given the massive data loss that appears to be happening between the original and the generated images, how is it that the network knows to re-construct the exactly park (box of trees with lawn). A human looking at this street map would probably assume it was concrete.

It’s impossible to know exactly what’s going on inside the net, but my guess would be that the broken up / lack of streets around the park in the middle image are what’s causing it to add a park there. It’s also possible that it’s overfit that specific example given how closely the park matches.

That’s something i’m unclear on with GANs is how closely the generated model match specific training examples at the image patch level.

Hey @mariya, good seeing you at the group study, as a follow up to this discussion we had:

  • The dataset is limited in terms of what it can learn as a representation for the un-paired translation task, it’s is suppose to do this as an un-supervised mapping (function to learn)

from the paper:

Maps :left_right_arrow: aerial photograph 1096 training images scraped from Google Maps [18] with image size 256×256. Images were sampled from in and around New York City. Data was
then split into train and test about the median latitude of the sampling region (with a buffer region added to ensure that no training pixel appeared in the test set).

  • In terms of the architectures, as I pointed out as a concern during the discussion, the PatchGANs are inspired from the U-Net, which are essentially good at:

“another challenge in many cell segmentation tasks is the separation of touch-
ing objects of the same class”

this allows learning small portions like the grass patch and pass it over in the Discriminator part.

Evidently, there is this:

Figure 17: Typical failure cases of our method. Please see our website for more comprehensive results.

This happens for almost a similar reason, where the image contains an overly saturated element, and during the down-sampling and up-sampling it only “translates” the channels (as in colors) rather than learning and translating the snow-y hill to a grassy hill, or the Zebra’s face.

Can you elaborate on the concern of the architectures? Do you think there are alternative architectures that might realize even better results?

Brief answer: I guess it depends on the tasks and the outcome, and what better would mean.

For this model, they want to keep the architecture simple and build up-on earlier work given the proof of concept, rather than building out a whole new Generative and Discriminative parts.

The cyclic-loss portion can be kept, meanwhile the rest can be swapped in and out depending on the type of the dataset (images). For example changing the PatchGANs (U-Net inspired) to something like SegNet can yield interesting results, but not necessarily a better alternative for all the tasks.

I’m not sure this example is typical - quite often it’ll get it wrong. But in this case since it’s a big empty space next to a green park, it’s more likely to be a park too, since there’s probably a few examples of that pattern in the dataset.