Tuning image2image GANs


Hi everyone—

I am working on an image2image translation project with various style transfer GANs (CycleGAN, MUNIT, StarGAN, and DiscoGAN). I am working with the handbags2shoes data used in the DiscoGAN paper and I am not yet pleased with the results. I am getting some good results for individual shoes or handbags, but a large portion is still returning semi-shoe/bag-shaped blobs. E.g.


Has anyone had any success tuning any training hyper-parameters to optimize the performance of these models?

So far I have tried the following:

  • Deeper generator architectures (any tips here would be appreciated),

  • I spent some time expanding my dataset and trying to scape shoes and bags that are oriented the same as the original dataset (around 150’000 images for each category),

  • Replace my transposed convolutions with upsampling operations to remove the checkered patterns (https://distill.pub/2016/deconv-checkerboard/)

Since there are a lot of potential changes that could be made and evaluation of the performance is less straight forward than checking the loss functions, any help would be greatly appreciated!

All of my current implementations start from the official GitHub repos and take their training params (as outlined in their respective publications).

(Theodoros Galanos) #2

Hi there,

I am using these models (mainly pix2pix and cyclegan) on a much different use case so take this with a grain of salt.

I found good success with the following options:

  • lsgan loss
  • instance norm
  • bs=1
  • load_size=crop_size (so no augmentation here)
  • no_flip (again this is due to my case)
  • unet generator
  • sometimes 4 layers in the discriminator (although if your D losses are too low it probably means us houldn’t)

I also increased the depth of the generator when using larger image resolutions. For the cyclegan repos, this means creating flags like ‘unet_512’ where num_downs is 9 (so one more than the 256 one). What I found is that both of these work for 512 images, but the unet_256 produces a slightly smoothed interpolation of the original image.

Hope some of this helps!

Kind regards,