Adversarial examples


(Chatel Gregory) #1

Hi everyone!

I was having some troubles understanding the interactions between backend variables and the BFGS optimizer and reading the code wasn’t helping. I decided to code something from scratch to get my ideas straight unfortunately I have not managed to get my code working so I am asking for your help.

I tried to implement the ‘fast gradient sign method’ from this paper Explaining and Harnessing Adversarial Examples. The goal of this algorithm is to make changes to image imperceptible to the naked in order to fool the classifier. From what I understand, the algorithm is pretty simple:

  • take a image I and classify it using the neural network
  • compute the cross entropy loss with the wrong class
  • take the gradient of this loss according to the pixels of I
  • compute the sign of the gradient value for each pixel
  • multiply this gradient sign matrix by a very small number, this is our pertubation P
  • add P to I

My notebook implementation is on my Github page: Adversarial-examples

I manage to do all the steps but the model still does not make any mistake. There are a few things that I didn’t manage to figure out:

  • The documentation of keras.metrics.categorical_crossentropy state that the first argument should be the ground truth and the second one the predictions but If I call it this way, all my gradients are 0.
  • I have to take big values of epsilon otherwise nothing changes, I think this might have to do with the preprocessing step.
  • When I try to plt.imshow my modified picture, it is in negative, I have to call plt.imshow(256 - modified_array[0]).

My final goal was to build a transformation network to create adversarial samples really quickly (even if the method described earlier is fast).

Does anyone have an idea on how to fix my code ?


(Igor Barinov) #2

Did you check Cleverhens "A library for benchmarking vulnerability to adversarial examples " repository on github? https://github.com/openai/cleverhans

There is a tutorial for FGSM https://github.com/openai/cleverhans/blob/master/tutorials/mnist_tutorial_tf.md


(Chatel Gregory) #3

I will look at their tutorial to see how they implement it. Thank you for the reference.

Do you have any idea on how to solve the few strange things that I noticed in my code ?


(Chatel Gregory) #4

Hi everyone!

My code is now working!

I worked a bit more on my code and it seems that I had nearly everything right the first time. The main problem was that I was taking gradients relative to an arbitrary class the first time to compute the perturbation. It seems to work a lot better when I take a class that’s somewhat reasonable.

Here is the example I compute in the notebook:

Source image:

Classified as Labrador retriever 0.81% or Golden retriever 8%

Perturbated image:

Classified as wire-haired fox terrier 33% and tennis ball (!!) 19%.

This is really scary as the algorithms used on self-driving can be fooled as easily as this.

The link of the notebook if anyone is interested.


(sravya8) #5

Damn, thats pretty scary :astonished: Thanks for posting the notebook, I want to understand how this works.


(Kajal Kansal) #6

The given links are not working. Please provide another links for the same.


(Chatel Gregory) #7

Hi @kajal24,

I moved things around in this repo recently, this is the new link to this notebook.

Additionally, I will give a meetup talk about this subject tomorrow, you will find my slides here.

Don’t hesitate to ask if you have any question.


(Savan Visalpara) #8

The method introduced in this paper is not targeted. So, you compute loss with true class not target/wrong class. Subsequent work also extended FGSM to targeted attacks but they require slight modification.


(Chatel Gregory) #9

Hi Savan!

Yes, the perturbation can be computed in a targeted way. Most of the attack algorithms were covered pretty extensively in the recent Kaggle NIPS competition on targeted adversarial samples.

There is a lot a recent paper that explores adversarial samples designed to fool both human and machine learning systems. Below is their main example which I find quite shocking:

adversarial_humans

If anyone would like to know more about this subject I wrote a blog post on medium on it.