Why deep neural networks are easily fooled?
I saw a page/study demonstrating that deep neural networks are easily fooled by giving high confidence predictions for unrecognizable images.
The answer to why deep neural networks are easily fooled is -
Let me start with one requirement that is absolutely essential for this to work: the attacker must know neural network architecture (number of layers, size of each layer, etc). Moreover, in all cases that I examined myself, the attacker knows the snapshot of the model that is used in production, i.e. all weights. In other words, the "source code" of the network isn't a secret.
You can't fool a neural network if you treat it like a black box. And you can't reuse the same fooling image for different networks. In fact, you have to "train" the target network yourself, and here by training I mean to run forward and backprop passes, but specially crafted for another purpose.
Why is it working at all?
Now, here's the intuition. Images are very high dimensional: even the space of small 32x32 color images has 3 * 32 * 32 = 3072 dimensions. But the training data set is relatively small and contains real pictures, all of which have some structure and nice statistical properties (e.g. smoothness of color). So the training data set is located on a tiny manifold of this huge space of images.
The convolutional networks work extremely well on this manifold, but basically, know nothing about the rest of the space. The classification of the points outside of the manifold is just a linear extrapolation based on the points inside the manifold. No wonder that some particular points are extrapolated incorrectly. The attacker only needs a way to navigate to the closest of these points.
The simplest way to prevent the system from being fooled is to use an ensemble of neural networks, i.e. a system that aggregates the votes of several networks on each request. It's much more difficult to backpropagate with respect to several networks simultaneously. The attacker might try to do it sequentially, one network at a time, but the update for one network might easily mess up with the results obtained for another network. The more networks are used, the more complex an attack becomes. Another possibility is to smooth the input before passing it to the network.