Explain weight normalisation.
I was reading the paper Weight Normalisation: A Simple Reparameterization to Accelerate Training of Deep Neural Networks about improving the learning of an ANN using weight normalisation.
They consider standard artificial neural networks where the computation of each neuron consists in taking a weighted sum of input features, followed by an elementwise nonlinearity
y=ϕ(x⋅w+b)
where w is a k-dimensional weight vector, b is a scalar bias term, x is a k-dimensional vector of input features, ϕ(⋅) denotes an elementwise nonlinearity and y denotes the the scalar output of the neuron.
They then propose to reparameterize each weight vector w in terms of a parameter vector v and a scalar parameter g and to perform stochastic gradient descent with respect to those parameters instead.
w=g∥v∥v
where v is a k-dimensional vector, g is a scalar, and ∥v∥ denotes the Euclidean norm of v. They call this parameterization weight normalisation. What is this scalar g used for, and where does it come from? Is w is the normalized weight? In general, how does weight normalization work? What is the intuition behind it?
Your interpretation is quite correct regarding weight normalization. I could not understand how it would speed up the convergence though. What they are doing is basically re-assigning the magnitude of the weight vector (also called norm of the weight vector). To put things in perspective, the conventional approach to any Machine Learning cost function is to not only check the variations of the error with respect to a weight variable (gradient) but also add a normalisation term which is λ(w20+w21+…). This has got a few advantages: The weights will not get exponentially high, even if you make some mistake (generally bouncing off to exponential costs due to the wrong choice of learning rate).
Also, the convergence is quicker somehow (maybe because you have now 2 ways to control how much weight should be given to a feature. Unimportant features weights are not only getting reduced by normal gradient, but also the gradient of the normalization term λ(w20+w21+…)).
In this paper, they have proposed to fix the magnitude of the weight vector. This is a good way, although I am not sure if it is better than feature normalization. By limiting the magnitude of weight to g, they are fixing the resource available. The intuition is that, if you have 24 hours, you have to distribute this time among subjects. You'll distribute it in a way such that your grade/knowledge is maximized. So this might be helping in faster convergence.
Also, another intuition would be, when you are subtracting the gradient from a weight vector you use a learning rate α. This decides by how much weight-age of error you want to give which will be subsequently subtracted from the weights. In this approach, you are not only subtracting the weights but also using another learning rate g to scale the weight. I call this a learning rate because you can customize it which in turn customize the value of weights which in turn affects the future reductions of weight gradient descent. I am sure someone will post a better mathematical explanation of this stuff but this is all the intuition I could think of. I would be grateful if other intuitions and mathematical subtleties are pointed out. Hope this helps!