How can I understand the derivation of first order model agnostic meta learning?
According to the authors of this paper, to improve the performance, they decided to
drop backward pass and using a first-order approximation I found a blog which discussed how to derive the math but got stuck along the way (please refer to the embedded image below): Why disappeared in the next line. How come (which is an Identity matrix)
Regarding the first order model,
∇θi−1θi−1=I in a similar way that dfdx=1 for f(x)=x. Strictly speaking, I should be a vector of 1s with the same dimensionality as θi−1, but they are probably abusing notation here and putting such a vector as the diagonal elements of a matrix. Alternatively (actually, the most likely reason!), they are computing the partial derivative of θji−1 with respect to θki−1, for all k, for all j, which will make up an identity matrix. Regarding your first question, ∇θθ0 probably becomes 1, but I am not familiar enough with the math of this paper to tell you why. Maybe it's because ∇θθ0 actually means ∇θ0θ0. I would need to dive into it.