What is the difference between SGD and GD?
I am currently engaged in a particular task that is related to training a deep-learning model on a dataset with millions of samples. What are the advantages and also the disadvantages of using stochastic gradient descent versus traditional gradient descent in terms of convergence speed and computational Effie for this particular scenario?
In the context of data science, here are the differences Given:
Gradient descent (GD)
Gradient descent can compute the gradient of the cost function to the parameters for the entire dataset. It can update the parameters in the opposite direction of the gradient to minimize of the cost function. It can generally converge to the global minimum or a local minimum with fewer iterations as compared to the SGD. However, it needs the computational of the gradient from the entire dataset before each parameters update which can be computationally expensive for large datasets. There is also a disadvantage of GD that storing the entire dataset in memory is necessary which might not be feasible for the very large datasets.
# Gradient Descent
Def gradient_descent(X, y, theta, alpha, num_iters):
M = len(y)
J_history = []
For iter in range(num_iters):
Error = np.dot(X, theta) – y
Gradient = np.dot(X.T, error) / m
Theta -= alpha * gradient
J_history.append(cost_function(X, y, theta))
Return theta, J_history
Stochastic gradient descent (SGD)
The stochastic gradient descent can compute the gradient and then update the parameters for each training example in the dataset. It can be done one at a time or even in small batches. It can also introduce randomness in the parameter updates. The SGD can update the parameters more frequently and it can use only a small subset of the data for each update which can make it computationally efficient especially for the dataset which are large. However, the SGD has a nature that can lead to noisy updates and it possibly does not cover the global minimum instead oscillates around it, especially in the non-convex optimization problems. It also needs careful tuning of the learning rate and schedule to ensure convergence.
Here is the example code for the stochastic gradient descent:-
# Stochastic Gradient Descent
Def stochastic_gradient_descent(X, y, theta, alpha, num_iters):
M = len(y)
J_history = []
For iter in range(num_iters):
For I in range(m):
Random_index = np.random.randint(m)
Xi = X[random_index:random_index+1]
Yi = y[random_index:random_index+1]
Error = np.dot(xi, theta) – yi
Gradient = xi.T * error
Theta -= alpha * gradient
J_history.append(cost_function(X, y, theta))
Return theta, J_history