s we have to choose for optimizing our hypothesis. The gradient descent algorithms is as following:
repeat until convergence:
J = 0,1 representing the feature index numbers. For each iteration every should update simultaneously.
represents the learning rate. If the learning rate is to large, gradient descent can overshoot the minimum and might not able to find it. If the learning rate is to small, gradient descent can be slow.
is our derivative.
The following figure shows an example of gradient descent. The x and z axis are s and the y axis is the value of our cost function J of our hypothesis h. With each iteration our hypothesis changes and we approximate at the local minimum.