Gradient descent is one of the basic algorithm in machine learning, to master machine learning and deep learning implementing gradient descent is must. In this post, we write a simple code to implement Linear Regression using gradient descent.

Gradient Descent

An intuitive way of explaining Gradient Descent is to imagine a mountain and the path of for snowboarding in the mountain. The goal of the snowboarding is to reach the bottom it is exactly the same what gradient descent strives to achieve.

Global Minima

But few hills can have more than on lower points this means that they can have multiple local minima but single global minima.

Local Minima

Suppose we were told that you need to get to the lowest point possible. Between every two hills is a low point, not necessarily the lowest point. Once you started our gradient descent it doesnot have any power to take you up again to the second hill. For this we have to completely start a new descent down the second hill.

In this we will implement Linear regression using gradient descent. The main idea behind is that, In each iteration we will calculate the error and according to that we will try to find the new value y-intercept (b) and slope(m) that classify our data better than before.

Gradient Descent

To learn more about gradient descent

Implementation of Gradient Descent

Gradient descent is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost). So lets implement the idea what we have discussed above. We will write Gradient Descent manually without using any external machine learning libraries to understand the intuition behind it. Since, Gradient Descent is consider as the base algorithm for Machine Learning.

Import Libraries

We only use numpy to use some math function in our code and matplotlib for plotting the data.

from numpy import * 
import matplotlib.pyplot as plt 

Load the dataset

Data is present in the csv file data.csv. After reading the data from the file we will save in the variable call points so we can use it afterwards.

points = genfromtxt("input/data.csv", delimiter=",")
print (points[0:10,:])
print('Shape: {0} '.format(points.shape))

Data plot

Initialize hyper-parameters

Next, we will initialize all the hyper-parameters with the initial value so we can start the algorithm.

b = 0 
m = 0 
learning_rate = 0.0001 #alpha
num_iterations = 1000 #number of iteration

Error function

We compute mean square error by calculating difference between original output y and predicted output y^.


def ComputerError(b,m,points):
    totalError = 0
    for i in range(len(points)):
        x = points[i, 0]
        y = points[i, 1]
        totalError += (y - (m * x + b)) ** 2
    return totalError / float(len(points))

Gradient Descent

This methods is used to updated the y-intercept and slope by computing the partial derivative with respect to y-intercept and slope. For this we use the below equations.


def GetGradientDescent(b_current, m_current, points,learning_rate):
    b_gradient = 0
    m_gradient = 0
    N = float(len(points))
    for i in range(0, len(points)):
        x = points[i, 0]
        y = points[i, 1]
        b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
        m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
    new_b = b_current - (learning_rate * b_gradient)
    new_m = m_current - (learning_rate * m_gradient)
    return [new_b, new_m]

After writing a method to computer new Slope and Y-intercept we run the Gradient descent algorithm for prediction of the line.

def RunAlgorithm(starting_b, starting_m, learning_rate, num_iterations, points):
    b = starting_b
    m = starting_m
    for i in range(num_iterations):
        b, m = GetGradientDescent(b, m, array(points), learning_rate)
        error = ComputerError(b, m, points)
        if error < float64(2.0):
            return [b, m]
        if i % 100 == 0: 
            print ("After {0} iterations b = {1}, m = {2}, error = {3}".format(i, b, m, ComputerError(b, m, points)))
    return [b, m]


Below is the snapshots of gradient descent running after 1000 iterations for our example problem. We start out at point m = 0 and b = 0. Each iteration m and b are updated to values that yield slightly lower error than the previous iteration. The plot displays the current location of the gradient descent search (red dot) and the path taken to get there (blue line). Eventually we ended up with a pretty accurate fit.

Data plot

Tips for Gradient Descent

This section lists some tips and tricks for getting the most out of the gradient descent algorithm for machine learning.

  • Plot Cost versus Time: Collect and plot the cost values calculated by the algorithm each iteration. The expectation for a well performing gradient descent run is a decrease in cost each iteration. If it does not decrease, try reducing your learning rate.
  • Learning Rate: The learning rate value is a small real value such as 0.1, 0.001 or 0.0001. Try different values for your problem and see which works best. Rescale Inputs: The algorithm will reach the minimum cost faster if the shape of the cost function is not skewed and distorted. You can achieved this by rescaling all of the input variables (X) to the same range, such as [0, 1] or [-1, 1].
  • Few Passes: Stochastic gradient descent often does not need more than 1-to-10 passes through the training dataset to converge on good or good enough coefficients.
  • Plot Mean Cost: The updates for each training dataset instance can result in a noisy plot of cost over time when using stochastic gradient descent. Taking the average over 10, 100, or 1000 updates can give you a better idea of the learning trend for the algorithm.

Code for this can be found here