Linear regression is a supervised learning algorithm to predict data for future. In this post, we will apply Linear regression on the data-set provided as a challenge in the video: How to Make a Prediction - Intro to Deep Learning #1 created by Siraj Rabal on Youtube.

Click here to see manual implementation of Linear Regression

In traditional, programming approach we define every single step to complete our task.

Programming Approach

Machine learning allow us to ommit the old approach of coding. So, instead of writting sequence of steps to complete our task. So we provide the outcome and program learns to find an optimal way to reach that goal. For example, we have a Robot which want to predict if the thing in his hand is Apple or Orange. First we train our program by providing it the data sample and once our robot is train we can give it a random object and ask him to predict if this is an apple or not.

Machine Learning Example

Machine learning in general, is been divided into three categories. We will not go in detail of those in this post but for the general knowledge there are three main categories of ML.

Linear regression is a supervised learning algorithm to predict data for future. Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable. For example, a modeler might want to relate the weights of individuals to their heights using a linear regression model.

To get more detail about linear regression check out this video

Challenge

  • The challenge for this post is to use scikit-learn to create a line of best fit for the included ‘challenge_dataset’. Then, make a prediction for an existing data point and see how close it matches up to the actual value. Print out the error you get. You can use scikit-learn’s documentation for more help.

  • Bonus points if you perform linear regression on a dataset with 3 different variables.

Requirements

Code

Implementation is been done in python. Libraries used here are:

  • pandas
  • numpy
  • sklearn
  • matplotlib

Challenge 1: Using Linear Regression on ‘challenge_dataset’

First load all the libraries

import pandas as pd
import numpy as np
from sklearn import linear_model as model
import matplotlib.pyplot as plt 
#read data from the challenge_dataset
dataframe = pd.read_csv('input/challenge_dataset.txt')
x_values = dataframe[[0]]
y_values = dataframe[[1]]

#train model on data
regr = model.LinearRegression()
regr.fit(x_values, y_values)
#train model on data
regr = model.LinearRegression()
regr.fit(x_values, y_values)

Mean Square Error

# The coefficients
print('Coefficients: ', alg.coef_)
# The mean squared error
print('Mean squared error: %.2f ' % np.mean((regr.predict(x_values) - y_values) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(x_values, y_values))

Result of Challenge Data-Set

Visualization

#Visualize Results
plt.scatter(x_values, y_values)
plt.plot(x_values, regr.predict(x_values))
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Challenge Dataset')
plt.show()

Result of Challenge Data-Set

Challenge 2: Linear regression on a dataset with 3 different variables.

import matplotlib.pyplot as plt## **[Click here to see manual implementation of Linear Regression]
from sklearn import  linear_model
from sklearn import datasets
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import LinearLocator, FormatStrFormatter
import numpy as np
# Loading data
#read data from the challenge_dataset
iris = datasets.load_iris()
# for the above bonus, consider only 3 different vaiables 
# i.e. two are input variable and one is output variable
x_values = iris.data[:,1:3]
y_values = iris.target
#train model on data
regr = model.LinearRegression()
regr.fit(x_values, y_values)

Mean Square Error

# The coefficients
print('Coefficients: ', alg.coef_)
# The mean squared error
print('Mean squared error: %.2f ' % np.mean((regr.predict(x_values) - y_values) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(x_values, y_values))

Result of Challenge Data-Set

Visualization

#Visualize Results
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter(x_values[:,0],x_values[:,1], y_values, c='g', marker= 'o')
#ax.scatter(x_values[:,0],x_values[:,1], regr.predict(x_values), c='r', marker= 'o')
ax.plot_surface(x_values[:,0],x_values[:,1], z_surf, cmap=cm.hot, color='r', alpha=0.2); 
ax.set_xlabel('Sepal Length')
ax.set_ylabel('Sepal Width')
ax.set_zlabel('Species')
ax.set_title('Orignal Dataset')
plt.show()

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
     
ax.scatter(x_values[:,0],x_values[:,1], regr.predict(x_values), c='r', marker= 'o')
ax.plot_surface(x_values[:,0],x_values[:,1], regr.predict(x_values), cmap=cm.hot, color='b', alpha=0.2); 
ax.set_xlabel('Sepal Length')
ax.set_ylabel('Sepal Width')
ax.set_zlabel('Species')
ax.set_title('Predicted Dataset')
plt.show()

Result of Challenge Data-Set Result of Challenge Data-Set

Source code can be found here.

Click here to see manual implementation of Linear Regression

Reference