In this we will build a Gender Classifer, 10 lines script that will classify anyone as Male and Female given just there body measurements i.e Height, Weight and Size. For this we will use sklearn and apply four different classifier on our given data and later on compare there score and accuracy to select best algorithm.

Classification

Classification Approach

In Machine Learning, classificaiton is a way of identifying to which category a new population belong. Let’s take the example of classifying apple and orange, we have to identify for the new element if it belongs to orange family or apple category. This process is called as Classification.

Orange And Apple Classification Approach

Examples of Classification

Few other example of Classification are :

  • Text Categorization (e.g. Spam Filtering)
  • Classification of Apple and Oranges
  • Fruad Detection
  • Face Detection
  • Optical Character Recognition
  • Natural Language Processing

Classification Approach

The classifer used for the problem are:

  • Decision Tree Classifier
  • KNeighbors Classifier
  • Guassian Process Classifier
  • Random Forest Classifier
  • Ada-Boost Classifier

So Let’s Start

Implementation is been done in python. Libraries used here are:

  • numpy
  • sklearn
  • matplotlib

For this problem we generated data manually, data has 4 variables, 3 inputs Weight, Height and Size and 1 output variable **

1) First load all the usefull libraries

# Import all the libraries here
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score

2) Initialize all the Classifers

decisionClf = DecisionTreeClassifier()
knnClf = KNeighborsClassifier()
gpcClf = GaussianProcessClassifier()
rpcClf = RandomForestClassifier(bootstrap=True)
adaBoostClf = AdaBoostClassifier()

3) Import and visualize data

We generate two type of data, one for training the classifer and other prediction. X and Y are the training data nd test_X and test_Y will be used for prediction and checking accuracy score.

# [height, weight, shoe_size]
X = [[181, 80, 44], [177, 70, 43], [160, 60, 38], [154, 54, 37], [166, 65, 40],
     [190, 90, 47], [175, 64, 39],
     [177, 70, 40], [159, 55, 37], [171, 75, 42], [181, 85, 43]]

Y = ['male', 'male', 'female', 'female', 'male', 'male', 'female', 'female',
     'female', 'male', 'male']

#TEST DATA[height, weight, shoe_size]
test_X = [[179, 90, 44], [190, 88, 44], [165, 55, 37], [160, 60, 39], [156, 56, 36], [181, 85, 43], [174, 66, 40],
     [177, 70, 43], [159, 66, 47], [188, 100, 44], [179, 84, 47]]

test_Y = ['male', 'male', 'female', 'female', 'male', 'male', 'female', 'female', 'female', 'male', 'male']

4) Classification

We used 5 classifier for this example but feel free to use any of the other classifier, there are lots of classifier which you can used.

a) Decision Tree Classifier

Decision tree learning uses a decision tree as a predictive model which maps observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves). It is one of the simplest classifier, since we all are fimilarize little bit from coding consider Decison Trees as the set of if and else conditions.

Classification Approach

decisionClf = decisionClf.fit(X, Y)
prediction = decisionClf.predict(test_X)
# Explained variance score: 1 is perfect prediction
print('Decision Tree Classifier')
print('Score:  %.2f ' % accuracy_score(test_Y, prediction))
print('Variance score: %.2f' % decisionClf.score(test_X, test_Y))

Output

Decision Tree Classifier
Score:  0.64 
Variance score: 0.64

b) KNeighbors Classifier

KNeighbors Classfier is also an example of supervised learning like Linear Regression, which we have already discussed last week. To learn more about KNeighbors Classifer visit here.

Classification Approach.

knnClf = knnClf.fit(X, Y)
prediction = knnClf.predict(test_X)
print('KNeighbors Classifier')
print('Score:  %.2f ' % accuracy_score(test_Y, prediction))
print('Variance score: %.2f' % knnClf.score(test_X, test_Y))

Output

KNeighbors Classifier
Score:  0.73 
Variance score: 0.73

c) Guassian Process Classifier

gpcClf = gpcClf.fit(X, Y)
prediction = gpcClf.predict(test_X)

print('Guassian Process Classifier')
print( 'Score:  %.2f ' % accuracy_score(test_Y, prediction))
print('Variance score: %.2f' % gpcClf.score(test_X, test_Y))

Output

Guassian Process Classifier
Score:  0.73 
Variance score: 0.73

d) Random Forest Classifier

rpcClf = rpcClf.fit(X, Y)
prediction = rpcClf.predict(test_X)

# Explained variance score: 1 is perfect prediction
print('Random Forest Classifier')
print( 'Score:  %.2f ' % accuracy_score(test_Y, prediction))
print('Variance score: %.2f' % rpcClf.score(test_X, test_Y))

Output

Random Forest Classifier
Score:  0.82 
Variance score: 0.82

e) Ada-Boost Classifier

adaBoostClf = adaBoostClf.fit(X, Y)
prediction = adaBoostClf.predict(test_X)

# Explained variance score: 1 is perfect prediction
print('Ada-Boost Classifier')
print( 'Score:  %.2f ' % accuracy_score(test_Y, prediction))
print('Variance score: %.2f' % adaBoostClf.score(test_X, test_Y))

Output

Ada-Boost Classifier
Score:  0.73 
Variance score: 0.73

5) Result

Random Forest Classifier
Score:  0.82 
Variance score: 0.82

Ada-Boost Classifier
Score:  0.73 
Variance score: 0.73

Guassian Process Classifier
Score:  0.73 
Variance score: 0.73

Decision Tree Classifier
Score:  0.64 
Variance score: 0.64

KNeighbors Classifier
Score:  0.73 
Variance score: 0.73

Summary

The best score of Random Forest Classfier is 0.82 and worst case score is 0.63, this is because random forest classifier in order to improve the predictive accuracy and over-fitting it mean prediction (regression) of multiple trees and thus each time we get score different since we don’t have enough data. On the otherhand, all the other Classifier are getting same score so for the above example we can use any classifier if you want to improve the results you must increase the amount of data. Since, here we are generated data manually but if you want that these classifier give you more accurate result than you must use more data this will allow the above algorithm to generalize its learns parameter more.

Source code can be found here

To read more on Machine Learning, check out

Reference