The Problem of Overfitting/Underfitting in Machine Learning

This blog post is part of a series, where I talk about concepts and algorithms in Machine Learning.

In this part I want to talk about Overfitting and Underfitting, which are common problems in Machine Learning which lead to low predictive models e.g. in classification problems.

Overfitting

Overfitting often happens when your train your model with too much data. Then the model learn from the noise and inaccurate data entries from the data set. If the model then should predict the target of the new data, it does not categorize the data correct, because there is too much noise and details. Overfitting often happens with non-parametric and non-linear methods, because this types of machine learning algorithms have more freedom in building a model based on the data set and therefore these types of algorithms can really build unrealistic models. A solution to avoid overfitting is using a linear algorithm like Support Vector Machines (SVM) if you have linear data or using parameters like the maximal depth if you are using decision trees.

 

In the example above we can see lines of an overfitted model an a line of a model with a regular model. The green line is from the overfitted model. It is very complex and overreacts for every point instead of just finding a moderate way like the black line from the regular model.

Underfitting

Underfitting is obviously the opposite of underfitting, but also can destroy the accuracy of your machine learning model. If often happens when you have to less data to build an accurate model. It also happens when you try to build a linear model with non-linear data. Then the rules of your machine learning model are to easy and flex and the model will probably guess a lot of wrong predictions. Solutions for underfitting beside getting more data is e.g. feature engineering, so e.g. reducing the features by feature selection.

Choosing the wrong parameters

Further overfitting and underfitting appears if you choosing the wrong parameters of your machine learning algorithms. As mentioned above, decision trees have e.g. the parameter max depth, which describes the max depth of the tree. If max depth is to high, overfitting will be encouraged and if max depth is to low, underfitting will be a problem.

In the next weeks I will post a blog post where I talk about concept on how you can avoid building a overfitted or underfitted machine learning model.

Resume

So overfitting and underfitting are both things you want to avoid, in the best case you want to be in the middle of them. To avoid both, be care to have enough data (there can’t be too much data, but you then have to care about choosing the right parameters and the right algorithms) and carefully evaluate your data set to choose the right decision for training.

Thanks for reading, if you have not already read my last blog post, then you can do that by clicking here. I am also active on Twitter, so if you do not want to miss any new blog posts, then follow me there.

support vector machines

Support Vector Machines (SVM)

This blog post is part of a series, where I talk about concepts and algorithms in Machine Learning. In this part I want to talk about another popular algorithm, which is widely used for solving classification and regression problems. I am of course talking about Support Vector Machines (SVM).

What does Support Vector Machines do?

Support Vector Machines are supervised learning models for classification and regression problems. They can solve linear and non-linear problems and work well for many practical problems. The idea of Support Vector Machines is simple: The algorithm creates a line which separates the classes in case e.g. in a classification problem. The goal of the line is to maximizing the margin between the points on either side of the so called decision line. The benefit of this process is, that after the separation, the model can easily guess the target classes (labels) for new cases.

Maybe you say now, that this probably only works for a low dimensional problem, e.g a data set with only 2 features, but that is wrong! Support Vector Machines are actually very effective in higher dimensional spaces. It is even very effective on data sets where number of dimensions is greater than the number of samples. This is mainly because of the kernel trick, which we talk about it later. Further advantages of Support Vector Machines are the memory efficiency, speed and general accuracy in comparison to other classification methods like k-nearest neighbor or deep neural networks. Of course they are not every time better than e.g. deep neural networks, but sometimes they still outperform deep neural networks.

Difference between Linear and Non-Linear Data

To clear everything up, I explain quickly what it is all about the linear and non-linear data thing. We talk about linear data, when we can classify the data with a linear classifier. The linear classifier makes his classification decision based on a linear combination of characteristics. The characteristics are also known as features in machine learning. The following picture will make things more clear.

In figure A we can separate the target labels linear with a line (like Support Vector Machines do classification with a decision line). A linear classifier can do this with a linear combination of characteristics. We could use e.g. Support Vector Machines do build a model, but we could also use many other linear classification methods like quadratic classification.

In figure B we can not separate the target labels linear. The data is more complex divided. Therefore we can not just use a linear classification method. Fortunately Support Vector Machines can do both, linear and non-linear classification. Lets first take an easier linear example to get an introduction about Support Vector Machines. Later we will look at non-linear classification with Support Vector Machines and we will see how it works with the kernel trick.

Linear Example

To create an linear example and train a model with the Support Vector Machines algorithm I will use the C-Support Vector Classification algorithm from the sklearn library in Python. First we will just implement the clean C-Support Vector Classification algorithm (SVC) on the iris data set.

# iris_svc.py
# iris dataset
# 150 total entries
# features are: sepal length in cm, sepal width in cm, petal length in cm, petal width in cm\n 
# labels names: setosa, versicolor, virginica
#
# used algorithm: SVC (C-Support Vector Classifiction) 
#
# accuracy ~100%
#
from time import time
import numpy as np
from sklearn.datasets import load_iris
from sklearn import svm
from sklearn.metrics import accuracy_score

def main():
	data_set = load_iris()

	features, labels = split_features_labels(data_set)

	train_features, train_labels, test_features, test_labels = split_train_test(features, labels, 0.18)

	print(len(train_features), " ", len(test_features))

	clf = svm.SVC()

	print("Start training...")
	tStart = time()
	clf.fit(train_features, train_labels)
	print("Training time: ", round(time()-tStart, 3), "s")

	print("Accuracy: ", accuracy_score(clf.predict(test_features), test_labels))


def split_train_test(features, labels, test_size):
	total_test_size = int(len(features) * test_size)
	np.random.seed(2)
	indices = np.random.permutation(len(features))
	train_features = features[indices[:-total_test_size]]
	train_labels = labels[indices[:-total_test_size]]
	test_features  = features[indices[-total_test_size:]]
	test_labels  = labels[indices[-total_test_size:]]
	return train_features, train_labels, test_features, test_labels

def split_features_labels(data_set):
	features = data_set.data
	labels = data_set.target
	return features, labels

if __name__ == "__main__":
	main()

Output:

123 27
Start training…
Training time: 0.002 s
Accuracy: 1.0

Okay, this seems to work pretty well, but how does a decision line of a Support Vector Machine looks like? First lets plot the iris data set to see how the data set looks like. To make things easier, lets just concentrate on the first two features: sepal length and sepal width.

With the Support Vector Machine algorithm we could probably separate now the red group from the other two groups (orange and grey).

 

So this would probably look like this. But we still have the problem, that the orange and grey group are difficult to separate. We could guess, that a higher sepal width and a higher sepal length is a sign, that the entry is from the grey group, but fortunately we do not have to assume such things, because we have a third feature and a fourth features, petal length and petal width, so that we can group them with decision lines in higher dimensions. The example above should show the general principle of Support Vector Machines. Here we do not have a binary classification problem (2 labels), but we can easily separate the red group from the grey and orange group by a decision line by using only two features out of four features.

How does this work?

For humans this seems pretty intuitive. We just drawing a line to separate the different labeled classes from each other. But how does Support Vector Machines solve this problem? The SVM want to find are so called maximum-margin hyperplane.

The hyperplane is the line with the biggest margin to both groups. We have called the line above decision line, but the mathematical correct term is hyperplane, because in dimensions higher than two, it will be not a line anymore.

We will give the Support Vector Machine algorithm a bunch of labeled vectors as a training set. All vectors are p dimensional, p is the number of features we have in our training set. To find the maximum margin hyperplane, we have to maximize the margin to every nearest point of each target group. In a binary classification, we can declare the labels of the two target groups as -1 and 1.  The hyperplane as a set of points can be described as

where  is the normal vector to the hyperplane and b is a bias. A normal vector simply is an orthogonal standing vector to a line or plane. If you are familiar with linear algebra, this may look familiar to you. It is like the Hesse normal form, except that  does not have to be a unit vector.

The parameter  determines the offset of the hyperplane from the origin along the normal vector . With the use of the hyperplane (decision line) the model can now classify new entries.

There are actually different sub classifier, who behave different. The Soft Margin Classifier allows some noise in the training data, but on the other side the Hard Margin Classifier does not allow noise in the training data.

Using the Support Vector Machines for non-linear data with the kernel trick

Until know we have talked about linear examples and how Support Vector Machines work and how you can implement them with sklearn in Python. I already talked a little bit about non-linear data. When there is a non-linear data set Support Vector Machines can not simply draw a linear hyperplane. Therefore Support Vector Machines use the kernel trick. When you have non-linear data, the kernel method helps you to find pattern and relations to reach a high accuracy in your final machine learning model.

How does the Kernel method works?

The kernel method are contains are so called kernel function. These function map the non-linear separable input space into a higher dimensional linear separable feature space. And in this new higher dimensional linear separable feature space Support Vector Machines can work as normal. The kernel method then maps the solutions back, so that in the non-linear separable input space you then have a non-linear solution.

In the example above we have a two dimensional feature space, which is non-linear. With the kernel function we can map the input space into a three dimensional feature space. In this feature space we then can separate the training set linear. When we map the solution back to the input space we get a non-linear solution.

Implementation

In sklearn you can use different kernels to train your model. You can check out here how you implement a non-linear Support Vector Machine classifier in Python. There is no big difference to the linear example above, you just have to figure out what parameters you want to choose to get an high accuracy for your machine learning model.

Resume

Support Vector Machines calculate a hyperplane to build a classification model. The hyperplane divides the target labels with a maximized margin. With the use of the kernel trick you also can classify non-linear data.

Thanks for reading, if you have not already read my last blog post, then you can do that by clicking here. I am also active on Twitter, so if you do not want to miss any new blog posts, then follow me there.

Data Visualization with Python

This blog post is part of a series, where I talk about concepts and algorithms in Machine Learning.  In this blog post I do not want to talk about any concept or algorithm, but about data visualization with Python. Data Visualization with Python is very important, especially for Machine Learning, because you want to explore your data and gain knowledge about the data set you work with, so that you can select good features. I won’t explain the following plots and charts in depth, because most of them are very similar and should just give you some ideas how you can plot your data. You will find all examples of the following plots and charts on GitHub.

First Plot

I will use the matplotlib and Python to show how easy it can be to visualize data and see relationships between features and targets. First lets plot easy functions like x and x². On the one side we have the values for the x axis and on the other the result of the function on the y axis.

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5, 6, 7]
y = [1, 4, 9, 16, 25, 36, 49]
y1 = [1, 2, 3, 4, 5, 6, 7]

plt.plot(x, y, label='x^2')
plt.plot(x, y1, label='x')
plt.xlabel('x axis')
plt.ylabel('y axis')

plt.title('function plots')
plt.legend()
plt.show()

As a result there will come this plot:

Bar charts

Next there are bar charts. Bar charts are also extremely useful if you want to explore the frequency of certain events, e.g. the frequency of emails you get over the day.

import matplotlib.pyplot as plt

incoming_email_at_hour = [8, 8, 9, 9, 9, 9, 10, 10, 11, 11, 11, 11, 12, 13, 13, 13, 13, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 17, 17, 18]

hours = [x for x in range(6, 18)]

plt.hist(incoming_email_at_hour, hours, histtype='bar', rwidth=0.8)

plt.xlabel('t in hour')
plt.ylabel('incoming emails')

plt.title('Functions')
plt.legend()
plt.show()

Scatter plots

import matplotlib.pyplot as plt

x = [1.1, 1.7, 1.3, 1.4, 1.2, 2.7, 2.3, 2.2]
y = [4, 6, 10, 3, 2, 10, 11, 5]

x0 = [3.2, 3.3, 3.7, 4, 4.2, 4.3]
y0 = [5, 7, 3, 9, 3, 12]

plt.scatter(x, y, label='Red', color='r')
plt.scatter(x0, y0, label='Blue', color='b')

plt.xlabel('x')
plt.xlabel('y')
plt.title('Scatter Pot')
plt.legend()
plt.show()

At the most time I use scatter plots to visualize how certain features are connected to the target label. Lets assume that we have a binary features like this. Then we can see, that we could easily work here with support vector machines, because we could separate the two division with a line. If there are not a clear division between the two regions and there are existing multiple regions, then maybe decisions tree are a better solution.

Pie Charts

Pie charts are like bar charts great for representing surveys or showing relative sizes. In the following example a pie chart represents how people come to work.

import matplotlib.pyplot as plt

transport_methodes = ['Car', 'Train', 'Bus', 'Walk']

transport_count = [23, 25, 10, 2] # e.g. total count of random survey

col = ['r', 'b', 'y', 'g'] 

plt.pie(transport_count, labels=transport_methodes, colors=col)

plt.title('Pie Chart')

plt.show()

Mosaic plots

Mosaic plots also called marimekko charts are a great way to visualize relationships between two or more categorical features. So instead plotting a more dimensional bar chart you can use mosaic plot to explore the relation between multiple categorical features. Unfortunately I did not find a way to use mosaic plots in matplotlib. But I was able to make an example with statsmodels

import pandas as pd
from statsmodels.graphics.mosaicplot import mosaic
import pylab
from itertools import product
import numpy as np
rand = np.random.random

speaks_mul_foreign_languages = list(product(['male', 'female'], ['yes', 'no']))
index = pd.MultiIndex.from_tuples(speaks_mul_foreign_languages, names=['male', 'female'])
data = pd.Series(rand(4), index=index)

mosaic(data, gap=0.01, title='Who knows multiple foregin languages? - Mosaic Chart')
pylab.show()

Resume

There are dozens of ways to represent data and visualize them. I tried in this blog post to give you some inspiration and ideas how you can do it at your own. I have put all examples in a GitHub repository and I will add examples if I find good ones and have the time to do it.

Thanks for reading, if you have not already read my last blog post, then you can do that by clicking here. I am also active on Twitter, so if you do not want to miss any new blog posts, then follow me there.

 

Preparation Of Data Before Training

This blog post is part of a series, where I talk about concepts and algorithms in Machine Learning. In this part I want to talk a little about the preparation of data before you start training the data to your Machine Learning algorithm.

Basic feature engineering: Selecting the right features

In the feature engineering you typically care about your features and decide if they are relevant to your question you want to answer. Further you also create new features out of existing features as described later. For the most machine learning algorithms the data must be numerical, but there are also existing algorithms which can deal with categorical features. Normally you explore your data set and decide what features seems to be predictive of the target value. With these selected features you then train your selected machine learning algorithm. After your first pick of features you then can add other less obvious strong features which seem to be related to your feature. If the model then performs better, you can re do the step again and again until you find the best selection of features.

The number of the perfect training data

There is of course no optimal answer to this question. But there are certain factors which determine the accuracy of your model, e.g. if you have less data your accuracy will be probably less then you will have with more training data. Further for more complex and non-linear patterns you also will need more training data to explore the full range of complexity in your traing data. Another huge factor of the number of training data you need is the dimension of your features. If you have 100 features you will probably need way more than 1000 entries in your data set, but if you only have 2 features 1000 entries maybe will be fine. In general you can say, that the more data there is, the higher will be the accuracy.

Quality of the data

Quantify is important, but another huge factor is the quality and representativeness of the training data. If you want to detect if an online account is from a men or from a women, but your training set includes entries mostly from men, you probably will not have huge success at training it. Also if your training data only includes old data entries, but you want to predict new events the chance is also high of a lose of accuracy. So on the one hand side the training data has to be represantive for every possible target you want to determine and also must be up to data.

Changing categorical features to numerical

The most machine learning algorithms are only working with numerical features, so you have to be sure that you prepare the training data right before training the model. Categorical features are features like gender and relationship status. One simple way to deal with this problem is to transform the categories into multiple binary features. If we take the example of gender, then we can transform the categorical feature gender to two binary features, male and female. If the entry has a 1 on male the entry refers to a man, if the entry has a 1 on female the entry refers to a women.

Replacing missing data

In an ideal world there exists no missing data in a data set, but in the real world there exist missing data. So we have to deal with this problem, too. One naive way would be to just remove the whole entry if there is data missing in it, but then we remove possible essential data entries which later could helps us to improve the accuracy of our model. This approach only work if we have a large data set with only less data missing. Therefore we have to replace the missing data fields with fake data. If there are missing data on categorical features, you will most likely create a new category for missing value. If there is missing data on numerical features, we replace them by values which are outside of the real spectrum of the feature, so e.g. you could replace missing entries in the feature age with are negative number like -1. Sometimes this can work, but an more elegant way is the concept of imputation, where you replace missing data by guessing the true value. One approach for imputation is for numerical features is replacing missing data by the median.

Thanks for reading, if you have not already read my last blog post, then you can do that by clicking here. I am also active on Twitter, so if you do not want to miss any new blog posts, then follow me there.

Gaussian Naive Bayes

So I currently learning some machine learning stuff and therefore I also exploring some interesting algorithms I want to share here. This time I want to talk about the Gaussian Naive Bayes algorithm, which is a simple classification algorithm which is based on the Bayes’ theorem.

Bayes’ theorem

Bayes Theorem is named after Thomas Bayes (1701-1761), who first introduced Bayes’ theorem which then later get developed further by Pierre Simon Laplace, who published the modern equation of the Bayes Theorem in 1812. In general Bayes Theorem describes the probability of an event, based on prior knowledge of conditions be related of conditions to the event. So it basically fits perfectly for machine learning, because that is exactly what machine learning does: making predictions for the future based on prior experience. Mathematically you can write the Bayes theorem as following:

CodeCogsEqn (2).png

The Bayes’ theorem

Let’s break the equation down:

  • A and B are events.
  • P(A) and P(B) (P(B) not 0) are the probabilities of the event independent from each other.
  • P(A|B) is the probability A under the condition B.
  • Equivalent with P(B|A), it is the probability of observing event B given that event A is true.

So for me as a non expert in the probability theory P(A|B) and P(B|A) seemed first a little bit confusing for me. So this probabilities are also called conditional probability and they are describing the probability of A under the condition of B. So lets see that on an example: Lets say that that A stands for the probability, that if you look outside your house “you will see at least one person”. Lets assume that B stands for “it is raining”. Then P(A|B) stands for the probability that you will see at least one person outside your house if it is raining. This of course works also for negations. So P(A|not B), where not B stands for “it is not raining”, will describe the probability that you will see at least one person outside your house if it is not raining.

Okay, back the the Bayes’ theorem. Lets take an example and see how it works. I try to show you many examples, because I often find it myself difficult to understand certain topics, especially math topics if there are no concrete examples.

Example

  • Lets say that A is the event, that a person is ill, not A person is not ill.
  • There exists a test, B stands for a positive test result, not B stands for a negative test result.
  • The probability A is P(A) = 0.01 (1%), P(not A) = 0.99 (99%)
  • The probability that the test is correct if A is ill is P(B|A) = 0.99 (99%) and same that if the test is false the person is also not ill P(not B|not A) = 0.99 (99%). This is also called that the test is 99% sensitive (positive test result and person is ill) and 99% specific (negative test result and person is not ill).
ill.png

Tree diagram of the probabilities

Lets calculate the probability, that a random person with a positive test result is ill:

CodeCogsEqn (3).png

So with a probability of 50% a random person with a positive test result is ill. With other words: A random person with a positive test result is not ill in 50% of the cases.

Why is that so? It is simple: Because there are much more people who are actually not ill. Lets assume that we have 100 people. 99 people are not ill, 1 person is ill. 99 (not ill people) * 0.01 (probability of wrong test results) = 0.99. So there is 0.99 false positives expected.  1 ill person * 0.99 (probability of correct test results) = 0.99. That leads us to a probability of 0.5 aka a 50% of a random person with a positive test result.

So with Bayes’ theorem you can calculate pretty easy the probability of an event based on the prior probabilities and conditions.

Gaussian Naive Bayes

The Gaussian Naive Bayes is one classifier model. Beside the Gaussian Naive Bayes there are also existing the Multinomial naive Bayes and the Bernoulli naive Bayes. I picked the Gaussian Naive Bayes because it is the simplest and the most popular one.

Iris data set

On of the most popular data sets in machine learning ist definitely the iris data set. The iris data set is about flowers who has three features: sepal length, sepal width and petal length and petal width. Each flower is labeled as one of three species: setosa, versiscolor and virginica. The data set has a total of 150 entries, so it is very small.

Lets see how it will perform with the Gaussian Naive Bayes classifier.

 

#
# iris dataset
# 150 total entries
# features are: sepal length in cm, sepal width in cm,petal length in cm        - petal width in cm\n 
# labels names: setosa, versicolor, virginica
#
# used algorithm: Gaussian Naive Bayes (GaussianNB)
#
# accuracy ~100%
#
from time import time
import numpy as np
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

def main():
	data_set = load_iris()

	features, labels = split_features_labels(data_set)

	train_features, train_labels, test_features, test_labels = split_train_test(features, labels, 0.18)

	print(len(train_features), " ", len(test_features))

	clf = GaussianNB()

	print("Start training...")
	tStart = time()
	clf.fit(train_features, train_labels)
	print("Training time: ", round(time()-tStart, 3), "s")

	print("Accuracy: ", accuracy_score(clf.predict(test_features), test_labels))



def split_train_test(features, labels, test_size):
	total_test_size = int(len(features) * test_size)
	np.random.seed(2)
	indices = np.random.permutation(len(features))
	train_features = features[indices[:-total_test_size]]
	train_labels = labels[indices[:-total_test_size]]
	test_features  = features[indices[-total_test_size:]]
	test_labels  = labels[indices[-total_test_size:]]
	return train_features, train_labels, test_features, test_labels

def split_features_labels(data_set):
	features = data_set.data
	labels = data_set.target
	return features, labels


if __name__ == "__main__":
	main()

The Jupyter Notebook and the Python file will be also available on GitHub. As we can see the classifier is very fast, even it is no big data set, and the accuracy is perfect. Of course with other permutations and other ratio between training and testing set there will be other results.

Adult census income

Next lets see at another example. I found this data set which contains entries of people, features are specified in age, workclass,  education, occupation, race and many more. Each entry is labeled which income. It is either below 50k or below. So the task is to train a classifier, e.g. a Gaussian Naive Bayes classifier, and find out for future entries if the peoples income is above or below 50k.

That is what I did:

# 
# https://www.kaggle.com/uciml/adult-census-income
# there are 32,562 total entries
# features are: "age","workclass","fnlwgt","education","education.num",
# "marital.status","occupation", "relationship","race","sex","capital.gain",
# "capital.loss","hours.per.week","native.country","income"
# label is "income"
#
# used algorithm: Gaussian Naive Bayes (GaussianNB)
# 
# accuracy ~80%
#
import os
import pandas as pd
from time import time
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

def main():
	data_set = load_data_set()
	print(data_set.info())

	# data cleaning and preprocessing
	data_set = clean_data_set(data_set)

	train_set, test_set = train_test_split(data_set, test_size=0.2, random_state=3)
	print(len(train_set), " ", len(test_set))

	train_features, train_labels = split_features_labels(train_set, "income")
	test_features, test_labels = split_features_labels(test_set, "income")

	clf = GaussianNB()

	print("Start training...")
	tStart = time()
	clf.fit(train_features, train_labels)
	print("Training time: ", round(time()-tStart, 3), "s")

	print("Accuracy: ", accuracy_score(clf.predict(test_features), test_labels))


def load_data_set():
	csv_path = os.path.join("adult.csv")
	return pd.read_csv(csv_path)

def split_features_labels(data_set, feature):
	features = data_set.drop(feature, axis=1)
	labels = data_set[feature].copy()
	return features, labels

def clean_data_set(data_set):
	for column in data_set.columns:
		if data_set[column].dtype == type(object):
			le = LabelEncoder()
			data_set[column] = le.fit_transform(data_set[column])

	return data_set

if __name__ == "__main__":
	main()

The accuracy is here only 80% unfortunately, but with feature engineering and other algorithms the accuracy will be up to 87%.

Resume

The Naive Bayes classifiers are working based on the Bayes’ theorem, which describes the probability of an event, based on prior knowledge of conditions be related of conditions to the event. It is a very simple and fast classifier and works sometimes very good, and even without much effort you can get a okay accuracy.

Thanks for reading, if you have not already read my last blog post, then you can do that by clicking here. I am also active on Twitter, so if you do not want to miss any new blog posts, then follow me there.