Data Visualization with Python

This blog post is part of a series, where I talk about concepts and algorithms in Machine Learning.  In this blog post I do not want to talk about any concept or algorithm, but about data visualization with Python. Data Visualization with Python is very important, especially for Machine Learning, because you want to explore your data and gain knowledge about the data set you work with, so that you can select good features. I won’t explain the following plots and charts in depth, because most of them are very similar and should just give you some ideas how you can plot your data. You will find all examples of the following plots and charts on GitHub.

First Plot

I will use the matplotlib and Python to show how easy it can be to visualize data and see relationships between features and targets. First lets plot easy functions like x and x². On the one side we have the values for the x axis and on the other the result of the function on the y axis.

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5, 6, 7]
y = [1, 4, 9, 16, 25, 36, 49]
y1 = [1, 2, 3, 4, 5, 6, 7]

plt.plot(x, y, label='x^2')
plt.plot(x, y1, label='x')
plt.xlabel('x axis')
plt.ylabel('y axis')

plt.title('function plots')
plt.legend()
plt.show()

As a result there will come this plot:

Bar charts

Next there are bar charts. Bar charts are also extremely useful if you want to explore the frequency of certain events, e.g. the frequency of emails you get over the day.

import matplotlib.pyplot as plt

incoming_email_at_hour = [8, 8, 9, 9, 9, 9, 10, 10, 11, 11, 11, 11, 12, 13, 13, 13, 13, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 17, 17, 18]

hours = [x for x in range(6, 18)]

plt.hist(incoming_email_at_hour, hours, histtype='bar', rwidth=0.8)

plt.xlabel('t in hour')
plt.ylabel('incoming emails')

plt.title('Functions')
plt.legend()
plt.show()

Scatter plots

import matplotlib.pyplot as plt

x = [1.1, 1.7, 1.3, 1.4, 1.2, 2.7, 2.3, 2.2]
y = [4, 6, 10, 3, 2, 10, 11, 5]

x0 = [3.2, 3.3, 3.7, 4, 4.2, 4.3]
y0 = [5, 7, 3, 9, 3, 12]

plt.scatter(x, y, label='Red', color='r')
plt.scatter(x0, y0, label='Blue', color='b')

plt.xlabel('x')
plt.xlabel('y')
plt.title('Scatter Pot')
plt.legend()
plt.show()

At the most time I use scatter plots to visualize how certain features are connected to the target label. Lets assume that we have a binary features like this. Then we can see, that we could easily work here with support vector machines, because we could separate the two division with a line. If there are not a clear division between the two regions and there are existing multiple regions, then maybe decisions tree are a better solution.

Pie Charts

Pie charts are like bar charts great for representing surveys or showing relative sizes. In the following example a pie chart represents how people come to work.

import matplotlib.pyplot as plt

transport_methodes = ['Car', 'Train', 'Bus', 'Walk']

transport_count = [23, 25, 10, 2] # e.g. total count of random survey

col = ['r', 'b', 'y', 'g'] 

plt.pie(transport_count, labels=transport_methodes, colors=col)

plt.title('Pie Chart')

plt.show()

Mosaic plots

Mosaic plots also called marimekko charts are a great way to visualize relationships between two or more categorical features. So instead plotting a more dimensional bar chart you can use mosaic plot to explore the relation between multiple categorical features. Unfortunately I did not find a way to use mosaic plots in matplotlib. But I was able to make an example with statsmodels

import pandas as pd
from statsmodels.graphics.mosaicplot import mosaic
import pylab
from itertools import product
import numpy as np
rand = np.random.random

speaks_mul_foreign_languages = list(product(['male', 'female'], ['yes', 'no']))
index = pd.MultiIndex.from_tuples(speaks_mul_foreign_languages, names=['male', 'female'])
data = pd.Series(rand(4), index=index)

mosaic(data, gap=0.01, title='Who knows multiple foregin languages? - Mosaic Chart')
pylab.show()

Resume

There are dozens of ways to represent data and visualize them. I tried in this blog post to give you some inspiration and ideas how you can do it at your own. I have put all examples in a GitHub repository and I will add examples if I find good ones and have the time to do it.

Thanks for reading, if you have not already read my last blog post, then you can do that by clicking here. I am also active on Twitter, so if you do not want to miss any new blog posts, then follow me there.

 

Preparation Of Data Before Training

This blog post is part of a series, where I talk about concepts and algorithms in Machine Learning. In this part I want to talk a little about the preparation of data before you start training the data to your Machine Learning algorithm.

Basic feature engineering: Selecting the right features

In the feature engineering you typically care about your features and decide if they are relevant to your question you want to answer. Further you also create new features out of existing features as described later. For the most machine learning algorithms the data must be numerical, but there are also existing algorithms which can deal with categorical features. Normally you explore your data set and decide what features seems to be predictive of the target value. With these selected features you then train your selected machine learning algorithm. After your first pick of features you then can add other less obvious strong features which seem to be related to your feature. If the model then performs better, you can re do the step again and again until you find the best selection of features.

The number of the perfect training data

There is of course no optimal answer to this question. But there are certain factors which determine the accuracy of your model, e.g. if you have less data your accuracy will be probably less then you will have with more training data. Further for more complex and non-linear patterns you also will need more training data to explore the full range of complexity in your traing data. Another huge factor of the number of training data you need is the dimension of your features. If you have 100 features you will probably need way more than 1000 entries in your data set, but if you only have 2 features 1000 entries maybe will be fine. In general you can say, that the more data there is, the higher will be the accuracy.

Quality of the data

Quantify is important, but another huge factor is the quality and representativeness of the training data. If you want to detect if an online account is from a men or from a women, but your training set includes entries mostly from men, you probably will not have huge success at training it. Also if your training data only includes old data entries, but you want to predict new events the chance is also high of a lose of accuracy. So on the one hand side the training data has to be represantive for every possible target you want to determine and also must be up to data.

Changing categorical features to numerical

The most machine learning algorithms are only working with numerical features, so you have to be sure that you prepare the training data right before training the model. Categorical features are features like gender and relationship status. One simple way to deal with this problem is to transform the categories into multiple binary features. If we take the example of gender, then we can transform the categorical feature gender to two binary features, male and female. If the entry has a 1 on male the entry refers to a man, if the entry has a 1 on female the entry refers to a women.

Replacing missing data

In an ideal world there exists no missing data in a data set, but in the real world there exist missing data. So we have to deal with this problem, too. One naive way would be to just remove the whole entry if there is data missing in it, but then we remove possible essential data entries which later could helps us to improve the accuracy of our model. This approach only work if we have a large data set with only less data missing. Therefore we have to replace the missing data fields with fake data. If there are missing data on categorical features, you will most likely create a new category for missing value. If there is missing data on numerical features, we replace them by values which are outside of the real spectrum of the feature, so e.g. you could replace missing entries in the feature age with are negative number like -1. Sometimes this can work, but an more elegant way is the concept of imputation, where you replace missing data by guessing the true value. One approach for imputation is for numerical features is replacing missing data by the median.

Thanks for reading, if you have not already read my last blog post, then you can do that by clicking here. I am also active on Twitter, so if you do not want to miss any new blog posts, then follow me there.

Gaussian Naive Bayes

So I currently learning some machine learning stuff and therefore I also exploring some interesting algorithms I want to share here. This time I want to talk about the Gaussian Naive Bayes algorithm, which is a simple classification algorithm which is based on the Bayes’ theorem.

Bayes’ theorem

Bayes Theorem is named after Thomas Bayes (1701-1761), who first introduced Bayes’ theorem which then later get developed further by Pierre Simon Laplace, who published the modern equation of the Bayes Theorem in 1812. In general Bayes Theorem describes the probability of an event, based on prior knowledge of conditions be related of conditions to the event. So it basically fits perfectly for machine learning, because that is exactly what machine learning does: making predictions for the future based on prior experience. Mathematically you can write the Bayes theorem as following:

CodeCogsEqn (2).png

The Bayes’ theorem

Let’s break the equation down:

  • A and B are events.
  • P(A) and P(B) (P(B) not 0) are the probabilities of the event independent from each other.
  • P(A|B) is the probability A under the condition B.
  • Equivalent with P(B|A), it is the probability of observing event B given that event A is true.

So for me as a non expert in the probability theory P(A|B) and P(B|A) seemed first a little bit confusing for me. So this probabilities are also called conditional probability and they are describing the probability of A under the condition of B. So lets see that on an example: Lets say that that A stands for the probability, that if you look outside your house “you will see at least one person”. Lets assume that B stands for “it is raining”. Then P(A|B) stands for the probability that you will see at least one person outside your house if it is raining. This of course works also for negations. So P(A|not B), where not B stands for “it is not raining”, will describe the probability that you will see at least one person outside your house if it is not raining.

Okay, back the the Bayes’ theorem. Lets take an example and see how it works. I try to show you many examples, because I often find it myself difficult to understand certain topics, especially math topics if there are no concrete examples.

Example

  • Lets say that A is the event, that a person is ill, not A person is not ill.
  • There exists a test, B stands for a positive test result, not B stands for a negative test result.
  • The probability A is P(A) = 0.01 (1%), P(not A) = 0.99 (99%)
  • The probability that the test is correct if A is ill is P(B|A) = 0.99 (99%) and same that if the test is false the person is also not ill P(not B|not A) = 0.99 (99%). This is also called that the test is 99% sensitive (positive test result and person is ill) and 99% specific (negative test result and person is not ill).
ill.png

Tree diagram of the probabilities

Lets calculate the probability, that a random person with a positive test result is ill:

CodeCogsEqn (3).png

So with a probability of 50% a random person with a positive test result is ill. With other words: A random person with a positive test result is not ill in 50% of the cases.

Why is that so? It is simple: Because there are much more people who are actually not ill. Lets assume that we have 100 people. 99 people are not ill, 1 person is ill. 99 (not ill people) * 0.01 (probability of wrong test results) = 0.99. So there is 0.99 false positives expected.  1 ill person * 0.99 (probability of correct test results) = 0.99. That leads us to a probability of 0.5 aka a 50% of a random person with a positive test result.

So with Bayes’ theorem you can calculate pretty easy the probability of an event based on the prior probabilities and conditions.

Gaussian Naive Bayes

The Gaussian Naive Bayes is one classifier model. Beside the Gaussian Naive Bayes there are also existing the Multinomial naive Bayes and the Bernoulli naive Bayes. I picked the Gaussian Naive Bayes because it is the simplest and the most popular one.

Iris data set

On of the most popular data sets in machine learning ist definitely the iris data set. The iris data set is about flowers who has three features: sepal length, sepal width and petal length and petal width. Each flower is labeled as one of three species: setosa, versiscolor and virginica. The data set has a total of 150 entries, so it is very small.

Lets see how it will perform with the Gaussian Naive Bayes classifier.

 

#
# iris dataset
# 150 total entries
# features are: sepal length in cm, sepal width in cm,petal length in cm        - petal width in cm\n 
# labels names: setosa, versicolor, virginica
#
# used algorithm: Gaussian Naive Bayes (GaussianNB)
#
# accuracy ~100%
#
from time import time
import numpy as np
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

def main():
	data_set = load_iris()

	features, labels = split_features_labels(data_set)

	train_features, train_labels, test_features, test_labels = split_train_test(features, labels, 0.18)

	print(len(train_features), " ", len(test_features))

	clf = GaussianNB()

	print("Start training...")
	tStart = time()
	clf.fit(train_features, train_labels)
	print("Training time: ", round(time()-tStart, 3), "s")

	print("Accuracy: ", accuracy_score(clf.predict(test_features), test_labels))



def split_train_test(features, labels, test_size):
	total_test_size = int(len(features) * test_size)
	np.random.seed(2)
	indices = np.random.permutation(len(features))
	train_features = features[indices[:-total_test_size]]
	train_labels = labels[indices[:-total_test_size]]
	test_features  = features[indices[-total_test_size:]]
	test_labels  = labels[indices[-total_test_size:]]
	return train_features, train_labels, test_features, test_labels

def split_features_labels(data_set):
	features = data_set.data
	labels = data_set.target
	return features, labels


if __name__ == "__main__":
	main()

The Jupyter Notebook and the Python file will be also available on GitHub. As we can see the classifier is very fast, even it is no big data set, and the accuracy is perfect. Of course with other permutations and other ratio between training and testing set there will be other results.

Adult census income

Next lets see at another example. I found this data set which contains entries of people, features are specified in age, workclass,  education, occupation, race and many more. Each entry is labeled which income. It is either below 50k or below. So the task is to train a classifier, e.g. a Gaussian Naive Bayes classifier, and find out for future entries if the peoples income is above or below 50k.

That is what I did:

# 
# https://www.kaggle.com/uciml/adult-census-income
# there are 32,562 total entries
# features are: "age","workclass","fnlwgt","education","education.num",
# "marital.status","occupation", "relationship","race","sex","capital.gain",
# "capital.loss","hours.per.week","native.country","income"
# label is "income"
#
# used algorithm: Gaussian Naive Bayes (GaussianNB)
# 
# accuracy ~80%
#
import os
import pandas as pd
from time import time
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

def main():
	data_set = load_data_set()
	print(data_set.info())

	# data cleaning and preprocessing
	data_set = clean_data_set(data_set)

	train_set, test_set = train_test_split(data_set, test_size=0.2, random_state=3)
	print(len(train_set), " ", len(test_set))

	train_features, train_labels = split_features_labels(train_set, "income")
	test_features, test_labels = split_features_labels(test_set, "income")

	clf = GaussianNB()

	print("Start training...")
	tStart = time()
	clf.fit(train_features, train_labels)
	print("Training time: ", round(time()-tStart, 3), "s")

	print("Accuracy: ", accuracy_score(clf.predict(test_features), test_labels))


def load_data_set():
	csv_path = os.path.join("adult.csv")
	return pd.read_csv(csv_path)

def split_features_labels(data_set, feature):
	features = data_set.drop(feature, axis=1)
	labels = data_set[feature].copy()
	return features, labels

def clean_data_set(data_set):
	for column in data_set.columns:
		if data_set[column].dtype == type(object):
			le = LabelEncoder()
			data_set[column] = le.fit_transform(data_set[column])

	return data_set

if __name__ == "__main__":
	main()

The accuracy is here only 80% unfortunately, but with feature engineering and other algorithms the accuracy will be up to 87%.

Resume

The Naive Bayes classifiers are working based on the Bayes’ theorem, which describes the probability of an event, based on prior knowledge of conditions be related of conditions to the event. It is a very simple and fast classifier and works sometimes very good, and even without much effort you can get a okay accuracy.

Thanks for reading, if you have not already read my last blog post, then you can do that by clicking here. I am also active on Twitter, so if you do not want to miss any new blog posts, then follow me there.