Data Visualization with Python

This blog post is part of a series, where I talk about concepts and algorithms in Machine Learning.  In this blog post I do not want to talk about any concept or algorithm, but about data visualization with Python. Data Visualization with Python is very important, especially for Machine Learning, because you want to explore your data and gain knowledge about the data set you work with, so that you can select good features. I won’t explain the following plots and charts in depth, because most of them are very similar and should just give you some ideas how you can plot your data. You will find all examples of the following plots and charts on GitHub.

First Plot

I will use the matplotlib and Python to show how easy it can be to visualize data and see relationships between features and targets. First lets plot easy functions like x and x². On the one side we have the values for the x axis and on the other the result of the function on the y axis.

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5, 6, 7]
y = [1, 4, 9, 16, 25, 36, 49]
y1 = [1, 2, 3, 4, 5, 6, 7]

plt.plot(x, y, label='x^2')
plt.plot(x, y1, label='x')
plt.xlabel('x axis')
plt.ylabel('y axis')

plt.title('function plots')
plt.legend()
plt.show()

As a result there will come this plot:

Bar charts

Next there are bar charts. Bar charts are also extremely useful if you want to explore the frequency of certain events, e.g. the frequency of emails you get over the day.

import matplotlib.pyplot as plt

incoming_email_at_hour = [8, 8, 9, 9, 9, 9, 10, 10, 11, 11, 11, 11, 12, 13, 13, 13, 13, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 17, 17, 18]

hours = [x for x in range(6, 18)]

plt.hist(incoming_email_at_hour, hours, histtype='bar', rwidth=0.8)

plt.xlabel('t in hour')
plt.ylabel('incoming emails')

plt.title('Functions')
plt.legend()
plt.show()

Scatter plots

import matplotlib.pyplot as plt

x = [1.1, 1.7, 1.3, 1.4, 1.2, 2.7, 2.3, 2.2]
y = [4, 6, 10, 3, 2, 10, 11, 5]

x0 = [3.2, 3.3, 3.7, 4, 4.2, 4.3]
y0 = [5, 7, 3, 9, 3, 12]

plt.scatter(x, y, label='Red', color='r')
plt.scatter(x0, y0, label='Blue', color='b')

plt.xlabel('x')
plt.xlabel('y')
plt.title('Scatter Pot')
plt.legend()
plt.show()

At the most time I use scatter plots to visualize how certain features are connected to the target label. Lets assume that we have a binary features like this. Then we can see, that we could easily work here with support vector machines, because we could separate the two division with a line. If there are not a clear division between the two regions and there are existing multiple regions, then maybe decisions tree are a better solution.

Pie Charts

Pie charts are like bar charts great for representing surveys or showing relative sizes. In the following example a pie chart represents how people come to work.

import matplotlib.pyplot as plt

transport_methodes = ['Car', 'Train', 'Bus', 'Walk']

transport_count = [23, 25, 10, 2] # e.g. total count of random survey

col = ['r', 'b', 'y', 'g'] 

plt.pie(transport_count, labels=transport_methodes, colors=col)

plt.title('Pie Chart')

plt.show()

Mosaic plots

Mosaic plots also called marimekko charts are a great way to visualize relationships between two or more categorical features. So instead plotting a more dimensional bar chart you can use mosaic plot to explore the relation between multiple categorical features. Unfortunately I did not find a way to use mosaic plots in matplotlib. But I was able to make an example with statsmodels

import pandas as pd
from statsmodels.graphics.mosaicplot import mosaic
import pylab
from itertools import product
import numpy as np
rand = np.random.random

speaks_mul_foreign_languages = list(product(['male', 'female'], ['yes', 'no']))
index = pd.MultiIndex.from_tuples(speaks_mul_foreign_languages, names=['male', 'female'])
data = pd.Series(rand(4), index=index)

mosaic(data, gap=0.01, title='Who knows multiple foregin languages? - Mosaic Chart')
pylab.show()

Resume

There are dozens of ways to represent data and visualize them. I tried in this blog post to give you some inspiration and ideas how you can do it at your own. I have put all examples in a GitHub repository and I will add examples if I find good ones and have the time to do it.

Thanks for reading, if you have not already read my last blog post, then you can do that by clicking here. I am also active on Twitter, so if you do not want to miss any new blog posts, then follow me there.

 

Gaussian Naive Bayes

So I currently learning some machine learning stuff and therefore I also exploring some interesting algorithms I want to share here. This time I want to talk about the Gaussian Naive Bayes algorithm, which is a simple classification algorithm which is based on the Bayes’ theorem.

Bayes’ theorem

Bayes Theorem is named after Thomas Bayes (1701-1761), who first introduced Bayes’ theorem which then later get developed further by Pierre Simon Laplace, who published the modern equation of the Bayes Theorem in 1812. In general Bayes Theorem describes the probability of an event, based on prior knowledge of conditions be related of conditions to the event. So it basically fits perfectly for machine learning, because that is exactly what machine learning does: making predictions for the future based on prior experience. Mathematically you can write the Bayes theorem as following:

CodeCogsEqn (2).png

The Bayes’ theorem

Let’s break the equation down:

  • A and B are events.
  • P(A) and P(B) (P(B) not 0) are the probabilities of the event independent from each other.
  • P(A|B) is the probability A under the condition B.
  • Equivalent with P(B|A), it is the probability of observing event B given that event A is true.

So for me as a non expert in the probability theory P(A|B) and P(B|A) seemed first a little bit confusing for me. So this probabilities are also called conditional probability and they are describing the probability of A under the condition of B. So lets see that on an example: Lets say that that A stands for the probability, that if you look outside your house “you will see at least one person”. Lets assume that B stands for “it is raining”. Then P(A|B) stands for the probability that you will see at least one person outside your house if it is raining. This of course works also for negations. So P(A|not B), where not B stands for “it is not raining”, will describe the probability that you will see at least one person outside your house if it is not raining.

Okay, back the the Bayes’ theorem. Lets take an example and see how it works. I try to show you many examples, because I often find it myself difficult to understand certain topics, especially math topics if there are no concrete examples.

Example

  • Lets say that A is the event, that a person is ill, not A person is not ill.
  • There exists a test, B stands for a positive test result, not B stands for a negative test result.
  • The probability A is P(A) = 0.01 (1%), P(not A) = 0.99 (99%)
  • The probability that the test is correct if A is ill is P(B|A) = 0.99 (99%) and same that if the test is false the person is also not ill P(not B|not A) = 0.99 (99%). This is also called that the test is 99% sensitive (positive test result and person is ill) and 99% specific (negative test result and person is not ill).
ill.png

Tree diagram of the probabilities

Lets calculate the probability, that a random person with a positive test result is ill:

CodeCogsEqn (3).png

So with a probability of 50% a random person with a positive test result is ill. With other words: A random person with a positive test result is not ill in 50% of the cases.

Why is that so? It is simple: Because there are much more people who are actually not ill. Lets assume that we have 100 people. 99 people are not ill, 1 person is ill. 99 (not ill people) * 0.01 (probability of wrong test results) = 0.99. So there is 0.99 false positives expected.  1 ill person * 0.99 (probability of correct test results) = 0.99. That leads us to a probability of 0.5 aka a 50% of a random person with a positive test result.

So with Bayes’ theorem you can calculate pretty easy the probability of an event based on the prior probabilities and conditions.

Gaussian Naive Bayes

The Gaussian Naive Bayes is one classifier model. Beside the Gaussian Naive Bayes there are also existing the Multinomial naive Bayes and the Bernoulli naive Bayes. I picked the Gaussian Naive Bayes because it is the simplest and the most popular one.

Iris data set

On of the most popular data sets in machine learning ist definitely the iris data set. The iris data set is about flowers who has three features: sepal length, sepal width and petal length and petal width. Each flower is labeled as one of three species: setosa, versiscolor and virginica. The data set has a total of 150 entries, so it is very small.

Lets see how it will perform with the Gaussian Naive Bayes classifier.

 

#
# iris dataset
# 150 total entries
# features are: sepal length in cm, sepal width in cm,petal length in cm        - petal width in cm\n 
# labels names: setosa, versicolor, virginica
#
# used algorithm: Gaussian Naive Bayes (GaussianNB)
#
# accuracy ~100%
#
from time import time
import numpy as np
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

def main():
	data_set = load_iris()

	features, labels = split_features_labels(data_set)

	train_features, train_labels, test_features, test_labels = split_train_test(features, labels, 0.18)

	print(len(train_features), " ", len(test_features))

	clf = GaussianNB()

	print("Start training...")
	tStart = time()
	clf.fit(train_features, train_labels)
	print("Training time: ", round(time()-tStart, 3), "s")

	print("Accuracy: ", accuracy_score(clf.predict(test_features), test_labels))



def split_train_test(features, labels, test_size):
	total_test_size = int(len(features) * test_size)
	np.random.seed(2)
	indices = np.random.permutation(len(features))
	train_features = features[indices[:-total_test_size]]
	train_labels = labels[indices[:-total_test_size]]
	test_features  = features[indices[-total_test_size:]]
	test_labels  = labels[indices[-total_test_size:]]
	return train_features, train_labels, test_features, test_labels

def split_features_labels(data_set):
	features = data_set.data
	labels = data_set.target
	return features, labels


if __name__ == "__main__":
	main()

The Jupyter Notebook and the Python file will be also available on GitHub. As we can see the classifier is very fast, even it is no big data set, and the accuracy is perfect. Of course with other permutations and other ratio between training and testing set there will be other results.

Adult census income

Next lets see at another example. I found this data set which contains entries of people, features are specified in age, workclass,  education, occupation, race and many more. Each entry is labeled which income. It is either below 50k or below. So the task is to train a classifier, e.g. a Gaussian Naive Bayes classifier, and find out for future entries if the peoples income is above or below 50k.

That is what I did:

# 
# https://www.kaggle.com/uciml/adult-census-income
# there are 32,562 total entries
# features are: "age","workclass","fnlwgt","education","education.num",
# "marital.status","occupation", "relationship","race","sex","capital.gain",
# "capital.loss","hours.per.week","native.country","income"
# label is "income"
#
# used algorithm: Gaussian Naive Bayes (GaussianNB)
# 
# accuracy ~80%
#
import os
import pandas as pd
from time import time
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

def main():
	data_set = load_data_set()
	print(data_set.info())

	# data cleaning and preprocessing
	data_set = clean_data_set(data_set)

	train_set, test_set = train_test_split(data_set, test_size=0.2, random_state=3)
	print(len(train_set), " ", len(test_set))

	train_features, train_labels = split_features_labels(train_set, "income")
	test_features, test_labels = split_features_labels(test_set, "income")

	clf = GaussianNB()

	print("Start training...")
	tStart = time()
	clf.fit(train_features, train_labels)
	print("Training time: ", round(time()-tStart, 3), "s")

	print("Accuracy: ", accuracy_score(clf.predict(test_features), test_labels))


def load_data_set():
	csv_path = os.path.join("adult.csv")
	return pd.read_csv(csv_path)

def split_features_labels(data_set, feature):
	features = data_set.drop(feature, axis=1)
	labels = data_set[feature].copy()
	return features, labels

def clean_data_set(data_set):
	for column in data_set.columns:
		if data_set[column].dtype == type(object):
			le = LabelEncoder()
			data_set[column] = le.fit_transform(data_set[column])

	return data_set

if __name__ == "__main__":
	main()

The accuracy is here only 80% unfortunately, but with feature engineering and other algorithms the accuracy will be up to 87%.

Resume

The Naive Bayes classifiers are working based on the Bayes’ theorem, which describes the probability of an event, based on prior knowledge of conditions be related of conditions to the event. It is a very simple and fast classifier and works sometimes very good, and even without much effort you can get a okay accuracy.

Thanks for reading, if you have not already read my last blog post, then you can do that by clicking here. I am also active on Twitter, so if you do not want to miss any new blog posts, then follow me there.

Why Python is probably the programming language for dealing with Big Data

Python is a very amazing programming language and is very popular these days, so you have probably heard about Python already, you have tried it out or you use it even as your main programming language at work. Therefore I want to talk in this blog post why Python is the language (or at least one great language) when it comes to dealing and working with Big data.

Continue reading →