After going through some of the theory behind a perceptron, the perceptron learning algorithm (PLA), and learning curves, let us take a detour and see if this truly works in a practical sense. It is time for a field test. But we are not quite prepared well enough to do full test, so we will take it easy. We will tackle a 'toy' dataset very commonly used in statistical and ML papers: the Iris dataset of the University of California Irvine (UCI).
The Iris dataset and the UCI repository
The Iris dataset is easily explained through its original readme.txt, as we will do below. But let us use this time to highlight the entire UCI dataset repository (here). The formal name is UCI Machine Learning Repository, Center for Machine Learning and Intelligent Systems. There can be no doubt about its intent!
Many ML papers would list a benchmark test against various UCI datasets to argue that a new algorithm is superior to prior art. Iris is one of many datasets traditionally used for initial testing of proposed ML algorithms (although there are also 'harder' benchmark datasets). Besides Iris, there is a Wine dataset, a Heart dataset, Isolet (sound files), and so on. Most are not considered difficult datasets, but are suitable for educational trials. Again, the datasets are here.
The Iris dataset contains petal and sepal measurements of three species of the Iris flower. To untrained eyes, all three flowers look similar (pictures at Wikipedia here). With the help of a Perceptron, we will attempt to separate the different species by giving a Perceptron a series of measurements for each Iris species. The Perceptron should then be able to separate these samples on its own eventually. This trained Perceptron can then be used to classify future Iris flower measurements.
The Iris dataset description
Title: Iris Plants Database Updated Sept 21 by C.Blake - Added discrepency information
Sources: (a) Creator: R.A. Fisher (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) (c) Date: July, 1988
Past Usage: (intentionally skipped by me; read at the UCI source link)
Relevant Information:
- This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.
- Predicted attribute: class of iris plant.
- This is an exceedingly simple domain.
- This data differs from the data presented in Fishers article (identified by Steve Chadwick, spchadwick@espeedaz.net ). The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa" where the error is in the fourth feature. The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" where the errors are in the second and third features.
Number of Instances: 150 (50 in each of three classes)
Number of Attributes: 4 numeric, predictive attributes and the class
Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class: -- Iris Setosa -- Iris Versicolour -- Iris Virginica
Missing Attribute Values: None; Summary Statistics (intentionally skipped by me; read at the UCI source link)
Class Distribution: 33.3% for each of 3 classes.
Plotting the Iris Dataset
The description above shows that one of the three categories is linearly separable from the other two, while these latter two have some overlap. Let us investigate the dataset with graphs. There are six attributes (features). Since we want to plot in 2D, we have six 2-feature combinations. We want to see which feature pair is linearly separable so we can apply our first perceptron.
Like most datasets, the Iris dataset is in a comma-separated values (CSV) format, and many standard data analytics software can read it (Excel can read a CSV file, but the user will need to tell Excel what markers to search). These values can be extracted with our own code, but Python already has standard file reading libraries. For example, Python has the csv library (import csv in code; I used this for all prior Perceptron code so far).
We will instead use another important Python library called pandas (to augment NumPy and matplotlib) that will make extracting information from a CSV file easy. While pandas is overkill in this instance, it is a good time to introduce it. Once we have the data in a pandas dataframe (think of a dataframe as a tab or table in Excel), we can easily plot the values using matplotlib.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# load iris into a dataframe direct from UCI link (better if local copy)
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases\
/iris/iris.data', header = None)
#df = pd.read_csv('/data/iris.csv', header = None)
print('df:',df)
# extract all rows and columns into an array
P_N = df.iloc[:, :].values
#print('P_N:',P_N)
# extract labels
y_N = np.where(P_N[:, 4] == 'Iris-setosa', -1, 1)
y_N[np.where(P_N[:, 4] == 'Iris-virginica')] = 0
print('y_N:',y_N)
# create subplot
fig1 = plt.figure(figsize = (4, 4))
ax1 = fig1.add_subplot(111)
# prep subplot
p1 = 0
p2 = 2
ax1.set_title('A Random Walk to AI - Philip Docena\n'+'UCI Iris Dataset\n' + \
'sepal length (cm) vs petal length (cm)')
ax1.set_xlabel('sepal length (cm)')
ax1.set_ylabel('petal length (cm)')
max_axis = int(max(max(P_N[:, p1]), max(P_N[:, p2]))+1)
ax1.axis([0, max_axis, 0, max_axis])
# plot points
ax1.scatter(P_N[y_N == -1][:, p1], P_N[y_N == -1][:, p2], marker = '+', \
c = 'r', label = 'setosa')
ax1.scatter(P_N[y_N == 0][:, p1], P_N[y_N == 0][:, p2], marker = 'x', \
c = 'g', label = 'virginica')
ax1.scatter(P_N[y_N == 1][:, p1], P_N[y_N == 1][:, p2], marker = '1', \
c = 'b', label = 'versicolor')
# add legends
ax1.legend(loc = 3, fontsize = 'x-small')
# show plot
plt.show()
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# load iris into a dataframe direct from UCI link (better if local copy)
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases\
/iris/iris.data', header = None)
#df = pd.read_csv('/data/iris.csv', header = None)
print('df.head:',df.head())
print('df.tail:',df.tail())
# extract all rows and columns into an array
P_N = df.iloc[:, :].values
#print('P_N:',P_N)
# extract labels
y_N = np.where(P_N[:, 4] == 'Iris-setosa', -1, 1)
y_N[np.where(P_N[:, 4] == 'Iris-virginica')] = 0
#print('y_N:',y_N)
fig1=plt.figure(figsize=(12,8))
# prep helper vars
combo=[[0,1],[0,2],[0,3],[1,2],[1,3],[2,3]]
subplots=[231,232,233,234,235,236]
features=['sepal length (cm)','sepal width (cm)','petal length (cm)',\
'petal width (cm)']
# plot each subplot
for row, sub_plot in zip(combo, subplots):
# create subplot
ax1=fig1.add_subplot(sub_plot)
# prep subplots
p1,p2=row
title=features[p1] + ' vs '+features[p2]
if p1==0 and p2==2:
ax1.set_title('A Random Walk to AI - Philip Docena\n'+'UCI Iris Dataset')
ax1.set_xlabel(features[p1])
ax1.set_ylabel(features[p2])
for item in ([ax1.title, ax1.xaxis.label, ax1.yaxis.label] +
ax1.get_xticklabels() + ax1.get_yticklabels()):
item.set_fontsize(10)
max_axis=int(max(max(P_N[:,p1]),max(P_N[:,p2]))+2)
ax1.axis([0,max_axis,0,max_axis])
# plot points
ax1.scatter(P_N[y_N==-1][:,p1],P_N[y_N==-1][:,p2],s=20,marker='+',\
c='r',label='setosa')
ax1.scatter(P_N[y_N==0][:,p1],P_N[y_N==0][:,p2],s=20,marker='x',\
c='g',label='virginica')
ax1.scatter(P_N[y_N==1][:,p1],P_N[y_N==1][:,p2],s=20,marker='1',\
c='b',label='versicolor')
ax1.legend(loc=3, fontsize = 'x-small')
# show plot
plt.show()
Round 1: Perceptron vs Iris
We finally reach our first field test. Recognizing that a Perceptron can only split linearly separable data, we note that we could never separate the Virginica and Versicolor species of the Iris flower based on any combination of petal and sepal dimensions alone. Thus, we are left with separating Setosa from these two species. We can easily handle this revised classification by labeling Setosa datapoints as one type (e.g., Type +1), and both Virginica and Versicolor as another type (e.g., Type -1).
Five of the six scatter plots above can be clearly separated, but the first one (sepal width vs sepal length) is not quite clear. There is a little red '+' separated from the rest that might prevent a linear separation. Let us run a PLA that stops the GIF capture at ~50 iterations, and a hard stop at ~100 iterations. Since we are unsure how the PLA will behave, let us be conservative and use a small learning rate of 0.05.
Round 1: Results
Below are the (successful) runs of the PLA. The first graph was indeed difficult, taking close to 100 iterations to converge. There was really no doubt of finding a separating line, given our previous PLA runs on simulated random data, but it is always good validation to see a theoretical model pass its first test against real world data. This is however a 'toy' dataset that is fairly easy to classify, so these good results are predictable.
Closing thoughts
Before we leave this section, let us review how we trained the perceptron. Near the end of Part 1, we asked what would happen to the PLA if the scale were not -1 to +1 on each axis, given the perceptron notation was based on those ranges.
Without paying attention to the feature (petal and sepal) values when we ran the PLA against the Iris data, we see above that the PLA can indeed handle such cases. In fact, these tests show that a PLA can handle a [0,n] range --not just [-1,+1] nor [0,+1]-- and axes with different magnitudes (petal dimensions tend to be higher than the sepal measurements).
Based on the PLA update rule, we do expect some odd behavior. For example, given that the datasets were concentrated (i.e., comparable to a very low number of training data points from a uniform distribution), it seemed to take longer to find a solution. The movement of the guessed line was also erratic, even if the learning rate was a very small number (that otherwise crawled in our prior tests). We will explore these in a future post/section, in comparison to how the PLA behaves under the original [-1,+1] range.
Further, since we forced two flower species to be of one type (Type -1, non-Setosa), we also accidentally created an unbalanced dataset (50 vs 100). The PLA handled this very easily. This is also to be expected given our previous PLA runs where a few of the red lines were near the edges, leaving the PLA with a few samples of one type relative to the other type.