https://github.com/gtraines/perceptron_classification

*Background*

One of the fundamental concepts in artificial intelligence and machine learning is the perceptron learning algorithm which gives life to the abstract data structure known as the perceptron. The perceptron is a data structure created to resemble the functioning of a neuron in the brain. The perceptron has a set of inputs (variable values) which each has an excitatory (positive) or inhibitory (negative) weight associated with it. During the training phase, the perceptron receives a set of values corresponding to its inputs along with an expected target outcome. If the sum of the weights multiplied by their corresponding input values is greater than a threshold value, the perceptron will emit a positive response; if the sum is lower than the threshold value, the perceptron will emit a negative response.

By supplying the expected response and recalibrating the weights if necessary, the perceptron can “learn” appropriate responses given a set of inputs. Because the learning algorithm is provided with an expected (“correct”) response, the perceptron learning algorithm belongs to the class of “supervised” training machine learning algorithms. It is considered a special case of logistic regression classification algorithms [Barber, 378].

The perceptron learning algorithm (PLA) is a classificiation algorithm that can be successfully applied to any linearly separable data: meaning, that for a D-dimensional data set, a (D-1)-dimensional hyperplane exists that separates the classes [Mitchell, 86]. The PLA has the property that if such a hyperplane exists, the algorithm will ultimately converge on a solution [Lin, et al., 8]. In cases where the data is not linearly separable, methods exist for approximating a solution (the delta rule and gradient descent) or moving past linear separation altogether (the sigmoid unit) [Mitchell, 96].

For this assignment we were provided with two sets of 2-dimensional, linearly-separable data. Given that the decision hyperplane is of dimension D-1, we can generate a 1-dimensional (line) plane to classify the data. The target function is a vector of length D+1 where the dot product of the target function and the set of D inputs (the first term is considered to be a constant 1; see below) will produce a postive or negative response, corresponding to the two classifications generated by the decision plane.

The extra term in the weights is the threshold value; if the dot product of the weights and inputs is greater than the threshold value, the perceptron emits an excitatory response and if it is smaller than the threshold value, it emits the inhibitory response. This threshold value is termed w[0] and its input is always 1 (also called x[0]). This results in the equation:

output(X) = sign((W).dot(X))

where X = vector of inputs and x[0] (1) and W = vector of weights.

*Coding*

My initial approach was to look over the DataScienceLab example to get a feel for how the perceptron was implemented but I wanted to try implementing the program in a way that [a] separated the concerns of an individual perceptron from external processes like reading in training data or creating visual representations such as graphs and [b] used variable names and algorithms that more closely resembled the mathematical representations provided by Lin, et al., and Tom Mitchell’s texts.

This approach resulted in three possibly trivial departures from the DataScienceLab example. The first is that the learning algorithm takes as an argument the “learning rate” to control the incrementation of the updates to the hypothetical weights. The second difference is that rather than simply updating the hypothesis with the product of the hypothesis vector * the target value for the data points which disagree with the hypothetical outcome, I use Mitchell’s [88] return value which is ((target output) – (hypothesis output)) or (t – o) in Mitchell’s notation. Thus, the update “delta” becomes:

(learning rate)*(t – o)*x[i]

The third change was an attempt to optimize the speed of the learning algorithm by returning (t – o) from the first data point encountered in each iteration that disagrees with the hypothesis output rather than creating a list of all disagreeing points and then selecting one at random.

I ended up creating two versions of my perceptron program because the resulting weights from the first version of the program seemed incorrect and my graph of the decision plane was very obviously wrong. After creating the second version where the data structures more closely resembled those used in the DataScienceLab blog (and removing the very slow recombination of data points from two separate arrays for each hypothesis comparison), I realized that it was my attempts at placing the weights into a slope-intercept form that were wrong. Of course, the solution became very obviously visible in the DataScienceLab example once I knew what I was looking for.

Instead of making the y-value a function of the x[1] *and* x[2] multiplied by their weights (which was correctly rendering the decision made by the perceptron given those values, creating a sort of upside-down form of the decision plane), I realized that I needed to solve for y (x2) by setting the sum of the dot product of w[D] and x[D] to zero and applying some elementary algebra. This gave me the equation:

w[0] + w[1]x[1] + w[2]x[2] = 0

Solving for x[2] (representing our y value) and placing it into slope-intercept form:

y = (-w[1]/w[2] * x) + (-w[0]/w[2])

At this point I felt quite foolish as this was clearly in the DataScienceLab example in a very slightly different notation (please don’t tell my 8^{th} grade algebra teacher). Using the same NumPy.linspace function as the DataScienceLab example, I was able to create the range of x values and graph the resulting y value, rendering the decision plane.

*Figures Produced by Training Data*

My implementation of the perceptron learning algorithm seems to be somewhat less efficient than that used to generate the original decision plane. In the case of the dataset requiring 1652 iterations to train, I required 3431; in the case of the dataset requiring 2683, I required 205884. Ultimately I arrived at the following weight vectors:

IT1652:

w[0] = 4.75 w[1] = 12.2351 w[2] = -25.32695

IT2683:

w[0] = -94.75 w[1] = -67.2096 w[2] = 115.8705

*Performance and Possible Optimizations*

Generating the correct hypothesis became significantly more expensive in time as the number of iterations increased. The algorithm requires at least an O(n(d + 1))-complexity check of the entire training data space at each iteration to determine the percentage of error (where d is the dimensionality of the data) and the search for a misclassified point at each iteration could be as expensive as O(n(d + 1)) in the worst case. This seems like an aspect of the algorithm that could be investigated for possible optimizations where speed is more important than precise separation.

References:

Barber, D. *Bayesian Reasoning and Machine Learning. *Cambridge University Press, 2012, 376 – 378.

DataScienceLab. “Machine learning classics: the perceptron.” Retrieved from https://datasciencelab.wordpress.com/2014/01/10/machine-learning-classics-the-perceptron/ on 21 February 2015.

Mitchell, T. *Machine Learning. *McGraw-Hill, San Francisco, 1997, 81 – 96.

Lin, H., Magdon-Ismael, M., and Abu-Mostafa, Y. *Learning from Data: A Short Course*. AMLbook.com, 2012, 5 – 9.