*Background*

One of the fundamental concepts in artificial intelligence and machine learning is the perceptron learning algorithm which gives life to the abstract data structure known as the perceptron. The perceptron is a data structure created to resemble the functioning of a neuron in the brain. The perceptron has a set of inputs (variable values) which each has an excitatory (positive) or inhibitory (negative) weight associated with it. During the training phase, the perceptron receives a set of values corresponding to its inputs along with an expected target outcome. If the sum of the weights multiplied by their corresponding input values is greater than a threshold value, the perceptron will emit a positive response; if the sum is lower than the threshold value, the perceptron will emit a negative response.

By supplying the expected response and recalibrating the weights if necessary, the perceptron can “learn” appropriate responses given a set of inputs. Because the learning algorithm is provided with an expected (“correct”) response, the perceptron learning algorithm belongs to the class of “supervised” training machine learning algorithms. It is considered a special case of logistic regression classification algorithms [Barber, 378].

The perceptron learning algorithm (PLA) is a classificiation algorithm that can be successfully applied to any linearly separable data: meaning, that for a D-dimensional data set, a (D-1)-dimensional hyperplane exists that separates the classes [Mitchell, 86]. The PLA has the property that if such a hyperplane exists, the algorithm will ultimately converge on a solution [Lin, et al., 8]. In cases where the data is not linearly separable, methods exist for approximating a solution (the delta rule and gradient descent) or moving past linear separation altogether (the sigmoid unit) [Mitchell, 96].

For this assignment we were provided with two sets of 2-dimensional, linearly-separable data. Given that the decision hyperplane is of dimension D-1, we can generate a 1-dimensional (line) plane to classify the data. The target function is a vector of length D+1 where the dot product of the target function and the set of D inputs (the first term is considered to be a constant 1; see below) will produce a postive or negative response, corresponding to the two classifications generated by the decision plane.

The extra term in the weights is the threshold value; if the dot product of the weights and inputs is greater than the threshold value, the perceptron emits an excitatory response and if it is smaller than the threshold value, it emits the inhibitory response. This threshold value is termed w[0] and its input is always 1 (also called x[0]). This results in the equation:

output(X) = sign((W).dot(X))

where X = vector of inputs and x[0] (1) and W = vector of weights.

*Coding*

My initial approach was to look over the DataScienceLab example to get a feel for how the perceptron was implemented but I wanted to try implementing the program in a way that [a] separated the concerns of an individual perceptron from external processes like reading in training data or creating visual representations such as graphs and [b] used variable names and algorithms that more closely resembled the mathematical representations provided by Lin, et al., and Tom Mitchell’s texts.

This approach resulted in three possibly trivial departures from the DataScienceLab example. The first is that the learning algorithm takes as an argument the “learning rate” to control the incrementation of the updates to the hypothetical weights. The second difference is that rather than simply updating the hypothesis with the product of the hypothesis vector * the target value for the data points which disagree with the hypothetical outcome, I use Mitchell’s [88] return value which is ((target output) – (hypothesis output)) or (t – o) in Mitchell’s notation. Thus, the update “delta” becomes:

(learning rate)*(t – o)*x[i]

The third change was an attempt to optimize the speed of the learning algorithm by returning (t – o) from the first data point encountered in each iteration that disagrees with the hypothesis output rather than creating a list of all disagreeing points and then selecting one at random.

I ended up creating two versions of my perceptron program because the resulting weights from the first version of the program seemed incorrect and my graph of the decision plane was very obviously wrong. After creating the second version where the data structures more closely resembled those used in the DataScienceLab blog (and removing the very slow recombination of data points from two separate arrays for each hypothesis comparison), I realized that it was my attempts at placing the weights into a slope-intercept form that were wrong. Of course, the solution became very obviously visible in the DataScienceLab example once I knew what I was looking for.

Instead of making the y-value a function of the x[1] *and* x[2] multiplied by their weights (which was correctly rendering the decision made by the perceptron given those values, creating a sort of upside-down form of the decision plane), I realized that I needed to solve for y (x2) by setting the sum of the dot product of w[D] and x[D] to zero and applying some elementary algebra. This gave me the equation:

w[0] + w[1]x[1] + w[2]x[2] = 0

Solving for x[2] (representing our y value) and placing it into slope-intercept form:

y = (-w[1]/w[2] * x) + (-w[0]/w[2])

At this point I felt quite foolish as this was clearly in the DataScienceLab example in a very slightly different notation (please don’t tell my 8^{th} grade algebra teacher). Using the same NumPy.linspace function as the DataScienceLab example, I was able to create the range of x values and graph the resulting y value, rendering the decision plane.

*Figures Produced by Training Data*

My implementation of the perceptron learning algorithm seems to be somewhat less efficient than that used to generate the original decision plane. In the case of the dataset requiring 1652 iterations to train, I required 3431; in the case of the dataset requiring 2683, I required 205884. Ultimately I arrived at the following weight vectors:

IT1652:

w[0] = 4.75 w[1] = 12.2351 w[2] = -25.32695

IT2683:

w[0] = -94.75 w[1] = -67.2096 w[2] = 115.8705

*Performance and Possible Optimizations*

Generating the correct hypothesis became significantly more expensive in time as the number of iterations increased. The algorithm requires at least an O(n(d + 1))-complexity check of the entire training data space at each iteration to determine the percentage of error (where d is the dimensionality of the data) and the search for a misclassified point at each iteration could be as expensive as O(n(d + 1)) in the worst case. This seems like an aspect of the algorithm that could be investigated for possible optimizations where speed is more important than precise separation.

References:

Barber, D. *Bayesian Reasoning and Machine Learning. *Cambridge University Press, 2012, 376 – 378.

DataScienceLab. “Machine learning classics: the perceptron.” Retrieved from https://datasciencelab.wordpress.com/2014/01/10/machine-learning-classics-the-perceptron/ on 21 February 2015.

Mitchell, T. *Machine Learning. *McGraw-Hill, San Francisco, 1997, 81 – 96.

Lin, H., Magdon-Ismael, M., and Abu-Mostafa, Y. *Learning from Data: A Short Course*. AMLbook.com, 2012, 5 – 9.

]]>

Machine learning methods are often applied to model complex systems where the function mapping inputs to outputs is unknown but a relationship is suspected or known to exist. Human behavior is one of these complex systems where machine learning can add insight to apparently random behavior. By looking at large samples of behavior, machine learning practitioners can highlight patterns. Retail stores and marketers have a vested interest in determining these patterns to support their decision-making processes and ultimately maximize profits.

To empirically show the power of machine learning in the retail and marketing domain, I apply association analysis to Tom Brijs’s retail dataset. Tom Brijs’s retail dataset (“retail.dat”, 4MB) is an anonymized set of approximately 88,000 real-world sales transactions from a Belgian supermarket. By determining frequently co-occurring sets of products, retailers can make more personal recommendations to shoppers or group those products closer together to take advantage of the natural impulse to purchase those items together.

I implement an offline shopping recommendation engine using Agrawal and Srikant’s Apriori algorithm and association rule extraction. This engine is trained and tested on the publicly-available retail dataset. If the engine performs well, real-world data can then be used and the resulting rules/recommendations can then be experimentally applied in the real-world domain. The recommendation engine can be improved by continuing to train the engine with later purchasing behavior. In this way, the system can receive feedback on its own recommendations.

The engine is implemented using the Python programming language with the NumPy and SciPy scientific computing libraries. The Apriori algorithm and analytical components are developed through guidance from Tom Brijs’’s own academic publications as well as popular publications for machine learning practitioners.

Shopping behaviors of customers in stores can be understood as a function of the influences of variables such as desires, needs, and socio-economic factors. For retailers and marketers to capitalize on these mysterious mappings, they could commission complex studies requiring years, a staff of highly-trained researchers, and great expense to determine patterns of behavior among consumers. Retailers and marketers could then devise strategies to maximize profits by exploiting these patterns.

Fortunately, such a time-consuming and costly approach is unnecessary. Retailers and marketers already have the behavioral data they need in the form of shopping transactions of existing customers. By applying machine learning methods to find undiscovered links between their products, retailers and marketers can more efficiently target their advertising, placement, and pricing to take advantage of these links.

Additionally, retailers can *directly recommend *items for a customer to purchase given the items they are or have selected in the past. This model has become prevalent among online retailers, which provides a personalized shopping experience based on the retailer’s knowledge of the customer’s purchasing patterns and other customers who have similar purchasing patterns.

USE AS LITTLE TEXT AS POSSIBLE.

USE FIGURES, IMAGES, CHARTS,

Brick-and-mortar retailers currently employ schemes such as customer loyalty and rewards programs to track demographic and purchasing information. This data can be anonymized and similar individuals can be compared for patterns or targeted for personalized marketing (Harrington, 224). Alternatively, credit cards can be used to associate purchases with a specific individual. In cases where multiple purchases cannot be linked to an individual, the transaction can still be treated as an itemset and compared against other itemsets.

Lots of sources of behavioral information. Wal-Mart – logistical cost minimization. KDD

What are the limits of current practice?

While some other machine learning frameworks, such as Tom Brijs’s PROFSET, focus on maximizing use of retailer resources (Brijs, et al., 2000), a customer-focused approach through personalized recommendations can be successful by exploiting the customer’s self-interest rather than the retailer.

“The technique of association rules produces a set of rules describing underlying purchase patterns in the data, like for instance bread => (implies) cheese [support = 20%, confidence = 75% ].”

Association analysis: Support and confidence

Support: Percentage of the dataset that contains this itemset

**Lift: lift(X -> Y) = P( Y | X) / P( Y )**

The probability of consequent Y given the presence of antecedent itemset X, divided by the probability of Y in the entire dataset.

Apriori algorithm: Alternative to brute-force method RRESULTS AND CONCLUSIONS

While looking at anonymized behavior in the form of sales transactions is clearly valuable, the inclusion of demographic information could further improve the value of the basket analysis for marketers in particular. For personalized recommendations, inclusion of demographic data can provide additional dimensions which could be used as part of a broader “itemset” that includes features of a consumer’s common identity as well as common purchase items.

REFERENCES

Brijs, T. Retail Market Basket Data Set. Limburgs Universitair Centrum: Diepenbeenk, Belgium. http://fimi.ua.ac.be/data/retail.dat or http://recsyswiki.com/wiki/Grocery_shopping_datasets

Brijs, T., Swinnen, G. and Vanhoof, G. (1999). Using association rules for product assortment decisions: a case study. In *Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining*. Association for Computing Machinery: San Diego. 254-260.

Brijs, T., Swinnen, G. and Vanhoof, G. (2000). A Data Mining Framework for Optimal Product Selection in Retail Supermarket Data: The Generalized PROFSET Model. In *KDD ’00 Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining*. Association for Computing Machinery: New York. 300 – 304. http://alpha.uhasselt.be/~brijs/pubs/PROFSET2000.pdf

Coelho, L. & Richert, W. (2013). *Building Machine Learning Systems with Python*. Packt Publishing: Birmingham, UK. 147 – 179.

Harrington, P. (2012). *Machine Learning in Action*. Manning: Shelter Island. 225 – 265.

Linear regression is an approach to machine/statistical learning generally applied to value prediction problems. It is a form of supervised learning, wherein the training data provides the “correct” answer in addition to the data points generated by an unknown function, (*f*). Although in this case we were provided a 2-dimensional data set, linear regression can be used on higher-dimensional data sets. The linear regression method assumes that the unknown function *f *can be approximated using a polynomial linear equation of *d *terms (the number of features being measured plus a constant value for bias). Among machine learning algorithms, it is fairly simple, and in his CalTech lectures Dr. Abu-Mostafa calls linear regression “one-step learning.”

A naïve approach to linear regression is an incremental solution where the algorithm loops through one iteration for every data point provided. At each pass, the line for the potential solution is “nudged” into place to where it can approximate the category of future data. This approach is obviously expensive from a time standpoint so a solution using linear algebra operations on matrices was found. In the same way that matrices can be used to solve systems of equations, a process of transposing, multiplying, and inverting the training data in matrix form can be used to “solve” the problem of the unknown function *f. *

As required by the assignment, I implemented my solution to the categorizing problem using Python. I used three libraries popular for scientific and mathematical applications, NumPy, SciPy, and Matplotlib. Matplotlib provides a simple interface to quickly generate visualizations of data which I used to plot the points and regression line. SciPy has a nifty method for quickly opening and parsing tabulated or character-separated data files such as CSVs, which I used to intake the training data set. Finally, NumPy provides the matrix data structures and linear algebra operations necessary to handle the dot-product multiplication, transposes, and inversions required by the linear regression algorithm. While SciPy also provides methods that can be used to automate much of the machine learning process, I did not take advantage of these tools as that would defeat the purpose of the exercise.

I encountered difficulty initially in the second-to-final step of the algorithm wherein the XTX^-1 matrix is multiplied by the XT and Y matrices. After some debugging and troubleshooting I found that the Y matrix needed to be transposed in order to put the matrix in the dimensions needed required by matrix multiplication rules. I was then finally able to arrive at the WT (coeffecients/weights for the terms of the linear equation) matrix, with an output of [[ 0.0867182 0.39972487]]. Applying these weights to the hypothesized equation w1x1 + w0x0 (x0 being 1) I got the (massively oversimplified) solution of g(x) = 0.0867*x + 0.399.

Ultimately, while my data plot appears to accurately represent the training data set, I am clearly missing something as my line appears to be missing a positive constant which would place it slightly higher on the graph. After a few attempts to discover the missing constant, I decided to study my notes again to try to isolate the mistake and possibly use some of the SciPy tools to find a better fit.

Coelho, L. & Richert, W. (2013). *Building machine learning systems with Python. *Packt Publishing: Birmingham, UK. pp. 19 – 35.

Bressert, E. (2013). *SciPy and NumPy.* O’Reilly: Sebastopol, CA.

Harrington, P. (2012) *Machine learning in action.* Manning: Shelter Island, NY. pp 153 – 159.

Klein, P. N. (2013) *Coding the Matrix: Linear algebra through computer science applications*. Newtonian Press. pp. 89-90.

The Hoeffding Inequality showed that learning is a theoretical possibility.

Additionally, further methods can show whether the learning task can be accomplished in polynomial time relative to the problem space, given a polynomial number of training examples. That is, we obviously don’t have eternity to find our approximation of the target function and we don’t have an infinitely large training set, so we need to show that we can find it in a reasonable amount of time and with a limited number of examples. Otherwise, trying to train a system to solve our problem would be pointless. For our purposes, we will assume a learning system that must classify data in binary terms (example: {-1, +1}) such as perceptron outputs.

In a potential machine learning problem where the input space is continuous there are an infinite number of potential **hypotheses**. We can’t reasonably assess the possible correctness of an infinite number of hypotheses, so we must break the problem down to a more tractable size. Using a finite representation of the hypothesis space makes the assessment possible and allows us to apply the equations and tests we need to determine if the problem is tractable. This finite representation is called a **dichotomy**. In a 2-dimensional binary classification problem, we can represent a dichotomy as a finite number of points plotted on a Cartesian plane. The hypothesis will then classify each point as a -1 or +1 point.

Using these unclassified points, we test to see how many distinct combinations of classified points are possible. The theoretical upper limit of possible combinations is 2^N (where N is the number of points being considered). However, our approach to classification can significantly limit the number of combinations (hypotheses) we can actually generate. In the case of a perceptron considering a 2D binary classification problem, the perceptron can successfully find novel combinations for 3 points, but at four points our options become limited due to the nature of the decision plane (a line), so the highest number of dichotomies possible is 14 (2^4 – 2, because we can only generate that many unique hypotheses).

This is very useful information as it gives us a starting point from which to determine if we can find a target function for this situation. We can create a term from this knowledge that we will substitute into the Hoeffding Inequality in order to reduce the hypothesis space from infinite to something we can actually work with. This term is called the **growth function**. Rather than the term M in the Hoeffding Inequality, we call this term m[H](N) where N is the number of points for which we can generate all possible hypotheses. In our 2D perceptron, this would read as m[H](3). This also tells us where the **break point** is for this system, which is 4 (the point where we stopped being able to generate all unique hypotheses). Since the number 4 is certainly within the constraint of polynomial sample size with respect to the problem space, we know that learning in this setting is possible. This is a generalization given that we don’t know what the desired probability of correctness for the system will be, but we can make the leap that even the requirement for a high probability of correctness will still only require a reasonably finite number of examples. This is because of the property that where k = the break point and N = the number of samples required by the growth function, the growth function will always be less than or equal to a polynomial on the order of N^(k-1). This property is encapsulated by **Sauer’s Lemma**, which allows us to conclude that a finite break point indicates a polynomial growth function.

Some examples of growth functions compared to their problem spaces can help illustrate why growth functions are helpful in determining the feasibility of learning in that problem space. For a binary classification of a ray on a number line, the growth function is m[H](N) = N + 1, where N is any number of sample points drawn on the line. This is to say that a hypothesis space could at most classify each point as being a part of the N+1 intervals created by placing N points on the number line. Using the same problem space (a number line) where we are trying to classify a positive interval, the growth function is ½N^2 + 1/2N + 1 because a hypothesis space must necessarily be constantly splitting the line into three portions, one shrinking (or growing) portion in the middle and two on the outside of the interval. Finally, a convex set, where a point being encompassed by the convex region indicates a +1 and a -1 otherwise, while have a growth function of 2^N (which can

A term related to finding the growth function and break point of a system is “**shattering**.” In order to shatter a data set in a hypothesis space H, H must include sufficient hypotheses such that, between them, H is able to generate every possible combination of classifications for that data set. The higher the number of examples that a hypothesis space can shatter, the more expressive the hypothesis space is. For example, since we know that a 2D perceptron can only generate hypotheses adequate to shatter three points, we can conclude that a 2D perceptron is not generating very expressive hypotheses.

Determining at what cardinality of points a hypothesis space can shatter the example set leads into determining the **Vapnik-Chervonenkis dimension** of the problem. Simply put, the Vapnik-Chervonenkis dimension (VC dimension) of a hypothesis set is the count of the largest number of examples that the hypothesis set can shatter. To relate it to the growth function in a binary classification problem, it is the highest number of examples where m[H](N) can equal 2^N. The VC dimension is what Dr. Abu-Mostafa calls “the most important theoretical result in machine learning.” A possible reason for this is that it removes the problem of having to consider lots of overlapping hypotheses individually while being able to consider their effect on the overall probability of a correct approximation of the target function. The VC dimension is key in showing not only that a learning problem has a finite solution as well as what the computational effort and number of required training examples will look like. If *epsilon* is a value representing the tolerable difference in error rates between training (in-sample) and out-of-sample data, N is the number of training examples required, m[H](x) is the growth function, and *sigma *is the probability that we will get a “bad” result from a hypothesis,

*sigma *<= **4** * m[H](**2N**) * e^(**-⅛ *** *epsilon*^2 * N).

This tells us that to reach a probability of correctness of (1 – *sigma*) for a learning system with a tolerance of *epsilon* between in- and out-of-sample data, we will need to use N training examples. This is obviously a requirement of many more training samples to approach the high probability of correctness that we want, but now at least we are in the realm of finite numbers so we will know that we are dealing with a realistically learnable situation. This inequality is also important because it gives us the sample size at which, regardless of the probability distribution from which the sample set was drawn, a hypothesis will generalize to approximate a target function (irrespective of the target function or learning algorithm used), which we would then select as our “g in H.”

We can rearrange terms and gain some further understanding of how accurate we can get with out-of-sample data classified by our learning system. If we make

*epsilon *= root( (8/N*ln * (4 * m[H](2N) / 8) )

and call it *Omega, *then the error of our system as applied to out-of-sample data (E[out]) will be:

E[out] <= E[in] + *Omega.*

One may get the impression from this apparent relation between a larger number of examples and lower probability of error that we could simply skip all of the above steps and just provide the learning system with as many examples as is physically possible and we will get the smallest amount of E[out] possible. This, unfortunately, is not correct. At a certain point (depending on the system), as we provide more training examples, we approach a bound of utility with respect to improving the learner’s performance on out-of-sample data. This relationship is represented by the **learning curve**. While a system will improve its classification of out-of-sample data as well as in-sample data over some increasing span of training examples, both the accuracy of the system pertaining to in-sample data and out-of-sample data reaches a bound. Thus, the computational expense and expense in terms of additional required data begins to outweigh the advantage gained by attempting to learn from those examples.

Generating a large number of hypotheses can also be detrimental. Since we don’t know the target function (which is why we are having to learn from the data), we need to find and recognize the approximation in our hypothesis set. Of course, we would like to have our approximation come as close as possible to the unknown target function, but we can get bogged down in the search if must continue generating hypotheses to decrease our “distance” from the target function. This tradeoff is called **bias-variance**. As bias, the “distance” from the true target function decreases, we must generate and assess a larger space of hypotheses (the “variance”). The complexity of the model we are using to generate our hypotheses can also significantly affect the value of trying to balance the bias and variance when attempting approximate the target function. Thus it can be preferable to choose a hypothesis that less closely approximates the target function if coming closer to the target function would be more expensive computationally.

References:* *

Mitchell, T. *Machine Learning.* McGraw-Hill, San Francisco, 1997, 201-220.

Abu-Mostafa, Y., Lin, H., and Magdon-Ismael, M. *Learning from Data: A Short Course*. AMLbook.com, 2012, 39-68.

Abu-Mostafa, Y. CS 156, “Learning from data.” California Institute of Technology, Lectures 6 – 8. https://www.youtube.com/watch?v=VeKeFIepJBU&list=PL6E95797B0B983ECB

]]>Logistic regression is a type of machine learning approach to the classification of noisy data. Whereas linear classification requires data to be linearly separable in order to find the decision hyperplane, logistic regression allows for the expression of uncertainty by providing a probability that a given sample should be placed into one class or the other.

Logistic regression calculates the probability by running the vector of of inputs and weights through a logistic or “sigmoid” function which

places the value of **w**T**x **on a probability distribution that stretches like an elongated “S” between 0.0 and 1.0. The “y” value returned is therefore the probability that the sample should be classified as the class in question.

One training algorithm for use in logistic regression is called gradient descent. Like machine learning training algorithms we’ve covered, gradient descent involves seeks to find a linear function that approximates an unknown function. The algorithm alters the values of coefficients (weights) and then calculates the amount of error between our hypothesis of what the coefficients should be compared to what the true, hidden function produces (in the form of the training inputs and their target outputs). Since logistic regression is used in cases where the target function is noisy (that is, the data is not necessarily linearly separable), instead of iterating through the algorithm until the error measure is 0, we either set an acceptable amount of error or stop the algorithm after a given number of iterations.

In gradient descent, we move the weights in the direction of less error by calculating the gradient of the “error surface,” a d-dimensional surface where the lowest point represents the least amount of error, given the “point” our current hypothesis represents, which is a vector of weights. The gradient is the steepest direction from our current hypothesis, so by moving in the opposite direction, we are moving toward the point of lowest error [Mitchell 91].

Since the gradient only represents a “direction,” we need to decide on a magnitude or “distance” to move our weights. We then update our weights and then start the algorithm over. If we find we are still on a steep portion of the surface, we know we are still some distance from our goal location. Once the gradient becomes sufficiently small, we know that we are converging on the best approximation for the target function.

**My Implementation**

Per the parameters of the assignment, I implemented a logistic regression learning program in Python, using gradient descent as our training algorithm. In addition to the Python standard library, I use NumPy, SciPy, and the PyPlot library inside of Matplotlib. The data set we are provided with is Credit_Data.csv, which is a collection with fields Balance and Income and a string of “Yes” or “No” indicating if that sample resulted in a Default. Our assignment is to apply gradient descent learning and the logistic regression algorithm to the data to determine a probability distribution over whether a given Balance or Income amount has a greater than 0.5 probability of resulting in a Default.

In this case, we are to implement the logistic regression algorithm specified by Abu-Mostafa, et al. [Abu-Moistafa, et al., 95].

I interpreted Abu-Mostafa’s learning algorithm as the following steps:

- Create a weight vector
**w**of length*d*(dimensionality of the training data, plus a bias term w[0]) and initialize the values to 0. - Set iteration/time step variable
*t*to 0. - While (some criteria not met) do:
- Compute the grEin (gradient of E[in] (in-sample error))
- Delta of weights = eta * grEin
- Update weights for iteration t+1:

**w**[t + 1] = **w**[t] – delta of weights

- t = t + 1
- Continue loop until criteria is met
- Return
**w**

The final set of weights returned, **w**, can then be used with the sigmoid function and a set of input data (don’t forget the bias term!) to return the expected probability that the input indicates a default.

The learning rate variable eta is set at the beginning of the script as a small magnitude value and passed into the algorithm. Many possibilities exist for setting the termination criteria of the algorithm, including setting a “little epsilon” value which is an acceptably small gradient at which we can be satisfied that we’ve reached the closest approximation of the target function. I initially attempted to use this “little epsilon” value as the termination criteria but was having difficulty reaching an adequately small gradient. My next solution was to simply to terminate after a reasonably large number of iterations.

The main worker method of the script is gradient_descent(), which performs the weight updates and provides the high-level implementation of the gradient descent algorithm. I made my implementation rather verbose in order to facilitate debugging and inspecting the values that were being returned. As such, my gradient_descent() method makes use of a separate get_gradient method, which takes the set of target values, training data, and current weight vector and returns the gradient. Abu-Mostafa calculates the gradient using this summation:

gradient = -1/N * (summation (for n in N) → ((*y*[n] * **x**[n]) / ( 1 + *e*^( *y*[n]*** w**[*t*]T**x**[n] )))

where:

N is the cardinality of the training set,

n is a given sample in N,

*y*[n] is the target value for that instance of N,

**x**[n] is the vector of inputs,

and **w**[t] is the vector of weights for that iteration of the algorithm (this weight vector stays the same throughout the whole summation).

This particular method of finding the gradient is nice because it has the sigmoid function somewhat baked into it. Other methods seem to require a lot of normalizing the data beforehand or require running the error through the sigmoid function separately [Harrington, 89].

** **I implemented the summation in separate steps, with the get_gradient() method responsible for iterating through the dataset N and get_partial_gradient_sum() responsible for the calculation of (*y*[n] * **x**[n]) / ( 1 + *e*^( *y*[n]*** w**[*t*]T**x**[n] )). This was to prevent the code from becoming too unwieldy in one method and also to facilitate stepping through and debugging the algorithm.

**Lessons Learned**

If I have the opportunity to refine future versions of this implementation, I plan to turn use lambda functions with matrix operations instead of a for-loop in get_gradient(). I used the iterative approach in order to be able to inspect the values that were being returned by the different methods that produce the gradient, but once I know they are correct, I need to convert the process into a matrix operation.

It seems that the standard Python numeric types couldn’t quite accommodate operations we needed to perform the large values in the Income data so I had to divide all of these values by 1000 when generating the datasets in order to produce a nice distribution.

One hard-learned lesson as I was trying to assess the performance of the algorithm was that an integer divided by an integer will yield an integer in Python. I discovered this when normalizing the gradient with 1/N. N, of course, is an integer so the output of this operation was 1 instead of .0001.

Abu-Mostafa, Y., Lin, H., and Magdon-Ismael, M. *Learning from Data: A Short Course*. AMLbook.com, 2012, 88-99.

Abu-Mostafa, Y. CS 156, “Learning from data.” California Institute of Technology, Lecture 9. https://www.youtube.com/watch?v=VeKeFIepJBU&list=PL6E95797B0B983ECB

Harrington, P. *Machine Learning in Action. *Manning, Shelter Island, 2012, 83-100.

Mitchell, T. *Machine Learning.* McGraw-Hill, San Francisco, 1997, 89-97.

So, if you read the accompanying lab report, you’ll see this was an attempt at using the C++11 multithreading library (essentially the same as the Boost multithreading library) to implement a parallel breadth-first search algorithm. The results weren’t spectacular, but I don’t feel too bad; of the two computer scientists who developed the more successful approach, one of them literally wrote the book on algorithms.

Anyway, if this is helpful to anyone, please feel free to add to, change, or take away from my work here. I’d really appreciate any feedback, though. Thanks!

]]>