# Logistic Regression with Gradient Descent – Some Thoughts and Lessons

https://github.com/gtraines/logistic-regression-classification

Logistic regression is a type of machine learning approach to the classification of noisy data. Whereas linear classification requires data to be linearly separable in order to find the decision hyperplane, logistic regression allows for the expression of uncertainty by providing a probability that a given sample should be placed into one class or the other.

Logistic regression calculates the probability by running the vector of of inputs and weights through a logistic or “sigmoid” function which

places the value of wTx on a probability distribution that stretches like an elongated “S” between 0.0 and 1.0. The “y” value returned is therefore the probability that the sample should be classified as the class in question.

One training algorithm for use in logistic regression is called gradient descent. Like machine learning training algorithms we’ve covered, gradient descent involves seeks to find a linear function that approximates an unknown function. The algorithm alters the values of coefficients (weights) and then calculates the amount of error between our hypothesis of what the coefficients should be compared to what the true, hidden function produces (in the form of the training inputs and their target outputs). Since logistic regression is used in cases where the target function is noisy (that is, the data is not necessarily linearly separable), instead of iterating through the algorithm until the error measure is 0, we either set an acceptable amount of error or stop the algorithm after a given number of iterations.

In gradient descent, we move the weights in the direction of less error by calculating the gradient of the “error surface,” a d-dimensional surface where the lowest point represents the least amount of error, given the “point” our current hypothesis represents, which is a vector of weights. The gradient is the steepest direction from our current hypothesis, so by moving in the opposite direction, we are moving toward the point of lowest error [Mitchell 91].

Since the gradient only represents a “direction,” we need to decide on a magnitude or “distance” to move our weights. We then update our weights and then start the algorithm over. If we find we are still on a steep portion of the surface, we know we are still some distance from our goal location. Once the gradient becomes sufficiently small, we know that we are converging on the best approximation for the target function.

My Implementation

Per the parameters of the assignment, I implemented a logistic regression learning program in Python, using gradient descent as our training algorithm. In addition to the Python standard library, I use NumPy, SciPy, and the PyPlot library inside of Matplotlib. The data set we are provided with is Credit_Data.csv, which is a collection with fields Balance and Income and a string of “Yes” or “No” indicating if that sample resulted in a Default. Our assignment is to apply gradient descent learning and the logistic regression algorithm to the data to determine a probability distribution over whether a given Balance or Income amount has a greater than 0.5 probability of resulting in a Default.

In this case, we are to implement the logistic regression algorithm specified by Abu-Mostafa, et al. [Abu-Moistafa, et al., 95].

I interpreted Abu-Mostafa’s learning algorithm as the following steps:

1. Create a weight vector w of length d (dimensionality of the training data, plus a bias term w[0]) and initialize the values to 0.
2. Set iteration/time step variable t to 0.
3. While (some criteria not met) do:
4. Compute the grEin (gradient of E[in] (in-sample error))
5. Delta of weights = eta * grEin
6. Update weights for iteration t+1:

w[t + 1] = w[t] – delta of weights

1. t = t + 1
2. Continue loop until criteria is met
3. Return w

The final set of weights returned, w, can then be used with the sigmoid function and a set of input data (don’t forget the bias term!) to return the expected probability that the input indicates a default.

The learning rate variable eta is set at the beginning of the script as a small magnitude value and passed into the algorithm. Many possibilities exist for setting the termination criteria of the algorithm, including setting a “little epsilon” value which is an acceptably small gradient at which we can be satisfied that we’ve reached the closest approximation of the target function. I initially attempted to use this “little epsilon” value as the termination criteria but was having difficulty reaching an adequately small gradient. My next solution was to simply to terminate after a reasonably large number of iterations.

The main worker method of the script is gradient_descent(), which performs the weight updates and provides the high-level implementation of the gradient descent algorithm. I made my implementation rather verbose in order to facilitate debugging and inspecting the values that were being returned. As such, my gradient_descent() method makes use of a separate get_gradient method, which takes the set of target values, training data, and current weight vector and returns the gradient. Abu-Mostafa calculates the gradient using this summation:

gradient = -1/N * (summation (for n in N) → ((y[n] * x[n]) / ( 1 + e^( y[n]* w[t]Tx[n] )))

where:

N is the cardinality of the training set,

n is a given sample in N,

y[n] is the target value for that instance of N,

x[n] is the vector of inputs,

and w[t] is the vector of weights for that iteration of the algorithm (this weight vector stays the same throughout the whole summation).

This particular method of finding the gradient is nice because it has the sigmoid function somewhat baked into it. Other methods seem to require a lot of normalizing the data beforehand or require running the error through the sigmoid function separately [Harrington, 89].

I implemented the summation in separate steps, with the get_gradient() method responsible for iterating through the dataset N and  get_partial_gradient_sum() responsible for the calculation of (y[n] * x[n]) / ( 1 + e^( y[n]* w[t]Tx[n] )). This was to prevent the code from becoming too unwieldy in one method and also to facilitate stepping through and debugging the algorithm.

Lessons Learned

If I have the opportunity to refine future versions of this implementation, I plan to turn use lambda functions with matrix operations instead of a for-loop in get_gradient(). I used the iterative approach in order to be able to inspect the values that were being returned by the different methods that produce the gradient, but once I know they are correct, I need to convert the process into a matrix operation.

It seems that the standard Python numeric types couldn’t quite accommodate operations we needed to perform the large values in the Income data so I had to divide all of these values by 1000 when generating the datasets in order to produce a nice distribution.

One hard-learned lesson as I was trying to assess the performance of the algorithm was that an integer divided by an integer will yield an integer in Python. I discovered this when normalizing the gradient with 1/N. N, of course, is an integer so the output of this operation was 1 instead of .0001.

###### References:

Abu-Mostafa, Y., Lin, H., and Magdon-Ismael, M. Learning from Data: A Short Course. AMLbook.com, 2012, 88-99.

Abu-Mostafa, Y. CS 156, “Learning from data.” California Institute of Technology, Lecture 9. https://www.youtube.com/watch?v=VeKeFIepJBU&list=PL6E95797B0B983ECB

Harrington, P. Machine Learning in Action. Manning, Shelter Island, 2012, 83-100.

Mitchell, T. Machine Learning. McGraw-Hill, San Francisco, 1997, 89-97.