Before setting out to build a learning system, practitioners, students, and consumers of machine learning should be able to rigorously show that learning methods can be applied successfully to a given learning problem. Using statistics and probability, we can show how a supervised classifying learning system can achieve a level of success called “probably approximately correct.” Given the desired probability that a learner will correctly classify a problem, we can determine how expensive it will be to approximate the target function in terms of computation (generally time and space) and how many training examples would be required to achieve that level of correctness.

The Hoeffding Inequality showed that learning is a theoretical possibility.

Additionally, further methods can show whether the learning task can be accomplished in polynomial time relative to the problem space, given a polynomial number of training examples. That is, we obviously don’t have eternity to find our approximation of the target function and we don’t have an infinitely large training set, so we need to show that we can find it in a reasonable amount of time and with a limited number of examples. Otherwise, trying to train a system to solve our problem would be pointless. For our purposes, we will assume a learning system that must classify data in binary terms (example: {-1, +1}) such as perceptron outputs.

In a potential machine learning problem where the input space is continuous there are an infinite number of potential **hypotheses**. We can’t reasonably assess the possible correctness of an infinite number of hypotheses, so we must break the problem down to a more tractable size. Using a finite representation of the hypothesis space makes the assessment possible and allows us to apply the equations and tests we need to determine if the problem is tractable. This finite representation is called a **dichotomy**. In a 2-dimensional binary classification problem, we can represent a dichotomy as a finite number of points plotted on a Cartesian plane. The hypothesis will then classify each point as a -1 or +1 point.

Using these unclassified points, we test to see how many distinct combinations of classified points are possible. The theoretical upper limit of possible combinations is 2^N (where N is the number of points being considered). However, our approach to classification can significantly limit the number of combinations (hypotheses) we can actually generate. In the case of a perceptron considering a 2D binary classification problem, the perceptron can successfully find novel combinations for 3 points, but at four points our options become limited due to the nature of the decision plane (a line), so the highest number of dichotomies possible is 14 (2^4 – 2, because we can only generate that many unique hypotheses).

This is very useful information as it gives us a starting point from which to determine if we can find a target function for this situation. We can create a term from this knowledge that we will substitute into the Hoeffding Inequality in order to reduce the hypothesis space from infinite to something we can actually work with. This term is called the **growth function**. Rather than the term M in the Hoeffding Inequality, we call this term m[H](N) where N is the number of points for which we can generate all possible hypotheses. In our 2D perceptron, this would read as m[H](3). This also tells us where the **break point** is for this system, which is 4 (the point where we stopped being able to generate all unique hypotheses). Since the number 4 is certainly within the constraint of polynomial sample size with respect to the problem space, we know that learning in this setting is possible. This is a generalization given that we don’t know what the desired probability of correctness for the system will be, but we can make the leap that even the requirement for a high probability of correctness will still only require a reasonably finite number of examples. This is because of the property that where k = the break point and N = the number of samples required by the growth function, the growth function will always be less than or equal to a polynomial on the order of N^(k-1). This property is encapsulated by **Sauer’s Lemma**, which allows us to conclude that a finite break point indicates a polynomial growth function.

Some examples of growth functions compared to their problem spaces can help illustrate why growth functions are helpful in determining the feasibility of learning in that problem space. For a binary classification of a ray on a number line, the growth function is m[H](N) = N + 1, where N is any number of sample points drawn on the line. This is to say that a hypothesis space could at most classify each point as being a part of the N+1 intervals created by placing N points on the number line. Using the same problem space (a number line) where we are trying to classify a positive interval, the growth function is ½N^2 + 1/2N + 1 because a hypothesis space must necessarily be constantly splitting the line into three portions, one shrinking (or growing) portion in the middle and two on the outside of the interval. Finally, a convex set, where a point being encompassed by the convex region indicates a +1 and a -1 otherwise, while have a growth function of 2^N (which can

A term related to finding the growth function and break point of a system is “**shattering**.” In order to shatter a data set in a hypothesis space H, H must include sufficient hypotheses such that, between them, H is able to generate every possible combination of classifications for that data set. The higher the number of examples that a hypothesis space can shatter, the more expressive the hypothesis space is. For example, since we know that a 2D perceptron can only generate hypotheses adequate to shatter three points, we can conclude that a 2D perceptron is not generating very expressive hypotheses.

Determining at what cardinality of points a hypothesis space can shatter the example set leads into determining the **Vapnik-Chervonenkis dimension** of the problem. Simply put, the Vapnik-Chervonenkis dimension (VC dimension) of a hypothesis set is the count of the largest number of examples that the hypothesis set can shatter. To relate it to the growth function in a binary classification problem, it is the highest number of examples where m[H](N) can equal 2^N. The VC dimension is what Dr. Abu-Mostafa calls “the most important theoretical result in machine learning.” A possible reason for this is that it removes the problem of having to consider lots of overlapping hypotheses individually while being able to consider their effect on the overall probability of a correct approximation of the target function. The VC dimension is key in showing not only that a learning problem has a finite solution as well as what the computational effort and number of required training examples will look like. If *epsilon* is a value representing the tolerable difference in error rates between training (in-sample) and out-of-sample data, N is the number of training examples required, m[H](x) is the growth function, and *sigma *is the probability that we will get a “bad” result from a hypothesis,

*sigma *<= **4** * m[H](**2N**) * e^(**-⅛ *** *epsilon*^2 * N).

This tells us that to reach a probability of correctness of (1 – *sigma*) for a learning system with a tolerance of *epsilon* between in- and out-of-sample data, we will need to use N training examples. This is obviously a requirement of many more training samples to approach the high probability of correctness that we want, but now at least we are in the realm of finite numbers so we will know that we are dealing with a realistically learnable situation. This inequality is also important because it gives us the sample size at which, regardless of the probability distribution from which the sample set was drawn, a hypothesis will generalize to approximate a target function (irrespective of the target function or learning algorithm used), which we would then select as our “g in H.”

We can rearrange terms and gain some further understanding of how accurate we can get with out-of-sample data classified by our learning system. If we make

*epsilon *= root( (8/N*ln * (4 * m[H](2N) / 8) )

and call it *Omega, *then the error of our system as applied to out-of-sample data (E[out]) will be:

E[out] <= E[in] + *Omega.*

One may get the impression from this apparent relation between a larger number of examples and lower probability of error that we could simply skip all of the above steps and just provide the learning system with as many examples as is physically possible and we will get the smallest amount of E[out] possible. This, unfortunately, is not correct. At a certain point (depending on the system), as we provide more training examples, we approach a bound of utility with respect to improving the learner’s performance on out-of-sample data. This relationship is represented by the **learning curve**. While a system will improve its classification of out-of-sample data as well as in-sample data over some increasing span of training examples, both the accuracy of the system pertaining to in-sample data and out-of-sample data reaches a bound. Thus, the computational expense and expense in terms of additional required data begins to outweigh the advantage gained by attempting to learn from those examples.

Generating a large number of hypotheses can also be detrimental. Since we don’t know the target function (which is why we are having to learn from the data), we need to find and recognize the approximation in our hypothesis set. Of course, we would like to have our approximation come as close as possible to the unknown target function, but we can get bogged down in the search if must continue generating hypotheses to decrease our “distance” from the target function. This tradeoff is called **bias-variance**. As bias, the “distance” from the true target function decreases, we must generate and assess a larger space of hypotheses (the “variance”). The complexity of the model we are using to generate our hypotheses can also significantly affect the value of trying to balance the bias and variance when attempting approximate the target function. Thus it can be preferable to choose a hypothesis that less closely approximates the target function if coming closer to the target function would be more expensive computationally.

References:* *

Mitchell, T. *Machine Learning.* McGraw-Hill, San Francisco, 1997, 201-220.

Abu-Mostafa, Y., Lin, H., and Magdon-Ismael, M. *Learning from Data: A Short Course*. AMLbook.com, 2012, 39-68.

Abu-Mostafa, Y. CS 156, “Learning from data.” California Institute of Technology, Lectures 6 – 8. https://www.youtube.com/watch?v=VeKeFIepJBU&list=PL6E95797B0B983ECB