Support Vector Machines

Support Vector Machines are supervised learning models which use a specific type of learning method to perform classification over sets of data. To understand the method employed by Support Vector Machines, one must first understand the category of machine learning tasks into which the SVM falls: supervised learning.

Supervised Learning

In supervised learning, we are given a set of training examples, which consist of an input vector x, and an output y which represents the desired result (the solution for that input instance). Given this data, a supervised learning algorithm can perform two types of tasks: it can classify the data points into a discrete set of different classes, or it can perform regression on these data points, to predict a continuous value of the solution, y. After training on the given data set, the learning algorithm will develop a set of parameters that can be used to predict the discrete class or the continuous value of future (and unseen) input instances.

Perceptron

The Support Vector Machine is simply a method for implementing the learning task of a Perceptron. Thus, it is equally useful to understand the details of the Perceptron in order to fully grasp the capabilities of a Support Vector Machine.


The Perceptron was invented in 1957 by Frank Rosenblatt. Frank Rosenblatt was a psychologist who studied in great detail the mechanisms of the human brain and how it is able to learn. In his research, he tried to simulate those learning mechanisms with machines, and the Perceptron was arguably his most successful attempt.



The Perceptron is a classification method in the form of:

						
hw(x) = sign(w · x + w0) = {
+1 if if w · x + w0 ≥ 0,
-1 otherwise
where hw(x) is the hypothesis (a term for the formula which represents the algorithm’s encoded prediction method), w is a vector of parameters which have a coefficient for each of the dimensions of the input vector x, and w0 is an offset which is analogous to a y-intercept of a line. In fact, the whole formula is of the form y = mx + b, which is a characteristic of linear regression techniques (however, in this case it’s multi-dimensional, so our formula is a surface, not a line). Since the formula dictates that the result is simply +1 if the formula produces a number greater than or equal to zero, and the result is -1 otherwise, then the result is instead a classifier between two classes. In that way, the formula represents a decision boundary, where every instance that falls on one side of the n-dimensional surface belongs to one class, and all others to the other class.

The Perceptron had its own learning algorithm, which was as follows (Precup, 2012):
  1. Initialize w and w0 randomly
  2. While any training examples remain incorrectly classified
    1. Loop through all misclassified examples
    2. For misclassified example i, perform the updates:
      ww + γyixi
      w0 ← w0 + γyi
      Where γ is the step-size parameter.
Since yi is always either +1 or -1, the result of the update is that the parameter vector is increased when the misclassification was too low, and decreased when it was too high.

However, there are distinct drawbacks to this approach. For one, the final decision boundary will be different depending on the order of the misclassified examples. Secondly, if the problem is not linearly separable (there exists no decision boundary that completely separates the data into two distinct classes), the learning algorithm will never stop; there will always be a non-zero number of misclassified examples.

Support Vector Machine

Because of these drawbacks, a better learning algorithm was needed. The Support Vector Machine provides just that. The original SVM algorithm was invented by Vladimir N. Vapnik (Wikipedia, 2012). It overcomes the two drawbacks of the Perceptron learning algorithm, and also provides one further advantage. To overcome the drawback of the variability of the final decision boundary, the Support Vector Machine (or SVM) optimizes not only the correct classification of instances, but also optimizes the margin between decision boundary and the closest instance of each class. To overcome the limitation of linearly separability, the SVM allows for mistakes in the classifications.

The SVM’s main goal is to maximize the size of the margin. To do this, it relates the margin for each instance to the norm, and through some matrix algebra (Primal Form from Wikipedia), it is discovered that to maximize the margin, one needs to minimize the norm of the parameter vector (through a relation to the normal of the decision boundary, w). It is also mathematically equivalent, and more convenient for optimization methods to minimize the square of the norm, ‖w2‖. However, while maximizing the margin, one must also optimize for the correct classification of instances. Thus, Lagrangian multipliers are used to optimize these dual constraints. Through the Lagrangian multipliers, it is discovered that the solution to the dual form is also a linear combination of instances. These instances are unique, however, because they represent the entire set of instances that lie on the edge of the margin. Therefore, these instances were coined as “support vectors” because they define the optimal decision boundary. This solves the problem of non-unique solution sets, because the optimal decision boundary is explicitly that: optimal. There can only be one optimal solution.

One other improvement from the Perceptron algorithm that the SVM learning algorithm provides it the possibility of solving non-linearly separable problems. The nature of the result of the Lagrangian optimization was that the solution could be computed as the dot product of two separate instances. Because of this representation, one can describe the Lagrangian optimization function in terms of a Kernel. A Kernel is an abstraction of the data, which creates a non-linear representation of the data. To utilize a Kernel, one must first create a Kernel matrix, K. A Kernel matrix is a square matrix, for which every place in the matrix is the dot product of the two instances indexed by the column and the row of that place. Furthermore, a Kernel must be proven to have the following properties: 1. K is symmetric 2. K is positive semidefinite After proving that the abstraction of the data that you have created is indeed a Kernel, one can then apply the SVM to the Kernel, by utilizing the matrix rows as training instances. In this way, we are applying linear approaches to the Kernel, which has abstracted away the non-linearity of the data by using the dot product method. In this way, we can incorporate the possibility of solving classification problems which are non-linearly separable. Many successful applications in: – Text classification (e.g. Joachims, 1998) – Object detection (e.g. Osuna, Freund and Girosi, 1997) – Object recognition (e.g. Pontil and Verri, 1998) – Bioinformatics (e.g. Lee et al, 2002)


Bibliography