Support Vector Machines (SVM) are supervised learning methods primarily used for classification problems. Their purpose is to create a model based on a set of data points and their class affiliation, which accurately determines the class of new data points. SVMs are versatile and are applied in areas such as text and image classification.
For an introduction to the topic of classification, the blog post “Classification in Machine Learning” by Sebastian Müller is recommended. Although SVMs can handle data in any dimensional space, the following examples are limited to 2-dimensional or 3-dimensional data for simplified visualization.
How they work
The basic principle of SVMs follows a surprisingly simple approach: It aims to find a plane—known as the separating hyperplane—that separates data points according to their class affiliation. In the following 2-dimensional example (Fig. 1), the hyperplane is simply a line that should separate the red and blue points from each other. To determine the class of a new data point, it is sufficient to check on which side of the hyperplane the point lies. Since there are theoretically infinite possible hyperplanes, the question arises of how exactly this plane should be chosen.
To best separate the classes, SVMs select the hyperplane that maximizes the distances to the closest points from each class. In other words, the optimal hyperplane is located in the center of a maximally wide strip, called the margin, that separates the two classes. This is thus a classic optimization problem that can be solved using standard methods, such as gradient descent.
Soft Margin SVM
In most cases, however, data is not linearly separable as initially assumed. That is, due to unfavorable data distributions, there may not exist a hyperplane that perfectly separates the classes. For such cases, the SVM principle can be expanded. The so-called Soft Margin SVM allows a certain degree of errors in the classification of training data. Data points are permitted to fall within the margin or even on the wrong side of the hyperplane. Instead of simply finding a maximally wide margin, Soft Margin SVMs also select the class boundaries such that the sum of the distances of all points lying in or beyond the margin is minimized (Fig. 3). Soft Margin SVMs therefore pursue two objectives: maximizing the margin while minimizing errors due to misclassifications. This trade-off between the two goals is determined by a parameter (usually called C). Small values of C tend to enlarge the margin, allowing more classification errors in the training examples. Typically, finding an appropriate value for C is part of the training process.
Kernel trick
Unfortunately, Soft Margin SVMs do not solve all problems. To handle challenging cases, like the one shown in the following example (Fig. 4), a transformation is needed to make the data linearly separable.
The solution idea involves transforming the points into a higher-dimensional space where they can be separated by a hyperplane. The following illustration shows what this might look like for the example (Fig. 5). The originally 2-dimensional points are each extended with a third dimension. In this new 3-dimensional space, the separating hyperplane corresponds to a 2-dimensional plane, which now clearly separates the two classes.
However, this transformation is not free. On the contrary, especially for high-dimensional data, finding a suitable transformation generally requires a prohibitive amount of computation. So, how do we obtain an appropriate transformation instead? The answer is that it is not actually needed explicitly. Instead, the problem can be reformulated in a clever way. It suffices to calculate pairwise similarities between data points using a kernel function. With this so-called kernel trick, the need for an explicit, costly transformation of data points into a higher-dimensional space is completely avoided.
Related articles on the topic of SVM:
The intelligent shelf: The path to intelligent human-machine interfaces in industry