What are Support Vector Machines and how do they work?

00 Blog schulz SVMs - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)
© Till Schulz & Lamarr Institute

Support Vector Machines (SVM) are supervised learning methods primarily used for classification problems. Their purpose is to create a model based on a set of data points and their class affiliation, which accurately determines the class of new data points. SVMs are versatile and are applied in areas such as text and image classification.

For an introduction to the topic of classification, the blog post “Classification in Machine Learning” by Sebastian Müller is recommended. Although SVMs can handle data in any dimensional space, the following examples are limited to 2-dimensional or 3-dimensional data for simplified visualization.

How they work

The basic principle of SVMs follows a surprisingly simple approach: It aims to find a plane—known as the separating hyperplane—that separates data points according to their class affiliation. In the following 2-dimensional example (Fig. 1), the hyperplane is simply a line that should separate the red and blue points from each other. To determine the class of a new data point, it is sufficient to check on which side of the hyperplane the point lies. Since there are theoretically infinite possible hyperplanes, the question arises of how exactly this plane should be chosen.

hyperplane choice - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)
© Till Schulz
Figure 1: Hyperplanes $h_1$ and $h_2$ comprehensibly classify the new point (shown in black) as a blue point, whereas $h_3$ assigns the point to the red class.

To best separate the classes, SVMs select the hyperplane that maximizes the distances to the closest points from each class. In other words, the optimal hyperplane is located in the center of a maximally wide strip, called the margin, that separates the two classes. This is thus a classic optimization problem that can be solved using standard methods, such as gradient descent.

optimal hyperplane en - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)
© Till Schulz
Figure 2: The optimal hyperplane chosen by the SVM maximizes the distance to the two class boundaries (margin).

Soft Margin SVM

In most cases, however, data is not linearly separable as initially assumed. That is, due to unfavorable data distributions, there may not exist a hyperplane that perfectly separates the classes. For such cases, the SVM principle can be expanded. The so-called Soft Margin SVM allows a certain degree of errors in the classification of training data. Data points are permitted to fall within the margin or even on the wrong side of the hyperplane. Instead of simply finding a maximally wide margin, Soft Margin SVMs also select the class boundaries such that the sum of the distances of all points lying in or beyond the margin is minimized (Fig. 3). Soft Margin SVMs therefore pursue two objectives: maximizing the margin while minimizing errors due to misclassifications. This trade-off between the two goals is determined by a parameter (usually called C). Small values of C tend to enlarge the margin, allowing more classification errors in the training examples. Typically, finding an appropriate value for C is part of the training process.

soft margin svm en - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)
© Till Schulz
Figure 3: To allow for a certain amount of error, soft margin SVMs try to minimize the distances to misclassified or in-margin points.

Kernel trick

Unfortunately, Soft Margin SVMs do not solve all problems. To handle challenging cases, like the one shown in the following example (Fig. 4), a transformation is needed to make the data linearly separable.

non lin separable - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)
© Till Schulz
Figure 4: Example of a set of points that can hardly be satisfactorily separated in 2-dimensional space (even using the Soft Margin SVM).

The solution idea involves transforming the points into a higher-dimensional space where they can be separated by a hyperplane. The following illustration shows what this might look like for the example (Fig. 5). The originally 2-dimensional points are each extended with a third dimension. In this new 3-dimensional space, the separating hyperplane corresponds to a 2-dimensional plane, which now clearly separates the two classes.

2d to 3d - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)
© Till Schulz
Figure 5: Using the kernel trick, i.e. the transformation into a higher-dimensional space, the points can now be separated from each other.

However, this transformation is not free. On the contrary, especially for high-dimensional data, finding a suitable transformation generally requires a prohibitive amount of computation. So, how do we obtain an appropriate transformation instead? The answer is that it is not actually needed explicitly. Instead, the problem can be reformulated in a clever way. It suffices to calculate pairwise similarities between data points using a kernel function. With this so-called kernel trick, the need for an explicit, costly transformation of data points into a higher-dimensional space is completely avoided.

Related articles on the topic of SVM:

The intelligent shelf: The path to intelligent human-machine interfaces in industry

Computer vision: How do machines learn to see?

Better assessment of disease progression with ranking SVM

Till Schulz

Till Schulz is a research associate at the Lamarr site of the University of Bonn. He works on learning methods on graph data, in particular the classification of graphs.

More blog posts