SVM fundamentals

Let $\bx \in \real^d$ be a vector representing, for example, an image, an audio track, or a fragment of text. Our goal is to design a classifier*, i.e. a function that associates to each vector $\bx$ a positive or negative label based on a desired criterion, for example the fact that the image contains or not a cat, that the audio track contains or not English speech, or that the text is or not a scientific paper.

The vector $\bx$ is classified by looking at the sign of a linear scoring function $\langle \bx, \bw \rangle$. The goal of learning is to estimate the parameter $\bw \in \real^d$ in such a way that the score is positive if the vector $\bx$ belongs to the positive class and negative otherwise. In fact, in the standard SVM formulation the the goal is to have a score of at least 1 in the first case, and of at most -1* in the second one, imposing a margin.

The parameter $\bw$ is estimated or learned by fitting the scoring function to a training set of $n$ example pairs $(\bx_i,y_i), i=1,\dots,n$. Here $y_i \in \{-1,1\}$ are the ground truth labels of the corresponding example vectors. The fit quality is measured by a loss function* which, in standard SVMs, is the hinge loss:

$\ell_i(\langle \bw,\bx\rangle) = \max\{0, 1 - y_i \langle \bw,\bx\rangle\}.$

Note that the hinge loss is zero only if the score $\langle \bw,\bx\rangle$ is at least 1 or at most -1, depending on the label $y_i$.

Fitting the training data is usually insufficient. In order for the scoring function generalize to future data as well, it is usually preferable to trade off the fitting accuracy with the regularity of the learned scoring function $\langle \bx, \bw \rangle$. Regularity in the standard formulation is measured by the norm of the parameter vector $\|\bw\|^2$ (see Advanced SVM topics). Averaging the loss on all training samples and adding to it the regularizer weighed by a parameter $\lambda$ yields the regularized loss objective

$$\boxed{\displaystyle E(\bw) = \frac{\lambda}{2} \left\| \bw \right\|^2 + \frac{1}{n} \sum_{i=1}^n \max\{0, 1 - y_i \langle \bw,\bx\rangle\}. \label{e:svm-primal-hinge} }$$

Note that this objective function is convex, so that there exists a single global optimum.

The scoring function $\langle \bx, \bw \rangle$ considered so far has been linear and unbiased. Adding a bias discusses how a bias term can be added to the SVM and Non-linear SVMs and feature maps shows how non-linear SVMs can be reduced to the linear case by computing suitable feature maps.

Learning shows how VLFeat can be used to learn an SVM by minimizing $E(\bw)$.

# Learning

Learning an SVM amounts to finding the minimizer $\bw^*$ of the cost function $E(\bw)$. While there are dozens of methods that can be used to do so, VLFeat implements two large scale methods, designed to work with linear SVMs (see Non-linear SVMs and feature maps to go beyond linear):

Using these solvers is exemplified in Getting started.

It is common to add to the SVM scoring function a bias term $b$, and to consider the score $\langle \bx,\bw \rangle + b$. In practice the bias term can be crucial to fit the training data optimally, as there is no reason why the inner products $\langle \bx,\bw \rangle$ should be naturally centered at zero.

Some SVM learning algorithms can estimate both $\bw$ and $b$ directly. However, other algorithms such as SGD and SDCA cannot. In this case, a simple workaround is to add a constant component $B > 0$ (we call this constant the bias multiplier) to the data, i.e. consider the extended data vectors:

$\bar \bx = \begin{bmatrix} \bx \\ B \end{bmatrix}, \quad \bar \bw = \begin{bmatrix} \bw \\ w_b \end{bmatrix}.$

In this manner the scoring function incorporates implicitly a bias $b = B w_b$:

$\langle \bar\bx, \bar\bw \rangle = \langle \bx, \bw \rangle + B w_b.$

The disadvantage of this reduction is that the term $w_b^2$ becomes part of the SVM regularizer, which shrinks the bias $b$ towards zero. This effect can be alleviated by making $B$ sufficiently large, because in this case $\|\bw\|^2 \gg w_b^2$ and the shrinking effect is negligible.

Unfortunately, making $B$ too large makes the problem numerically unbalanced, so a reasonable trade-off between shrinkage and stability is generally sought. Typically, a good trade-off is obtained by normalizing the data to have unitary Euclidean norm and then choosing $B \in [1, 10]$.

Specific implementations of SGD and SDCA may provide explicit support to learn the bias in this manner, but it is important to understand the implications on speed and accuracy of the learning if this is done.

# Non-linear SVMs and feature maps

So far only linear scoring function $\langle \bx,\bw \rangle$ have been considered. Implicitly, however, this assumes that the objects to be classified (e.g. images) have been encoded as vectors $\bx$ in a way that makes linear classification possible. This encoding step can be made explicit by introducing the feature map $\Phi(\bx) \in \real^d$. Including the feature map yields a scoring function non-linear* in $\bx$:

$\bx\in\mathcal{X} \quad\longrightarrow\quad \langle \Phi(\bx), \bw \rangle.$

The nature of the input space $\mathcal{X}$ can be arbitrary and might not have a vector space structure at all.

The representation or encoding captures a notion of similarity between objects: if two vectors $\Phi(\bx_1)$ and $\Phi(\bx_2)$ are similar, then their scores will also be similar. Note that choosing a feature map amounts to incorporating this information in the model prior* to learning.

The relation of feature maps to similarity functions is formalized by the notion of a kernel, a positive definite function $K(\bx,\bx')$ measuring the similarity of a pair of objects. A feature map defines a kernel by

$K(\bx,\bx') = \langle \Phi(\bx),\Phi(\bx') \rangle.$

Viceversa, any kernel function can be represented by a feature map in this manner, establishing an equivalence.

So far, all solvers in VLFeat assume that the feature map $\Psi(\bx)$ can be explicitly computed. Although classically kernels were introduced to generalize solvers to non-linear SVMs for which a feature map cannot be computed (e.g. for a Gaussian kernel the feature map is infinite dimensional), in practice using explicit feature representations allow to use much faster solvers, so it makes sense to reverse this process.