# Loss functions

The SVM formulation given in SVM fundamentals uses the hinge loss, which is only one of a variety of loss functions that are often used for SVMs. More in general, one can consider the objective

$$E(\bw) = \frac{\lambda}{2} \left\| \bw \right\|^2 + \frac{1}{n} \sum_{i=1}^n \ell_i(\langle \bw,\bx\rangle). \label{e:svm-primal}$$

where the loss $\ell_i(z)$ is a convex function of the scalar variable $z$. Losses differ by: (i) their purpose (some are suitable for classification, other for regression), (ii) their smoothness (which usually affects how quickly the SVM objective function can be minimized), and (iii) their statistical interpretation (for example the logistic loss can be used to learn logistic models).

Concrete examples are the:

 Name Loss $\ell_i(z)$ Description Hinge $\max\{0, 1-y_i z\}$ The standard SVM loss function. Square hinge $\max\{0, 1-y_i z\}^2$ The standard SVM loss function, but squared. This version is smoother and may yield numerically easier problems. Square or l2 $(y_i - z)^2$ This loss yields the ridge regression model (l2 regularised least square). Linear or l1 $|y_i - z|$ Another loss suitable for regression, usually more robust but harder to optimize than the squared one. Insensitive l1 $\max\{0, |y_i - z| - \epsilon\}$. This is a variant of the previous loss, proposed in the original Support Vector Regression formulation. Differently from the previous two losses, the insensitivity may yield to a sparse selection of support vectors. Logistic $\log(1 + e^{-y_i z})$ This corresponds to regularized logisitc regression. The loss can be seen as a negative log-likelihood: $\ell_i(z) = -\log P[y_i | z] = - \log \sigma(y_iz/2)$, where $\sigma(z) = e^z/(1 + e^z)$ is the sigmoid function, mapping a score $z$ to a probability. The $1/2$ factor in the sigmoid is due to the fact that labels are in $\{-1,1\}$ rather than $\{0,1\}$ as more common for the standard sigmoid model.

# Data abstraction: working with compressed data

VLFeat learning algorithms (SGD and SDCA) access the data by means of only two operations:

• inner product: computing the inner product between the model and a data vector, i.e. $\langle \bw, \bx \rangle$.
• accumulation: summing a data vector to the model, i.e. $\bw \leftarrow \bw + \beta \bx$.

VLFeat learning algorithms are parameterized in these two operations. As a consequence, the data can be stored in any format suitable to the user (e.g. dense matrices, sparse matrices, block-sparse matrices, disk caches, and so on) provided that these two operations can be implemented efficiently. Differently from the data, however, the model vector $\bw$ is represented simply as a dense array of doubles. This choice is adequate in almost any case.

A particularly useful aspect of this design choice is that the training data can be store in compressed format (for example by using product quantization (PQ)). Furthermore, higher-dimensional encodings such as the homogeneous kernel map (Homogeneous kernel map) and the intersection kernel map can be computed on the fly. Such techniques are very important when dealing with GBs of data.

# Dual problem

In optimization, the dual objective $D(\balpha)$ of the SVM objective $E(\bw)$ is of great interest. To obtain the dual objective, one starts by approximating each loss term from below by a family of planes:

$\ell_i(z) = \sup_{u} (u z - \ell_i^*(u) ), \qquad \ell_i^*(u) = \sup_{z} (z u - \ell_i(z) )$

where $\ell_i^*(u)$ is the dual conjugate of the loss and gives the intercept of each approximating plane as a function of the slope. When the loss function is convex, the approximation is in fact exact. Examples include:

 Name Loss $\ell_i(z)$ Conjugate loss $\ell_i^*(u)$ Hinge $\max\{0, 1-y_i z\}$ $\ell_i^*(u) = \begin{cases} y_i u, & -1 \leq y_i u \leq 0, \\ +\infty, & \text{otherwise} \end{cases}$ Square hinge $\max\{0, 1-y_i z\}^2$ $\ell_i^*(u) = \begin{cases} y_i u + \frac{u^2}{4}, & y_i u \leq 0, \\ +\infty, & \text{otherwise} \\ \end{cases}$ Linear or l1 $|y_i - z|$ $\ell_i^*(u) = \begin{cases} y_i u, & -1 \leq y_i u \leq 1, \\ +\infty, & \text{otherwise} \\ \end{cases}$ Square or l2 $(y_i - z)^2$ $\ell_i^*(u)=y_iu + \frac{u^2}{4}$ Insensitive l1 $\max\{0, |y_i - z| - \epsilon\}$. Logistic $\log(1 + e^{-y_i z})$ $\ell_i^*(u) = \begin{cases} (1+u) \log(1+u) - u \log(-u), & -1 \leq y_i u \leq 0, \\ +\infty, & \text{otherwise} \\ \end{cases}$

Since each plane $- z \alpha_i - \ell^*_i(-\alpha_i) \leq \ell_i(z)$ bounds the loss from below, by substituting in $E(\bw)$ one can write a lower bound for the SVM objective

$F(\bw,\balpha) = \frac{\lambda}{2} \|\bw\|^2 - \frac{1}{n}\sum_{i=1}^n (\bw^\top \bx_i\alpha_i + \ell_i^*(-\alpha_i)) \leq E(\bw).$

for each setting of the dual variables $\alpha_i$. The dual objective function $D(\balpha)$ is obtained by minimizing the lower bound $F(\bw,\balpha)$ w.r.t. to $\bw$:

$D(\balpha) = \inf_{\bw} F(\bw,\balpha) \leq E(\bw).$

The minimizer and the dual objective are now easy to find:

$\boxed{\displaystyle \bw(\balpha) = \frac{1}{\lambda n} \sum_{i=1}^n \bx_i \alpha_i = \frac{1}{\lambda n} X\balpha, \quad D(\balpha) = - \frac{1}{2\lambda n^2} \balpha^\top X^\top X \balpha + \frac{1}{n} \sum_{i=1}^n - \ell_i^*(-\alpha_i) }$

where $X = [\bx_1, \dots, \bx_n]$ is the data matrix. Since the dual is uniformly smaller than the primal, one has the duality gap bound:

$D(\balpha) \leq P(\bw^*) \leq P(\bw(\balpha))$

This bound can be used to evaluate how far off $\bw(\balpha)$ is from the primal minimizer $\bw^*$. In fact, due to convexity, this bound can be shown to be zero when $\balpha^*$ is the dual maximizer (strong duality):

$D(\balpha^*) = P(\bw^*) = P(\bw(\balpha^*)), \quad \bw^* = \bw(\balpha^*).$

# Parametrization in C

Often a slightly different form of the SVM objective is considered, where a parameter $C$ is used to scale the loss instead of the regularizer:

$E_C(\bw) = \frac{1}{2} \|\bw\|^2 + C \sum_{i=1}^n \ell_i(\langle \bx_i, \bw\rangle)$

This and the objective function $E(\bw)$ in $\lambda$ are equivalent (proportional) if

$\lambda = \frac{1}{nC}, \qquad C = \frac{1}{n\lambda}.$

up to an overall scaling factor to the problem.