### Table of Contents

This page describes the *Stochastic Gradient Descent* (SGD) linear SVM solver. SGD minimizes directly the primal SVM objective (see Support Vector Machines (SVM)):

\[ E(\bw) = \frac{\lambda}{2} \left\| \bw \right\|^2 + \frac{1}{n} \sum_{i=1}^n \ell_i(\langle \bw,\bx\rangle) \]

Firts, rewrite the objective as the average

\[ E(\bw) = \frac{1}{n} \sum_{i=1}^n E_i(\bw), \quad E_i(\bw) = \frac{\lambda}{2} \left\| \bw \right\|^2 + \ell_i(\langle \bw,\bx\rangle). \]

Then SGD performs gradient steps by considering at each iteration one term \(E_i(\bw)\) selected at random from this average. In its most basic form, the algorithm is:

- Start with \(\bw_0 = 0\).
- For \(t=1,2,\dots T\):
- Sample one index \(i\) in \(1,\dots,n\) uniformly at random.
- Compute a subgradient \(\bg_t\) of \(E_i(\bw)\) at \(\bw_t\).
- Compute the learning rate \(\eta_t\).
- Update \(\bw_{t+1} = \bw_t - \eta_t \bg_t\).

Provided that the learning rate \(\eta_t\) is chosen correctly, this simple algorithm is guaranteed to converge to the minimizer \(\bw^*\) of \(E\).

# Convergence and speed

The goal of the SGD algorithm is to bring the *primal suboptimality* below a threshold \(\epsilon_P\):

\[ E(\bw_t) - E(\bw^*) \leq \epsilon_P. \]

If the learning rate \(\eta_t\) is selected appropriately, SGD can be shown to converge properly. For example, [26] show that, since \(E(\bw)\) is \(\lambda\)-strongly convex, then using the learning rate

\[ \boxed{\eta_t = \frac{1}{\lambda t}} \]

guarantees that the algorithm reaches primal-suboptimality \(\epsilon_P\) in

\[ \tilde O\left( \frac{1}{\lambda \epsilon_P} \right). \]

iterations. This particular SGD variant is sometimes known as PEGASOS [26] and is the version implemented in VLFeat.

The *convergence speed* is not sufficient to tell the *learning speed*, i.e. how quickly an algorithm can learn an SVM that performs optimally on the test set. The following two observations can be used to link convergence speed to learning speed:

- The regularizer strength is often heuristically selected to be inversely proportional to the number of training samples: \(\lambda = \lambda_0 /n\). This reflects the fact that with more training data the prior should count less.
- The primal suboptimality \(\epsilon_P\) should be about the same as the estimation error of the SVM primal. This estimation error is due to the finite training set size and can be shown to be of the order of \(1/\lambda n = 1 / \lambda_0\).

Under these two assumptions, PEGASOS can learn a linear SVM in time \(\tilde O(n)\), which is *linear in the number of training examples*. This fares much better with \(O(n^2)\) or worse of non-linear SVM solvers.

# The bias term

Adding a bias \(b\) to the SVM scoring function \(\langle \bw, \bx \rangle +b\) is done, as explained in Adding a bias, by appending a constant feature \(B\) (the *bias multiplier*) to the data vectors \(\bx\) and a corresponding weight element \(w_b\) to the weight vector \(\bw\), so that \(b = B w_b\) As noted, the bias multiplier should be relatively large in order to avoid shrinking the bias towards zero, but small to make the optimization stable. In particular, setting \(B\) to zero learns an unbiased SVM (vl_svm_set_bias_multiplier).

To counter instability caused by a large bias multiplier, the learning rate of the bias is slowed down by multiplying the overall learning rate \(\eta_t\) by a bias-specific rate coefficient (vl_svm_set_bias_learning_rate).

As a rule of thumb, if the data vectors \(\bx\) are \(l^2\) normalized (as they typically should for optimal performance), then a reasonable bias multiplier is in the range 1 to 10 and a reasonable bias learning rate is somewhere in the range of the inverse of that (in this manner the two parts of the extended feature vector \((\bx, B)\) are balanced).

# Adjusting the learning rate

Initially, the learning rate \(\eta_t = 1/\lambda t\) is usually too fast: as usually \(\lambda \ll 1\), \(\eta_1 \gg 1\). But this is clearly excessive (for example, without a loss term, the best learning rate at the first iteration is simply \(\eta_1=1\), as this nails the optimum in one step). Thus, the learning rate formula is modified to be \(\eta_t = 1 / \lambda (t + t_0)\), where \(t_0 \approx 2/\lambda\), which is equivalent to start \(t_0\) iterations later. In this manner \(\eta_1 \approx 1/2\).

## Warm start

Starting from a given model \(\bw\) is easy in SGD as the optimization runs in the primal. However, the starting iteration index \(t\) should also be advanced for a warm start, as otherwise the initial setting of \(\bw\) is rapidly forgot (vl_svm_set_model, vl_svm_set_bias, vl_svm_set_iteration_number).

# Implementation details

- Random sampling of points

Rather than visiting points completely at random, VLFeat SDCA follows the best practice of visiting all the points at every epoch (pass through the data), changing the order of the visit randomly by picking every time a new random permutation.

- Factored representation

At each iteration, the SGD algorithm updates the vector \(\bw\) (including the additional bias component \(w_b\)) as \(\bw_{t+1} \leftarrow \bw_t - \lambda \eta_t \bw_t - \eta_t \bg_t\), where \(\eta_t\) is the learning rate. If the subgradient of the loss function \(\bg_t\) is zero at a given iteration, this amounts to simply shrink \(\bw\) towards the origin by multiplying it by the factor \(1 - \lambda \eta_t\). Thus such an iteration can be accelerated significantly by representing internally \(\bw_t = f_t \bu_t\), where \(f_t\) is a scaling factor. Then, the update becomes

\[ f_{t+1} \bu_{t+1} = f_{t} \bu_{t} - \lambda \eta_t f_{t} \bu_{t} - \eta_t \bg_t = (1-\lambda \eta_t) f_{t} \bu_{t} - \eta_t \bg_t. \]

Setting \(f_{t+1} = (1-\lambda \eta_t) f_{t}\), this gives the update equation for \(\bu_t\)

\[ \bu_{t+1} = \bu_{t} - \frac{\eta_t}{f_{t+1}} \bg_t. \]

but this step can be skipped whenever \(\bg_t\) is equal to zero.

When the bias component has a different learning rate, this scheme must be adjusted slightly by adding a separated factor for the bias, but it is otherwise identical.