Derive hinge loss from SVM

What is hinge loss

The hinge loss is a loss function used to train the machine learning classifier, which is

$L(\hat{y}) = max(0, 1 - y\hat{y})$      (1)

where $y$ =  -1 or 1  indicating two classes and  $\hat{y}$ represents the output from our classifier.

However, the SVM I know is like

$min\frac{1}{2}\parallel W \parallel^{2}_{} + C\sum^{N}_{i = 1}\xi^{}_{i}$      (2)

s.t.    $\xi^{}_{i} \geqslant 0, y^{}_{i}(x^{T}_{i}W + b) \geqslant 1-\xi^{}_{i} \forall i$

So what is the relation between the two? Are they just two perspectives to look at the same model?

How to derive it from SVM

To derive hinge loss, we will start from (2).

The constraint in (2) is

$\xi^{}_{i} \geqslant 0, y^{}_{i}(x^{T}_{i}W + b) \geqslant 1-\xi^{}_{i} \forall i$      (3)

It can we rewritten as

$\xi^{}_{i} \geqslant 0, x^{}_{i} \geqslant 1-y^{}_{i}(x^{T}_{i}W + b)$      (4)

Thus, we get

$\xi^{}_{i} = max(0,1-y^{}_{i}(x^{T}_{i}W + b))$      (5)

which is the same as the definition of hinge loss in (1). Thus, we could hide the constraint and put the $\xi^{}_{i}$ in (2) and then get

$min\frac{1}{2}\parallel W \parallel^{2}_{} + C\sum^{N}_{i = 1}max(0,1-y^{}_{i}(x^{T}_{i}W + b))$      (6)

We could continue to multiply a factor $\lambda=\frac{1}{NC}$ to the loss function, where N is the number of samples, then we get

$min \frac{1}{N}\sum^{N}_{i = 1}max(0,1-y^{}_{i}(x^{T}_{i}W + b)) + \frac{\lambda}{2}\parallel W \parallel^{2}_{}$      (7)

We could interpret (7) as a form of loss + L2 penalty, which is a very familiar paradigm in function estimation.  And indeed, the loss function is the hinge loss we expect. It represent the “margin maximizing loss function”.

Is kernel trick still available?

A powerful weapon of SVM is that it could use kernel trick to build a linear hyperplane in an enlarged feature space with kernel trick but avoid project the original data to that space, which will lead to a non-linear decision boundary in the original feature space.

However, that happens when we use the Lagrange and optimize the dual problem. If we use the loss + penalty paradigm, can we still use the kernel trick?

The answer is YES with the theory of RKHS (reproducing kernel Hilbert space)

More information about this topic could be found at chapter 12.3.3 of The Element of Statistical Learning

Summary

In this post, I illustrate how to go from the traditional definition of SVM to the paradigm of hinge loss plus regularization. This shift of view tells us that we can understand the same model from different angle, which will review different properties of it.

I will try to learn and write more post about SVM in the future.

Reference

The Element of Statistical Learning

Reproducing kernel Hilbert space

Hinge loss

Lecture notes