Derive hinge loss from SVM

What is hinge loss

The hinge loss is a loss function used to train the machine learning classifier, which is

L(\hat{y}) = max(0, 1 - y\hat{y})      (1)

where y =  -1 or 1  indicating two classes and  \hat{y} represents the output from our classifier.

However, the SVM I know is like

 min\frac{1}{2}\parallel W \parallel^{2}_{} + C\sum^{N}_{i = 1}\xi^{}_{i}      (2)

s.t.    \xi^{}_{i} \geqslant 0, y^{}_{i}(x^{T}_{i}W + b) \geqslant 1-\xi^{}_{i} \forall i

So what is the relation between the two? Are they just two perspectives to look at the same model?

How to derive it from SVM

To derive hinge loss, we will start from (2).

The constraint in (2) is

\xi^{}_{i} \geqslant 0, y^{}_{i}(x^{T}_{i}W + b) \geqslant 1-\xi^{}_{i} \forall i      (3)

It can we rewritten as

\xi^{}_{i} \geqslant 0, x^{}_{i} \geqslant 1-y^{}_{i}(x^{T}_{i}W + b)      (4)

Thus, we get

\xi^{}_{i} = max(0,1-y^{}_{i}(x^{T}_{i}W + b))      (5)

which is the same as the definition of hinge loss in (1). Thus, we could hide the constraint and put the \xi^{}_{i} in (2) and then get

min\frac{1}{2}\parallel W \parallel^{2}_{} + C\sum^{N}_{i = 1}max(0,1-y^{}_{i}(x^{T}_{i}W + b))      (6)

We could continue to multiply a factor \lambda=\frac{1}{NC} to the loss function, where N is the number of samples, then we get

min \frac{1}{N}\sum^{N}_{i = 1}max(0,1-y^{}_{i}(x^{T}_{i}W + b)) + \frac{\lambda}{2}\parallel W \parallel^{2}_{}      (7)

We could interpret (7) as a form of loss + L2 penalty, which is a very familiar paradigm in function estimation.  And indeed, the loss function is the hinge loss we expect. It represent the “margin maximizing loss function”.

Is kernel trick still available?

A powerful weapon of SVM is that it could use kernel trick to build a linear hyperplane in an enlarged feature space with kernel trick but avoid project the original data to that space, which will lead to a non-linear decision boundary in the original feature space.

However, that happens when we use the Lagrange and optimize the dual problem. If we use the loss + penalty paradigm, can we still use the kernel trick?

The answer is YES with the theory of RKHS (reproducing kernel Hilbert space)

More information about this topic could be found at chapter 12.3.3 of The Element of Statistical Learning 

Summary

In this post, I illustrate how to go from the traditional definition of SVM to the paradigm of hinge loss plus regularization. This shift of view tells us that we can understand the same model from different angle, which will review different properties of it.

I will try to learn and write more post about SVM in the future.

Reference

The Element of Statistical Learning 

Reproducing kernel Hilbert space

Hinge loss

Lecture notes

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s