What is hinge loss
The hinge loss is a loss function used to train the machine learning classifier, which is
where = -1 or 1 indicating two classes and represents the output from our classifier.
However, the SVM I know is like
So what is the relation between the two? Are they just two perspectives to look at the same model?
How to derive it from SVM
To derive hinge loss, we will start from (2).
The constraint in (2) is
It can we rewritten as
Thus, we get
which is the same as the definition of hinge loss in (1). Thus, we could hide the constraint and put the in (2) and then get
We could continue to multiply a factor to the loss function, where N is the number of samples, then we get
We could interpret (7) as a form of loss + L2 penalty, which is a very familiar paradigm in function estimation. And indeed, the loss function is the hinge loss we expect. It represent the “margin maximizing loss function”.
Is kernel trick still available?
A powerful weapon of SVM is that it could use kernel trick to build a linear hyperplane in an enlarged feature space with kernel trick but avoid project the original data to that space, which will lead to a non-linear decision boundary in the original feature space.
However, that happens when we use the Lagrange and optimize the dual problem. If we use the loss + penalty paradigm, can we still use the kernel trick?
The answer is YES with the theory of RKHS (reproducing kernel Hilbert space)
More information about this topic could be found at chapter 12.3.3 of The Element of Statistical Learning
In this post, I illustrate how to go from the traditional definition of SVM to the paradigm of hinge loss plus regularization. This shift of view tells us that we can understand the same model from different angle, which will review different properties of it.
I will try to learn and write more post about SVM in the future.