Logistic regression with softmax function

The logistic regression model is a simple but popular generalized linear model. It is used to make classification on binary or multiple classes. Here, we will try to implement this model with python, test the results on simulated data and compare its performance with the logistic regression module of scikit-learn.

Review of Logistic Regression

Logit function

The linear regression model will build a quantitative relation between the value of explanatory variables X and that of the dependent variable y.

y^{}_{i} = WX^{}_{i} + b + \epsilon^{}_{i},     i = 1,...,n                          (1)

However, it is not reasonable to apply this model on the binary classification problems. Why? In that situation, the dependent variable y is denoted by a qualitative dummy variable, usually 0 and 1, where 1 represent the object belongs to one class and 0 the other class. But in linear regression, the dependent variable y is quantitative and usually ranges from -\infty to +\infty.

A natural transformation is to make the dependent variable y  the conditional probability of the object belonging to one class, which ranges from (0,1). Then use a monotonic link function to transform the probability p from (0,1) to (-\infty, +\infty). Then we can build a linear regression on the new dependent variable.

The transformation we describe above is the logit function

logit(p) = log(\frac{p}{1-p})                    (2)

Then, we have a logistic regression model as

log(\frac{p^{}_{i}}{1-p^{}_{i}}) = WX^{}_{i} + b,     i = 1,...,n                       (3)

With formula (3), we have p :

p^{}_{i} = \frac{e^{WX^{}_{i} + b}}{1 + e^{WX^{}_{i} + b}}                        (4)

Now we have got the probability of the object belonging to one class, we could derive the probability of the object belonging to another class by 1-p and estimate the parameters with MLE (Maximum Likelihood Estimation). However, we can also build two parallel logistic regression for two classes by using the softmax function and optimize the cross-entropy loss for estimation parameters.

softmax function

What is a softmax function? It is a generalization of the logistic function which takes in a K-dimension vector z and normalize each element of the vector into the range [0,1] by

f(z)^{}_{j} = \frac{e^{z^{}_{j}}}{\sum^{K}_{k = 1}e^{z^{}_{k}}}                     (5)

In our parallel logistic regression model, or softmax classifier, we have

p^{}_{i1} = \frac{e^{W^{}_{1}X^{}_{i} + b^{}_{1}}}{e^{W^{}_{1}X^{}_{i} + b^{}_{1}} + e^{W^{}_{2}X^{}_{i} + b^{}_{2}}}                           (6)

p^{}_{i2} = \frac{e^{W^{}_{2}X^{}_{i} + b^{}_{2}}}{e^{W^{}_{1}X^{}_{i} + b^{}_{1}} + e^{W^{}_{2}X^{}_{i} + b^{}_{2}}}                           (6)

p^{}_{i1} + p^{}_{i2} = 1                          (8)

The softmax function here normalize the results from two linear model and squashes them to the range of (0,1). However, we notice that the W^{}_{2} and b^{}_{2} are redundant parameters because we can acquire the same results by building only one logistic regression. Therefore, we have to include regularization item in our loss function to restrict the magnitude of our parameters.

The cross-entropy loss, or the negative maximum likelihood, with regularization is

J(\theta) = -\sum^{n}_{i = 1} y^{}_{i}log(\hat{y}^{}_{i}) + \frac{\lambda}{2}(\sum^{V}_{j=1}\theta^{2}_{j})                      (9)

where \hat{y}^{}_{i} = [p^{}_{i1}, p^{}_{i2}]^{T}_{} and y^{}_{i} = [0,1]^{T}_{} or [1,0]^{T}_{} .

The set \{\theta^{}_{i} | i=1,2,...,V\} represent all the parameters.

With the theory above, we can start to build the model.

Code from scratch

Necessary modules

We first need some necessary modules


import numpy as np
import random
import matplotlib.pyplot as plt

Simulate the data

We will simulate a data set for testing the performance of the data. To facilitate our convenience, we only simulate data with 2 dimension.

# simulate the data for test
np.random.seed(12)
num_observations = 5000

x1 = np.random.multivariate_normal([0, 0], [[1, .75],[.75, 1]], num_observations)
x2 = np.random.multivariate_normal([0, 4], [[1, .75],[.75, 1]], num_observations)

X = np.vstack((x1, x2)).astype(np.float32)
y = np.hstack((np.zeros(num_observations, dtype='int32'),
 np.ones(num_observations, dtype='int32')))

The input is X and the label is y .

X is a 10000 * 2 matrix and y is a vector of 10000

The data looks like

# visualize the simulation
colorPlate = {0:"red";,1:"blue";}
colors = [colorPlate[c] for c in y]
plt.scatter(X[:,0], X[:,1], color = colors);
plt.title("simulated data");
plt.xlabel("x1");
plt.ylabel("x2");

sim_data

Now we have generated the data for test, we can continue to work on building the model.

Softmax function

Since we will use softmax to generate the probabilities, we need to write a function for it.

def softmax(x):
    orig_shape = x.shape
    x = x - np.max(x) # avoid overflow of exponential
    x = np.exp(x)/np.sum(np.exp(x))
    assert x.shape == orig_shape
    return x

The argument of the function is a vector from the affine transformation of original data (WX + b ) and the returned value is a vector of the same dimension as the input vector but is normalized by the softmax function.

Calculate the gradient

To estimate the parameters of the model, we will use stochastic gradient descent to reduce our cost function. Thus, we need the gradient for each step The loss function is

We take derivative on W and b and get their gradients

\nabla^{}_{W}J = (\hat{y} - y)^{T}_{} * X^{}_{i} + \lambda*W      (10)

\nabla^{}_{b}J = (\hat{y} - y)^{T}_{} + \lambda*b        (11)

y, \hat{y} and b are column vectors, W is a 2 by 2 matrix

To implement the formulation above, we have code as follows.


def softmaxGradient(predictor, target, weights, bias, l = 1e-2):
    # predictor is one row of the data
    # target is the corresponding label of class
    # W and b are weights and bias
    # The regularization is also added to prevent
    # get y_hat
    y_hat = softmax(np.dot(weights, predictor) + bias)

    # get cost
    cost = -np.log(y_hat[target]) + (l/2)*(np.sum(weights**2) + np.sum(bias**2))
    y_hat[target] -= 1

    # get grad with regularization
    gradW = np.outer(y_hat, predictor) + l*weights
    gradb = y_hat + l*bias
    return cost, gradW, gradb

The softmaxGradient function have 5 arguments and returns the value of cost function, gradient of W and b. We will use the output from this function to update the W and b in the following SGD step.

Stochastic gradient descent

The Stochastic gradient descent also known as incremental gradient descent, is a stochastic approximation of the gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions. In other words, SGD tries to find minima or maxima by iteration.


def sgd(X, y, f, weights, bias, lr=0.01, epochs=100, batchSize=10, PRINT_EVERY=10):
    # X: explanatory variables matrix
    # y: dependent variable
    # f: the function for generating cost and gradients (softmaxGradient)
    # weights: parameters
    # bias: parameters -- intercept
    # lr: learning rate
    # epochs: number of iterations to go through the entire data
    # batchSize: the number of rows for a minibatch of data
    # PRINT_EVERY: number of steps to print optimization information
    # The function will return updated weights and bias

    Ndata = X.shape[0]
    expcost = None # cost accumulation
    index = np.array(range(Ndata)) 

    for i in range(epochs):

        # shuffle X and y
        np.random.shuffle(index)
        X = X[index,:]
        y = y[index]

        # generate minibatch
        XBatch = [X[k:k+batchSize,:] for k in np.arange(0, Ndata, batchSize)]
        yBatch = [y[k:k+batchSize] for k in np.arange(0, Ndata, batchSize)]

        # container of gradients of W and b
        gradWSum = np.zeros(weights.shape) # initialize the weights
        gradbSum = np.zeros(bias.shape) # initialize the bias

        for xb, yb in zip(XBatch, yBatch):
           bSize = len(yb) # minibatch size
           for j in range(bSize):
               cost, gradW, gradb = f(xb[j,:], yb[j], weights, bias) # calculate gradients
               if expcost:
                   expcost = 0.95*expcost + 0.05*cost # accumulate cost
               else:
                   expcost = cost
               assert gradW.shape == gradWSum.shape
               assert gradb.shape == gradbSum.shape
           gradWSum += gradW # accumulate batch gradient of W
           gradbSum += gradb # accumulate batch gradient of b
           weights -= lr * (1/bSize) * gradWSum # update W
           bias -= lr * (1/bSize) * gradbSum # update b

        # print cost each 10 steps
        if (i+1) % PRINT_EVERY == 0:
            print("Training cost after epoch {} is: {}".format(i+1, expcost))
return weights, bias

Initialize parameters

We need a function to help us initialize the parameters

def init_params(X,y):
     # get number of classes
     Nclass = len(np.unique(y))
     Nfeature = X.shape[1]

     # initalize parameters
     # Weights for different classes
     W = np.random.randn(Nclass, Nfeature)

     # bias for different classes
     b = np.random.randn(Nclass)

     return W, b

np.random.seed(1212)
W, b = init_params(X,y)

Then, we can use functions we created above to build a logistic model.

Model construction

We first initialize the parameters

np.random.seed(1212)
W, b = init_params(X,y)

Then,  we use the “sgd” function for optimization and get the updated W and b, which minimize the cost function. We iterate for only 700 epochs.


np.random.seed(4321)
WNew, bNew = sgd(X, y, softmaxGradient, W, b, 1e-5, 700, 10, 1)

The print out from this function is,

# Training cost after epoch 1 is:169.94332787084863
# Training cost after epoch 2 is:148.11343047524358
# ...
# Training cost after epoch 699 is:44.62553361794173
# Training cost after epoch 700 is:44.6287267971707

Prediction

After getting the parameters, we also need a function for prediction.


def predict(X,W,b):
    return np.argmax(softmax(np.dot(X,W) + b.reshape(1,-1)))

This function will take the updated parameters and the X as arguments and return class with highest probability.


pred_scratch = []
for i in range(len(y)):
     pred_scratch.append(predict(X[i,:], WNew, bNew))

print(sum(pred_scratch == y)/len(y)) # print train accuracy

Compare with scikit-learn logistic regression module

To evaluate the performance of our self-constructed model, we will compare it with the scikit-learn LogisticRegression module by making classification on our test data, and compare the training accuracy.


from sklearn.linear_model import LogisticRegression

logistic = LogisticRegression(penalty='l2', fit_intercept=True, C = 100, max_iter=300)

logistic.fit(X, y)
pred_sk = logistic.predict(X)

print(sum(pred_sk == y)/len(y)) # print train accuracy

The prediction accuracy from our self-made model and the scikit-learn model are

# training accuracy of model from scratch 0.9977
# training accuracy of model from scikit-learn 0.9993

They are pretty close, indicating that our self-made logistic regression is effective.

Decision boundary visualization

Besides evaluating the training accuracy, we can also plot the decision boundary of the logistic regression from our self-made model on the original data 2-D plane.

bound

The boundary have successfully captured the difference between the red class and the blue class.

Summary

Now, we have finished the construction of the logistic regression from scratch. It has good performance but is much slower than the scikit-learn implementation because the scikit-learn uses the advanced gradient descent solver. However, by doing this, we can have a better understanding of the logistic model structure, the factors that influence its performance and its close relation to the softmax function. With this knowledge, we can build our own extended algorithm to solve the real world problem that are not fit to any machine learning framework.

Reference

https://en.wikipedia.org/wiki/Logistic_regression

https://beckernick.github.io/logistic-regression-from-scratch/

https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf

https://en.wikipedia.org/wiki/Softmax_function

http://cs231n.github.io/linear-classify/


This is my first blog post, so I will appreciate your feedback on this post very much.

The notebook that include all the code and experiment can be found at my Github repo.

 

One thought on “Logistic regression with softmax function

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s