Logistic Regression with Gradient Descent

1. The Objective

Another implementation of the gradient descent algorithm which I learned in Andrew Ng’s famous Machine Learning course (Affiliate Link).

I will solve a binary classification problem with Gradient Descent. My main objective is getting a better understanding of Gradient Descent. It will solve not only this problem but also more complex ones like Neural Networks. The objective of this post is visualizing how it works.

Having this objective, I need to find a problem where the output is discrete-valued. \(y\in\{0,1\}\). So, I found Titanic: Machine Learning from Disaster competition on Kaggle would be a good practice. It is about predicting the survival chance of the passengers.

The below implementation is achieving 77% performance.

2. Logistic Regression

Introducing some terms:

\(h_\theta(x) = \theta_0x_1 +\theta_1x_2+\theta_2x_3+... = \theta^Tx\)

Sigmoid function: \(g(z) = \frac{1}{1+e^{-z}}\)

The predictions: \(g(\theta^Tx)\)

What is sigmoid function? It maps a real number to a value between 0 and 1.
It is also called Logistic function. In neural networks, it is called activation function. It can be visualized as below:

library(ggplot2)
library(dplyr)

sigmoid <- function(x) {1/(1+exp(1)^(-x))}

data_frame(x = c(-5, 5),
           y = c(0, 1)) %>%
  ggplot(aes(x = x, y = y))+
  stat_function(fun = sigmoid)

\(h_\theta(x) = 0.7\) can be interepreted as: x is positive with the probability of 0.7.

3. Logistic Regression with Gradient Descent

We need a measure how close our predictions and and the true values. It is called cost.

The cost function in this examples is based on Cross Entropy Loss.

3.1 Cost Function

\[ \begin{equation} h_\theta(x)=\begin{cases} -log(h_\theta(x) \text{ if } y = 1 \\ -log(1-h_\theta(x) \text{ if } y = 0) \end{cases} \end{equation} \]

In more compact way:

\[h_\theta(x) = -y*log(h_\theta(x))-(1-y)*log(1-h_\theta(x))\]

And cost function will be: \[J(\theta) = -\frac{1}{m}\sum\limits_{i=1}^m[(y^{(i)})log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))]\]

3.2 Gradient Descent

We will minimize cost with iteration. In each iteration, we will update parameters so that cost will get lower. This is actually the process of learning.

Implementation:

Repeat \(\big\{\)
\(\theta_j :=\theta_j-\alpha*\frac{\partial}{\partial\theta_j}*J(\theta)\)
\(\big\}\)

When we do the calculus:
Repeat \(\big\{\)
\(\theta_j :=\theta_j-\frac{\alpha}{m}\sum\limits_{i=1}^{m}(h_\theta(x^{(i)}-y^{(i)}x_j^{(i)})\)
\(\big\}\)

Note that this is identical to linear regression.

After stating some notations, now lets start the implementation:

4. Implementation

4.1 Import Data

library(tidyverse)
df_train <- read_csv('train.csv') %>% mutate(dataset = "train")
df_test <- read_csv('test.csv') %>% mutate(dataset = "test")

4.2 Data Processing

Process test and train datasets together:

Steps:

  • Bind rows of train and test sets
  • Convert categorical columns (Sex, Embarked) to intergers.
  • Separate datasets into train and test again.
# merge to datasets for data transformation
df_all <- bind_rows(df_train, df_test)

df_all <- df_all %>%  mutate(Sex = if_else(Sex =="male", true = 1L, false = 0L))

df_all <- df_all %>%
                mutate(Embarked = case_when(Embarked == "S" ~ 1L,
                                            Embarked == "C" ~ 2L,
                                            Embarked == "Q" ~ 3L,
                                            is.na(Embarked) ~ 4L,
                                            TRUE ~ 5L))

# separate datasets
df_train <- df_all %>% filter(dataset == "train")
df_test <- df_all %>% filter(dataset == "test")

4.3 Define Functions

Sigmoid Function:

# define sigmoid function
sigmoid <- function(x) {1/(1 + exp(-x))}

Cost Function:

# define cost function
cost <- function(x, y, theta){
  m <- nrow(x)
  hx <- sigmoid(x %*% theta)
  (1/m)*(((-t(y)%*%log(hx))-t(1-y)%*%log(1-hx)))
}

Gradient Function:

# gradient
grad <- function(x, y, theta){
  m <- nrow(x)
  hx <- sigmoid(x %*% theta)
  (1/m)*(t(x)%*%(hx - y))
}

Once we find the parameters, we will make predictions with test data.
So, prediction Function will be defined as:

# prediction
logit_predict <- function(df, theta, features) {
  x <- as.matrix(df[,features])
  x <- cbind(1, x)
  sigmoid(x %*% theta) %>%
    as_data_frame()
}

4.4 Runing Gradient Descent

This will return the parameters which minimizes the cost.

logit_gradient_descent <- function(df, iter = 10000, learning_rate, features, response){
  
  # initial theta values: set all of them to 1
  n_independent = length(features)
  theta = matrix(rep(1, n_independent+1), nrow = n_independent+1)
  m = nrow(df)
  
  # dependent and independent variables 
  y = as.matrix(df[response])
  x <- as.matrix(df[,features])
  x <- cbind(1, x)
  
  cost_values = NA
  # run gradient descent
  for (i in 1:iter){
    theta <- theta-learning_rate*grad(x, y, theta)
    
    cost_values[i] <- cost(x, y, theta)[1,1]
    if(i > 1){
      if(cost_values[i] > cost_values[i-1]) {
        warning("Cost is not decreasing in each iteration. Lower the learning_rate")
        break
      } 
    }
  }
  theta
}

4.5 Make Predictions

theta <- logit_gradient_descent(df = df_train, iter = 100000, learning_rate = 0.001,
                                features = c("Sex", "Pclass", "SibSp", "Embarked"), response = 'Survived')

predictions_train <- logit_predict(df_train, theta, features = c("Sex", "Pclass", "SibSp", "Embarked"))
df_train$predicted <- predictions_train$Survived
df_train %>% mutate(predicted = as.integer(round(predicted))) %>% 
  mutate(is_prediction_true = Survived == predicted) %>% 
  group_by(is_prediction_true) %>% 
  count()
## # A tibble: 2 x 2
## # Groups:   is_prediction_true [2]
##   is_prediction_true     n
##   <lgl>              <int>
## 1 F                    176
## 2 T                    715
predictions <- logit_predict(df_test, theta, features = c("Sex", "Pclass", "SibSp", "Embarked"))

References


  • Machine Learning
  • Gradient Descent
  • Logistic Regression