3 min read

Logistic Regression with Gradient Descent

1. Classification

Classification is like a regression problem but the output we would like predict is discrete valued.

Here we will focus on binary classification problem.
\(y\in\{0,1\}\)

1.1 Hypothesis Representation

The predictions: \(g(\theta^Tx)\)

Sigmoid function (or Logistic function): \(g(z) = \frac{1}{1+e^{-z}}\)

\(h_\theta(x) = \theta_0x_1 +\theta_1x_2+\theta_2x_3+... = \theta^Tx\)

1.1. 1 Sigmoid Function Visualization

library(ggplot2)
library(dplyr)

sigmoid <- function(x) {1/(1+exp(1)^(-x))}

data_frame(x = c(-8, 8),
           y = c(0, 1)) %>%
  ggplot(aes(x = x, y = y))+
  stat_function(fun = sigmoid)

\(h_\theta(x) = 0.7\) can be interepreted as: x is positive with the probability of 0.7.

2. Logistic Regression

2.1 Logistic Regression Cost Function and Gradient Descent

2.1.1 Cost Function

\[ \begin{equation} h_\theta(x)=\begin{cases} -log(h_\theta(x) \text{ if } y = 1 \\ -log(1-h_\theta(x) \text{ if } y = 0) \end{cases} \end{equation} \]

In more compact way:

\[h_\theta(x) = -y*log(h_\theta(x))-(1-y)*log(1-h_\theta(x))\]

2.1.2 Gradient Descent

Implementation:

Repeat \(\big\{\)
\(\theta_j :=\theta_j-\alpha*\frac{\partial}{\partial\theta_j}*J(\theta)\)
\(\big\}\)

When we do the calculus:
Repeat \(\big\{\)
\(\theta_j :=\theta_j-\frac{\alpha}{m}\sum\limits_{i=1}^{m}(h_\theta(x^{(i)}-y^{(i)}x_j^{(i)})\)
\(\big\}\)

Note that this is identical to linear regression.

3.1 Solving Classification Problem with Gradient Descent

An example with the iris dataset in R:
In iris dataset we have 3 sprecies(setosa, versicolor, virginica) but for simplicity can classify them into two groups: setosa, others. So we can turn this problem into a binary classification.

# create a data frame with 2 features, 1 response
learning_data <- iris %>% 
  as_tibble() %>% 
  rename(sepal_length = Sepal.Length,
         sepal_width = Sepal.Width,
         species = Species) %>% 
  mutate(is_setosa = if_else(species == "setosa",
                             true = 1L,
                             false = 0L)) %>% 
  select(sepal_length, sepal_width, is_setosa)

learning_data
## # A tibble: 150 x 3
##    sepal_length sepal_width is_setosa
##           <dbl>       <dbl>     <int>
##  1         5.10        3.50         1
##  2         4.90        3.00         1
##  3         4.70        3.20         1
##  4         4.60        3.10         1
##  5         5.00        3.60         1
##  6         5.40        3.90         1
##  7         4.60        3.40         1
##  8         5.00        3.40         1
##  9         4.40        2.90         1
## 10         4.90        3.10         1
## # ... with 140 more rows

Visualize with ggplot2:

# visualize
learning_data %>%
  mutate(is_setosa = as.factor(is_setosa)) %>% 
  ggplot(aes(x = sepal_length, y = sepal_width))+
  geom_point(aes(color = is_setosa))

Now let’s find the parameters:

# define sigmoid function
sigmoid <- function(x) {1/(1 + exp(-x))}

# define cost function
cost <- function(x, y, theta){
  m <- nrow(x)
  hx <- sigmoid(x %*% theta)
  (1/m)*(((-t(y)%*%log(hx))-t(1-y)%*%log(1-hx)))
}

# gradient
grad <- function(x, y, theta){
  m <- nrow(x)
  hx <- sigmoid(x %*% theta)
  (1/m)*(t(x)%*%(hx - y))
}

# gradient descent
alpha <- 0.001
iter <- 100000
theta <- matrix(c(1, 1, 1), nrow=3)
m <- nrow(learning_data)
y <- as.matrix(learning_data[,3])
x <- as.matrix(learning_data[,c(1,2)])
x <- cbind(1, x)

for (i in 1:iter){
  theta <- theta-alpha*grad(x, y, theta)
}

theta
##              is_setosa
##               1.471821
## sepal_length -3.047751
## sepal_width   4.766410

3.2 Decision Boundary

Decision boundary separates areas where \(y=1\) and \(y=0\)

prediction <- sigmoid(x %*% theta) %>%
  as_data_frame() %>% 
  rename(prediction = is_setosa)

theta0 <- theta[1]
theta1 <- theta[2]
theta2 <- theta[3]

decision_boundary <- function(x) {
  (-theta0/theta2)+x*(-theta1/theta2)
}

bind_cols(learning_data, prediction) %>%
  mutate(is_setosa = as.factor(is_setosa)) %>% 
  ggplot(aes(x = sepal_length, y = sepal_width))+
  geom_point(aes(color = is_setosa))+
  stat_function(fun = decision_boundary)+
  coord_equal()

References