1. The Objective

I am implementing the learning algorithms that I learned in Andrew Ng’s famous Machine Learning course (Affiliate Link).

I will solve a binary classification problem with Gradient Descent. My main objective is getting a better understanding of Gradient Descent. It will solve not only this problem, but also more complex ones like Neural Networks. I wanted to use that algorithm because it is like a Swiss Knife when solving most of the machine learning problems.

Having this objective, I need to find a problem where the output is discrete valued. $$y\in\{0,1\}$$. So, I found Titanic: Machine Learning from Disaster competition on Kaggle would be a good practice. It is about predicting the survival chance of the passangers.

The below implementation is achieving 77% performance but I didnt spend too much effort to achieve higher score.

2. Logistic Regression

Introducing some terms:

$$h_\theta(x) = \theta_0x_1 +\theta_1x_2+\theta_2x_3+... = \theta^Tx$$

Sigmoid function: $$g(z) = \frac{1}{1+e^{-z}}$$

The predictions: $$g(\theta^Tx)$$

What is sigmoid function? It maps a real number to a value between 0 and 1.
It is also called Logistic function. In neural networks, it is called activation function. It can be visualized as below:

library(ggplot2)
library(dplyr)

sigmoid <- function(x) {1/(1+exp(1)^(-x))}

data_frame(x = c(-5, 5),
y = c(0, 1)) %>%
ggplot(aes(x = x, y = y))+
stat_function(fun = sigmoid) $$h_\theta(x) = 0.7$$ can be interepreted as: x is positive with the probability of 0.7.

3. Logistic Regression with Gradient Descent

We need a measure for divergence of the fit and we will minimize it. It is called cost.

The cost function in this examples is based on Cross Entropy Loss.

3.1 Cost Function

$\begin{equation} h_\theta(x)=\begin{cases} -log(h_\theta(x) \text{ if } y = 1 \\ -log(1-h_\theta(x) \text{ if } y = 0) \end{cases} \end{equation}$

In more compact way:

$h_\theta(x) = -y*log(h_\theta(x))-(1-y)*log(1-h_\theta(x))$

And cost function will be: $J(\theta) = -\frac{1}{m}\sum\limits_{i=1}^m[(y^{(i)})log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))]$

We will minimize cost with iteration. In each iteration, we will update parameters so that cost will get lower. This is actually the process of learning.

Implementation:

Repeat $$\big\{$$
$$\theta_j :=\theta_j-\alpha*\frac{\partial}{\partial\theta_j}*J(\theta)$$
$$\big\}$$

When we do the calculus:
Repeat $$\big\{$$
$$\theta_j :=\theta_j-\frac{\alpha}{m}\sum\limits_{i=1}^{m}(h_\theta(x^{(i)}-y^{(i)}x_j^{(i)})$$
$$\big\}$$

Note that this is identical to linear regression.

After stating some notations, now lets start the implementation:

4. Implementation

4.1 Import Data

library(tidyverse)
df_train <- read_csv('train.csv') %>% mutate(dataset = "train")
df_test <- read_csv('test.csv') %>% mutate(dataset = "test")

4.2 Data Processing

Process test and train datasets together:

Steps:

• Bind rows of train and test sets
• Convert categorical columns (Sex, Embarked) to intergers.
• Separate datasets into train and test again.
# merge to datasets for data transformation
df_all <- bind_rows(df_train, df_test)

df_all <- df_all %>%  mutate(Sex = if_else(Sex =="male", true = 1L, false = 0L))

df_all <- df_all %>%
mutate(Embarked = case_when(Embarked == "S" ~ 1L,
Embarked == "C" ~ 2L,
Embarked == "Q" ~ 3L,
is.na(Embarked) ~ 4L,
TRUE ~ 5L))

# separate datasets
df_train <- df_all %>% filter(dataset == "train")
df_test <- df_all %>% filter(dataset == "test")

4.3 Define Functions

Sigmoid Function:

# define sigmoid function
sigmoid <- function(x) {1/(1 + exp(-x))}

Cost Function:

# define cost function
cost <- function(x, y, theta){
m <- nrow(x)
hx <- sigmoid(x %*% theta)
(1/m)*(((-t(y)%*%log(hx))-t(1-y)%*%log(1-hx)))
}

# gradient
m <- nrow(x)
hx <- sigmoid(x %*% theta)
(1/m)*(t(x)%*%(hx - y))
}

Once we find the parameters, we will make predictions with test data.
So, prediction Function will be defined as:

# prediction
logit_predict <- function(df, theta, features) {
x <- as.matrix(df[,features])
x <- cbind(1, x)
sigmoid(x %*% theta) %>%
as_data_frame()
}

This will return the parameters which minimizes the cost.

logit_gradient_descent <- function(df, iter = 10000, learning_rate, features, response){

# initial theta values: set all of them to 1
n_independent = length(features)
theta = matrix(rep(1, n_independent+1), nrow = n_independent+1)
m = nrow(df)

# dependent and independent variables
y = as.matrix(df[response])
x <- as.matrix(df[,features])
x <- cbind(1, x)

cost_values = NA
for (i in 1:iter){

cost_values[i] <- cost(x, y, theta)[1,1]
if(i > 1){
if(cost_values[i] > cost_values[i-1]) {
warning("Cost is not decreasing in each iteration. Lower the learning_rate")
break
}
}
}
theta
}

4.5 Make Predictions

theta <- logit_gradient_descent(df = df_train, iter = 100000, learning_rate = 0.001,
features = c("Sex", "Pclass", "SibSp", "Embarked"), response = 'Survived')

predictions_train <- logit_predict(df_train, theta, features = c("Sex", "Pclass", "SibSp", "Embarked"))
df_train$predicted <- predictions_train$Survived
df_train %>% mutate(predicted = as.integer(round(predicted))) %>%
mutate(is_prediction_true = Survived == predicted) %>%
group_by(is_prediction_true) %>%
count()
## # A tibble: 2 x 2
## # Groups:   is_prediction_true 
##   is_prediction_true     n
##   <lgl>              <int>
## 1 F                    176
## 2 T                    715
predictions <- logit_predict(df_test, theta, features = c("Sex", "Pclass", "SibSp", "Embarked"))

• Machine Learning