Another implementation of the gradient descent algorithm which I learned in Andrew Ng’s famous Machine Learning course (Affiliate Link).

I will solve a **binary classification problem** with Gradient Descent. My main objective is getting a better understanding of Gradient Descent. It will solve not only this problem but also more complex ones like Neural Networks. The objective of this post is visualizing how it works.

Having this objective, I need to find a problem where the output is discrete-valued. \(y\in\{0,1\}\). So, I found Titanic: Machine Learning from Disaster competition on Kaggle would be a good practice. It is about predicting the survival chance of the passengers.

The below implementation is achieving 77% performance__.__

Introducing some terms:

\(h_\theta(x) = \theta_0x_1 +\theta_1x_2+\theta_2x_3+... = \theta^Tx\)

Sigmoid function: \(g(z) = \frac{1}{1+e^{-z}}\)

The predictions: \(g(\theta^Tx)\)

What is sigmoid function? It maps a real number to a value between 0 and 1.

It is also called Logistic function. In neural networks, it is called activation function. It can be visualized as below:

```
library(ggplot2)
library(dplyr)
sigmoid <- function(x) {1/(1+exp(1)^(-x))}
data_frame(x = c(-5, 5),
y = c(0, 1)) %>%
ggplot(aes(x = x, y = y))+
stat_function(fun = sigmoid)
```

\(h_\theta(x) = 0.7\) can be interepreted as: x is *positive* with the probability of 0.7.

We need a measure how close our predictions and and **the true values**. It is called *cost*.

The cost function in this examples is based on Cross Entropy Loss.

\[ \begin{equation} h_\theta(x)=\begin{cases} -log(h_\theta(x) \text{ if } y = 1 \\ -log(1-h_\theta(x) \text{ if } y = 0) \end{cases} \end{equation} \]

In more compact way:

\[h_\theta(x) = -y*log(h_\theta(x))-(1-y)*log(1-h_\theta(x))\]

And cost function will be: \[J(\theta) = -\frac{1}{m}\sum\limits_{i=1}^m[(y^{(i)})log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))]\]

We will minimize cost with iteration. In each iteration, we will update parameters so that cost will get lower. This is actually *the process of learning*.

Implementation:

*Repeat* \(\big\{\)

\(\theta_j :=\theta_j-\alpha*\frac{\partial}{\partial\theta_j}*J(\theta)\)

\(\big\}\)

When we do the calculus:

*Repeat* \(\big\{\)

\(\theta_j :=\theta_j-\frac{\alpha}{m}\sum\limits_{i=1}^{m}(h_\theta(x^{(i)}-y^{(i)}x_j^{(i)})\)

\(\big\}\)

Note that this is identical to linear regression.

After stating some notations, now let__s start the implementation:__

```
library(tidyverse)
df_train <- read_csv('train.csv') %>% mutate(dataset = "train")
df_test <- read_csv('test.csv') %>% mutate(dataset = "test")
```

Process test and train datasets together:

Steps:

- Bind rows of train and test sets

- Convert categorical columns (
`Sex`

,`Embarked`

) to intergers.

- Separate datasets into train and test again.

```
# merge to datasets for data transformation
df_all <- bind_rows(df_train, df_test)
df_all <- df_all %>% mutate(Sex = if_else(Sex =="male", true = 1L, false = 0L))
df_all <- df_all %>%
mutate(Embarked = case_when(Embarked == "S" ~ 1L,
Embarked == "C" ~ 2L,
Embarked == "Q" ~ 3L,
is.na(Embarked) ~ 4L,
TRUE ~ 5L))
# separate datasets
df_train <- df_all %>% filter(dataset == "train")
df_test <- df_all %>% filter(dataset == "test")
```

Sigmoid Function:

```
# define sigmoid function
sigmoid <- function(x) {1/(1 + exp(-x))}
```

Cost Function:

```
# define cost function
cost <- function(x, y, theta){
m <- nrow(x)
hx <- sigmoid(x %*% theta)
(1/m)*(((-t(y)%*%log(hx))-t(1-y)%*%log(1-hx)))
}
```

Gradient Function:

```
# gradient
grad <- function(x, y, theta){
m <- nrow(x)
hx <- sigmoid(x %*% theta)
(1/m)*(t(x)%*%(hx - y))
}
```

Once we find the parameters, we will make predictions with test data.

So, prediction Function will be defined as:

```
# prediction
logit_predict <- function(df, theta, features) {
x <- as.matrix(df[,features])
x <- cbind(1, x)
sigmoid(x %*% theta) %>%
as_data_frame()
}
```

This will return the parameters which minimizes the cost.

```
logit_gradient_descent <- function(df, iter = 10000, learning_rate, features, response){
# initial theta values: set all of them to 1
n_independent = length(features)
theta = matrix(rep(1, n_independent+1), nrow = n_independent+1)
m = nrow(df)
# dependent and independent variables
y = as.matrix(df[response])
x <- as.matrix(df[,features])
x <- cbind(1, x)
cost_values = NA
# run gradient descent
for (i in 1:iter){
theta <- theta-learning_rate*grad(x, y, theta)
cost_values[i] <- cost(x, y, theta)[1,1]
if(i > 1){
if(cost_values[i] > cost_values[i-1]) {
warning("Cost is not decreasing in each iteration. Lower the learning_rate")
break
}
}
}
theta
}
```

```
theta <- logit_gradient_descent(df = df_train, iter = 100000, learning_rate = 0.001,
features = c("Sex", "Pclass", "SibSp", "Embarked"), response = 'Survived')
predictions_train <- logit_predict(df_train, theta, features = c("Sex", "Pclass", "SibSp", "Embarked"))
df_train$predicted <- predictions_train$Survived
df_train %>% mutate(predicted = as.integer(round(predicted))) %>%
mutate(is_prediction_true = Survived == predicted) %>%
group_by(is_prediction_true) %>%
count()
```

```
## # A tibble: 2 x 2
## # Groups: is_prediction_true [2]
## is_prediction_true n
## <lgl> <int>
## 1 F 176
## 2 T 715
```

`predictions <- logit_predict(df_test, theta, features = c("Sex", "Pclass", "SibSp", "Embarked"))`

- Coursera Machine Learning Course by Andrew Ng (Affiliate Link)