I am implementing the learning algorithms that I learned in Andrew Ng’s famous Machine Learning course (Affiliate Link).

I will solve a **binary classification problem** with Gradient Descent. My main objective is getting a better understanding of Gradient Descent. It will solve not only this problem, but also more complex ones like Neural Networks. I wanted to use that algorithm because it is like a Swiss Knife when solving most of the machine learning problems.

Having this objective, I need to find a problem where the output is discrete valued. \(y\in\{0,1\}\). So, I found Titanic: Machine Learning from Disaster competition on Kaggle would be a good practice. It is about predicting the survival chance of the passangers.

The below implementation is achieving 77% performance but I didn__t spend too much effort to achieve higher score.__

Introducing some terms:

\(h_\theta(x) = \theta_0x_1 +\theta_1x_2+\theta_2x_3+... = \theta^Tx\)

Sigmoid function: \(g(z) = \frac{1}{1+e^{-z}}\)

The predictions: \(g(\theta^Tx)\)

What is sigmoid function? It maps a real number to a value between 0 and 1.

It is also called Logistic function. In neural networks, it is called activation function. It can be visualized as below:

```
library(ggplot2)
library(dplyr)
sigmoid <- function(x) {1/(1+exp(1)^(-x))}
data_frame(x = c(-5, 5),
y = c(0, 1)) %>%
ggplot(aes(x = x, y = y))+
stat_function(fun = sigmoid)
```

\(h_\theta(x) = 0.7\) can be interepreted as: x is *positive* with the probability of 0.7.

We need a measure for divergence of the fit and we will minimize it. It is called *cost*.

The cost function in this examples is based on Cross Entropy Loss.

\[ \begin{equation} h_\theta(x)=\begin{cases} -log(h_\theta(x) \text{ if } y = 1 \\ -log(1-h_\theta(x) \text{ if } y = 0) \end{cases} \end{equation} \]

In more compact way:

\[h_\theta(x) = -y*log(h_\theta(x))-(1-y)*log(1-h_\theta(x))\]

And cost function will be: \[J(\theta) = -\frac{1}{m}\sum\limits_{i=1}^m[(y^{(i)})log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))]\]

We will minimize cost with iteration. In each iteration, we will update parameters so that cost will get lower. This is actually *the process of learning*.

Implementation:

*Repeat* \(\big\{\)

\(\theta_j :=\theta_j-\alpha*\frac{\partial}{\partial\theta_j}*J(\theta)\)

\(\big\}\)

When we do the calculus:

*Repeat* \(\big\{\)

\(\theta_j :=\theta_j-\frac{\alpha}{m}\sum\limits_{i=1}^{m}(h_\theta(x^{(i)}-y^{(i)}x_j^{(i)})\)

\(\big\}\)

Note that this is identical to linear regression.

After stating some notations, now let__s start the implementation:__

```
library(tidyverse)
df_train <- read_csv('train.csv') %>% mutate(dataset = "train")
df_test <- read_csv('test.csv') %>% mutate(dataset = "test")
```

Process test and train datasets together:

Steps:

- Bind rows of train and test sets

- Convert categorical columns (
`Sex`

,`Embarked`

) to intergers.

- Separate datasets into train and test again.

```
# merge to datasets for data transformation
df_all <- bind_rows(df_train, df_test)
df_all <- df_all %>% mutate(Sex = if_else(Sex =="male", true = 1L, false = 0L))
df_all <- df_all %>%
mutate(Embarked = case_when(Embarked == "S" ~ 1L,
Embarked == "C" ~ 2L,
Embarked == "Q" ~ 3L,
is.na(Embarked) ~ 4L,
TRUE ~ 5L))
# separate datasets
df_train <- df_all %>% filter(dataset == "train")
df_test <- df_all %>% filter(dataset == "test")
```

Sigmoid Function:

```
# define sigmoid function
sigmoid <- function(x) {1/(1 + exp(-x))}
```

Cost Function:

```
# define cost function
cost <- function(x, y, theta){
m <- nrow(x)
hx <- sigmoid(x %*% theta)
(1/m)*(((-t(y)%*%log(hx))-t(1-y)%*%log(1-hx)))
}
```

Gradient Function:

```
# gradient
grad <- function(x, y, theta){
m <- nrow(x)
hx <- sigmoid(x %*% theta)
(1/m)*(t(x)%*%(hx - y))
}
```

Once we find the parameters, we will make predictions with test data.

So, prediction Function will be defined as:

```
# prediction
logit_predict <- function(df, theta, features) {
x <- as.matrix(df[,features])
x <- cbind(1, x)
sigmoid(x %*% theta) %>%
as_data_frame()
}
```

This will return the parameters which minimizes the cost.

```
logit_gradient_descent <- function(df, iter = 10000, learning_rate, features, response){
# initial theta values: set all of them to 1
n_independent = length(features)
theta = matrix(rep(1, n_independent+1), nrow = n_independent+1)
m = nrow(df)
# dependent and independent variables
y = as.matrix(df[response])
x <- as.matrix(df[,features])
x <- cbind(1, x)
cost_values = NA
# run gradient descent
for (i in 1:iter){
theta <- theta-learning_rate*grad(x, y, theta)
cost_values[i] <- cost(x, y, theta)[1,1]
if(i > 1){
if(cost_values[i] > cost_values[i-1]) {
warning("Cost is not decreasing in each iteration. Lower the learning_rate")
break
}
}
}
theta
}
```

```
theta <- logit_gradient_descent(df = df_train, iter = 100000, learning_rate = 0.001,
features = c("Sex", "Pclass", "SibSp", "Embarked"), response = 'Survived')
predictions_train <- logit_predict(df_train, theta, features = c("Sex", "Pclass", "SibSp", "Embarked"))
df_train$predicted <- predictions_train$Survived
df_train %>% mutate(predicted = as.integer(round(predicted))) %>%
mutate(is_prediction_true = Survived == predicted) %>%
group_by(is_prediction_true) %>%
count()
```

```
## # A tibble: 2 x 2
## # Groups: is_prediction_true [2]
## is_prediction_true n
## <lgl> <int>
## 1 F 176
## 2 T 715
```

`predictions <- logit_predict(df_test, theta, features = c("Sex", "Pclass", "SibSp", "Embarked"))`

- Coursera Machine Learning Course by Andrew Ng