Solving a linear regression problem with *Gradient Descent* optimization algorithm to better understand *learning*…

Assume the training set below: (a tibble in R)

```
library(dplyr)
as_tibble(cars)
```

```
## # A tibble: 50 x 2
## speed dist
## <dbl> <dbl>
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
## 7 10 18
## 8 10 26
## 9 10 34
## 10 11 17
## # ... with 40 more rows
```

**m** : Number of training examples (or data points). In this example it is 50.

**x** : Input variable `speed`

**y** : Output variable `dist`

\((x,y)\) denotes one sample from our training set.

\((x^i, y^i)\) is the \(i^{th}\) training example. In our example, \(x^1\) = 4, \(y^1\) = 2

**The hypothesis** \(h\) maps \(x\)’s to \(y's\).

\(y\) will be the estimated value of `dist`

.

\[h_\theta(x) = \theta_0+\theta_1 x\]

Because there is one variable, it is also called as **univariate linear regression** or **linear regression with one variable**.

How to choose \(\theta_0\) and \(\theta_1\) for the best fit?

Chose \(\theta_0\) and \(\theta_1\) so that \(h_\theta(x)\) is **close to** our training examples \(y\).

Mathematically speaking, find \(\theta_0\) and \(\theta_1\) that **minimizes**: \[J(\theta_0, \theta_1) = \frac{1}{2m}\sum\limits_1^m(h_\theta(x^i)-y^i)^2\]

\(J(\theta_0, \theta_1)\) is the cost function. It is also called **Squared Error Function**.

An illustration with `cars`

datasets in R:

Plot `speed`

on x axis and `dist`

(distance to stop) on y axis.

```
library(ggplot2)
ggplot(cars, aes(x = speed, y = dist))+
geom_point()
```

Add some models to represent the relationship between `speed`

and `dist`

.

For simplicity, set \(\theta_0 = 0\).

```
library(ggplot2)
# simulate data
set.seed(1010)
sim1 <- data_frame(
x = rep(cars$speed, 9),
y = rep(cars$dist, 9),
model_no = paste("Model", sort(rep(1:9, 50))),
theta1 = sort(rep(seq(from = 1, to = 4.5, length.out = 9), 50)),
theta0 = 0)
# add predictions
sim1 <- sim1 %>% mutate(prediction = theta0 + x*theta1)
# visualize
sim1 %>% ggplot(aes(x = x, y = y))+
geom_point(size = 0.5)+
geom_abline(aes(slope = theta1, intercept = theta0), color = "red") +
geom_segment(aes(x = x, xend = x,
y = prediction, yend = y), alpha = 0.3) +
facet_grid(.~model_no)
```

Now visualize cost function:

```
# sum of squared errors
sse <- sim1 %>%
mutate(squared_error = (prediction-y)^2) %>%
group_by(model_no, theta1) %>%
summarize(m = n(),
cost = (1/(2*m))*sum(squared_error))
sse %>% ggplot(aes(x = theta1, y = cost)) +
geom_point(color = "blue") +
geom_text(aes(label = model_no), angle = 0, nudge_y = 20, size = 3)
```

It looks like a parabola. The model 5 is looks better comparing to other alternatives.

How to find the best one? There might be other solutions but the objective of this example is understanding *learning*. So let’s go with **Gradient Descent**.

Outline:

1. Start with some \(\theta_0, \theta_1\)

2. Keep changing \(\theta_0, \theta_1\) to reduce \(J(\theta_0, \theta_1)\) until we hopefully find a minimum.

It is an itterative optimization algorithm. It is not only useful for linear regression but many machine learning problems.

repeat until convergence: \[\theta_j := \theta_j-\alpha\frac{\partial}{\partial_j}J(\theta_0,\theta_1)\] for i = 0 and i = 1

- \(:=\) is the assigment operator

- \(\alpha\) is the learning rate.

- \(\theta_0\) and \(\theta_1\) should
**simultaneously**be updated.

Visualization with `ggplot2`

:

```
# cost function
cost_function <- function(x) (x-2)^2
# derivative function
library(Deriv)
derivative <- Deriv(cost_function)
df <- data_frame(x = 0,
y = cost_function(x),
new_x = x,
new_y = y)
# learning rate
learning_rate = 0.2
# create dataframe
for (i in 1:20) {
x = df$new_x[i]
y = df$new_y[i]
step = derivative(x)*learning_rate
new_x = x-step
new_y = cost_function(new_x)
new_df = data_frame(x = x, y = y, new_x = new_x, new_y = new_y)
df <- bind_rows(df, new_df)
rm(x, y, new_x, step, new_y, new_df)
}
# plot
ggplot(df, aes(x, y)) +
geom_point() +
geom_segment(aes(x = x, xend = new_x,
y =y, yend = new_y), color = "blue", linetype = "dotted") +
stat_function(fun = cost_function, alpha = 0.5) +
xlim(c(-1, 5)) +
ggtitle(label = "Gradient Descent Visualization", subtitle = "Learning rate : 0.2")
```

In this example the learning rate was 0.2.

Another example with a higher learning rate: 0.9 :

Note that the derivative of the cost function:

\[\frac{1}{m}\sum_{i}^{m}(h_\theta(x^{(i)}-y^{(i)})x^{(i)}\]

Find the parameters:

```
x <- cars$speed
y <- cars$dist
theta1 <- 6
alpha <- 0.001
m <- nrow(cars)
yhat <- theta1*x
df <- data_frame(theta1 = as.double(),
cost = NA,
iteration = 1)
for (i in 1:20){
theta1 <- theta1 - alpha * ((1 / m) * (sum((yhat - y) * x)))
yhat <- theta1*x
cost <- (1/m)*sum((yhat-y)^2)
df[i, 1] = theta1
df[i, 2] <- cost
df[i, 3] <- i
}
theta1
```

`## [1] 2.915755`

Visualize the linear regression line:

```
cars %>% ggplot(aes(speed, dist))+
geom_point()+
geom_abline(slope = theta1, intercept = 0)
```

Check the cost function:

```
df %>% ggplot(aes(x = iteration, y = cost))+
geom_line()+
geom_point()
```

- Coursera Machine Learning Course by Andrew Ng