# Werden wir Helden für einen Tag

Posted on Dec 23, 2011 by Chung-hong Chan

ml-class 最後一課講到 big data ，介紹了 stochastic gradient descent 。但是那一課只有一般習題，卻沒有編程功課，為美中不足。

Gradient descent 是一種 Optimization 的方法。例如你有 y (dependent variable) 及 x (independent variable) ，你想找 regression equation: hat y = θ0 + θ1 x ，除了可 normal equation 處理之外，也可以用 gradient descent 方法，以 iterative 方式計出 θ 。

[code]
# < - gradient descent ->#
# http://www.chainsawriot.com
# linear regression using batch gradient descent
y < - mtcars$mpg x <- mtcars$wt
plot(y, x, xlab="Weight", ylab="Fuel consumption") # clearly a linear relationship
# add a "bias unit" to x
X <- matrix(c(rep(1, length(x)), x), byrow=FALSE, nrow=length(x), ncol=2)

computeJ <- function(y, X, theta) {
return(sum(((X %*% theta)-y)^2) / (2*length(y)))
}
gradDesc <- function(y,X, theta, alpha=0.1, iters=1000, Jplot=TRUE) {
m <- length(y)
Jhist <- c()
for (i in 1:iters) {
newtheta <- theta -((alpha)*(1/m)*(t(t(X %*% theta - y)%*%X)))
theta <- newtheta
Jhist <- append(Jhist, computeJ(y,X,theta))
}
if (Jplot) {
plot(Jhist, xlab="No of iterations", ylab="Cost", type="l")
}
return(theta)
}

initheta <- matrix(c(0,0), byrow=FALSE)
btheta <- gradDesc(y, X, initheta, alpha=0.1)
# kind of similar to lm(y~x)

# will feature scaling make gradient descent converge faster?
scX <- matrix(c(rep(1, length(x)), scale(x)), byrow=FALSE, nrow=length(x), ncol=2)
sctheta <- gradDesc(y, scX, initheta,iters=50, alpha=0.1)

# we can use greater alpha and smaller iters no.
# kind of similar to lm(y~scale(x))

## more variables: multiple regression
mulX <- mtcars[,2:ncol(mtcars)]
scMulX <- cbind(rep(1, length(y)),scale(mulX))
iniMultheta <- matrix(rep(0, ncol(scMulX)), ncol=1)
scMultheta <- gradDesc(y, scMulX, iniMultheta,iters=1000, alpha=0.01)
[/code]

line 4 - 8:

line 10 - 12:

Gradient descent 的目的，就是經過 iteration 找出最低 J 的 theta 。

line 13 - 25:

line 27 - 28:

line 31 - 33:

line 39 - 42: