chainsawriot

Gradient Descent #1

Posted on Dec 23, 2011 by Chung-hong Chan

ml-class 最後一課講到 big data ，介紹了 stochastic gradient descent 。但是那一課只有一般習題，卻沒有編程功課，為美中不足。
實在很想試下玩下 Stochastic gradient descent ，於是自己亂寫來玩。
Gradient descent 是一種 Optimization 的方法。例如你有 y (dependent variable) 及 x (independent variable) ，你想找 regression equation: hat y = θ₀ + θ₁ x ，除了可 normal equation 處理之外，也可以用 gradient descent 方法，以 iterative 方式計出 θ 。
今次先試 Batch Gradient Descent ，之後才玩 Stochastic Gradient Descent (SGD) 。

[code]
# < - gradient descent ->#
# http://www.chainsawriot.com
# linear regression using batch gradient descent
y < - mtcars$mpg
x <- mtcars$wt
plot(y, x, xlab="Weight", ylab="Fuel consumption") # clearly a linear relationship
# add a "bias unit" to x
X <- matrix(c(rep(1, length(x)), x), byrow=FALSE, nrow=length(x), ncol=2)

computeJ <- function(y, X, theta) {
return(sum(((X %*% theta)-y)^2) / (2*length(y)))
}
gradDesc <- function(y,X, theta, alpha=0.1, iters=1000, Jplot=TRUE) {
m <- length(y)
Jhist <- c()
for (i in 1:iters) {
newtheta <- theta -((alpha)*(1/m)*(t(t(X %*% theta - y)%*%X)))
theta <- newtheta
Jhist <- append(Jhist, computeJ(y,X,theta))
}
if (Jplot) {
plot(Jhist, xlab="No of iterations", ylab="Cost", type="l")
}
return(theta)
}

initheta <- matrix(c(0,0), byrow=FALSE)
btheta <- gradDesc(y, X, initheta, alpha=0.1)
# kind of similar to lm(y~x)

# will feature scaling make gradient descent converge faster?
scX <- matrix(c(rep(1, length(x)), scale(x)), byrow=FALSE, nrow=length(x), ncol=2)
sctheta <- gradDesc(y, scX, initheta,iters=50, alpha=0.1)

# we can use greater alpha and smaller iters no.
# kind of similar to lm(y~scale(x))

## more variables: multiple regression
mulX <- mtcars[,2:ncol(mtcars)]
scMulX <- cbind(rep(1, length(y)),scale(mulX))
iniMultheta <- matrix(rep(0, ncol(scMulX)), ncol=1)
scMultheta <- gradDesc(y, scMulX, iniMultheta,iters=1000, alpha=0.01)
[/code]

line 4 - 8:

數據處理。用到的數據是 mtcars 。我想看看車輛的重量 (wt) 和汽油消耗量 (mpg)的開係。

line 10 - 12:

計算 cost 。假設有 hypothesis h θ(x) ，那麼 cost J(θ) 就是

Gradient descent 的目的，就是經過 iteration 找出最低 J 的 theta 。

line 13 - 25:

Gradient descent function 。至於為何 Gradient Descent 可以找出 J 最低的 theta ，請看此處。

line 27 - 28:

試用 gradient descent 找出 theta 。從圖中可見要到二百幾個 iterations 才 converge 。

line 31 - 33:

用 feature scaling ，方法是將 x 變成 standard deviation score ，再用 gradient descent 找出 theta 。就算用同一樣的 learning rate (alpha) ，未夠一百個 iterations 就 converge 了。

line 39 - 42:

同樣的 gradient descent function 可用於 multiple regression 的場合。