# Newton-Raphson #2

Posted on Dec 30, 2011 by Chung-hong Chan

θn+1 = θn - (f(θn)/f'(θn))

gradient 是 Cost function 的 partial derivative (f')。如果要用 Newton-Raphson ，就要找出 Cost function 的 f'' 。 f'' 計出來的東西，叫做 Hessian matrix 。我的 multivariate calculus 很屎，以下是別人計出來的 f' 和 f'' 。

## Newt logistic
y <- mtcars$am x <- as.matrix(cbind(mtcars$mpg, mtcars$wt)) X <- cbind(rep(1, nrow(x)), x) initheta <- matrix(c(0,0,0), ncol=1) g <- function(z) { 1/(1+exp(-z)) } computeJ <- function(y,X,theta) { htheta <- g(X %*% theta) J <- (1/length(y))*sum((-y*log(htheta))-((1-y)*log(1-htheta))) return(J) } gradDescLogisticReg <- function(y, X, theta,alpha=0.01,iters=1000, Jplot=FALSE) { m <- length(y) JHist <- c() for (i in 1:iters) { htheta <- g(X %*% theta) gradient <- 1/m * t(X) %*% (htheta-y) newtheta <- theta - alpha*gradient JHist <- append(JHist, computeJ(y,X, newtheta)) theta <- newtheta } if (Jplot){ plot(JHist, type="l") } return(list(theta=theta, JHist=JHist)) } newtLogisticReg <- function(y, X, theta, iters=1000, Jplot=FALSE) { m <- length(y) JHist <- c() for (i in 1:iters) { htheta <- g(X %*% theta) gradient <- 1/m * t(X) %*% (htheta-y) hessian <- 1/m * t(X) %*% X * diag(htheta) * diag(1-htheta) newtheta <- theta - solve(hessian) %*% gradient # hessian may not be invertable, use ginv() JHist <- append(JHist, computeJ(y,X, newtheta)) theta <- newtheta } if (Jplot){ plot(JHist, type="l") } return(list(theta=theta, JHist=JHist)) } benchmark <- coef(glm(mtcars$am~mtcars$mpg+mtcars$wt, family=binomial()))

# will not converge!
#GD <- gradDescLogisticReg(y, X, initheta, alpha=0.1, iters=10000)$theta #will scaling improve? #scaled benchmark y <- mtcars$am
scx <- as.matrix(scale(cbind(mtcars$mpg, mtcars$wt)))
scX <- cbind(rep(1, nrow(scx)), scx)
#scaled benchmark
scbenchmark <- coef(glm(y~scX[,2:3], family=binomial()))

scGD <- gradDescLogisticReg(y, scX, initheta, alpha=0.1, iters=10000)
newtLR <- newtLogisticReg(y, X, initheta)
scnewtLR <- newtLogisticReg(y, scX, initheta)
plot(scnewtLR$JHist, type="l", ylim=c(0.26, 0.5), ylab="Cost", xlab="No of iterations") lines(scGD$JHist[1:1000], col="red")
legend("topright", legend=c("Newton-Raphson","Gradient Descent"), col=c("black", "red"), lty=c(1,1), box.col="white")
#>


Newton-Raphson 也比 gradient descent 有多一個優點，就是不用設定 learning rate 。雖然 Newton-Raphson 有如此多的好處，但需要注意每次 iteration 的 computational cost 。 Gradient Descent 每次 iteration 是 O(no of features) ，但是 Newton-Raphson 每次 iteration 是 O(no of feature3) 。