θn+1 = θn - (f(θn)/f'(θn))

gradient 是 Cost function 的 partial derivative (f')。如果要用 Newton-Raphson ，就要找出 Cost function 的 f'' 。 f'' 計出來的東西，叫做 Hessian matrix 。我的 multivariate calculus 很屎，以下是別人計出來的 f' 和 f'' 。

```## Newt logistic
y <- mtcars\$am
x <- as.matrix(cbind(mtcars\$mpg, mtcars\$wt))
X <- cbind(rep(1, nrow(x)), x)
initheta <- matrix(c(0,0,0), ncol=1)

g <- function(z) {
1/(1+exp(-z))
}

computeJ <- function(y,X,theta) {
htheta <- g(X %*% theta)
J <- (1/length(y))*sum((-y*log(htheta))-((1-y)*log(1-htheta)))
return(J)
}

gradDescLogisticReg <- function(y, X, theta,alpha=0.01,iters=1000, Jplot=FALSE) {
m <- length(y)
JHist <- c()
for (i in 1:iters) {
htheta <- g(X %*% theta)
gradient <- 1/m * t(X) %*% (htheta-y)
JHist <- append(JHist, computeJ(y,X, newtheta))
theta <- newtheta
}
if (Jplot){
plot(JHist, type="l")
}
return(list(theta=theta, JHist=JHist))
}

newtLogisticReg <- function(y, X, theta, iters=1000, Jplot=FALSE) {
m <- length(y)
JHist <- c()
for (i in 1:iters) {
htheta <- g(X %*% theta)
gradient <-  1/m * t(X) %*% (htheta-y)
hessian <- 1/m * t(X) %*% X * diag(htheta) * diag(1-htheta)
newtheta <- theta - solve(hessian) %*% gradient # hessian may not be invertable, use ginv()
JHist <- append(JHist, computeJ(y,X, newtheta))
theta <- newtheta
}
if (Jplot){
plot(JHist, type="l")
}
return(list(theta=theta, JHist=JHist))
}

benchmark <- coef(glm(mtcars\$am~mtcars\$mpg+mtcars\$wt, family=binomial()))

# will not converge!
#GD <- gradDescLogisticReg(y, X, initheta, alpha=0.1, iters=10000)\$theta

#will scaling improve?
#scaled benchmark

y <- mtcars\$am
scx <- as.matrix(scale(cbind(mtcars\$mpg, mtcars\$wt)))
scX <- cbind(rep(1, nrow(scx)), scx)
#scaled benchmark
scbenchmark <- coef(glm(y~scX[,2:3], family=binomial()))

scGD <- gradDescLogisticReg(y, scX, initheta, alpha=0.1, iters=10000)
newtLR <- newtLogisticReg(y, X, initheta)
scnewtLR <- newtLogisticReg(y, scX, initheta)
plot(scnewtLR\$JHist, type="l", ylim=c(0.26, 0.5), ylab="Cost", xlab="No of iterations")
lines(scGD\$JHist[1:1000], col="red")
legend("topright", legend=c("Newton-Raphson","Gradient Descent"), col=c("black", "red"), lty=c(1,1), box.col="white")
#>
```

Newton-Raphson 也比 gradient descent 有多一個優點，就是不用設定 learning rate 。雖然 Newton-Raphson 有如此多的好處，但需要注意每次 iteration 的 computational cost 。 Gradient Descent 每次 iteration 是 O(no of features) ，但是 Newton-Raphson 每次 iteration 是 O(no of feature3) 。