Unit 4 Trees

这一周主要介绍了Cart以及randomForest的使用。

课程地址：

https://www.edx.org/course/the-analytics-edge

setwd("E:\\The Analytics Edge\\Unit 4 Trees")
stevens = read.csv("stevens.csv")

查看数据的结构。

str(stevens)

'data.frame':    566 obs. of  9 variables:
 $ Docket    : Factor w/ 566 levels "00-1011","00-1045",..: 63 69 70 145 97 181 242 289 334 436 ...
 $ Term      : int  1994 1994 1994 1994 1995 1995 1996 1997 1997 1999 ...
 $ Circuit   : Factor w/ 13 levels "10th","11th",..: 4 11 7 3 9 11 13 11 12 2 ...
 $ Issue     : Factor w/ 11 levels "Attorneys","CivilRights",..: 5 5 5 5 9 5 5 5 5 3 ...
 $ Petitioner: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Respondent: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ LowerCourt: Factor w/ 2 levels "conser","liberal": 2 2 2 1 1 1 1 1 1 1 ...
 $ Unconst   : int  0 0 0 0 0 1 0 1 0 0 ...
 $ Reverse   : int  1 1 1 1 1 0 1 1 1 1 ...

Split the data

这里来预测Reverse这个变量，首先划分数据。

library(caTools)
set.seed(3000)
spl = sample.split(stevens$Reverse, SplitRatio = 0.7)
Train = subset(stevens, spl==TRUE)
Test = subset(stevens, spl==FALSE)

Install rpart library

使用决策树模型只要安装rpart即可，rpart.plot主要是用来展示决策树的结构。

#install.packages("rpart")
library(rpart)
#install.packages("rpart.plot")
library(rpart.plot)

CART model

先来看一个CART的模型：

CART实际上就是决策树，停止的条件可以由叶节点的数量控制，在R中对应的参数为minbucket。
使用CART(Classification and Regression Trees)的方法很简单，如下所示：

StevensTree = rpart(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, data = Train, method="class", minbucket=25)

注意这里是分类问题，所以method=”class”，如果是回归问题，那么不需要这样设置。
prp函数可以查看树的结构。

prp(StevensTree)

Make predictions

预测数据的方式和之前基本一致，也是使用predict函数：

PredictCART = predict(StevensTree, newdata = Test, type = "class")

查看结果。

table(Test$Reverse, PredictCART)

   PredictCART
     0  1
  0 41 36
  1 22 71

ROC curve

接着来看上一周介绍的ROC曲线。

library(ROCR)

PredictROC = predict(StevensTree, newdata = Test)
pred = prediction(PredictROC[,2], Test$Reverse)
perf = performance(pred, "tpr", "fpr")
plot(perf)

Random Forest

接下来看Random Forest。

Random Forest是一系列决策树组合而成的模型，在R中的使用也很简单。

library(randomForest)

# Convert outcome to factor
Train$Reverse = as.factor(Train$Reverse)
Test$Reverse = as.factor(Test$Reverse)
StevensForest = randomForest(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, data = Train, ntree=200, nodesize=25 )

ntree是决策树的数量，nodesize就是决策树对应的minbucket参数。

Make predictions again

PredictForest = predict(StevensForest, newdata = Test)
table(Test$Reverse, PredictForest)

   PredictForest
     0  1
  0 40 37
  1 18 75

cross-validation

randomForest很容易过拟合，为了防止这种情况，可以增加正则项，正则项有一个对应的系数，R中可以很方便的求出这个系数。

#install.packages("caret")
library(caret)
#install.packages("e1071")
library(e1071)

Define cross-validation experiment

numFolds = trainControl( method = "cv", number = 10 )
cpGrid = expand.grid( .cp = seq(0.01,0.5,0.01))

Perform the cross validation

train(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, data = Train, method = "rpart", trControl = numFolds, tuneGrid = cpGrid )

CART 

396 samples
  6 predictor
  2 classes: '0', '1' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 357, 356, 356, 356, 356, 356, ... 
Resampling results across tuning parameters:

  cp    Accuracy   Kappa      
  0.01  0.6209615  0.219308080
  0.02  0.6308974  0.238358325
  0.03  0.6258974  0.238018757
  0.04  0.6386538  0.269706555
  0.05  0.6436538  0.282648474
  0.06  0.6436538  0.282648474
  0.07  0.6436538  0.282648474
  0.08  0.6436538  0.282648474
  0.09  0.6436538  0.282648474
  0.10  0.6436538  0.282648474
  0.11  0.6436538  0.282648474
  0.12  0.6436538  0.282648474
  0.13  0.6436538  0.282648474
  0.14  0.6436538  0.282648474
  0.15  0.6436538  0.282648474
  0.16  0.6436538  0.282648474
  0.17  0.6436538  0.282648474
  0.18  0.6436538  0.282648474
  0.19  0.6436538  0.282648474
  0.20  0.6257051  0.238726905
  0.21  0.5707051  0.092505117
  0.22  0.5604487  0.064158660
  0.23  0.5428205  0.001593625
  0.24  0.5428205  0.001593625
  0.25  0.5453846  0.000000000
  0.26  0.5453846  0.000000000
  0.27  0.5453846  0.000000000
  0.28  0.5453846  0.000000000
  0.29  0.5453846  0.000000000
  0.30  0.5453846  0.000000000
  0.31  0.5453846  0.000000000
  0.32  0.5453846  0.000000000
  0.33  0.5453846  0.000000000
  0.34  0.5453846  0.000000000
  0.35  0.5453846  0.000000000
  0.36  0.5453846  0.000000000
  0.37  0.5453846  0.000000000
  0.38  0.5453846  0.000000000
  0.39  0.5453846  0.000000000
  0.40  0.5453846  0.000000000
  0.41  0.5453846  0.000000000
  0.42  0.5453846  0.000000000
  0.43  0.5453846  0.000000000
  0.44  0.5453846  0.000000000
  0.45  0.5453846  0.000000000
  0.46  0.5453846  0.000000000
  0.47  0.5453846  0.000000000
  0.48  0.5453846  0.000000000
  0.49  0.5453846  0.000000000
  0.50  0.5453846  0.000000000

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was cp = 0.19.

最佳参数为0.19，利用这个参数来训练模型。

Create a new CART model

StevensTreeCV = rpart(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, data = Train, method="class", cp = 0.19)

Make predictions

PredictCV = predict(StevensTreeCV, newdata = Test, type = "class")
table(Test$Reverse, PredictCV)

   PredictCV
     0  1
  0 59 18
  1 29 64

最后补充Penalty Matrix的概念。

Penalty Matrix

Penalty Matrix主要是给不同的错徐赋予不同的权重，这样可以定义新的损失函数，来看具体做法：

PenaltyMatrix = matrix(c(0,1,2,0), byrow=TRUE, nrow=2)

计算误差。

as.matrix(table(Test$Reverse, PredictCV))*PenaltyMatrix

   PredictCV
     0  1
  0  0 18
  1 58  0