Unit 2 Linear Regression

这一章主要介绍了线性回归,这里回顾下基本操作。

课程地址:

https://www.edx.org/course/the-analytics-edge

读取数据,查看基本信息。

1
2
3
setwd("E:\\The Analytics Edge\\Unit 2 Linear Regression")
wine = read.csv("wine.csv")
str(wine)
'data.frame':    25 obs. of  7 variables:
 $ Year       : int  1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
 $ Price      : num  7.5 8.04 7.69 6.98 6.78 ...
 $ WinterRain : int  600 690 502 420 582 485 763 830 697 608 ...
 $ AGST       : num  17.1 16.7 17.1 16.1 16.4 ...
 $ HarvestRain: int  160 80 130 110 187 187 290 38 52 155 ...
 $ Age        : int  31 30 28 26 25 24 23 22 21 20 ...
 $ FrancePop  : num  43184 43495 44218 45152 45654 ...
1
summary(wine)
      Year          Price         WinterRain         AGST        HarvestRain   
 Min.   :1952   Min.   :6.205   Min.   :376.0   Min.   :14.98   Min.   : 38.0  
 1st Qu.:1960   1st Qu.:6.519   1st Qu.:536.0   1st Qu.:16.20   1st Qu.: 89.0  
 Median :1966   Median :7.121   Median :600.0   Median :16.53   Median :130.0  
 Mean   :1966   Mean   :7.067   Mean   :605.3   Mean   :16.51   Mean   :148.6  
 3rd Qu.:1972   3rd Qu.:7.495   3rd Qu.:697.0   3rd Qu.:17.07   3rd Qu.:187.0  
 Max.   :1978   Max.   :8.494   Max.   :830.0   Max.   :17.65   Max.   :292.0  
      Age         FrancePop    
 Min.   : 5.0   Min.   :43184  
 1st Qu.:11.0   1st Qu.:46584  
 Median :17.0   Median :50255  
 Mean   :17.2   Mean   :49694  
 3rd Qu.:23.0   3rd Qu.:52894  
 Max.   :31.0   Max.   :54602  

Linear Regression (one variable)

R中使用线性回归的方法很简单,只要使用如下的格式。

1
2
model1 = lm(Price ~ AGST, data=wine)
summary(model1)
Call:
lm(formula = Price ~ AGST, data = wine)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.78450 -0.23882 -0.03727  0.38992  0.90318 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.4178     2.4935  -1.371 0.183710    
AGST          0.6351     0.1509   4.208 0.000335 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4993 on 23 degrees of freedom
Multiple R-squared:  0.435,    Adjusted R-squared:  0.4105 
F-statistic: 17.71 on 1 and 23 DF,  p-value: 0.000335

如果要对两个变量做线性回归,使用如下的形式即可。

1
2
model2 = lm(Price ~ AGST + HarvestRain, data=wine)
summary(model2)
Call:
lm(formula = Price ~ AGST + HarvestRain, data = wine)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.88321 -0.19600  0.06178  0.15379  0.59722 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.20265    1.85443  -1.188 0.247585    
AGST         0.60262    0.11128   5.415 1.94e-05 ***
HarvestRain -0.00457    0.00101  -4.525 0.000167 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3674 on 22 degrees of freedom
Multiple R-squared:  0.7074,    Adjusted R-squared:  0.6808 
F-statistic: 26.59 on 2 and 22 DF,  p-value: 1.347e-06

如果要对全部变量使用线性回归,使用如下方式即可。

1
2
model3 = lm(Price ~ ., data=wine)
summary(model3)
Call:
lm(formula = Price ~ ., data = wine)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.48179 -0.24662 -0.00726  0.22012  0.51987 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.092e-01  1.467e+02   0.005 0.996194    
Year        -5.847e-04  7.900e-02  -0.007 0.994172    
WinterRain   1.043e-03  5.310e-04   1.963 0.064416 .  
AGST         6.012e-01  1.030e-01   5.836 1.27e-05 ***
HarvestRain -3.958e-03  8.751e-04  -4.523 0.000233 ***
Age                 NA         NA      NA       NA    
FrancePop   -4.953e-05  1.667e-04  -0.297 0.769578    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3019 on 19 degrees of freedom
Multiple R-squared:  0.8294,    Adjusted R-squared:  0.7845 
F-statistic: 18.47 on 5 and 19 DF,  p-value: 1.044e-06

*表示显著性水平,*越多,表示这个变量的影响越大,一般来说,我们会剔除没有*的变量。

Sum of Squared Errors

线性回归中有个很重要的概念——Sum of Squared Errors(SSE),这里可以利用模型中的residuals计算出来。
公式如下

$\hat y_i$为预测值,$y_i$为原有的值。

1
2
SSE = sum(model3$residuals^2)
SSE

1.73211271534381

另一个比较重要的概念为Root-Mean-Square Error (RMSE),计算方法如下:

1
2
RMSE = sqrt(SSE/nrow(wine))
RMSE

0.263219506522128

还有一个比较重要的概念为SST,计算公式如下:

1
2
SST = sum((wine$Price - mean(wine$Price))^2)
SST

10.1506377256

衡量模型好坏可以利用$R^2$

$R^2$越接近于$1$表示效果越好。

1
2
R2 = 1 - SSE/SST
R2

0.829359222329903

Make predictions

有了模型之后就要来做预测,这里选择model2

1
2
wineTest = read.csv("wine_test.csv")
predictTest = predict(model2, newdata=wineTest)