Unit 2 Linear Regression
这一章主要介绍了线性回归,这里回顾下基本操作。
课程地址:
https://www.edx.org/course/the-analytics-edge
读取数据,查看基本信息。
setwd("E:\\The Analytics Edge\\Unit 2 Linear Regression")
wine = read.csv("wine.csv")
str(wine)
'data.frame': 25 obs. of 7 variables:
$ Year : int 1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
$ Price : num 7.5 8.04 7.69 6.98 6.78 ...
$ WinterRain : int 600 690 502 420 582 485 763 830 697 608 ...
$ AGST : num 17.1 16.7 17.1 16.1 16.4 ...
$ HarvestRain: int 160 80 130 110 187 187 290 38 52 155 ...
$ Age : int 31 30 28 26 25 24 23 22 21 20 ...
$ FrancePop : num 43184 43495 44218 45152 45654 ...
summary(wine)
Year Price WinterRain AGST HarvestRain
Min. :1952 Min. :6.205 Min. :376.0 Min. :14.98 Min. : 38.0
1st Qu.:1960 1st Qu.:6.519 1st Qu.:536.0 1st Qu.:16.20 1st Qu.: 89.0
Median :1966 Median :7.121 Median :600.0 Median :16.53 Median :130.0
Mean :1966 Mean :7.067 Mean :605.3 Mean :16.51 Mean :148.6
3rd Qu.:1972 3rd Qu.:7.495 3rd Qu.:697.0 3rd Qu.:17.07 3rd Qu.:187.0
Max. :1978 Max. :8.494 Max. :830.0 Max. :17.65 Max. :292.0
Age FrancePop
Min. : 5.0 Min. :43184
1st Qu.:11.0 1st Qu.:46584
Median :17.0 Median :50255
Mean :17.2 Mean :49694
3rd Qu.:23.0 3rd Qu.:52894
Max. :31.0 Max. :54602
Linear Regression (one variable)
R中使用线性回归的方法很简单,只要使用如下的格式。
model1 = lm(Price ~ AGST, data=wine)
summary(model1)
Call:
lm(formula = Price ~ AGST, data = wine)
Residuals:
Min 1Q Median 3Q Max
-0.78450 -0.23882 -0.03727 0.38992 0.90318
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.4178 2.4935 -1.371 0.183710
AGST 0.6351 0.1509 4.208 0.000335 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4993 on 23 degrees of freedom
Multiple R-squared: 0.435, Adjusted R-squared: 0.4105
F-statistic: 17.71 on 1 and 23 DF, p-value: 0.000335
如果要对两个变量做线性回归,使用如下的形式即可。
model2 = lm(Price ~ AGST + HarvestRain, data=wine)
summary(model2)
Call:
lm(formula = Price ~ AGST + HarvestRain, data = wine)
Residuals:
Min 1Q Median 3Q Max
-0.88321 -0.19600 0.06178 0.15379 0.59722
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.20265 1.85443 -1.188 0.247585
AGST 0.60262 0.11128 5.415 1.94e-05 ***
HarvestRain -0.00457 0.00101 -4.525 0.000167 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3674 on 22 degrees of freedom
Multiple R-squared: 0.7074, Adjusted R-squared: 0.6808
F-statistic: 26.59 on 2 and 22 DF, p-value: 1.347e-06
如果要对全部变量使用线性回归,使用如下方式即可。
model3 = lm(Price ~ ., data=wine)
summary(model3)
Call:
lm(formula = Price ~ ., data = wine)
Residuals:
Min 1Q Median 3Q Max
-0.48179 -0.24662 -0.00726 0.22012 0.51987
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.092e-01 1.467e+02 0.005 0.996194
Year -5.847e-04 7.900e-02 -0.007 0.994172
WinterRain 1.043e-03 5.310e-04 1.963 0.064416 .
AGST 6.012e-01 1.030e-01 5.836 1.27e-05 ***
HarvestRain -3.958e-03 8.751e-04 -4.523 0.000233 ***
Age NA NA NA NA
FrancePop -4.953e-05 1.667e-04 -0.297 0.769578
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3019 on 19 degrees of freedom
Multiple R-squared: 0.8294, Adjusted R-squared: 0.7845
F-statistic: 18.47 on 5 and 19 DF, p-value: 1.044e-06
*表示显著性水平,*越多,表示这个变量的影响越大,一般来说,我们会剔除没有*的变量。
Sum of Squared Errors
线性回归中有个很重要的概念——Sum of Squared Errors(SSE),这里可以利用模型中的residuals计算出来。
公式如下
$\hat y_i$为预测值,$y_i$为原有的值。
SSE = sum(model3$residuals^2)
SSE
1.73211271534381
另一个比较重要的概念为Root-Mean-Square Error (RMSE),计算方法如下:
RMSE = sqrt(SSE/nrow(wine))
RMSE
0.263219506522128
还有一个比较重要的概念为SST,计算公式如下:
SST = sum((wine$Price - mean(wine$Price))^2)
SST
10.1506377256
衡量模型好坏可以利用$R^2$
$R^2$越接近于$1$表示效果越好。
R2 = 1 - SSE/SST
R2
0.829359222329903
Make predictions
有了模型之后就要来做预测,这里选择model2
wineTest = read.csv("wine_test.csv")
predictTest = predict(model2, newdata=wineTest)
本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 Doraemonzzz!
评论
ValineLivere