티스토리 뷰

INTRO

1. 후진 제거법(Backward Elimination)

    # 참고 : 단계별 변수 선택 방법 자동으로 수행하기

2. 전진 선택법(Forward Selection)

3. 단계별 방법(Stepwise Method)

 

1. 후진 제거법

독립변수 후보 모두를 포함한 모형에서 출발해 제곱합의 기준으로 가장 적은 영향을 주는 변수부터 하나씩 제거하면서 더 이상 유의하지 않은 변수가 없을 때까지 설명변수들을 제거하고 모형을 선택하는 방법

 

dataset

X1 X2 X3 X4 Y
7 26 6 60 78.5
1 29 15 52 74.3
11 56 8 20 104.3
11 31 8 47 87.6
7 52 6 33 95.9
11 55 9 22 109.2
3 71 17 6 102.7
1 31 22 44 72.5
2 54 18 22 93.1
21 47 4 26 115.9
1 40 23 34 83.8
11 66 9 12 113.3
10 68 8 12 109.4

 

Step1

test_bacward_elimination.R
rm(list=ls())
setwd = "~/Rcoding"

X1 = c(7,1,11,11,7,11,3,1,2,21,1,11,10)
X2 = c(26,29,56,31,52,55,71,31,54,47,40,66,68)
X3 = c(6,15,8,8,6,9,17,22,18,4,23,9,8)
X4 = c(60,52,20,47,33,22,6,44,22,26,34,12,12)
Y = c(78.5,74.3,104.3,87.6,95.9,109.2,102.7,72.5,93.1,115.9,83.8,113.3,109.4)

df = data.frame(X1,X2,X3,X4,Y)
df
# write.csv(df,file="~/R_coding/test_df.csv")

model = lm(Y ~ .,data=df)
model
summary(model)

출력결과 : model

> source("~/Rcoding/test_backward_elimination.R", echo=TRUE)

> rm(list=ls())

> setwd = "~/Rcoding"

> X1 = c(7,1,11,11,7,11,3,1,2,21,1,11,10)

> X2 = c(26,29,56,31,52,55,71,31,54,47,40,66,68)

> X3 = c(6,15,8,8,6,9,17,22,18,4,23,9,8)

> X4 = c(60,52,20,47,33,22,6,44,22,26,34,12,12)

> Y = c(78.5,74.3,104.3,87.6,95.9,109.2,102.7,72.5,93.1,115.9,83.8,113.3,109.4)

> df = data.frame(X1,X2,X3,X4,Y)

> df
   X1 X2 X3 X4     Y
1   7 26  6 60  78.5
2   1 29 15 52  74.3
3  11 56  8 20 104.3
4  11 31  8 47  87.6
5   7 52  6 33  95.9
6  11 55  9 22 109.2
7   3 71 17  6 102.7
8   1 31 22 44  72.5
9   2 54 18 22  93.1
10 21 47  4 26 115.9
11  1 40 23 34  83.8
12 11 66  9 12 113.3
13 10 68  8 12 109.4

> # write.csv(df,file="~/R_coding/test_df.csv")
> 
> model = lm(Y ~ .,data=df)

> model

Call:
lm(formula = Y ~ ., data = df)

Coefficients:
(Intercept)           X1           X2           X3           X4  
    62.4054       1.5511       0.5102       0.1019      -0.1441  


> summary(model)

Call:
lm(formula = Y ~ ., data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1750 -1.6709  0.2508  1.3783  3.9254 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  62.4054    70.0710   0.891   0.3991  
X1            1.5511     0.7448   2.083   0.0708 .
X2            0.5102     0.7238   0.705   0.5009  
X3            0.1019     0.7547   0.135   0.8959  
X4           -0.1441     0.7091  -0.203   0.8441  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.446 on 8 degrees of freedom
Multiple R-squared:  0.9824,	Adjusted R-squared:  0.9736 
F-statistic: 111.5 on 4 and 8 DF,  p-value: 4.756e-07

 

Regression Analysis : model

-> p-value가 가장 큰(유의 확률이 가장 높은) 독립변수(설명변수) X3을 제거

 

 

Step2

 

test_bacward_elimination.R
...

## Eliminate 1st high p-value : X3
model_1 = lm(Y ~ X1+X2+X4,data=df)
model_1
summary(model_1)

출력결과 : model_1

...

> ## Eliminate 1st high p-value : X3
> model_1 = lm(Y ~ X1+X2+X4,data=df)

> model_1

Call:
lm(formula = Y ~ X1 + X2 + X4, data = df)

Coefficients:
(Intercept)           X1           X2           X4  
    71.6483       1.4519       0.4161      -0.2365  


> summary(model_1)

Call:
lm(formula = Y ~ X1 + X2 + X4, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0919 -1.8016  0.2562  1.2818  3.8982 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  71.6483    14.1424   5.066 0.000675 ***
X1            1.4519     0.1170  12.410 5.78e-07 ***
X2            0.4161     0.1856   2.242 0.051687 .  
X4           -0.2365     0.1733  -1.365 0.205395    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.309 on 9 degrees of freedom
Multiple R-squared:  0.9823,	Adjusted R-squared:  0.9764 
F-statistic: 166.8 on 3 and 9 DF,  p-value: 3.323e-08

Regression Analysis : model_1

-> p-value가 가장 큰(유의 확률이 가장 높은) 독립변수(설명변수) X4를 제거

 

 

Step3

test_bacward_elimination.R
...

## Eliminate 2nd high p-value : X3, X4
model_2 = lm(Y ~ X1+X2,data=df)
model_2
summary(model_2)

출력결과 : model_2

...

> ## Eliminate 2nd high p-value : X3, X4
> model_2 = lm(Y ~ X1+X2,data=df)

> model_2

Call:
lm(formula = Y ~ X1 + X2, data = df)

Coefficients:
(Intercept)           X1           X2  
    52.5773       1.4683       0.6623  


> summary(model_2)

Call:
lm(formula = Y ~ X1 + X2, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-2.893 -1.574 -1.302  1.363  4.048 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 52.57735    2.28617   23.00 5.46e-10 ***
X1           1.46831    0.12130   12.11 2.69e-07 ***
X2           0.66225    0.04585   14.44 5.03e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.406 on 10 degrees of freedom
Multiple R-squared:  0.9787,	Adjusted R-squared:  0.9744 
F-statistic: 229.5 on 2 and 10 DF,  p-value: 4.407e-09

Regression Analysis : model_2

-> 설명변수 X1, X2의 p-value가 모두 유의하므로 변수 제거를 멈춘다.

F-statistic : 229.5, p-value: 4.407e-09로 유의수준 5% 하에서 추정된 회귀 모형이 통계적으로 매우 유의함

$ y = 52.5773 + 1.4683*X1 + 0.6623*X2 $

 

 

 

# 참고 : 단계별 변수 선택 방법 자동으로 수행하기

step(lm(종속변수 ~ 설명변수, dataset), scope=list(lower=~1,upper=~설명변수), direction="변수선택방법")

- lm : 회귀분석 방법

- scope : 고려할 변수 볌위, lower에서 1은 상수항, 높은 단계의 경우 설명변수 모두 써주면 된다.

- direction : 변수선택방법, backward, forward, both

 

test_bacward_elimination_step.R
rm(list=ls())
setwd = "~/R_coding"

X1 = c(7,1,11,11,7,11,3,1,2,21,1,11,10)
X2 = c(26,29,56,31,52,55,71,31,54,47,40,66,68)
X3 = c(6,15,8,8,6,9,17,22,18,4,23,9,8)
X4 = c(60,52,20,47,33,22,6,44,22,26,34,12,12)
Y = c(78.5,74.3,104.3,87.6,95.9,109.2,102.7,72.5,93.1,115.9,83.8,113.3,109.4)

df = data.frame(X1,X2,X3,X4,Y)
df

step_backward_model = step(lm(Y ~ X1+X2,df), scope=list(lower=~1,upper=~X1+X2+X3+X4), direction="backward")
step_backward_model
summary(step_backward_model)

출력결과 : step_backward

...

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 52.57735    2.28617   23.00 5.46e-10 ***
X1           1.46831    0.12130   12.11 2.69e-07 ***
X2           0.66225    0.04585   14.44 5.03e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.406 on 10 degrees of freedom
Multiple R-squared:  0.9787,	Adjusted R-squared:  0.9744 
F-statistic: 229.5 on 2 and 10 DF,  p-value: 4.407e-09

위에서 수동으로 backward elimination으로 결정한 model_2 통계적 추정이 동일하게 나타난다.

 

 

 

 

 

반응형
댓글
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
«   2024/07   »
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31
글 보관함