티스토리 뷰

R

(macOS)[R] 데이터 분석 : Exercise

jinozpersona 2022. 4. 21. 16:22

INTRO

1999년 미국 회사의 CEO와 성과 자료

 

1. 표의 빈칸에 해당하는 기초 통계량을 구하시오.

2. profits에 따른 CEO의 salary에 대한 산점도를 그리고 상관계수를 구하시오.

3. profits에 따른 CEO의 salary에 대한 단순선형회귀식을 구하고 분석하시오.

4. profits과 age, sales에 따른 CEO의 salary의 회귀식을 구하고 분석하시오.

5. profit, age, sales, tenure, assets에 따른 CEO의 salary의 회귀식을 구하고 분석하시오.

6. 후진제거법으로 최적의 회귀식을 찾으시오. 단, salary를 종속변수로 한다.

7. 전진선택법으로 최적의 회귀식을 찾으시오. 단, salary를 종속변수로 한다.

8. 단계적방법으로 최적의 회귀식을 찾으시오. 단, salary를 종속변수로 한다.

변수의 이름과 의미는 다음과 같다.

salary = 1999 salary + bonuses
tenure = number of years as CEO (=0 if less than 6 months)
age= age of CEO
sales = total 1998 sales revenue of firm i
profits = 1998 profits for firm i
assets = total assets of firm i in 1998
firm salary tenure age sales profits assets
1 3030 7 61 161315 2956 257389
2 6050 0 51 144416 22071 237545
3 3571 11 63 139208 4430 49271
4 3300 6 60 100697 6370 92630
5 0 18 63 100469 9296 355935
6 9375 6 57 81667 6328 86100
7 9525 15 60 76431 5807 668641
8 5000 5 61 57813 5372 59920
9 999 3 57 56154 1120 36672
10 3300 2 60 53588 6398 59550
11 3500 16 63 50777 5165 617679
12 2493 5 61 47678 1704 42754
13 1911 7 58 47061 2945 33673
14 2130 4 59 41322 1048 37675
15 1185 0 56 37154 3780 30966
16 5236 2 60 35853 1259 299804
17 1990 4 60 33674 568 14166
18 6000 32 74 33296 3765 19166
19 6229 5 63 32379 3782 194398
20 1523 3 56 31707 578 3665875
statistics            
mean            
variance            
standard
deviation
           
median            
max            
min            

 

 

1. 표의 빈칸에 해당하는 기초 통계량을 구하시오

Ex_firm_ceo.R
rm(list=ls())
library(data.table)

data_csv <- read.csv("~/Rcoding/firm_ceo.csv")
datas <- data_csv[1:20,2:7]

## 1.
datas_sum = summary(datas)
str(datas_sum)

# test for each col.
salary_mean = mean(datas$salary)
salary_mean
salary_var = var(datas$salary)
salary_var
salary_sd = sd(datas$salary)
salary_sd

ex_sol = data.frame(matrix(nrow=6,ncol=6))
for (i in 1:ncol(datas)) {
  data_mean   = mean(datas[,i])
  data_var    = var(datas[,i])
  data_sd     = sd(datas[,i])
  data_median = median(datas[,i])
  data_max    = max(datas[,i])
  data_min    = min(datas[,i])
  ex_sol[,i] <- c(data_mean,data_var,data_sd,data_median,data_max,data_min)
}
colnames(ex_sol) <- c("salary","tenure","age","sales","profits","assets")
rownames(ex_sol) <- c("mean","var","sd","median","max","min")
print(ex_sol)

출력결과

> rm(list=ls())

> library(data.table)

> data_csv <- read.csv("~/Rcoding/firm_ceo.csv")

> datas <- data_csv[1:20,2:7]

> ## 1.
> datas_sum = summary(datas)

> str(datas_sum)
 'table' chr [1:6, 1:6] "Min.   :   0  " "1st Qu.:1970  " ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:6] "" "" "" "" ...
  ..$ : chr [1:6] "    salary" "    tenure" "     age" "    sales" ...

> # test for each col.
> salary_mean = mean(datas$salary)

> salary_mean
[1] 3817.35

> salary_var = var(datas$salary)

> salary_var
[1] 6844228

> salary_sd = sd(datas$salary)

> salary_sd
[1] 2616.148

> ex_sol = data.frame(matrix(nrow=6,ncol=6))

> for (i in 1:ncol(datas)) {
+   data_mean   = mean(datas[,i])
+   data_var    = var(datas[,i])
+   data_sd     = sd(datas[,i])
+   data_median = medi .... [TRUNCATED] 

> colnames(ex_sol) <- c("salary","tenure","age","sales","profits","assets")

> rownames(ex_sol) <- c("mean","var","sd","median","max","min")

> print(ex_sol)
            salary    tenure       age        sales      profits       assets
mean      3817.350  7.550000 60.150000 6.813295e+04     4737.100 3.429905e+05
var    6844228.029 58.576316 19.502632 1.645457e+09 22255536.516 6.486447e+11
sd        2616.148  7.653517  4.416178 4.056423e+04     4717.577 8.053848e+05
median    3300.000  5.000000 60.000000 5.218250e+04     3781.000 7.301000e+04
max       9525.000 32.000000 74.000000 1.613150e+05    22071.000 3.665875e+06
min          0.000  0.000000 51.000000 3.170700e+04      568.000 1.416600e+04

 

 

 

 

2. profits에 따른 CEO의 salary에 대한 산점도를 그리고 상관계수를 구하시오.

Ex_firm_ceo.R
...

## 2.
par(mfrow=c(1,1))
plot(data$profits,data$salary)
cor(data$profits,data$salary)
cor(datas)

출력결과

> ##2.
> par(mfrow=c(1,1))

> plot(data$profits,data$salary)

> cor(data$profits,data$salary)
[1] 0.3019593

> cor(datas)
             salary      tenure         age       sales     profits
salary   1.00000000  0.19478167  0.09391173  0.11350054  0.30195934
tenure   0.19478167  1.00000000  0.83363621  0.01027995 -0.01510477
age      0.09391173  0.83363621  1.00000000 -0.18422778 -0.32539266
sales    0.11350054  0.01027995 -0.18422778  1.00000000  0.57055319
profits  0.30195934 -0.01510477 -0.32539266  0.57055319  1.00000000
assets  -0.12045382 -0.06564463 -0.20934915 -0.16217834 -0.14368491
             assets
salary  -0.12045382
tenure  -0.06564463
age     -0.20934915
sales   -0.16217834
profits -0.14368491
assets   1.00000000

 

 

 

3. profits에 따른 CEO의 salary에 대한 단순선형회귀식을 구하고 분석하시오.

Ex_firm_ceo.R
...

## 3. 
model_sr = lm(salary ~ profits, data=data)
model_sr
summary(model_sr)
par(mfrow=c(2,2))
plot(model_sr)

출력결과

...

> ## 3. 
> model = lm(salary ~ profits, .... [TRUNCATED] 

> model

Call:
lm(formula = salary ~ profits, data = data)

Coefficients:
(Intercept)      profits  
  3024.1107       0.1675  


> summary(model)

Call:
lm(formula = salary ~ profits, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4580.7 -1246.4  -730.4  1307.5  5528.5 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 3024.1107   822.6289   3.676  0.00173 **
profits        0.1675     0.1246   1.344  0.19570   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2562 on 18 degrees of freedom
Multiple R-squared:  0.09118,	Adjusted R-squared:  0.04069 
F-statistic: 1.806 on 1 and 18 DF,  p-value: 0.1957


> par(mfrow=c(2,2))

> plot(model)

 

 

4.profits과 age, sales에 따른 CEO의 salary의 회귀식을 구하고 분석하시오.

R console> install.packages("broom")

 

Ex_firm_ceo.R
...

## 4.
model_ex = lm(salary ~ age+sales+profits,data=data)
model_ex
summary(model_ex)
para_model_ex = c('salary','age','sales','profits')
coef_model_ex = paste(model_ex$coef)
para_model_ex
coef_model_ex

reg_eq = function(para,coef) {
  text1 = paste(para_model_ex[1], " = ", model_ex$coef[1])
  text2 = paste0(" + ", model_ex$coef[2], "*", para_model_ex[2])
  text3 = paste0(" + ", model_ex$coef[3], "*", para_model_ex[3])
  text4 = paste0(" + ", model_ex$coef[4], "*", para_model_ex[4])
  return(paste(text1,text2,text3,text4))
}
reg_eq(para_model_ex,coef_model_ex)
statics_model_ex = broom::glance(model_ex)[,4:5]
statics_model_ex$statistic
print(paste("F-statistic:",statics_model_ex$statistic,"p-value:",statics_model_ex$p.value))

출력결과

...

> ## 4.
> model_ex = lm(salary ~ age+sales+profits,data=data)

> model_ex

Call:
lm(formula = salary ~ age + sales + profits, data = data)

Coefficients:
(Intercept)          age        sales      profits  
 -4.569e+03    1.274e+02   -5.650e-03    2.340e-01  


> summary(model_ex)

Call:
lm(formula = salary ~ age + sales + profits, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5064.4 -1005.6  -762.7   553.4  5663.2 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.569e+03  9.034e+03  -0.506    0.620
age          1.274e+02  1.454e+02   0.876    0.394
sales       -5.650e-03  1.823e-02  -0.310    0.761
profits      2.340e-01  1.629e-01   1.436    0.170

Residual standard error: 2647 on 16 degrees of freedom
Multiple R-squared:  0.1377,	Adjusted R-squared:  -0.02403 
F-statistic: 0.8514 on 3 and 16 DF,  p-value: 0.4861


> para_model_ex = c('salary','age','sales','profits')

> coef_model_ex = paste(model_ex$coef)

> para_model_ex
[1] "salary"  "age"     "sales"   "profits"

> coef_model_ex
[1] "-4569.38405752631"    "127.403469383454"     "-0.00565029360758603"
[4] "0.233979976348927"   

> # paste("y=",model_ex$coef[1],"x + ",model_ex$coef[2])
> reg_eq = function(para,coef) {
+   text1 = paste(para_model_ex[1], " = ", model_ex$coef[1]) .... [TRUNCATED] 

> reg_eq(para_model_ex,coef_model_ex)
[1] "salary  =  -4569.38405752631  + 127.403469383454*age  + -0.00565029360758603*sales  + 0.233979976348927*profits"

> statics_model_ex = broom::glance(model_ex)[,4:5]

> statics_model_ex$statistic
    value 
0.8513706 

> print(paste("F-statistic:",statics_model_ex$statistic,"p-value:",statics_model_ex$p.value))
[1] "F-statistic: 0.851370582805658 p-value: 0.486064045102265"

 

 

5. profit, age, sales, tenure, assets에 따른 CEO의 salary의 회귀식을 구하고 분석하시오.

Ex_firm_ceo.R
...

## 5.
model_ex2 = lm(salary ~ profits+age+sales+tenure+assets,data=data)
summary(model_ex2)
para_model_ex2 = c('salary','profits','age','sales','tenure')
coef_model_ex2 = paste(model_ex2$coef)
para_model_ex2
coef_model_ex2

reg_eq = function(para,coef) {
  text1 = paste(para[1], " = ", coef[1])
  text2 = paste0(" + ", coef[2], "*", para[2])
  text3 = paste0(" + ", coef[3], "*", para[3])
  text4 = paste0(" + ", coef[4], "*", para[4])
  text5 = paste0(" + ", coef[5], "*", para[5])
  text6 = paste0(" + ", coef[6], "*", para[6])
  return(paste(text1,text2,text3,text4,text5,text6))
}
reg_eq(para_model_ex2,coef_model_ex2)
statics_model_ex2 = broom::glance(model_ex2)[,4:5]
print(paste("F-statistic:",statics_model_ex2$statistic,"p-value:",statics_model_ex2$p.value))

출력결과

...

> ## 5.
> model_ex2 = lm(salary ~ profits+age+sales+tenure+assets,data=data)

> summary(model_ex2)

Call:
lm(formula = salary ~ profits + age + sales + tenure + assets, 
    data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5173.6 -1006.6  -626.4   531.6  5462.7 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  6.295e+02  2.126e+04   0.030    0.977
profits      2.069e-01  1.989e-01   1.040    0.316
age          3.887e+01  3.611e+02   0.108    0.916
sales       -6.346e-03  1.961e-02  -0.324    0.751
tenure       4.881e+01  1.899e+02   0.257    0.801
assets      -1.939e-04  9.166e-04  -0.212    0.835

Residual standard error: 2822 on 14 degrees of freedom
Multiple R-squared:  0.1426,	Adjusted R-squared:  -0.1636 
F-statistic: 0.4658 on 5 and 14 DF,  p-value: 0.7954


> para_model_ex2 = c('salary','profits','age','sales','tenure')

> coef_model_ex2 = paste(model_ex2$coef)

> para_model_ex2
[1] "salary"  "profits" "age"     "sales"   "tenure" 

> coef_model_ex2
[1] "629.535762665344"      "0.206864514513282"     "38.8728334629791"     
[4] "-0.00634580332997101"  "48.8144166758324"      "-0.000193927067287707"

> reg_eq = function(para,coef) {
+   text1 = paste(para[1], " = ", coef[1])
+   text2 = paste0(" + ", coef[2], "*", para[2])
+   text3 = paste0(" + ", .... [TRUNCATED] 

> reg_eq(para_model_ex2,coef_model_ex2)
[1] "salary  =  629.535762665344  + 0.206864514513282*profits  + 38.8728334629791*age  + -0.00634580332997101*sales  + 48.8144166758324*tenure  + -0.000193927067287707*NA"

> statics_model_ex2 = broom::glance(model_ex2)[,4:5]

> print(paste("F-statistic:",statics_model_ex2$statistic,"p-value:",statics_model_ex2$p.value))
[1] "F-statistic: 0.465845043891575 p-value: 0.795374324895597"

 

6. 후진제거법으로 최적의 회귀식을 찾으시오. 단, salary를 종속변수로 한다.

Ex_firm_ceo.R
...

## 6.
step_backward_model = step(lm(salary ~ tenure+age+sales+profits+assets,datas), scope=list(lower=~1,upper=~tenure+age+sales+profits+assets), direction="backward")
step_backward_model
summary(step_backward_model)

출력결과

...

> ## 6.
> step_backward_model = step(lm(salary ~ tenure+age+sales+profits+assets,datas), scope=list(lower=~1,upper=~tenure+age+sales+profits+assets),  .... [TRUNCATED] 
Start:  AIC=322.67
salary ~ tenure + age + sales + profits + assets

          Df Sum of Sq       RSS    AIC
- age      1     92281 111583466 320.69
- assets   1    356490 111847674 320.74
- tenure   1    526022 112017206 320.77
- sales    1    833628 112324812 320.82
- profits  1   8617106 120108290 322.16
<none>                 111491184 322.67

Step:  AIC=320.69
salary ~ tenure + sales + profits + assets

          Df Sum of Sq       RSS    AIC
- assets   1    676425 112259890 318.81
- sales    1    892500 112475966 318.85
- tenure   1   4986238 116569704 319.57
- profits  1  10828227 122411692 320.54
<none>                 111583466 320.69

Step:  AIC=318.81
salary ~ tenure + sales + profits

          Df Sum of Sq       RSS    AIC
- sales    1    754779 113014669 316.95
- tenure   1   5257207 117517097 317.73
- profits  1  11229913 123489803 318.72
<none>                 112259890 318.81

Step:  AIC=316.95
salary ~ tenure + profits

          Df Sum of Sq       RSS    AIC
- tenure   1   5168658 118183328 315.84
<none>                 113014669 316.95
- profits  1  12091946 125106615 316.98

Step:  AIC=315.84
salary ~ profits

          Df Sum of Sq       RSS    AIC
- profits  1  11857005 130040333 315.75
<none>                 118183328 315.84

Step:  AIC=315.75
salary ~ 1


> step_backward_model

Call:
lm(formula = salary ~ 1, data = datas)

Coefficients:
(Intercept)  
       3817  


> summary(step_backward_model)

Call:
lm(formula = salary ~ 1, data = datas)

Residuals:
    Min      1Q  Median      3Q     Max 
-3817.3 -1847.1  -517.3  1609.7  5707.7 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3817        585   6.526    3e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2616 on 19 degrees of freedom

 

7. 전진선택법으로 최적의 회귀식을 찾으시오. 단, salary를 종속변수로 한다.

Ex_firm_ceo.R
...

## 7.
step_forward_model = step(lm(salary ~ tenure+age+sales+profits+assets,datas), scope=list(lower=~1,upper=~tenure+age+sales+profits+assets), direction="forward")
step_forward_model
summary(step_forward_model)

출력결과

...

> ## 7.
> step_forward_model = step(lm(salary ~ tenure+age+sales+profits+assets,datas), scope=list(lower=~1,upper=~tenure+age+sales+profits+assets), d .... [TRUNCATED] 
Start:  AIC=322.67
salary ~ tenure + age + sales + profits + assets


> step_forward_model

Call:
lm(formula = salary ~ tenure + age + sales + profits + assets, 
    data = datas)

Coefficients:
(Intercept)       tenure          age        sales      profits       assets  
  6.295e+02    4.881e+01    3.887e+01   -6.346e-03    2.069e-01   -1.939e-04  


> summary(step_forward_model)

Call:
lm(formula = salary ~ tenure + age + sales + profits + assets, 
    data = datas)

Residuals:
    Min      1Q  Median      3Q     Max 
-5173.6 -1006.6  -626.4   531.6  5462.7 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  6.295e+02  2.126e+04   0.030    0.977
tenure       4.881e+01  1.899e+02   0.257    0.801
age          3.887e+01  3.611e+02   0.108    0.916
sales       -6.346e-03  1.961e-02  -0.324    0.751
profits      2.069e-01  1.989e-01   1.040    0.316
assets      -1.939e-04  9.166e-04  -0.212    0.835

Residual standard error: 2822 on 14 degrees of freedom
Multiple R-squared:  0.1426,	Adjusted R-squared:  -0.1636 
F-statistic: 0.4658 on 5 and 14 DF,  p-value: 0.7954

 

8. 단계적방법으로 최적의 회귀식을 찾으시오. 단, salary를 종속변수로 한다

Ex_firm_ceo.R
...

## 8.
step_stepwise_model = step(lm(salary ~ tenure+age+sales+profits+assets,data), scope=list(lower=~1,upper=~tenure+age+sales+profits+assets), direction="both")
step_stepwise_model
summary(step_stepwise_model)

출력결과

...

> ## 8.
> step_stepwise_model = step(lm(salary ~ tenure+age+sales+profits+assets,data), scope=list(lower=~1,upper=~tenure+age+sales+profits+assets), d .... [TRUNCATED] 
Start:  AIC=322.67
salary ~ tenure + age + sales + profits + assets

          Df Sum of Sq       RSS    AIC
- age      1     92281 111583466 320.69
- assets   1    356490 111847674 320.74
- tenure   1    526022 112017206 320.77
- sales    1    833628 112324812 320.82
- profits  1   8617106 120108290 322.16
<none>                 111491184 322.67

Step:  AIC=320.69
salary ~ tenure + sales + profits + assets

          Df Sum of Sq       RSS    AIC
- assets   1    676425 112259890 318.81
- sales    1    892500 112475966 318.85
- tenure   1   4986238 116569704 319.57
- profits  1  10828227 122411692 320.54
<none>                 111583466 320.69
+ age      1     92281 111491184 322.67

Step:  AIC=318.81
salary ~ tenure + sales + profits

          Df Sum of Sq       RSS    AIC
- sales    1    754779 113014669 316.95
- tenure   1   5257207 117517097 317.73
- profits  1  11229913 123489803 318.72
<none>                 112259890 318.81
+ assets   1    676425 111583466 320.69
+ age      1    412217 111847674 320.74

Step:  AIC=316.95
salary ~ tenure + profits

          Df Sum of Sq       RSS    AIC
- tenure   1   5168658 118183328 315.84
<none>                 113014669 316.95
- profits  1  12091946 125106615 316.98
+ sales    1    754779 112259890 318.81
+ assets   1    538704 112475966 318.85
+ age      1    455187 112559482 318.87

Step:  AIC=315.84
salary ~ profits

          Df Sum of Sq       RSS    AIC
- profits  1  11857005 130040333 315.75
<none>                 118183328 315.84
+ age      1   5370818 112812510 316.91
+ tenure   1   5168658 113014669 316.95
+ assets   1    788629 117394698 317.71
+ sales    1    666230 117517097 317.73

Step:  AIC=315.75
salary ~ 1

          Df Sum of Sq       RSS    AIC
<none>                 130040333 315.75
+ profits  1  11857005 118183328 315.84
+ tenure   1   4933717 125106615 316.98
+ assets   1   1886771 128153561 317.46
+ sales    1   1675228 128365105 317.49
+ age      1   1146879 128893453 317.58

> step_stepwise_model

Call:
lm(formula = salary ~ 1, data = data)

Coefficients:
(Intercept)  
       3817  


> summary(step_stepwise_model)

Call:
lm(formula = salary ~ 1, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3817.3 -1847.1  -517.3  1609.7  5707.7 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3817        585   6.526    3e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2616 on 19 degrees of freedom

 

 

반응형
댓글
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
«   2024/07   »
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31
글 보관함