(macOS)[R] 결측값 처리 / 이상치 탐색

티스토리 뷰

(macOS)[R] 결측값 처리 / 이상치 탐색

jinozpersona 2022. 4. 4. 18:41

INTRO

1. 데이터 탐색

2. 결측값 처리 : missing data handle

3. 이상치 탐색 : outlier detection

1. 데이터 탐색 : 데이터 기초 통계를 이용한 데이터 구조 파악

summary(statistics), cov(covariance, 공분산), cor(correlation, 상관관계)

test_missingdata_outlier.R

rm(list=ls())
setwd = "~/Rcoding"

data(iris)
head(iris)
str(iris)

## basic statistics
summary(iris)
## covariance
cov(iris[,1:4])
## correlation
cor(iris[,1:4])

출력결과

> source("~/Rcoding/test_missingdata_outlier.R", echo=TRUE)

> rm(list=ls())

> setwd = "~/Rcoding"

> data(iris)

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

> str(iris)
'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

> ## basic statistics
> summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500                  

> ## covariance
> cov(iris[,1:4])
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    0.6856935  -0.0424340    1.2743154   0.5162707
Sepal.Width    -0.0424340   0.1899794   -0.3296564  -0.1216394
Petal.Length    1.2743154  -0.3296564    3.1162779   1.2956094
Petal.Width     0.5162707  -0.1216394    1.2956094   0.5810063

> ## correlation
> cor(iris[,1:4])
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

2. 결측값 처리

NA(Not Available) 여부 확인 및 Amelia package를 이용한 결측값 처리

## missing data handle
y = c(1,2,3,NA)
is.na(y)

출력결과

> ## missing data handle
> y = c(1,2,3,NA)

> is.na(y)
[1] FALSE FALSE FALSE  TRUE

Amelia package

Rstudio Console> install.packages("Amelia")

syntax.

var = amelia(data, m=imputation dataset 갯수, ts='시계열에 대한 정보', cs='cross-sectional 분석에 포함될 정보')

## Amelia package
library(Amelia)
data(freetrade)
head(freetrade)
str(freetrade)
a.out = amelia(freetrade, m=5, ts='year', cs='country')
hist(a.out$imputations[[3]]$tariff, col='grey', border='white')
save(a.out, file='imputations.RData')
write.amelia(obj=a.out, file.stem='outdata')

출력결과

> ## Amelia package
> library(Amelia)

> data(freetrade)

> head(freetrade)
  year  country tariff polity      pop   gdp.pc intresmi signed fiveop     usheg
1 1981 SriLanka     NA      6 14988000 461.0236 1.937347      0   12.4 0.2593112
2 1982 SriLanka     NA      5 15189000 473.7634 1.964430      0   12.5 0.2558008
3 1983 SriLanka   41.3      5 15417000 489.2266 1.663936      1   12.3 0.2655022
4 1984 SriLanka     NA      5 15599000 508.1739 2.797462      0   12.3 0.2988009
5 1985 SriLanka   31.0      5 15837000 525.5609 2.259116      0   12.3 0.2952431
6 1986 SriLanka     NA      5 16117000 538.9237 1.832549      0   12.5 0.2886563

> str(freetrade)
'data.frame':	171 obs. of  10 variables:
 $ year    : int  1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 ...
 $ country : chr  "SriLanka" "SriLanka" "SriLanka" "SriLanka" ...
 $ tariff  : num  NA NA 41.3 NA 31 ...
 $ polity  : int  6 5 5 5 5 5 5 5 5 5 ...
 $ pop     : num  14988000 15189000 15417000 15599000 15837000 ...
 $ gdp.pc  : num  461 474 489 508 526 ...
 $ intresmi: num  1.94 1.96 1.66 2.8 2.26 ...
 $ signed  : int  0 0 1 0 0 0 0 1 0 0 ...
 $ fiveop  : num  12.4 12.5 12.3 12.3 12.3 ...
 $ usheg   : num  0.259 0.256 0.266 0.299 0.295 ...

> a.out = amelia(freetrade, m=5, ts='year', cs='country')
-- Imputation 1 --

  1  2  3  4  5  6  7  8  9 10 11 12 13

-- Imputation 2 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

-- Imputation 3 --

  1  2  3  4  5  6  7  8  9 10 11 12

-- Imputation 4 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14

-- Imputation 5 --

  1  2  3  4  5  6  7  8  9 10


> hist(a.out$imputations[[3]]$tariff, col='grey', border='white')

> save(a.out, file='imputations.RData')

> write.amelia(obj=a.out, file.stem='outdata')

결측값 처리 전/후 비교

missmap(a.out)

freetrade$tariff <- a.out$imputations[[5]]$tariff
missmap(freetrade)

3. 이상치 탐색

분석에서 전처리 방법 결정과 FDS(Fraud Detection System, 부정사용방지시스템 or 이상금융거래탐지시스템)의 규칙 발견에서 사용

a1 : bad data, 의도하지 않게 잘못 입력된 경우

a2 : bad data, 분석 목적에 부합되지 않아 제거해야 하는 경우

a3 : outlier, 의도되지는 않은 결과이지만 분석에 포함되어야 하는 경우

b1 : outlier, 의도된 이상치로 대부분 사기(fraud)에 해당

관련 알고리즘 : ESD(Extreme Studentized Deviation), MADM

box plot을 이용한 탐색

## outlier detecting
x=rnorm(100)
boxplot(x)

x=c(x,19,28,30)
outwith=boxplot(x)
outwith$out

출력결과

> ## outlier detecting
> x=rnorm(100)

> boxplot(x)

> x=c(x,19,28,30)

> outwith=boxplot(x)

> outwith$out
[1] 19 28 30

outliers package 사용하여 이상치 탐색

Rstudio Console> install.packages("outlier")

## outliers package
library(outliers)
set.seed(1234)
y=rnorm(100)
outlier(y)
outlier(y, opposite=TRUE)

dim(y) = c(20,5)
outlier(y)
outlier(y, opposite=TRUE)
boxplot(y)

출력결과

## outliers package
> library(outliers)

> set.seed(1234)

> y=rnorm(100)

> outlier(y)
[1] 2.548991

> outlier(y, opposite=TRUE)
[1] -2.345698

> dim(y) = c(20,5)

> outlier(y)
[1] 2.415835 1.102298 1.647817 2.548991 2.121117

> outlier(y, opposite=TRUE)
[1] -2.345698 -2.180040 -1.806031 -1.390701 -1.372302

> boxplot(y)

저작자표시 비영리 변경금지 (새창열림)

'R' 카테고리의 다른 글

(macOS)[R] 기초 통계 분석 : 회귀 분석(Regression Analysis) - 1 (0)	2022.04.05
(macOS)[R] 기초 통계 분석 : 기술 통계 (0)	2022.04.05
(macOS)[R] data.table (0)	2022.03.30
(macOS)[R] sqldf, plyr (0)	2022.03.30
(macOS)[R] reshape (0)	2022.03.29

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

글 보관함

페르소나

티스토리 뷰