Crocus

R언어 Random Forest

Rule-Based Classifier

어떤 조건에 의한 클래스 분류는 다음과 같이 한다.

조건은 속성들에 의해 만들어지고, y는 결과로 나오는 클래스 값이다.

LHS 즉, 왼쪽 부분은 조건을 의미하고 속성을 이용한다.

RHS는 결론을 의미하고 클래스 라벨이 되는 값을 의미한다.

예를들어보자.

(Condition) -> y

(Blood Type = Warm) ^ (Lay Eggs = TRUE) -> Birds

위의 표를 이용하면 다음과 같이 조건에 의한 분류를 할 수 있다.

R1: (Give Birth = no) ^ (Can Fly = yes) -> Birds

R2: (Give Birth = no) ^ (Live in Water = yes) -> Fishes

R3: (Give Birth = yes) ^ (Blood Type = warm) -> Mammals

R4: (Give Birth = no) ^ (Can Fly = no) -> Reptiles

R5: (Live in Water = sometimes) -> Amphibians

Rule Coverage and Accuracy

Coverage of a rule는 Rule의 조건부를 만족하는 record 수와 전체 record 수의 비율이다.

Accuracy of a rule는 Rule의 조건부와 결론부를 동시에 만족하는 record수와 조건부를 만족하는 record수의 비율이다.

즉 여기서 (Status = Single) -> No라고 했을 때

Coverage of a rule은 전체 중 Single가 4개이니 40%

Accuracy of a rule은 Single중 No가 2개이니 50%이다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
> library(rpart)
> train<-read.csv("http://sites.google.com/site/stats202/data/sonar_train.csv",header=FALSE)
> y <- as.factor(train[, 61]) 
> x <- train[, 1:60]
> fit<-rpart(y~., x)
> 
> test <- read.csv('http://sites.google.com/site/stats202/data/sonar_test.csv', header = FALSE)
> 
> y <- as.factor(test[,61])
> x <- test[,1:60]
> 
> p_fit<-prune(fit, cp=fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])
> 
> prediction <- predict(p_fit, x, type="class")
> actual <- y
> 
> conf_matrix <- table(prediction, actual)
> conf_matrix
          actual
prediction -1  1
        -1 39 16
        1   6 17
> asRules(p_fit)
 
 Rule number: 3 [y=1 cover=51 (39%) prob=0.84]
   V11< 0.1709
 
 Rule number: 2 [y=-1 cover=79 (61%) prob=0.27]
   V11>=0.1709
 
> p_fit
n= 130 
 
node), split, n, loss, yval, (yprob)
      * denotes terminal node
 
1) root 130 64 -1 (0.5076923 0.4923077)  
  2) V11>=0.17095 79 21 -1 (0.7341772 0.2658228) *
  3) V11< 0.17095 51  8 1 (0.1568627 0.8431373) *
Colored by Color Scripter
cs

위와 같이 코딩하여 asRules로 특정 의사 결정 나무의 현재 rule 상태를 볼 수 있다.

랜덤 포레스트(Random Forest)

랜덤 포레스트는 의사결정 트리의 단점을 개선하기 위한 알고리즘이다.

다수의 의사 결정 트리를 결합하여 하나의 모형을 생성하는 방법이고

변수 및 관측치에 임의성을 적용하여 다수의 트리를 만들어낸다.

이러한 랜덤 포레스트를 제작하기 위해 Bagging을알아보자.

Bagging 원리

1. B개의 bootstrap sample을 생성한다.

2. 임의의 bootstrap sample과 변수로 B개의 트리를 생성한다.

3. 각 앙상블로부터 train classifier을 생성한다.

4. 예측 결과를 다수결 방식으로 선택한다(투표).

부트스트랩(Bootstrap)이란?

주어진 training set에서 중복을 허용하여 원래 데이터와 같은 크기의 데이터를 만드는 과정이다.

결국 tree의 bagging을 통해 Random Forest를 생성할 수 있다.

1
2
3
4
5
6
7
8
9
10
11
12
> require(randomForest)
> data(iris)
> set.seed(1)
> dat <- iris
> 
> dat$Species <- factor(ifelse(dat$Species == 'virginica', 'virginica', 'other'))
> 
> model.rf <- randomForest(Species~., dat, ntree = 25, importance=TRUE, nodesize = 5)
> 
> model.rf
> 
> varImpPlot(model.rf)
Colored by Color Scripter
cs

1. randomForest를 import한다 (require을 쓰거나 library를 쓰자)

2. dat에 iris를 넣어준다.

3. dat$Species에 factor로 넣어주는데 dat$Species가 'virginica'면 'virginica'로 그게 아니면 'other'로 넣어준다.

4. Species에 해당하는 값들을 쓸건데 그 값들은 dat에 있고 랜덤 포레스트에서 트리의 개수를 25개로 만든다.

이렇게 해서 나타나는 결과값을 model.rf로 보면 아래와 같다.

Call:

randomForest(formula = Species ~ ., data = dat, ntree = 25, importance = TRUE, nodesize = 5)

Type of random forest: classification

Number of trees: 25

No. of variables tried at each split: 2

OOB estimate of error rate: 5.33%

Confusion matrix:

other virginica class.error

other 96 4 0.04

virginica 4 46 0.08

OOB(Out Of Bag)

부트스트랩 샘플링 과정에서 추출되지 않은 관측치이고

평가용 데이터에서의 오분류율을 예측하는 용도 및 변수를 의미한다. 즉, 중요도를 추정하는 용도이다.

저작자표시 비영리 변경금지 (새창열림)

'Basic > R' 카테고리의 다른 글

knn, 최적 k 선택, x-validation 실습 (0)	2018.06.25
의사 결정 트리(Decision Tree), 프루닝(Pruning) (0)	2018.06.20
R언어 의사 결정 트리 및 다양한 개념 (0)	2018.06.19
R언어 데이터 프레임 몇가지 예제 (0)	2018.05.26
R언어 예제를 통한 몇가지 정리 (0)	2018.05.25

Crocus

R언어 Random Forest

'Basic > R' 카테고리의 다른 글

티스토리툴바