4 minute read

Cross Validation

1) 목적

  • Hyperparameter 튜닝

2) 데이터 불러오기

from sklearn import datasets
raw_wine = datasets.load_wine()

3) feature, target 데이터 지정

X = raw_wine.data
y = raw_wine.target

4) train / test 데이터 분할

from sklearn.model_selection import train_test_split
X_tn, X_te, y_tn, y_te = train_test_split(X,y,random_state=0)

5) 데이터 표준화

from sklearn.preprocessing import StandardScaler
std_scale = StandardScaler()
std_scale.fit(X_tn)
X_tn_std = std_scale.transform(X_tn)
X_te_std = std_scale.transform(X_te)

6) Grid Search

from sklearn import svm 
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

param_grid = {'kernel' : ('linear', 'rbf'), 
             'C' : [0.5, 1, 10, 100]}
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
svc = svm.SVC(random_state=0)
grid_cv = GridSearchCV(svc, param_grid, cv=kfold, scoring='accuracy')
grid_cv.fit(X_tn_std, y_tn)
GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=0, shuffle=True),
             estimator=SVC(random_state=0),
             param_grid={'C': [0.5, 1, 10, 100], 'kernel': ('linear', 'rbf')},
             scoring='accuracy')
  • 2: stratified k-fold cross validation은 일반적인 k-fold cross validation과는 달리 라벨링의 비율을 유지하면서 데이터를 추출하는 방법
  • 3: GridSearch를 위해 GridSearchCV 함수를 불러옴
  • 4: grid search를 위해 parameter를 정한다. SVM에서 커널은 linear 또는 rbf로 설정. C 값은 0.5, 1, 10, 100 으로 설정
  • 5: n_splits=5는 트레이닝 데이터를 5개의 split으로 나눈다라는 것, shuffle은 데이터를 섞는다는 의미
  • 6: 학습시킬 모형을 SVM을 기본형으로 다룬다.
  • 7: 학습시킬 모형 svc와 파라미터 param_grid, 크로스 벨리데이션 방법 kfold, 모형 평가 방법을 설정
  • 8: 표준화된 피처 데이터와 트레이닝 타깃 데이터를 넣고 적합시킨다.

7) Grid Search 결과 확인

grid_cv.cv_results_
{'mean_fit_time': array([0.00101571, 0.00106258, 0.00085115, 0.00109282, 0.0007503 ,
        0.00087047, 0.00070262, 0.00082722]),
 'std_fit_time': array([1.62061749e-04, 2.20231669e-05, 4.01492923e-05, 1.69207370e-04,
        2.22167229e-05, 1.88643192e-05, 1.72882242e-05, 1.32990559e-05]),
 'mean_score_time': array([0.00031829, 0.00031857, 0.00028024, 0.0003047 , 0.00024934,
        0.00026579, 0.00023694, 0.00025845]),
 'std_score_time': array([2.49586790e-05, 3.33922222e-06, 5.75493263e-06, 1.77508997e-05,
        3.91297116e-06, 7.18174384e-06, 5.86218878e-06, 1.09121764e-05]),
 'param_C': masked_array(data=[0.5, 0.5, 1, 1, 10, 10, 100, 100],
              mask=[False, False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_kernel': masked_array(data=['linear', 'rbf', 'linear', 'rbf', 'linear', 'rbf',
                    'linear', 'rbf'],
              mask=[False, False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 0.5, 'kernel': 'linear'},
  {'C': 0.5, 'kernel': 'rbf'},
  {'C': 1, 'kernel': 'linear'},
  {'C': 1, 'kernel': 'rbf'},
  {'C': 10, 'kernel': 'linear'},
  {'C': 10, 'kernel': 'rbf'},
  {'C': 100, 'kernel': 'linear'},
  {'C': 100, 'kernel': 'rbf'}],
 'split0_test_score': array([0.88888889, 0.96296296, 0.88888889, 0.92592593, 0.88888889,
        0.92592593, 0.88888889, 0.92592593]),
 'split1_test_score': array([0.96296296, 1.        , 0.96296296, 0.96296296, 0.96296296,
        0.96296296, 0.96296296, 0.96296296]),
 'split2_test_score': array([0.92592593, 0.96296296, 0.92592593, 0.96296296, 0.92592593,
        0.96296296, 0.92592593, 0.96296296]),
 'split3_test_score': array([1.        , 0.96153846, 1.        , 0.96153846, 1.        ,
        0.96153846, 1.        , 0.96153846]),
 'split4_test_score': array([0.84615385, 1.        , 0.84615385, 1.        , 0.84615385,
        1.        , 0.84615385, 1.        ]),
 'mean_test_score': array([0.92478632, 0.97749288, 0.92478632, 0.96267806, 0.92478632,
        0.96267806, 0.92478632, 0.96267806]),
 'std_test_score': array([0.05401397, 0.01838435, 0.05401397, 0.02343121, 0.05401397,
        0.02343121, 0.05401397, 0.02343121]),
 'rank_test_score': array([5, 1, 5, 2, 5, 2, 5, 2], dtype=int32)}
  • 결과적으로는, 두번째 진행된 fit, 학습 정확도가 제일 높기에 두번째 학습 때의 hyperparameter가 사용될 것

8) Grid Search 결과 시각적 확인(데이터 프레임)

import numpy as np
import pandas as pd
np.transpose(pd.DataFrame(grid_cv.cv_results_))
0 1 2 3 4 5 6 7
mean_fit_time 0.00101571 0.00106258 0.000851154 0.00109282 0.000750303 0.000870466 0.00070262 0.000827217
std_fit_time 0.000162062 2.20232e-05 4.01493e-05 0.000169207 2.22167e-05 1.88643e-05 1.72882e-05 1.32991e-05
mean_score_time 0.000318289 0.000318575 0.000280237 0.000304699 0.000249338 0.000265789 0.00023694 0.000258446
std_score_time 2.49587e-05 3.33922e-06 5.75493e-06 1.77509e-05 3.91297e-06 7.18174e-06 5.86219e-06 1.09122e-05
param_C 0.5 0.5 1 1 10 10 100 100
param_kernel linear rbf linear rbf linear rbf linear rbf
params {'C': 0.5, 'kernel': 'linear'} {'C': 0.5, 'kernel': 'rbf'} {'C': 1, 'kernel': 'linear'} {'C': 1, 'kernel': 'rbf'} {'C': 10, 'kernel': 'linear'} {'C': 10, 'kernel': 'rbf'} {'C': 100, 'kernel': 'linear'} {'C': 100, 'kernel': 'rbf'}
split0_test_score 0.888889 0.962963 0.888889 0.925926 0.888889 0.925926 0.888889 0.925926
split1_test_score 0.962963 1 0.962963 0.962963 0.962963 0.962963 0.962963 0.962963
split2_test_score 0.925926 0.962963 0.925926 0.962963 0.925926 0.962963 0.925926 0.962963
split3_test_score 1 0.961538 1 0.961538 1 0.961538 1 0.961538
split4_test_score 0.846154 1 0.846154 1 0.846154 1 0.846154 1
mean_test_score 0.924786 0.977493 0.924786 0.962678 0.924786 0.962678 0.924786 0.962678
std_test_score 0.054014 0.0183843 0.054014 0.0234312 0.054014 0.0234312 0.054014 0.0234312
rank_test_score 5 1 5 2 5 2 5 2

9) Best Score & Hyperparameter

grid_cv.best_score_
0.9774928774928775
grid_cv.best_params_
{'C': 0.5, 'kernel': 'rbf'}

10) 최종 모형

# grid search, cross-validation 후 가장 좋은 hyperparameter로 clf 설정
clf = grid_cv.best_estimator_
print(clf)
SVC(C=0.5, random_state=0)

11) Cross-validation 스코어 확인(1)

from sklearn.model_selection import cross_validate
metrics = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
cv_scores = cross_validate(clf,X_tn_std,y_tn,cv=kfold,scoring=metrics)
cv_scores
{'fit_time': array([0.00132918, 0.00131297, 0.00119996, 0.00112677, 0.00101924]),
 'score_time': array([0.002707  , 0.00217199, 0.00220799, 0.00206614, 0.00182986]),
 'test_accuracy': array([0.96296296, 1.        , 0.96296296, 0.96153846, 1.        ]),
 'test_precision_macro': array([0.96296296, 1.        , 0.96969697, 0.96969697, 1.        ]),
 'test_recall_macro': array([0.96666667, 1.        , 0.96296296, 0.95833333, 1.        ]),
 'test_f1_macro': array([0.9628483 , 1.        , 0.96451914, 0.96190476, 1.        ])}

12) Cross-validation 스코어 확인(2)

from sklearn.model_selection import cross_val_score
cv_score = cross_val_score(clf, X_tn_std, y_tn, cv=kfold, scoring='accuracy')
# split 별 score..!
print(cv_score)
[0.96296296 1.         0.96296296 0.96153846 1.        ]

13) 예측

pred_svm = clf.predict(X_te_std)
print(pred_svm)

[0 2 1 0 1 1 0 2 1 1 2 2 0 1 2 1 0 0 1 0 1 0 0 1 1 1 1 1 1 2 0 0 1 0 0 0 2
 1 1 2 0 0 1 1 1]

14) Grid Search로 고른 Hyperparameter를 적용한 모델 평가

1. accuracy

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_te, pred_svm)
print(accuracy)
1.0

2. confusion matrix

from sklearn.metrics import confusion_matrix
conf = confusion_matrix(y_te, pred_svm)
print(conf)
[[16  0  0]
 [ 0 21  0]
 [ 0  0  8]]

3. classification report

from sklearn.metrics import classification_report
cls_rpt = classification_report(y_te, pred_svm)
print(cls_rpt)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       1.00      1.00      1.00        21
           2       1.00      1.00      1.00         8

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45