2. 교차검증 (Cross-Validation)

Index

교차검증은 과적합(Overfitting : 학습데이터에만 과도하게 최적화되어 다른 데이터 셋에는 정합성이 안맞는 현상)을 방지하기 위하여, '테스트 이전에' 여러 세트로 구성된 학습-검증 데이터 세트를 교차로 학습-검증하여 모델을 최적화 하는 과정이다 . 예측 모델의 유연성을 확보하는 과정이라고 할 수 있다.

학습 → 테스트 로 바로 가지 않고, 학습 → 검증 →테스트로 거쳐서 간다.
검증결과를 가지고 실제 테스트(예측) 이전에 하이퍼 파라미터 튜닝을 진행한다. (최적화)

● K-Fold 교차검증

데이터셋을 k 개로 분리하여, 한개씩 검증 세트로 번갈아가면서 쓰면서 k번 평가한 뒤, k개의 평가값 평균을 최종 평가값으로 계산하여 모델을 검증하는 방식.

구현 코드는 아래와 같다. (물론 아래 설명할 교차검증 API 를 활용하면 이렇게 for문을 사용하지 않아도 된다)

from sklearn.model_selection import KFold
kfold = KFold(n_splits=5) # K=5 인 KFold 객체 생성
cv_accuracy = [] #폴드 세트별 정확도를 담을 리스트 객체 생성

n_iter=0

#  KFold 객체에 split() 호출하면 => 학습/검증 테스트의 '로우 인덱스' 를 array로 반환 
for train_index, test_index in  kfold.split(iris_data):
    # KFold 로 반환된 인덱스를 활용해, 학습/검증 데이터 생성 
    x_train, x_test = iris_data[train_index], iris_data[test_index]
    y_train, y_test = iris_label[train_index], iris_label[test_index]
    # DecisionTreeClassifier 로 학습과 예측 진행 
    dt_clf.fit(x_train,y_train)
    pred = dt_clf.predict(x_test)
    n_iter +=1
    # 반복시 마다 정확도 측정 
    accuracy = np.round(accuracy_score(y_test,pred),4)
    train_size = x_train.shape[0]
    test_size = x_test.shape[0]
    print('{0} 교차검증 정확도 : {1}, 학습 데이터 크기: {2}, 검증 데이터 크기: {3}'
          .format(n_iter,accuracy,train_size,test_size))
    # print('{0} 검증 세트의 인덱스 : {1}'.format(n_iter,test_index))
    cv_accuracy.append(accuracy)

# 개별 iteration 의 정확도 평균계산 
print('>>평균 검증 정확도:', np.round(np.mean(cv_accuracy),2))

● Stratified K-Fold 교차검증

불균형한(imblanced) 분포를 가진 레이블 데이터 집합을 위한 K Fold 방식. (예를들어 복권당첨은 꽝이 백만개, 당첨이 10개 뿐이면 k폴드 시 당첨이 하나에 데이터세트에만 몰려버릴 확률이 크다)
특정 레이블이 학습,검증에 몰려버리지 않도록 원본 데이터 분포를 고려해서 학습/검증에 분배
K Fold 와 사용 방법은 비슷하나, Split 매서드에 피쳐 뿐 아니라 레이블도 넣어줘야함 (레이블 분포를 알아야하니..)
일반적인 Classsification모델은 Strarified 를 사용하는게 맞고, Regression은 라벨이 연속값이라 Stratified 불가능

# Stratified 교차검증시 학습/검증 데이터에 레이블 분포 균등함 확인하는 예제
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=3)
n_iter=0

for train_index, test_index in skf.split(iris_d,iris_df['label']):
    n_iter+=1
    label_train = iris_df['label'].iloc[train_index]
    label_test= iris_df['label'].iloc[test_index]
    print('## 교차 검증: {0}'.format(n_iter))
    print('학습 레이블 데이터 분포:\n',label_train.value_counts())
    print('검증 레이블 데이터 분포:\n',label_test.value_counts())

● Cross Validation (교차검증) 및 Hyper Parameter Tuning (최적화) 를 위한 API

1. cross_val_score()

위와 같이 K-Fold 교차검증을 수행하기 위한 for문을 한번에 수행해주는 API 이다. API 가 알아서 분류 문제에서는 stratified kfold 를, 회귀 문제에서는 일반 kfold 를 적용해서 수행해준다.
Estimator (학습모델) , 피쳐/레이블 데이터, 평가지표(accuracy, precision, recall 등..) , 폴드개수를 파라미터로 사용
평가결과를 ndarray 형식으로 반환한다.

from sklearn.model_selection import cross_val_score

scores = cross_val_score(dt_clf,iris_data,iris_label,scoring='accuracy',cv=3)
         # 주요 파라미터 : estimator, 피쳐, 레이블, 평가지표, 폴드 수(=cv)  
print('교차 검증별 정확도:',np.round(scores,4)) # 결과: 교차 검증별 정확도: [0.98 0.92 0.98]
print('평균 검증 정확도:', np.mean(scores))

2. GridSearchCV()

교차검증과 하이퍼 파라미터 튜닝(최적화)를 한번에 할 수 있는 API
하이퍼 파라미터의 조합개수 * CV 수 만큼 평가를 진행함.
주요 파라미터 : 1) estimator 2) param_grid : key+리스트로 이루어진 딕셔너리 형태의 파라미터 이름+값 정보 3) scoring: 성능측정 지표 4) cv: 폴드 개수 5) refit: True 가 디폴트로, 최적의 파라미터 찾은 뒤 해당 파라미터로 재학습
GridSearchCV 객체를 fit 매서드를 통해 학습시키면 교차검증의 결과가 cv_results_ 에 , 최적의 Parameter 조합이 best_params_ 에, 최고점수가 best_score_ 에 담기게 된다.

from sklearn.model_selection import GridSearchCV

#파라미터를 딕셔너리 형태로 설정
parameters = {'max_depth':[1,2,3], 'min_samples_split':[2,3]}

#GridSearchCV 실행 : GridSearchCV 객체에 fit(학습데이터세트) 적용하면 → 파라미터 순차변경하며 학습/검증 시행
grid_dt = GridSearchCV(dt_clf, param_grid=parameters, cv=3, refit=True)
#fit 하면, 결과값은 cv_results_ 에 반환, 파라미터별 결과값 조회 가능 
grid_dt.fit(x_train,y_train)
scores_df = pd.DataFrame(grid_dt.cv_results_) # 검증결과를 데이터 프레임형식으로 변환
scores_df[['params','split0_test_score','split1_test_score','split2_test_score','mean_test_score','rank_test_score']]

print(grid_dt.best_params_) # 최적화된 파라미터 반환
print(grid_dt.best_score_) # 최고점수 반환

저작자표시

'Study > ML' 카테고리의 다른 글

6. 앙상블(Ensemble) - 보팅(Voting),배깅(Bagging) (0)	2023.05.28
5. 분류(Classification) - Decision Tree (0)	2023.05.27
4. 성능 평가 (Evaluation) - Classification 평가방법 (0)	2023.05.27
3. 데이터의 전처리 (Pre-processing) (1)	2023.05.27
1. 머신러닝 개요 (2)	2023.05.26

sbspace

2. 교차검증 (Cross-Validation)

Index

● K-Fold 교차검증

● Stratified K-Fold 교차검증

● Cross Validation (교차검증) 및 Hyper Parameter Tuning (최적화) 를 위한 API

1. cross_val_score()

2. GridSearchCV()

'Study > ML' 카테고리의 다른 글

댓글

티스토리툴바

2. 교차검증 (Cross-Validation)

Index

● K-Fold 교차검증

● Stratified K-Fold 교차검증

● Cross Validation (교차검증) 및 Hyper Parameter Tuning (최적화) 를 위한 API

1. cross_val_score()

2. GridSearchCV()

'Study > ML' 카테고리의 다른 글

관련글

댓글

티스토리툴바