[머신러닝 기법] Decision Tree, AdaBoost, Gradient Boosting, XGBoost, LightGBM의 주요 차이점

Machine Learning

[머신러닝 기법] Decision Tree, AdaBoost, Gradient Boosting, XGBoost, LightGBM의 주요 차이점

뉴욕킴 2024. 5. 5. 18:33

머신러닝 기법들인 Decision Tree, AdaBoost, Gradient Boosting, XGBoost, LightGBM은 모두 트리 기반의 앙상블 학습 방법을 사용하지만, 각각의 특징과 동작 방식에는 몇 가지 차이가 있습니다.

1. Decision Tree (의사 결정 트리):

단일 트리 모델로, 데이터를 특성에 따라 분할하여 의사 결정을 내리는 방식입니다.
각 노드에서의 최적의 분할을 찾기 위해 정보 이득이나 지니 불순도 등의 지표를 사용합니다.
해석이 용이하고 설명력이 뛰어나지만, 과적합(overfitting)되기 쉬운 경향이 있습니다.

model = DecisionTreeRegressor(random_state=random_state) #결정 트리 회귀 모델을 초기화

# Define the hyperparameters and their possible values #결정 트리 모델의 max_depth, min_samples_split, ccp_alpha 세 가지 하이퍼파라미터에 대해 탐색할 값의 목록을 지정
param_grid = {
    "max_depth": [5, 10, 20],
    "min_samples_split": [2, 10, 20],
    "ccp_alpha": [0.0, 0.01],
}

grid_search = GridSearchCV(model, param_grid, cv=kf, scoring=scoring, refit=True, n_jobs=-1) #그리드 서치 객체를 초기화
grid_search.fit(X_train, y_train)

print("Best parameters: ", grid_search.best_params_) #최적의 파라미터 조합 찾기
print("Best CV score: {:.6f}".format(grid_search.best_score_)) #최적의 하이퍼파라미터 조합으로 얻은 최상의 교차 검증 점수를 출력

models["Decision Tree"] = grid_search.best_estimator_ #최적의 모델을 선택하고, models 딕셔너리에 저장
# Best CV score: -0.414531: 높을 수록 좋은 값임

2. AdaBoost (적응 부스트):

약한 학습기(weak learner)를 여러 개 결합하여 강력한 학습기(strong learner)를 만드는 앙상블 학습 방법입니다.
반복적으로 학습되며, 이전 학습에서 잘못 예측한 샘플에 가중치를 부여하여 다음 학습에서 보다 집중할 수 있도록 합니다.
잘못 분류된 샘플에 집중하여 성능을 향상시키는 경향이 있습니다.

# AdaBoost는 앙상블 학습 방법 중 하나로, 여러 개의 약한 학습기(weak learners)를 결합하여 강력한 학습기(strong learner)를 만드는 방식

# n_estimators 매개변수는 사용할 약한 학습기의 수를 지정
# loss 매개변수는 AdaBoost 알고리즘이 사용할 손실 함수를 지정
# random_state 매개변수는 랜덤 시드를 설정하여 결과를 재현
model = AdaBoostRegressor(n_estimators=50,   
                          loss="linear",
                          random_state=random_state)

# Define the hyperparameters and their possible values
# AdaBoost 모델의 estimator와 learning_rate 두 가지 하이퍼파라미터에 대해 탐색할 값의 목록을 지정
param_grid = {
    "estimator": [DecisionTreeRegressor(max_depth=3), DecisionTreeRegressor(max_depth=6)],
    "learning_rate": [0.1, 1.0],
}

# 그리드 서치를 사용하여 모델을 학습
grid_search = GridSearchCV(model, param_grid, cv=kf, scoring=scoring, refit=True, n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best parameters: ", grid_search.best_params_)
print("Best CV score: {:.6f}".format(grid_search.best_score_))

models["AdaBoost"] = grid_search.best_estimator_

3. Gradient Boosting (그래디언트 부스팅):

이전 트리의 예측 오차에 대해 새로운 트리를 구축하여 이를 순차적으로 합치는 방식으로 학습됩니다.
이전 트리의 오차를 보정하는 새로운 트리를 만들어나가는 과정을 반복합니다.
경사 하강법(Gradient Descent)을 이용하여 최적화되며, 오차를 줄이는 방향으로 학습합니다.
과적합을 줄이고 예측 성능을 향상시킬 수 있습니다.

# 회귀(GradientBoostingRegressor) 모델을 사용하여 그리드 서치(GridSearchCV)를 수행

# 결정 트리의 수를 지정, loss 매개변수는 Gradient Boosting 알고리즘이 사용할 손실 함수를 지정
# subsample 매개변수는 각 트리를 학습할 때 사용할 데이터의 비율을 지정
# random_state 매개변수는 랜덤 시드를 설정하여 결과를 재현
model = GradientBoostingRegressor(n_estimators=50,
                                  loss="squared_error",
                                  subsample=1.0,
                                  random_state=random_state)

# Define the hyperparameters and their possible values
# max_depth와 learning_rate 두 가지 하이퍼파라미터에 대해 탐색할 값의 목록을 지정
# max_depth는 각 결정 트리의 최대 깊이를 지정하고, learning_rate는 각 결정 트리의 기여도를 조절하는 매개변수
param_grid = {
    "max_depth": [3, 6],
    "learning_rate": [0.0, 0.1],
}

grid_search = GridSearchCV(model, param_grid, cv=kf, scoring=scoring, refit=True, n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best parameters: ", grid_search.best_params_)
print("Best CV score: {:.6f}".format(grid_search.best_score_))

models["Gradient Boosting"] = grid_search.best_estimator_

4. XGBoost (Extreme Gradient Boosting):

Gradient Boosting의 확장된 형태로, 분산 학습을 지원하여 대용량 데이터셋에 대한 효율적인 학습을 가능하게 합니다.
정규화(regularization) 기법과 트리 가지치기(pruning)를 통해 과적합을 방지합니다.
조기 종료(early stopping)를 통해 적절한 트리의 개수를 자동으로 선택할 수 있습니다.

# XGBoost 회귀(XGBRegressor) 모델을 사용하여 그리드 서치(GridSearchCV)를 수행
# Gradient Boosting 알고리즘의 한 종류로, 빠른 속도와 높은 성능으로 널리 사용되는 앙상블 학습 모델

# n_estimators 매개변수는 사용할 트리의 수를 지정하고, subsample 매개변수는 각 트리를 학습할 때 사용할 데이터의 비율을 지정
# learning_rate는 각 트리의 기여도를 조절하는 매개변수이며, max_depth는 각 트리의 최대 깊이를 지정
# n_jobs는 병렬 처리에 사용할 CPU 코어의 수를 지정
# random_state는 랜덤 시드를 설정하여 결과를 재현
model = XGBRegressor(n_estimators=50,
                     subsample=1.0,
                     learning_rate=0.1,
                     max_depth=6,
                     n_jobs=-1,
                     random_state=random_state)

# Define the hyperparameters and their possible values
param_grid = {
    "reg_alpha": [0, 0.1],
    "reg_lambda": [0, 0.1],
}

grid_search = GridSearchCV(model, param_grid, cv=kf, scoring=scoring, refit=True)
grid_search.fit(X_train, y_train)

print("Best parameters: ", grid_search.best_params_)
print("Best CV score: {:.6f}".format(grid_search.best_score_))

models["XGBoost"] = grid_search.best_estimator_

5. LightGBM (Light Gradient Boosting Machine):

XGBoost와 유사하지만, 훨씬 빠른 학습 속도와 더 낮은 메모리 사용량을 제공합니다.
GOSS(Gradient-based One-Side Sampling)와 EFB(Exclusive Feature Bundling) 등의 최적화 기법을 사용하여 속도를 향상시킵니다.
대규모 데이터셋에 대한 빠른 학습 및 예측이 가능하며, 카테고리형 특성을 자동으로 처리할 수 있습니다.

각 모델은 특징과 활용 목적에 따라 선택되어야 합니다. Decision Tree는 해석이 용이하고 설명력이 뛰어나지만, 과적합되기 쉬운 경향이 있습니다. AdaBoost와 Gradient Boosting은 일반적으로 성능이 우수하며, XGBoost와 LightGBM은 대용량 데이터셋에 대한 빠른 학습 및 예측이 필요한 경우 유용합니다.

# LightGBM은 트리 기반의 학습 방법을 사용하는 빠르고 분산 처리가 가능

model = LGBMRegressor(n_estimators=50,
                      learning_rate=0.1,
                      data_sample_strategy="goss",
                      top_rate=0.2,
                      other_rate=0.1,
                      force_col_wise=True,
                      verbosity=0,
                      n_jobs=-1,
                      random_state=random_state)

# Define the hyperparameters and their possible values
# reg_alpha, reg_lambda, enable_bundle 세 가지 하이퍼파라미터에 대해 탐색할 값의 목록을 지정
# reg_alpha와 reg_lambda는 L1 정규화 및 L2 정규화를 제어하는 매개변수
# enable_bundle는 데이터를 묶음(bundle)으로 처리하는 기능을 활성화할지 여부를 결정
param_grid = {
    "reg_alpha": [0, 0.1],
    "reg_lambda": [0, 0.1],
    "enable_bundle": [True, False]
}

grid_search = GridSearchCV(model, param_grid, cv=kf, scoring=scoring, refit=True)
grid_search.fit(X_train, y_train)

print("Best parameters: ", grid_search.best_params_)
print("Best CV score: {:.6f}".format(grid_search.best_score_))

models["LightGBM"] = grid_search.best_estimator_