사이킷런 의사결정나무 알고리즘 및 당뇨병 데이터셋

Machine Learning

사이킷런 의사결정나무 알고리즘 및 당뇨병 데이터셋

뉴욕킴 2024. 1. 16. 20:51

사이킷런 의사결정나무 알고리즘

Supervised learning의 classification 기법

새로 들어온 고객이 물건을 구매할지, 구매하지 않을지를 분류
1.10. Decision Trees — scikit-learn 1.3.2 documentation

1.10. Decision Trees

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning s...

scikit-learn.org

tree 패키지 불러오기

>>> from sklearn import tree
>>> X = [[0, 0], [1, 1]]
>>> Y = [0, 1]

의사결정 트리 분류 (DecisionTreeClassifier)

>>> clf = tree.DecisionTreeClassifier()
>>> clf

fit 함수를 통해 모델 학습 및 저장

clf = clf.fit(X, Y)
clf

predict 함수는 정답을 0,1 중 하나로만 나타냄

clf.predict([[2., 2.]])

predict_proba 함수를 사용하여 값이 비율로 출력

clf.predict_proba([[2., 2.]])

>>> from sklearn.datasets import load_iris
>>> from sklearn import tree
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(X, y)

iris.data, iris.target을 X,y로 바꾸어 시각화 진

import matplotlib.pyplot as plt

plt.figure(figsize=(20, 10))
t = tree.plot_tree(clf.fit(X, y))

당뇨병 데이터셋

Pima Indians Diabetes Database (kaggle.com)

Pima Indians Diabetes Database

Predict the onset of diabetes based on diagnostic measures

www.kaggle.com

Data 구성

Pregnancies : 임신 횟수
Glucose : 2시간 동안의 경구 포도당 내성 검사에서 혈장 포도당 농도
BloodPressure : 이완기 혈압 (mm Hg)
SkinThickness : 삼두근 피부 주름 두께 (mm), 체지방을 추정하는데 사용되는 값
Insulin : 2시간 혈청 인슐린 (mu U / ml)
BMI : 체질량 지수 (체중kg / 키(m)^2)
DiabetesPedigreeFunction : 당뇨병 혈통 기능
Age : 나이
Outcome : 768개 중에 268개의 결과 클래스 변수(0 또는 1)는 1이고 나머지는 0입니다.

라이브러리 로드

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

데이터셋 로드

df=pd.read_csv('data/diabetes.csv')
df.shape

df.head()

학습, 예측 데이터셋 나누기

split_count = int(df.shape[0] * 0.8)
split_count

train = df[:split_count].copy()
train.shape

test=df[split_count:].copy()
test.shape

학습, 예측에 사용할 컬럼 지정

feature_names = train.columns[:-1].tolist()
feature_names

정답 컬럼 지정

label_name = train.columns[-1]
label_name

학습, 예측 데이터셋 만들기

X_train = train[feature_names]
print(X_train.shape)
X_train.head()

y_train = train[label_name]
print(y_train.shape)
y_train.head()

예측에 사용할 데이터셋 만들기

X=test = test[feature_names]
pring(X_test.shape)
X_test.head()

y_test = test[label_name]
print(y_test.shape)
y_test.head()

'Machine Learning' 카테고리의 다른 글

4wk_예측값과 실제값 비교 : MSE · RMSE · MAE · R2 (0)	2024.03.31
1wk_Machine Learning_Python Packages (0)	2024.03.09
사이킷런과 머신러닝이란? (2)	2024.01.11
Machine Learning_Numpy (0)	2023.09.16
What is Machine Learning? (0)	2023.09.04

현재글사이킷런 의사결정나무 알고리즘 및 당뇨병 데이터셋

250x250

데이터 공부하는 뉴욕킴

데이터 공부하는 뉴욕킴입니다. 데이터 사이언스 함께 공부해요😘

데이터, r, 프로그래머스, 머신러닝완벽가이드, SQL, 국비지원, 딥러닝, 패스트캠퍼스, 회귀, 태블로, 데이터분석, 데이터시각화, 데이터베이스, 파이썬기초, 이대데이터사이언스, 빅데이터, 머신러닝, 빅데이터공모전, 파이썬머신러닝완벽가이드, 파이썬,

Today :
Yesterday :

일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

데이터 공부하는 뉴욕킴

사이킷런 의사결정나무 알고리즘 및 당뇨병 데이터셋

사이킷런 의사결정나무 알고리즘

당뇨병 데이터셋

'Machine Learning' 카테고리의 다른 글

'Machine Learning'의 다른글

티스토리툴바

사이킷런 의사결정나무 알고리즘 및 당뇨병 데이터셋

사이킷런 의사결정나무 알고리즘

당뇨병 데이터셋

'Machine Learning' 카테고리의 다른 글

'Machine Learning'의 다른글

관련글

티스토리툴바