6.1 Taking a look at ML data

4 minute read

6.1 머신러닝 데이터 살펴보기

먼저 라이브러리를 불러오자

# 수학적 계산관련 편리한 함수들 담은 라이브러리
import numpy as np
# 데이터를 데이터 프레임 형식으로 다루기에 유용한 라이브러리
import pandas as pd
# 데이터 시각화 라이브러리
import seaborn as sns
import matplotlib.pyplot as plt
# 사이킷런 라이브러리에서 제공하는 데이터 셋
from sklearn import datasets

6.1.1 집값 예측하기

# 보스턴 집값 데이터, 13가지 feature로 구성
raw_boston = datasets.load_boston()
# feature
X_boston = pd.DataFrame(raw_boston.data)
# target
y_boston = pd.DataFrame(raw_boston.target)
# feature와 target은 concat해서 하나의 데이터 프레임으로 만듦
# axis=0 : row concat / axis=1 : col concat
df_boston = pd.concat([X_boston, y_boston], axis = 1)

# df_boston의 전체 row length 확인
len(df_boston)

# df_boston 첫 5행 보기
df_boston.head()
# df_boston 첫 3행 보기
# df_boston.head(3)

	0	1	2	4	5	6	7	8	9	10	11	12	0
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

# feature 이름 확인 
feature_boston = raw_boston.feature_names
print(feature_boston)

['CRIM' 'ZN' 'indus' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']

# feature $ target 이름 정해주기 
col_boston = np.append(feature_boston, ['target'])
df_boston.columns = col_boston
df_boston.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	target
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

6.1.2 꽃 구분하기

raw_iris = datasets.load_iris()
X_iris = pd.DataFrame(raw_iris.data)
y_iris = pd.DataFrame(raw_iris.target)
df_iris = pd.concat([X_iris, y_iris], axis=1)

feature_iris = raw_iris.feature_names
print(feature_iris)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

col_iris = np.append(feature_iris, ['target'])
df_iris.columns = col_iris
df_iris.head()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

6.1.3 와인 구분하기

raw_wine = datasets.load_wine()
X_wine = pd.DataFrame(raw_wine.data)
y_wine = pd.DataFrame(raw_wine.target)
df_wine = pd.concat([X_wine, y_wine], axis=1)

df_wine.head()

	0	1	2	3	4	5	6	7	8	9	10	11	12
0	14.23	1.71	2.43	15.6	127.0	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065.0
1	13.20	1.78	2.14	11.2	100.0	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050.0
2	13.16	2.36	2.67	18.6	101.0	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185.0
3	14.37	1.95	2.50	16.8	113.0	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480.0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0

feature_wine = raw_wine.feature_names
print(feature_wine)

['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']

col_wine = np.append(feature_wine, ['target'])
df_wine.columns = col_wine
df_wine.head()

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline
0	14.23	1.71	2.43	15.6	127.0	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065.0
1	13.20	1.78	2.14	11.2	100.0	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050.0
2	13.16	2.36	2.67	18.6	101.0	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185.0
3	14.37	1.95	2.50	16.8	113.0	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480.0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0

6.1.4 당뇨병 예측하기

raw_diab = datasets.load_diabetes()
X_diab = pd.DataFrame(raw_diab.data)
y_diab = pd.DataFrame(raw_diab.target)
df_diab = pd.concat([X_diab, y_diab], axis=1)

feature_names = raw_diab.feature_names
col_diab = np.append(feature_names, ['target'])
df_diab.columns = col_diab
df_diab.head()

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6	target
0	0.038076	0.050680	0.061696	0.021872	-0.044223	-0.034821	-0.043401	-0.002592	0.019908	-0.017646	151.0
1	-0.001882	-0.044642	-0.051474	-0.026328	-0.008449	-0.019163	0.074412	-0.039493	-0.068330	-0.092204	75.0
2	0.085299	0.050680	0.044451	-0.005671	-0.045599	-0.034194	-0.032356	-0.002592	0.002864	-0.025930	141.0
3	-0.089063	-0.044642	-0.011595	-0.036656	0.012191	0.024991	-0.036038	0.034309	0.022692	-0.009362	206.0
4	0.005383	-0.044642	-0.036385	0.021872	0.003935	0.015596	0.008142	-0.002592	-0.031991	-0.046641	135.0

6.1.5 유방암 예측하기

raw_bc = datasets.load_breast_cancer()
X_bc = pd.DataFrame(raw_bc.data)
y_bc = pd.DataFrame(raw_bc.target)
df_bc = pd.concat([X_bc, y_bc], axis=1)

feature_bc = raw_bc.feature_names
col_bc = np.append(feature_bc, ['target'])
df_bc.columns = col_bc
df_bc.head()

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	...	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	...	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

5 rows × 31 columns

Twitter Facebook LinkedIn

Inyup Lee

6.1 Taking a look at ML data

6.1 머신러닝 데이터 살펴보기

먼저 라이브러리를 불러오자

6.1.1 집값 예측하기

6.1.2 꽃 구분하기

6.1.3 와인 구분하기

6.1.4 당뇨병 예측하기

6.1.5 유방암 예측하기

You May Also Enjoy

10.1-10.2 Principal Component Analysis

CycleGAN - Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks(2017)

Image-to-Image Translation with Conditional Adversarial Networks(CVPR2017))

8.9 Cross Validation

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2