자연어 전처리_2.전처리

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

Colab으로 하루에 하나씩 딥러닝

자연어 전처리_2.전처리_4) 정규화 본문

딥러닝_개념

자연어 전처리_2.전처리_4) 정규화

Elleik 2022. 12. 10. 23:45

728x90

정규화

표현 방법이 다른 단어들을 통합시켜서 같은 단어로 만들어 주는 것
각각의 데이터가 갖는 스케일 차이가 크면, 상대적으로 큰 값의 범위를 갖는 것이 더 많은 영향을 미침
- MonthlyIncome이 0~10,000의 범위를 갖고 RelationshipSatisfaction이 0~5의 범위를 가질 때, 상대적으로 MonthlyIncome이 더 많은 영향을 미침

정규화 실습(사전 준비)

동일한 데이터셋을 이용하여 정규화를 진행했을 때와 진행하지 않았을 때의 정확도 비교
데이터셋은 https://datahub.io/machine-learning/covertype에서 아래와 같은 csv 파일 사용

https://datahub.io/machine-learning/covertype에서 covertype의 csv파일 사용

파일을 구글 드라이브에 저장하고 마운트하여 진행함

정규화 실습(정규화하지 않았을 때)

### 라이브러리 호출

import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.python.data import Dataset
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import models
from tensorflow.keras import layers

### 데이터셋 로딩 및 모델 훈련

df = pd.read_csv('/content/drive/MyDrive/covertype_csv.csv')	# 저장받은 csv파일의 경로 작성
x = df[df.columns[:54]]
y = df['class'] # 정답(레이블)을 Cover_Type 칼럼으로 지정

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=90)  # 훈련과 테스트 데이터셋으로 분리하며, 전체 데이터셋 중 70%를 훈련용으로 사용

model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu',
                          input_shape=(x_train.shape[1],)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(8, activation='softmax')
])  # 출력층은 소프트맥스 활성 함수 사용

model.compile(optimizer=tf.keras.optimizers.Adam(0.001),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy']) # y가 다중 분류가 가능한 값을 갖기 때문에 sparse-categorical crossentropy 손실 함수 사용

history1 = model.fit(
    x_train, y_train,
    epochs=26, batch_size=60,
    validation_data=(x_test, y_test)  # 모델 훈련

결과(데이터 정규화 X)

훈련 결과 테스트 데이터셋에 대한 정확도는 85% 이지만 손실은 35%
Epoch가 26 진행되면서 정확도가 거의 변하지 않음 -> 모델의 학습이 진행되지 않음을 의미
칼럼들이 비슷한 값의 범위를 갖지 않기 때문에 기울기가 앞뒤로 진동하거나 전역, 지역 최솟값에 도달하기까지 오랜 시간이 걸림

정규화 실습(정규화 했을 때)

### 데이터 정규화

from sklearn import preprocessing
df = pd.read_csv('/content/drive/MyDrive/covertype_csv.csv')
x = df[df.columns[:55]]
y = df['class']
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=90)

train_norm = x_train[x_train.columns[0:10]] # 훈련 데이터셋에서 정규화가 필요한 칼럼 선택
test_norm = x_test[x_test.columns[0:10]]  # 테스트 데이터셋에서 정규화가 필요한 칼럼 선택

std_scale = preprocessing.StandardScaler().fit(train_norm)
x_train_norm = std_scale.transform(train_norm)

training_norm_col = pd.DataFrame(x_train_norm, index=train_norm.index, columns=train_norm.columns)  # numpy 배열을 데이터프레임으로 변환
x_train.update(training_norm_col)
print(x_train.head())

x_test_norm = std_scale.transform(test_norm)  # 테스트 데이터셋 정규화
testing_norm_col = pd.DataFrame(x_test_norm, index=test_norm.index, columns=test_norm.columns)
x_test.update(testing_norm_col) 
print(x_test.head())

### 데이터셋 로딩 및 모델 훈련

model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu',
                          input_shape=(x_train.shape[1],)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(8, activation='softmax')
])

model.compile(optimizer=tf.keras.optimizers.Adam(0.001),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
history2 = model.fit(
    x_train, y_train,
    epochs=26, batch_size=60,
    validation_data=(x_test, y_test)
)

결과(데이터 정규화 O)

정규화 이후 테스트 데이터셋에 대한 정확도가 100%가 되며 손실은 정규화 전에 비해 큰 폭으로 줄어들음
일반적으로 정확도가 100% 나오는 경우는 드물며, 해당 경우에서는 epoch가 2에서 정확도가 100% 되었기 때문에 어느 정도 이미 정규화가 되어 있음을 알 수 있음

출처: 서지영, 『딥러닝 텐서플로 교과서』, 길벗(2022)

'딥러닝_개념' 카테고리의 다른 글

자연어 전처리_3.임베딩_2) 횟수 기반 임베딩 (0)	2022.12.13
자연어 전처리_3.임베딩_1) 희소 표현 기반 임베딩(원-핫 인코딩) (0)	2022.12.12
자연어 전처리_2.전처리_3) 어간 추출 / 표제어 추출 (0)	2022.12.09
자연어 전처리_2.전처리_2) 불용어 제거 (0)	2022.12.09
자연어 전처리_2.전처리_1) 토큰화 (2)	2022.12.08

'딥러닝_개념' Related Articles

Colab으로 하루에 하나씩 딥러닝

자연어 전처리_2.전처리_4) 정규화 본문

자연어 전처리_2.전처리_4) 정규화

정규화

정규화 실습(사전 준비)

정규화 실습(정규화하지 않았을 때)

정규화 실습(정규화 했을 때)

'딥러닝_개념' 카테고리의 다른 글

티스토리툴바