자연어 전처리_3.임베딩_4) 횟수/예측 기반 임베딩

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

Colab으로 하루에 하나씩 딥러닝

자연어 전처리_3.임베딩_4) 횟수/예측 기반 임베딩 본문

딥러닝_개념

자연어 전처리_3.임베딩_4) 횟수/예측 기반 임베딩

Elleik 2022. 12. 15. 23:44

728x90

횟수/예측 기반 임베딩

앞선 횟수 기반과 예측 기반의 단점을 보완하기 위한 임베딩 기법으로 글로브를 사용

글로브(GloVe, Global Vectors for Word Representation)

횟수 기반의 LSA(Latent Semantic Analysis,잠재 의미 분석)와 예측 기반의 워드투벡터 단점을 보완하기 위한 모델
단어에 대한 글로벌 동시 발생확률 정보를 포함하는 단어 임베딩 => skip-gram을 합친 방식
skip-gram을 방법을 사용하되 통계적 기법이 추가됨

<데이터 셋 가져오기>

아래의 코드는 glove.6B.zip의 glove.6B.100d.txt를 참조하였습니다.

출처: https://nlp.stanford.edu/projects/glove/

### 라이브러리 호출 및 데이터셋 로딩

import numpy as np
%matplotlib notebook
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from sklearn.decomposition import PCA
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

glove_file = datapath('/content/drive/MyDrive/glove.6B.100d.txt')
word2vec_glove_file = get_tmpfile("glove.6B.100d.word2vec.txt") # 글로브 데이터를 워드투벡터 형태로 변환 
glove2word2vec(glove_file, word2vec_glove_file)

### glove2word2vec의 파라미터

glove_file: 글로브 입력 파일

word2vec_glove_file: 워드투벡터 출력 파일

### 'bill'과 유사한 단어의 리스트를 반환

model = KeyedVectors.load_word2vec_format(word2vec_glove_file)  # 파이썬 콘솔에서 결과를 확인하기 위해 word2vec_glove_file 파일을 로딩
model.most_similar('bill')  # 단어(bill) 기준으로 가장 유사한 단어들의 리스트를 보여줌

### 'cherry'와 유사한 단어의 리스트를 반환 

model.most_similar('cherry')  # 단어(cherry) 기준으로 가장 유사한 단어들의 리스트를 보여줌

### 'cherry'와 관련성이 없는 단어의 리스트를 반환

model.most_similar(negative=['cherry']) # 단어(cherry)와 관련성이 없는 단어들을 추출

### 'woman','king'과 유사성이 높으면서 'man'과 관련성이 없는 단어를 반환

result = model.most_similar(positive=['woman','king'], negative=['man'])  # woman,king과 유사성이 높으면서 man과 관련성이 없는 단어를 반환
print("{} : {:.4f}".format(*result[0]))

### 'austalia','beer','france'와 관련성이 있는 단어를 반환

def analogy(x1,x2,y1):
  result = model.most_similar(positive=[y1,x2],negative=[x1])
  return result[0][0]
  analogy('australia','beer','france')

### 'tall','tallest','long' 단어를 기반으로 새로운 단어를 유추

analogy('tall','tallest','long')

### 'breakfast cereal dinner lunch'중 유사도가 낮은 단어를 반환

print(model.doesnt_match("breakfast cereal dinner lunch".split()))  # 유사도가 가장 낮은 단어를 반환

참고: 출처: 서지영, 『딥러닝 텐서플로 교과서』, 길벗(2022)

'딥러닝_개념' 카테고리의 다른 글

텍스트 마이닝_ 1. BOW 기반의 텍스트 마이닝_1)카운트 기반/BOW의 문서 표현 (1)	2022.12.19
자연어 전처리_3.임베딩_5) 한국어 임베딩 (0)	2022.12.16
자연어 전처리_3.임베딩_3) 예측 기반 임베딩 (1)	2022.12.14
자연어 전처리_3.임베딩_2) 횟수 기반 임베딩 (0)	2022.12.13
자연어 전처리_3.임베딩_1) 희소 표현 기반 임베딩(원-핫 인코딩) (0)	2022.12.12

'딥러닝_개념' Related Articles

Colab으로 하루에 하나씩 딥러닝

자연어 전처리_3.임베딩_4) 횟수/예측 기반 임베딩 본문

자연어 전처리_3.임베딩_4) 횟수/예측 기반 임베딩

횟수/예측 기반 임베딩

글로브(GloVe, Global Vectors for Word Representation)

'딥러닝_개념' 카테고리의 다른 글

티스토리툴바