使用 Python 進行詞嵌入：Wordc

首頁 > 程式設計 > 使用 Python 進行詞嵌入：Wordc

使用 Python 進行詞嵌入：Wordc

發佈於2024-11-08

Word-embedding-with-Python: Wordc

使用 Python（和 Gensim）實現 word2vec

注意：此程式碼是用Python 3.6.1（Gensim 2.3.0）編寫的
word2vec與Gensim的Python實現及應用
原文：Mikolov, T.、Chen, K.、Corrado, G. 與 Dean, J. (2013)。向量空間中單字表示的有效估計。 arXiv 預印本 arXiv:1301.3781.

import re
import numpy as np

from gensim.models import Word2Vec
from nltk.corpus import gutenberg
from multiprocessing import Pool
from scipy import spatial

導入訓練資料集
從nltk庫導入莎士比亞的哈姆雷特語料庫

sentences = list(gutenberg.sents('shakespeare-hamlet.txt'))   # import the corpus and convert into a list

print('Type of corpus: ', type(sentences))
print('Length of corpus: ', len(sentences))

語料庫類型：class 'list'
語料長度：3106

print(sentences[0])    # title, author, and year
print(sentences[1])
print(sentences[10])

['[', 'The', '悲劇', 'of', '哈姆雷特', 'by', '威廉', '莎士比亞', '1599', ']']
['Actus', 'Primus', '.']
['弗蘭', '.']

預處理資料

使用re模組預處理資料
將所有字母轉換為小寫
刪除標點符號、數字等。

for i in range(len(sentences)):
    sentences[i] = [word.lower() for word in sentences[i] if re.match('^[a-zA-Z] ', word)]  
print(sentences[0])    # title, author, and year
print(sentences[1])
print(sentences[10])

['the'、'悲劇'、'of'、'哈姆雷特'、'by'、'威廉'、'莎士比亞']
['actus', 'primus']
['弗蘭']

創建和訓練模型

建立 word2vec 模型並使用 Hamlet 語料庫進行訓練
關鍵參數說明（https://radimrehurek.com/gensim/models/word2vec.html）
- 句子：訓練資料（必須是標記化句子的列表）
- size：嵌入空間的尺寸
- sg: CBOW 若為 0，skip-gram 若為 1
- 視窗：每個上下文中的單字數（如果視窗
- 大小為3，考慮左鄰域中的3個單字和右鄰域中的3個單字）
- min_count：詞彙表中包含的最小單字數
- iter：訓練迭代次數
- workers：要訓練的工作執行緒數量

model = Word2Vec(sentences = sentences, size = 100, sg = 1, window = 3, min_count = 1, iter = 10, workers = Pool()._processes)

model.init_sims(replace = True)

儲存和載入模型

word2vec模型可以本地保存和載入
這樣做可以減少再次訓練模型的時間

model.save('word2vec_model')
model = Word2Vec.load('word2vec_model')

相似度計算

嵌入單字（即向量）之間的相似度可以使用餘弦相似度等指標來計算

model.most_similar('hamlet')

[('horatio', 0.9978846311569214),
('女王', 0.9971947073936462),
('萊爾特斯', 0.9971820116043091),
('國王', 0.9968599081039429),
('媽媽', 0.9966716170310974),
('哪裡', 0.9966292381286621),
('迪爾', 0.9965540170669556),
('奧菲莉亞', 0.9964221715927124),
('非常', 0.9963752627372742),
('哦', 0.9963476657867432)]

v1 = model['king']
v2 = model['queen']

# define a function that computes cosine similarity between two words
def cosine_similarity(v1, v2):
    return 1 - spatial.distance.cosine(v1, v2)

cosine_similarity(v1, v2)

0.99437165260314941

版本聲明本文轉載於：https://dev.to/ragoli86/word-embedding-with-python-word2vec-540c?1如有侵犯，請聯絡[email protected]刪除