注意:此程式碼是用Python 3.6.1(Gensim 2.3.0)編寫的
word2vec與Gensim的Python實現及應用
原文:Mikolov, T.、Chen, K.、Corrado, G. 與 Dean, J. (2013)。向量空間中單字表示的有效估計。 arXiv 預印本 arXiv:1301.3781.
import re import numpy as np from gensim.models import Word2Vec from nltk.corpus import gutenberg from multiprocessing import Pool from scipy import spatial
sentences = list(gutenberg.sents('shakespeare-hamlet.txt')) # import the corpus and convert into a list print('Type of corpus: ', type(sentences)) print('Length of corpus: ', len(sentences))
語料庫類型:class 'list'
語料長度:3106
print(sentences[0]) # title, author, and year print(sentences[1]) print(sentences[10])
['[', 'The', '悲劇', 'of', '哈姆雷特', 'by', '威廉', '莎士比亞', '1599', ']']
['Actus', 'Primus', '.']
['弗蘭', '.']
預處理資料
for i in range(len(sentences)): sentences[i] = [word.lower() for word in sentences[i] if re.match('^[a-zA-Z] ', word)] print(sentences[0]) # title, author, and year print(sentences[1]) print(sentences[10])
['the'、'悲劇'、'of'、'哈姆雷特'、'by'、'威廉'、'莎士比亞']
['actus', 'primus']
['弗蘭']
model = Word2Vec(sentences = sentences, size = 100, sg = 1, window = 3, min_count = 1, iter = 10, workers = Pool()._processes) model.init_sims(replace = True)
model.save('word2vec_model') model = Word2Vec.load('word2vec_model')
model.most_similar('hamlet')
[('horatio', 0.9978846311569214),
('女王', 0.9971947073936462),
('萊爾特斯', 0.9971820116043091),
('國王', 0.9968599081039429),
('媽媽', 0.9966716170310974),
('哪裡', 0.9966292381286621),
('迪爾', 0.9965540170669556),
('奧菲莉亞', 0.9964221715927124),
('非常', 0.9963752627372742),
('哦', 0.9963476657867432)]
v1 = model['king'] v2 = model['queen'] # define a function that computes cosine similarity between two words def cosine_similarity(v1, v2): return 1 - spatial.distance.cosine(v1, v2) cosine_similarity(v1, v2)
0.99437165260314941
免責聲明: 提供的所有資源部分來自互聯網,如果有侵犯您的版權或其他權益,請說明詳細緣由並提供版權或權益證明然後發到郵箱:[email protected] 我們會在第一時間內為您處理。
Copyright© 2022 湘ICP备2022001581号-3