"If a worker wants to do his job well, he must first sharpen his tools." - Confucius, "The Analects of Confucius. Lu Linggong"
Front page > Programming > Word-embedding-with-Python: Wordc

Word-embedding-with-Python: Wordc

Published on 2024-11-08
Browse:686

Word-embedding-with-Python: Wordc

word2vec implementation with Python (& Gensim)

  • Note: This code is written in Python 3.6.1 ( Gensim 2.3.0)

  • Python implementation and application of word2vec with Gensim

  • Original paper: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

import re
import numpy as np

from gensim.models import Word2Vec
from nltk.corpus import gutenberg
from multiprocessing import Pool
from scipy import spatial
  • Import training dataset
  • Import Shakespeare's Hamlet corpus from nltk library
sentences = list(gutenberg.sents('shakespeare-hamlet.txt'))   # import the corpus and convert into a list

print('Type of corpus: ', type(sentences))
print('Length of corpus: ', len(sentences))

Type of corpus: class 'list'
Length of corpus: 3106

print(sentences[0])    # title, author, and year
print(sentences[1])
print(sentences[10])

['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']']
['Actus', 'Primus', '.']
['Fran', '.']

Preprocess data

  • Use re module to preprocess data
  • Convert all letters into lowercase
  • Remove punctuations, numbers, etc.
for i in range(len(sentences)):
    sentences[i] = [word.lower() for word in sentences[i] if re.match('^[a-zA-Z] ', word)]  
print(sentences[0])    # title, author, and year
print(sentences[1])
print(sentences[10])

['the', 'tragedie', 'of', 'hamlet', 'by', 'william', 'shakespeare']
['actus', 'primus']
['fran']

Create and train model

  • Create a word2vec model and train it with Hamlet corpus
  • Key parameter description (https://radimrehurek.com/gensim/models/word2vec.html)
    • sentences: training data (has to be a list with tokenized sentences)
    • size: dimension of embedding space
    • sg: CBOW if 0, skip-gram if 1
    • window: number of words accounted for each context (if the window
    • size is 3, 3 word in the left neighorhood and 3 word in the right neighborhood are considered)
    • min_count: minimum count of words to be included in the vocabulary
    • iter: number of training iterations
    • workers: number of worker threads to train
model = Word2Vec(sentences = sentences, size = 100, sg = 1, window = 3, min_count = 1, iter = 10, workers = Pool()._processes)

model.init_sims(replace = True)

Save and load model

  • word2vec model can be saved and loaded locally
  • Doing so can reduce time to train model again
model.save('word2vec_model')
model = Word2Vec.load('word2vec_model')

Similarity calculation

  • Similarity between embedded words (i.e., vectors) can be computed using metrics such as cosine similarity
model.most_similar('hamlet')

[('horatio', 0.9978846311569214),
('queene', 0.9971947073936462),
('laertes', 0.9971820116043091),
('king', 0.9968599081039429),
('mother', 0.9966716170310974),
('where', 0.9966292381286621),
('deere', 0.9965540170669556),
('ophelia', 0.9964221715927124),
('very', 0.9963752627372742),
('oh', 0.9963476657867432)]

v1 = model['king']
v2 = model['queen']

# define a function that computes cosine similarity between two words
def cosine_similarity(v1, v2):
    return 1 - spatial.distance.cosine(v1, v2)

cosine_similarity(v1, v2)

0.99437165260314941

Release Statement This article is reproduced at: https://dev.to/ragoli86/word-embedding-with-python-word2vec-540c?1 If there is any infringement, please contact [email protected] to delete it
Latest tutorial More>

Disclaimer: All resources provided are partly from the Internet. If there is any infringement of your copyright or other rights and interests, please explain the detailed reasons and provide proof of copyright or rights and interests and then send it to the email: [email protected] We will handle it for you as soon as possible.

Copyright© 2022 湘ICP备2022001581号-3