gensim.doctest 4.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141
  1. .. Copyright (C) 2001-2019 NLTK Project
  2. .. For license information, see LICENSE.TXT
  3. =======================================
  4. Demonstrate word embedding using Gensim
  5. =======================================
  6. We demonstrate three functions:
  7. - Train the word embeddings using brown corpus;
  8. - Load the pre-trained model and perform simple tasks; and
  9. - Pruning the pre-trained binary model.
  10. >>> import gensim
  11. ---------------
  12. Train the model
  13. ---------------
  14. Here we train a word embedding using the Brown Corpus:
  15. >>> from nltk.corpus import brown
  16. >>> model = gensim.models.Word2Vec(brown.sents())
  17. It might take some time to train the model. So, after it is trained, it can be saved as follows:
  18. >>> model.save('brown.embedding')
  19. >>> new_model = gensim.models.Word2Vec.load('brown.embedding')
  20. The model will be the list of words with their embedding. We can easily get the vector representation of a word.
  21. >>> len(new_model['university'])
  22. 100
  23. There are some supporting functions already implemented in Gensim to manipulate with word embeddings.
  24. For example, to compute the cosine similarity between 2 words:
  25. >>> new_model.similarity('university','school') > 0.3
  26. True
  27. ---------------------------
  28. Using the pre-trained model
  29. ---------------------------
  30. NLTK includes a pre-trained model which is part of a model that is trained on 100 billion words from the Google News Dataset.
  31. The full model is from https://code.google.com/p/word2vec/ (about 3 GB).
  32. >>> from nltk.data import find
  33. >>> word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
  34. >>> model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)
  35. We pruned the model to only include the most common words (~44k words).
  36. >>> len(model.vocab)
  37. 43981
  38. Each word is represented in the space of 300 dimensions:
  39. >>> len(model['university'])
  40. 300
  41. Finding the top n words that are similar to a target word is simple. The result is the list of n words with the score.
  42. >>> model.most_similar(positive=['university'], topn = 3)
  43. [(u'universities', 0.70039...), (u'faculty', 0.67809...), (u'undergraduate', 0.65870...)]
  44. Finding a word that is not in a list is also supported, although, implementing this by yourself is simple.
  45. >>> model.doesnt_match('breakfast cereal dinner lunch'.split())
  46. 'cereal'
  47. Mikolov et al. (2013) figured out that word embedding captures much of syntactic and semantic regularities. For example,
  48. the vector 'King - Man + Woman' is close to 'Queen' and 'Germany - Berlin + Paris' is close to 'France'.
  49. >>> model.most_similar(positive=['woman','king'], negative=['man'], topn = 1)
  50. [(u'queen', 0.71181...)]
  51. >>> model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 1)
  52. [(u'France', 0.78840...)]
  53. We can visualize the word embeddings using t-SNE (http://lvdmaaten.github.io/tsne/). For this demonstration, we visualize the first 1000 words.
  54. | import numpy as np
  55. | labels = []
  56. | count = 0
  57. | max_count = 1000
  58. | X = np.zeros(shape=(max_count,len(model['university'])))
  59. |
  60. | for term in model.vocab:
  61. | X[count] = model[term]
  62. | labels.append(term)
  63. | count+= 1
  64. | if count >= max_count: break
  65. |
  66. | # It is recommended to use PCA first to reduce to ~50 dimensions
  67. | from sklearn.decomposition import PCA
  68. | pca = PCA(n_components=50)
  69. | X_50 = pca.fit_transform(X)
  70. |
  71. | # Using TSNE to further reduce to 2 dimensions
  72. | from sklearn.manifold import TSNE
  73. | model_tsne = TSNE(n_components=2, random_state=0)
  74. | Y = model_tsne.fit_transform(X_50)
  75. |
  76. | # Show the scatter plot
  77. | import matplotlib.pyplot as plt
  78. | plt.scatter(Y[:,0], Y[:,1], 20)
  79. |
  80. | # Add labels
  81. | for label, x, y in zip(labels, Y[:, 0], Y[:, 1]):
  82. | plt.annotate(label, xy = (x,y), xytext = (0, 0), textcoords = 'offset points', size = 10)
  83. |
  84. | plt.show()
  85. ------------------------------
  86. Prune the trained binary model
  87. ------------------------------
  88. Here is the supporting code to extract part of the binary model (GoogleNews-vectors-negative300.bin.gz) from https://code.google.com/p/word2vec/
  89. We use this code to get the `word2vec_sample` model.
  90. | import gensim
  91. | from gensim.models.word2vec import Word2Vec
  92. | # Load the binary model
  93. | model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary = True);
  94. |
  95. | # Only output word that appear in the Brown corpus
  96. | from nltk.corpus import brown
  97. | words = set(brown.words())
  98. | print (len(words))
  99. |
  100. | # Output presented word to a temporary file
  101. | out_file = 'pruned.word2vec.txt'
  102. | f = open(out_file,'wb')
  103. |
  104. | word_presented = words.intersection(model.vocab.keys())
  105. | f.write('{} {}\n'.format(len(word_presented),len(model['word'])))
  106. |
  107. | for word in word_presented:
  108. | f.write('{} {}\n'.format(word, ' '.join(str(value) for value in model[word])))
  109. |
  110. | f.close()