ibm4.py 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491
  1. # -*- coding: utf-8 -*-
  2. # Natural Language Toolkit: IBM Model 4
  3. #
  4. # Copyright (C) 2001-2019 NLTK Project
  5. # Author: Tah Wei Hoon <hoon.tw@gmail.com>
  6. # URL: <http://nltk.org/>
  7. # For license information, see LICENSE.TXT
  8. """
  9. Translation model that reorders output words based on their type and
  10. distance from other related words in the output sentence.
  11. IBM Model 4 improves the distortion model of Model 3, motivated by the
  12. observation that certain words tend to be re-ordered in a predictable
  13. way relative to one another. For example, <adjective><noun> in English
  14. usually has its order flipped as <noun><adjective> in French.
  15. Model 4 requires words in the source and target vocabularies to be
  16. categorized into classes. This can be linguistically driven, like parts
  17. of speech (adjective, nouns, prepositions, etc). Word classes can also
  18. be obtained by statistical methods. The original IBM Model 4 uses an
  19. information theoretic approach to group words into 50 classes for each
  20. vocabulary.
  21. Terminology:
  22. Cept:
  23. A source word with non-zero fertility i.e. aligned to one or more
  24. target words.
  25. Tablet:
  26. The set of target word(s) aligned to a cept.
  27. Head of cept:
  28. The first word of the tablet of that cept.
  29. Center of cept:
  30. The average position of the words in that cept's tablet. If the
  31. value is not an integer, the ceiling is taken.
  32. For example, for a tablet with words in positions 2, 5, 6 in the
  33. target sentence, the center of the corresponding cept is
  34. ceil((2 + 5 + 6) / 3) = 5
  35. Displacement:
  36. For a head word, defined as (position of head word - position of
  37. previous cept's center). Can be positive or negative.
  38. For a non-head word, defined as (position of non-head word -
  39. position of previous word in the same tablet). Always positive,
  40. because successive words in a tablet are assumed to appear to the
  41. right of the previous word.
  42. In contrast to Model 3 which reorders words in a tablet independently of
  43. other words, Model 4 distinguishes between three cases.
  44. (1) Words generated by NULL are distributed uniformly.
  45. (2) For a head word t, its position is modeled by the probability
  46. d_head(displacement | word_class_s(s),word_class_t(t)),
  47. where s is the previous cept, and word_class_s and word_class_t maps
  48. s and t to a source and target language word class respectively.
  49. (3) For a non-head word t, its position is modeled by the probability
  50. d_non_head(displacement | word_class_t(t))
  51. The EM algorithm used in Model 4 is:
  52. E step - In the training data, collect counts, weighted by prior
  53. probabilities.
  54. (a) count how many times a source language word is translated
  55. into a target language word
  56. (b) for a particular word class, count how many times a head
  57. word is located at a particular displacement from the
  58. previous cept's center
  59. (c) for a particular word class, count how many times a
  60. non-head word is located at a particular displacement from
  61. the previous target word
  62. (d) count how many times a source word is aligned to phi number
  63. of target words
  64. (e) count how many times NULL is aligned to a target word
  65. M step - Estimate new probabilities based on the counts from the E step
  66. Like Model 3, there are too many possible alignments to consider. Thus,
  67. a hill climbing approach is used to sample good candidates.
  68. Notations:
  69. i: Position in the source sentence
  70. Valid values are 0 (for NULL), 1, 2, ..., length of source sentence
  71. j: Position in the target sentence
  72. Valid values are 1, 2, ..., length of target sentence
  73. l: Number of words in the source sentence, excluding NULL
  74. m: Number of words in the target sentence
  75. s: A word in the source language
  76. t: A word in the target language
  77. phi: Fertility, the number of target words produced by a source word
  78. p1: Probability that a target word produced by a source word is
  79. accompanied by another target word that is aligned to NULL
  80. p0: 1 - p1
  81. dj: Displacement, Δj
  82. References:
  83. Philipp Koehn. 2010. Statistical Machine Translation.
  84. Cambridge University Press, New York.
  85. Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and
  86. Robert L. Mercer. 1993. The Mathematics of Statistical Machine
  87. Translation: Parameter Estimation. Computational Linguistics, 19 (2),
  88. 263-311.
  89. """
  90. from __future__ import division
  91. import warnings
  92. from collections import defaultdict
  93. from math import factorial
  94. from nltk.translate import AlignedSent
  95. from nltk.translate import Alignment
  96. from nltk.translate import IBMModel
  97. from nltk.translate import IBMModel3
  98. from nltk.translate.ibm_model import Counts
  99. from nltk.translate.ibm_model import longest_target_sentence_length
  100. class IBMModel4(IBMModel):
  101. """
  102. Translation model that reorders output words based on their type and
  103. their distance from other related words in the output sentence
  104. >>> bitext = []
  105. >>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small']))
  106. >>> bitext.append(AlignedSent(['das', 'haus', 'war', 'ja', 'groß'], ['the', 'house', 'was', 'big']))
  107. >>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small']))
  108. >>> bitext.append(AlignedSent(['ein', 'haus', 'ist', 'klein'], ['a', 'house', 'is', 'small']))
  109. >>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house']))
  110. >>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book']))
  111. >>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book']))
  112. >>> bitext.append(AlignedSent(['ich', 'fasse', 'das', 'buch', 'zusammen'], ['i', 'summarize', 'the', 'book']))
  113. >>> bitext.append(AlignedSent(['fasse', 'zusammen'], ['summarize']))
  114. >>> src_classes = {'the': 0, 'a': 0, 'small': 1, 'big': 1, 'house': 2, 'book': 2, 'is': 3, 'was': 3, 'i': 4, 'summarize': 5 }
  115. >>> trg_classes = {'das': 0, 'ein': 0, 'haus': 1, 'buch': 1, 'klein': 2, 'groß': 2, 'ist': 3, 'war': 3, 'ja': 4, 'ich': 5, 'fasse': 6, 'zusammen': 6 }
  116. >>> ibm4 = IBMModel4(bitext, 5, src_classes, trg_classes)
  117. >>> print(round(ibm4.translation_table['buch']['book'], 3))
  118. 1.0
  119. >>> print(round(ibm4.translation_table['das']['book'], 3))
  120. 0.0
  121. >>> print(round(ibm4.translation_table['ja'][None], 3))
  122. 1.0
  123. >>> print(round(ibm4.head_distortion_table[1][0][1], 3))
  124. 1.0
  125. >>> print(round(ibm4.head_distortion_table[2][0][1], 3))
  126. 0.0
  127. >>> print(round(ibm4.non_head_distortion_table[3][6], 3))
  128. 0.5
  129. >>> print(round(ibm4.fertility_table[2]['summarize'], 3))
  130. 1.0
  131. >>> print(round(ibm4.fertility_table[1]['book'], 3))
  132. 1.0
  133. >>> print(ibm4.p1)
  134. 0.033...
  135. >>> test_sentence = bitext[2]
  136. >>> test_sentence.words
  137. ['das', 'buch', 'ist', 'ja', 'klein']
  138. >>> test_sentence.mots
  139. ['the', 'book', 'is', 'small']
  140. >>> test_sentence.alignment
  141. Alignment([(0, 0), (1, 1), (2, 2), (3, None), (4, 3)])
  142. """
  143. def __init__(
  144. self,
  145. sentence_aligned_corpus,
  146. iterations,
  147. source_word_classes,
  148. target_word_classes,
  149. probability_tables=None,
  150. ):
  151. """
  152. Train on ``sentence_aligned_corpus`` and create a lexical
  153. translation model, distortion models, a fertility model, and a
  154. model for generating NULL-aligned words.
  155. Translation direction is from ``AlignedSent.mots`` to
  156. ``AlignedSent.words``.
  157. :param sentence_aligned_corpus: Sentence-aligned parallel corpus
  158. :type sentence_aligned_corpus: list(AlignedSent)
  159. :param iterations: Number of iterations to run training algorithm
  160. :type iterations: int
  161. :param source_word_classes: Lookup table that maps a source word
  162. to its word class, the latter represented by an integer id
  163. :type source_word_classes: dict[str]: int
  164. :param target_word_classes: Lookup table that maps a target word
  165. to its word class, the latter represented by an integer id
  166. :type target_word_classes: dict[str]: int
  167. :param probability_tables: Optional. Use this to pass in custom
  168. probability values. If not specified, probabilities will be
  169. set to a uniform distribution, or some other sensible value.
  170. If specified, all the following entries must be present:
  171. ``translation_table``, ``alignment_table``,
  172. ``fertility_table``, ``p1``, ``head_distortion_table``,
  173. ``non_head_distortion_table``. See ``IBMModel`` and
  174. ``IBMModel4`` for the type and purpose of these tables.
  175. :type probability_tables: dict[str]: object
  176. """
  177. super(IBMModel4, self).__init__(sentence_aligned_corpus)
  178. self.reset_probabilities()
  179. self.src_classes = source_word_classes
  180. self.trg_classes = target_word_classes
  181. if probability_tables is None:
  182. # Get probabilities from IBM model 3
  183. ibm3 = IBMModel3(sentence_aligned_corpus, iterations)
  184. self.translation_table = ibm3.translation_table
  185. self.alignment_table = ibm3.alignment_table
  186. self.fertility_table = ibm3.fertility_table
  187. self.p1 = ibm3.p1
  188. self.set_uniform_probabilities(sentence_aligned_corpus)
  189. else:
  190. # Set user-defined probabilities
  191. self.translation_table = probability_tables['translation_table']
  192. self.alignment_table = probability_tables['alignment_table']
  193. self.fertility_table = probability_tables['fertility_table']
  194. self.p1 = probability_tables['p1']
  195. self.head_distortion_table = probability_tables['head_distortion_table']
  196. self.non_head_distortion_table = probability_tables[
  197. 'non_head_distortion_table'
  198. ]
  199. for n in range(0, iterations):
  200. self.train(sentence_aligned_corpus)
  201. def reset_probabilities(self):
  202. super(IBMModel4, self).reset_probabilities()
  203. self.head_distortion_table = defaultdict(
  204. lambda: defaultdict(lambda: defaultdict(lambda: self.MIN_PROB))
  205. )
  206. """
  207. dict[int][int][int]: float. Probability(displacement of head
  208. word | word class of previous cept,target word class).
  209. Values accessed as ``distortion_table[dj][src_class][trg_class]``.
  210. """
  211. self.non_head_distortion_table = defaultdict(
  212. lambda: defaultdict(lambda: self.MIN_PROB)
  213. )
  214. """
  215. dict[int][int]: float. Probability(displacement of non-head
  216. word | target word class).
  217. Values accessed as ``distortion_table[dj][trg_class]``.
  218. """
  219. def set_uniform_probabilities(self, sentence_aligned_corpus):
  220. """
  221. Set distortion probabilities uniformly to
  222. 1 / cardinality of displacement values
  223. """
  224. max_m = longest_target_sentence_length(sentence_aligned_corpus)
  225. # The maximum displacement is m-1, when a word is in the last
  226. # position m of the target sentence and the previously placed
  227. # word is in the first position.
  228. # Conversely, the minimum displacement is -(m-1).
  229. # Thus, the displacement range is (m-1) - (-(m-1)). Note that
  230. # displacement cannot be zero and is not included in the range.
  231. if max_m <= 1:
  232. initial_prob = IBMModel.MIN_PROB
  233. else:
  234. initial_prob = 1 / (2 * (max_m - 1))
  235. if initial_prob < IBMModel.MIN_PROB:
  236. warnings.warn(
  237. "A target sentence is too long ("
  238. + str(max_m)
  239. + " words). Results may be less accurate."
  240. )
  241. for dj in range(1, max_m):
  242. self.head_distortion_table[dj] = defaultdict(
  243. lambda: defaultdict(lambda: initial_prob)
  244. )
  245. self.head_distortion_table[-dj] = defaultdict(
  246. lambda: defaultdict(lambda: initial_prob)
  247. )
  248. self.non_head_distortion_table[dj] = defaultdict(lambda: initial_prob)
  249. self.non_head_distortion_table[-dj] = defaultdict(lambda: initial_prob)
  250. def train(self, parallel_corpus):
  251. counts = Model4Counts()
  252. for aligned_sentence in parallel_corpus:
  253. m = len(aligned_sentence.words)
  254. # Sample the alignment space
  255. sampled_alignments, best_alignment = self.sample(aligned_sentence)
  256. # Record the most probable alignment
  257. aligned_sentence.alignment = Alignment(
  258. best_alignment.zero_indexed_alignment()
  259. )
  260. # E step (a): Compute normalization factors to weigh counts
  261. total_count = self.prob_of_alignments(sampled_alignments)
  262. # E step (b): Collect counts
  263. for alignment_info in sampled_alignments:
  264. count = self.prob_t_a_given_s(alignment_info)
  265. normalized_count = count / total_count
  266. for j in range(1, m + 1):
  267. counts.update_lexical_translation(
  268. normalized_count, alignment_info, j
  269. )
  270. counts.update_distortion(
  271. normalized_count,
  272. alignment_info,
  273. j,
  274. self.src_classes,
  275. self.trg_classes,
  276. )
  277. counts.update_null_generation(normalized_count, alignment_info)
  278. counts.update_fertility(normalized_count, alignment_info)
  279. # M step: Update probabilities with maximum likelihood estimates
  280. # If any probability is less than MIN_PROB, clamp it to MIN_PROB
  281. existing_alignment_table = self.alignment_table
  282. self.reset_probabilities()
  283. self.alignment_table = existing_alignment_table # don't retrain
  284. self.maximize_lexical_translation_probabilities(counts)
  285. self.maximize_distortion_probabilities(counts)
  286. self.maximize_fertility_probabilities(counts)
  287. self.maximize_null_generation_probabilities(counts)
  288. def maximize_distortion_probabilities(self, counts):
  289. head_d_table = self.head_distortion_table
  290. for dj, src_classes in counts.head_distortion.items():
  291. for s_cls, trg_classes in src_classes.items():
  292. for t_cls in trg_classes:
  293. estimate = (
  294. counts.head_distortion[dj][s_cls][t_cls]
  295. / counts.head_distortion_for_any_dj[s_cls][t_cls]
  296. )
  297. head_d_table[dj][s_cls][t_cls] = max(estimate, IBMModel.MIN_PROB)
  298. non_head_d_table = self.non_head_distortion_table
  299. for dj, trg_classes in counts.non_head_distortion.items():
  300. for t_cls in trg_classes:
  301. estimate = (
  302. counts.non_head_distortion[dj][t_cls]
  303. / counts.non_head_distortion_for_any_dj[t_cls]
  304. )
  305. non_head_d_table[dj][t_cls] = max(estimate, IBMModel.MIN_PROB)
  306. def prob_t_a_given_s(self, alignment_info):
  307. """
  308. Probability of target sentence and an alignment given the
  309. source sentence
  310. """
  311. return IBMModel4.model4_prob_t_a_given_s(alignment_info, self)
  312. @staticmethod # exposed for Model 5 to use
  313. def model4_prob_t_a_given_s(alignment_info, ibm_model):
  314. probability = 1.0
  315. MIN_PROB = IBMModel.MIN_PROB
  316. def null_generation_term():
  317. # Binomial distribution: B(m - null_fertility, p1)
  318. value = 1.0
  319. p1 = ibm_model.p1
  320. p0 = 1 - p1
  321. null_fertility = alignment_info.fertility_of_i(0)
  322. m = len(alignment_info.trg_sentence) - 1
  323. value *= pow(p1, null_fertility) * pow(p0, m - 2 * null_fertility)
  324. if value < MIN_PROB:
  325. return MIN_PROB
  326. # Combination: (m - null_fertility) choose null_fertility
  327. for i in range(1, null_fertility + 1):
  328. value *= (m - null_fertility - i + 1) / i
  329. return value
  330. def fertility_term():
  331. value = 1.0
  332. src_sentence = alignment_info.src_sentence
  333. for i in range(1, len(src_sentence)):
  334. fertility = alignment_info.fertility_of_i(i)
  335. value *= (
  336. factorial(fertility)
  337. * ibm_model.fertility_table[fertility][src_sentence[i]]
  338. )
  339. if value < MIN_PROB:
  340. return MIN_PROB
  341. return value
  342. def lexical_translation_term(j):
  343. t = alignment_info.trg_sentence[j]
  344. i = alignment_info.alignment[j]
  345. s = alignment_info.src_sentence[i]
  346. return ibm_model.translation_table[t][s]
  347. def distortion_term(j):
  348. t = alignment_info.trg_sentence[j]
  349. i = alignment_info.alignment[j]
  350. if i == 0:
  351. # case 1: t is aligned to NULL
  352. return 1.0
  353. if alignment_info.is_head_word(j):
  354. # case 2: t is the first word of a tablet
  355. previous_cept = alignment_info.previous_cept(j)
  356. src_class = None
  357. if previous_cept is not None:
  358. previous_s = alignment_info.src_sentence[previous_cept]
  359. src_class = ibm_model.src_classes[previous_s]
  360. trg_class = ibm_model.trg_classes[t]
  361. dj = j - alignment_info.center_of_cept(previous_cept)
  362. return ibm_model.head_distortion_table[dj][src_class][trg_class]
  363. # case 3: t is a subsequent word of a tablet
  364. previous_position = alignment_info.previous_in_tablet(j)
  365. trg_class = ibm_model.trg_classes[t]
  366. dj = j - previous_position
  367. return ibm_model.non_head_distortion_table[dj][trg_class]
  368. # end nested functions
  369. # Abort computation whenever probability falls below MIN_PROB at
  370. # any point, since MIN_PROB can be considered as zero
  371. probability *= null_generation_term()
  372. if probability < MIN_PROB:
  373. return MIN_PROB
  374. probability *= fertility_term()
  375. if probability < MIN_PROB:
  376. return MIN_PROB
  377. for j in range(1, len(alignment_info.trg_sentence)):
  378. probability *= lexical_translation_term(j)
  379. if probability < MIN_PROB:
  380. return MIN_PROB
  381. probability *= distortion_term(j)
  382. if probability < MIN_PROB:
  383. return MIN_PROB
  384. return probability
  385. class Model4Counts(Counts):
  386. """
  387. Data object to store counts of various parameters during training.
  388. Includes counts for distortion.
  389. """
  390. def __init__(self):
  391. super(Model4Counts, self).__init__()
  392. self.head_distortion = defaultdict(
  393. lambda: defaultdict(lambda: defaultdict(lambda: 0.0))
  394. )
  395. self.head_distortion_for_any_dj = defaultdict(lambda: defaultdict(lambda: 0.0))
  396. self.non_head_distortion = defaultdict(lambda: defaultdict(lambda: 0.0))
  397. self.non_head_distortion_for_any_dj = defaultdict(lambda: 0.0)
  398. def update_distortion(self, count, alignment_info, j, src_classes, trg_classes):
  399. i = alignment_info.alignment[j]
  400. t = alignment_info.trg_sentence[j]
  401. if i == 0:
  402. # case 1: t is aligned to NULL
  403. pass
  404. elif alignment_info.is_head_word(j):
  405. # case 2: t is the first word of a tablet
  406. previous_cept = alignment_info.previous_cept(j)
  407. if previous_cept is not None:
  408. previous_src_word = alignment_info.src_sentence[previous_cept]
  409. src_class = src_classes[previous_src_word]
  410. else:
  411. src_class = None
  412. trg_class = trg_classes[t]
  413. dj = j - alignment_info.center_of_cept(previous_cept)
  414. self.head_distortion[dj][src_class][trg_class] += count
  415. self.head_distortion_for_any_dj[src_class][trg_class] += count
  416. else:
  417. # case 3: t is a subsequent word of a tablet
  418. previous_j = alignment_info.previous_in_tablet(j)
  419. trg_class = trg_classes[t]
  420. dj = j - previous_j
  421. self.non_head_distortion[dj][trg_class] += count
  422. self.non_head_distortion_for_any_dj[trg_class] += count