translate.doctest 8.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243
  1. .. Copyright (C) 2001-2019 NLTK Project
  2. .. For license information, see LICENSE.TXT
  3. .. -*- coding: utf-8 -*-
  4. =========
  5. Alignment
  6. =========
  7. Corpus Reader
  8. -------------
  9. >>> from nltk.corpus import comtrans
  10. >>> words = comtrans.words('alignment-en-fr.txt')
  11. >>> for word in words[:6]:
  12. ... print(word)
  13. Resumption
  14. of
  15. the
  16. session
  17. I
  18. declare
  19. >>> als = comtrans.aligned_sents('alignment-en-fr.txt')[0]
  20. >>> als # doctest: +NORMALIZE_WHITESPACE
  21. AlignedSent(['Resumption', 'of', 'the', 'session'],
  22. ['Reprise', 'de', 'la', 'session'],
  23. Alignment([(0, 0), (1, 1), (2, 2), (3, 3)]))
  24. Alignment Objects
  25. -----------------
  26. Aligned sentences are simply a mapping between words in a sentence:
  27. >>> print(" ".join(als.words))
  28. Resumption of the session
  29. >>> print(" ".join(als.mots))
  30. Reprise de la session
  31. >>> als.alignment
  32. Alignment([(0, 0), (1, 1), (2, 2), (3, 3)])
  33. Usually we look at them from the perspective of a source to a target language,
  34. but they are easily inverted:
  35. >>> als.invert() # doctest: +NORMALIZE_WHITESPACE
  36. AlignedSent(['Reprise', 'de', 'la', 'session'],
  37. ['Resumption', 'of', 'the', 'session'],
  38. Alignment([(0, 0), (1, 1), (2, 2), (3, 3)]))
  39. We can create new alignments, but these need to be in the correct range of
  40. the corresponding sentences:
  41. >>> from nltk.translate import Alignment, AlignedSent
  42. >>> als = AlignedSent(['Reprise', 'de', 'la', 'session'],
  43. ... ['Resumption', 'of', 'the', 'session'],
  44. ... Alignment([(0, 0), (1, 4), (2, 1), (3, 3)]))
  45. Traceback (most recent call last):
  46. ...
  47. IndexError: Alignment is outside boundary of mots
  48. You can set alignments with any sequence of tuples, so long as the first two
  49. indexes of the tuple are the alignment indices:
  50. >>> als.alignment = Alignment([(0, 0), (1, 1), (2, 2, "boat"), (3, 3, False, (1,2))])
  51. >>> Alignment([(0, 0), (1, 1), (2, 2, "boat"), (3, 3, False, (1,2))])
  52. Alignment([(0, 0), (1, 1), (2, 2, 'boat'), (3, 3, False, (1, 2))])
  53. Alignment Algorithms
  54. --------------------
  55. EM for IBM Model 1
  56. ~~~~~~~~~~~~~~~~~~
  57. Here is an example from Koehn, 2010:
  58. >>> from nltk.translate import IBMModel1
  59. >>> corpus = [AlignedSent(['the', 'house'], ['das', 'Haus']),
  60. ... AlignedSent(['the', 'book'], ['das', 'Buch']),
  61. ... AlignedSent(['a', 'book'], ['ein', 'Buch'])]
  62. >>> em_ibm1 = IBMModel1(corpus, 20)
  63. >>> print(round(em_ibm1.translation_table['the']['das'], 1))
  64. 1.0
  65. >>> print(round(em_ibm1.translation_table['book']['das'], 1))
  66. 0.0
  67. >>> print(round(em_ibm1.translation_table['house']['das'], 1))
  68. 0.0
  69. >>> print(round(em_ibm1.translation_table['the']['Buch'], 1))
  70. 0.0
  71. >>> print(round(em_ibm1.translation_table['book']['Buch'], 1))
  72. 1.0
  73. >>> print(round(em_ibm1.translation_table['a']['Buch'], 1))
  74. 0.0
  75. >>> print(round(em_ibm1.translation_table['book']['ein'], 1))
  76. 0.0
  77. >>> print(round(em_ibm1.translation_table['a']['ein'], 1))
  78. 1.0
  79. >>> print(round(em_ibm1.translation_table['the']['Haus'], 1))
  80. 0.0
  81. >>> print(round(em_ibm1.translation_table['house']['Haus'], 1))
  82. 1.0
  83. >>> print(round(em_ibm1.translation_table['book'][None], 1))
  84. 0.5
  85. And using an NLTK corpus. We train on only 10 sentences, since it is so slow:
  86. >>> from nltk.corpus import comtrans
  87. >>> com_ibm1 = IBMModel1(comtrans.aligned_sents()[:10], 20)
  88. >>> print(round(com_ibm1.translation_table['bitte']['Please'], 1))
  89. 0.2
  90. >>> print(round(com_ibm1.translation_table['Sitzungsperiode']['session'], 1))
  91. 1.0
  92. Evaluation
  93. ----------
  94. The evaluation metrics for alignments are usually not interested in the
  95. contents of alignments but more often the comparison to a "gold standard"
  96. alignment that has been been constructed by human experts. For this reason we
  97. often want to work just with raw set operations against the alignment points.
  98. This then gives us a very clean form for defining our evaluation metrics.
  99. .. Note::
  100. The AlignedSent class has no distinction of "possible" or "sure"
  101. alignments. Thus all alignments are treated as "sure".
  102. Consider the following aligned sentence for evaluation:
  103. >>> my_als = AlignedSent(['Resumption', 'of', 'the', 'session'],
  104. ... ['Reprise', 'de', 'la', 'session'],
  105. ... Alignment([(0, 0), (3, 3), (1, 2), (1, 1), (1, 3)]))
  106. Precision
  107. ~~~~~~~~~
  108. ``precision = |A∩P| / |A|``
  109. **Precision** is probably the most well known evaluation metric and it is implemented
  110. in `nltk.metrics.scores.precision`_. Since precision is simply interested in the
  111. proportion of correct alignments, we calculate the ratio of the number of our
  112. test alignments (*A*) that match a possible alignment (*P*), over the number of
  113. test alignments provided. There is no penalty for missing a possible alignment
  114. in our test alignments. An easy way to game this metric is to provide just one
  115. test alignment that is in *P* [OCH2000]_.
  116. Here are some examples:
  117. >>> from nltk.metrics import precision
  118. >>> als.alignment = Alignment([(0,0), (1,1), (2,2), (3,3)])
  119. >>> precision(Alignment([]), als.alignment)
  120. 0.0
  121. >>> precision(Alignment([(0,0), (1,1), (2,2), (3,3)]), als.alignment)
  122. 1.0
  123. >>> precision(Alignment([(0,0), (3,3)]), als.alignment)
  124. 0.5
  125. >>> precision(Alignment.fromstring('0-0 3-3'), als.alignment)
  126. 0.5
  127. >>> precision(Alignment([(0,0), (1,1), (2,2), (3,3), (1,2), (2,1)]), als.alignment)
  128. 1.0
  129. >>> precision(als.alignment, my_als.alignment)
  130. 0.6
  131. .. _nltk.metrics.scores.precision:
  132. http://www.nltk.org/api/nltk.metrics.html#nltk.metrics.scores.precision
  133. Recall
  134. ~~~~~~
  135. ``recall = |A∩S| / |S|``
  136. **Recall** is another well known evaluation metric that has a set based
  137. implementation in NLTK as `nltk.metrics.scores.recall`_. Since recall is
  138. simply interested in the proportion of found alignments, we calculate the
  139. ratio of the number of our test alignments (*A*) that match a sure alignment
  140. (*S*) over the number of sure alignments. There is no penalty for producing
  141. a lot of test alignments. An easy way to game this metric is to include every
  142. possible alignment in our test alignments, regardless if they are correct or
  143. not [OCH2000]_.
  144. Here are some examples:
  145. >>> from nltk.metrics import recall
  146. >>> print(recall(Alignment([]), als.alignment))
  147. None
  148. >>> recall(Alignment([(0,0), (1,1), (2,2), (3,3)]), als.alignment)
  149. 1.0
  150. >>> recall(Alignment.fromstring('0-0 3-3'), als.alignment)
  151. 1.0
  152. >>> recall(Alignment([(0,0), (3,3)]), als.alignment)
  153. 1.0
  154. >>> recall(Alignment([(0,0), (1,1), (2,2), (3,3), (1,2), (2,1)]), als.alignment)
  155. 0.66666...
  156. >>> recall(als.alignment, my_als.alignment)
  157. 0.75
  158. .. _nltk.metrics.scores.recall:
  159. http://www.nltk.org/api/nltk.metrics.html#nltk.metrics.scores.recall
  160. Alignment Error Rate (AER)
  161. ~~~~~~~~~~~~~~~~~~~~~~~~~~
  162. ``AER = 1 - (|A∩S| + |A∩P|) / (|A| + |S|)``
  163. **Alignment Error Rate** is commonly used metric for assessing sentence
  164. alignments. It combines precision and recall metrics together such that a
  165. perfect alignment must have all of the sure alignments and may have some
  166. possible alignments [MIHALCEA2003]_ [KOEHN2010]_.
  167. .. Note::
  168. [KOEHN2010]_ defines the AER as ``AER = (|A∩S| + |A∩P|) / (|A| + |S|)``
  169. in his book, but corrects it to the above in his online errata. This is
  170. in line with [MIHALCEA2003]_.
  171. Here are some examples:
  172. >>> from nltk.translate import alignment_error_rate
  173. >>> alignment_error_rate(Alignment([]), als.alignment)
  174. 1.0
  175. >>> alignment_error_rate(Alignment([(0,0), (1,1), (2,2), (3,3)]), als.alignment)
  176. 0.0
  177. >>> alignment_error_rate(als.alignment, my_als.alignment)
  178. 0.333333...
  179. >>> alignment_error_rate(als.alignment, my_als.alignment,
  180. ... als.alignment | Alignment([(1,2), (2,1)]))
  181. 0.222222...
  182. .. [OCH2000] Och, F. and Ney, H. (2000)
  183. *Statistical Machine Translation*, EAMT Workshop
  184. .. [MIHALCEA2003] Mihalcea, R. and Pedersen, T. (2003)
  185. *An evaluation exercise for word alignment*, HLT-NAACL 2003
  186. .. [KOEHN2010] Koehn, P. (2010)
  187. *Statistical Machine Translation*, Cambridge University Press