lm.doctest 3.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
  1. .. Copyright (C) 2001-2019 NLTK Project
  2. .. For license information, see LICENSE.TXT
  3. .. -*- coding: utf-8 -*-
  4. Regression Tests
  5. ================
  6. Issue 167
  7. ---------
  8. https://github.com/nltk/nltk/issues/167
  9. >>> from nltk.corpus import brown
  10. >>> from nltk.lm.preprocessing import padded_everygram_pipeline
  11. >>> ngram_order = 3
  12. >>> train_data, vocab_data = padded_everygram_pipeline(
  13. ... ngram_order,
  14. ... brown.sents(categories="news")
  15. ... )
  16. >>> from nltk.lm import WittenBellInterpolated
  17. >>> lm = WittenBellInterpolated(ngram_order)
  18. >>> lm.fit(train_data, vocab_data)
  19. Sentence containing an unseen word should result in infinite entropy because
  20. Witten-Bell is based ultimately on MLE, which cannot handle unseen ngrams.
  21. Crucially, it shouldn't raise any exceptions for unseen words.
  22. >>> from nltk.util import ngrams
  23. >>> sent = ngrams("This is a sentence with the word aaddvark".split(), 3)
  24. >>> lm.entropy(sent)
  25. inf
  26. If we remove all unseen ngrams from the sentence, we'll get a non-infinite value
  27. for the entropy.
  28. >>> sent = ngrams("This is a sentence".split(), 3)
  29. >>> lm.entropy(sent)
  30. 17.41365588455936
  31. Issue 367
  32. ---------
  33. https://github.com/nltk/nltk/issues/367
  34. Reproducing Dan Blanchard's example:
  35. https://github.com/nltk/nltk/issues/367#issuecomment-14646110
  36. >>> from nltk.lm import Lidstone, Vocabulary
  37. >>> word_seq = list('aaaababaaccbacb')
  38. >>> ngram_order = 2
  39. >>> from nltk.util import everygrams
  40. >>> train_data = [everygrams(word_seq, max_len=ngram_order)]
  41. >>> V = Vocabulary(['a', 'b', 'c', ''])
  42. >>> lm = Lidstone(0.2, ngram_order, vocabulary=V)
  43. >>> lm.fit(train_data)
  44. For doctest to work we have to sort the vocabulary keys.
  45. >>> V_keys = sorted(V)
  46. >>> round(sum(lm.score(w, ("b",)) for w in V_keys), 6)
  47. 1.0
  48. >>> round(sum(lm.score(w, ("a",)) for w in V_keys), 6)
  49. 1.0
  50. >>> [lm.score(w, ("b",)) for w in V_keys]
  51. [0.05, 0.05, 0.8, 0.05, 0.05]
  52. >>> [round(lm.score(w, ("a",)), 4) for w in V_keys]
  53. [0.0222, 0.0222, 0.4667, 0.2444, 0.2444]
  54. Here's reproducing @afourney's comment:
  55. https://github.com/nltk/nltk/issues/367#issuecomment-15686289
  56. >>> sent = ['foo', 'foo', 'foo', 'foo', 'bar', 'baz']
  57. >>> ngram_order = 3
  58. >>> from nltk.lm.preprocessing import padded_everygram_pipeline
  59. >>> train_data, vocab_data = padded_everygram_pipeline(ngram_order, [sent])
  60. >>> from nltk.lm import Lidstone
  61. >>> lm = Lidstone(0.2, ngram_order)
  62. >>> lm.fit(train_data, vocab_data)
  63. The vocabulary includes the "UNK" symbol as well as two padding symbols.
  64. >>> len(lm.vocab)
  65. 6
  66. >>> word = "foo"
  67. >>> context = ("bar", "baz")
  68. The raw counts.
  69. >>> lm.context_counts(context)[word]
  70. 0
  71. >>> lm.context_counts(context).N()
  72. 1
  73. Counts with Lidstone smoothing.
  74. >>> lm.context_counts(context)[word] + lm.gamma
  75. 0.2
  76. >>> lm.context_counts(context).N() + len(lm.vocab) * lm.gamma
  77. 2.2
  78. Without any backoff, just using Lidstone smoothing, P("foo" | "bar", "baz") should be:
  79. 0.2 / 2.2 ~= 0.090909
  80. >>> round(lm.score(word, context), 6)
  81. 0.090909
  82. Issue 380
  83. ---------
  84. https://github.com/nltk/nltk/issues/380
  85. Reproducing setup akin to this comment:
  86. https://github.com/nltk/nltk/issues/380#issue-12879030
  87. For speed take only the first 100 sentences of reuters. Shouldn't affect the test.
  88. >>> from nltk.corpus import reuters
  89. >>> sents = reuters.sents()[:100]
  90. >>> ngram_order = 3
  91. >>> from nltk.lm.preprocessing import padded_everygram_pipeline
  92. >>> train_data, vocab_data = padded_everygram_pipeline(ngram_order, sents)
  93. >>> from nltk.lm import Lidstone
  94. >>> lm = Lidstone(0.2, ngram_order)
  95. >>> lm.fit(train_data, vocab_data)
  96. >>> lm.score("said", ("",)) < 1
  97. True