123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132 |
- .. Copyright (C) 2001-2019 NLTK Project
- .. For license information, see LICENSE.TXT
- .. -*- coding: utf-8 -*-
- Regression Tests
- ================
- Issue 167
- ---------
- https://github.com/nltk/nltk/issues/167
- >>> from nltk.corpus import brown
- >>> from nltk.lm.preprocessing import padded_everygram_pipeline
- >>> ngram_order = 3
- >>> train_data, vocab_data = padded_everygram_pipeline(
- ... ngram_order,
- ... brown.sents(categories="news")
- ... )
- >>> from nltk.lm import WittenBellInterpolated
- >>> lm = WittenBellInterpolated(ngram_order)
- >>> lm.fit(train_data, vocab_data)
- Sentence containing an unseen word should result in infinite entropy because
- Witten-Bell is based ultimately on MLE, which cannot handle unseen ngrams.
- Crucially, it shouldn't raise any exceptions for unseen words.
- >>> from nltk.util import ngrams
- >>> sent = ngrams("This is a sentence with the word aaddvark".split(), 3)
- >>> lm.entropy(sent)
- inf
- If we remove all unseen ngrams from the sentence, we'll get a non-infinite value
- for the entropy.
- >>> sent = ngrams("This is a sentence".split(), 3)
- >>> lm.entropy(sent)
- 17.41365588455936
- Issue 367
- ---------
- https://github.com/nltk/nltk/issues/367
- Reproducing Dan Blanchard's example:
- https://github.com/nltk/nltk/issues/367#issuecomment-14646110
- >>> from nltk.lm import Lidstone, Vocabulary
- >>> word_seq = list('aaaababaaccbacb')
- >>> ngram_order = 2
- >>> from nltk.util import everygrams
- >>> train_data = [everygrams(word_seq, max_len=ngram_order)]
- >>> V = Vocabulary(['a', 'b', 'c', ''])
- >>> lm = Lidstone(0.2, ngram_order, vocabulary=V)
- >>> lm.fit(train_data)
- For doctest to work we have to sort the vocabulary keys.
- >>> V_keys = sorted(V)
- >>> round(sum(lm.score(w, ("b",)) for w in V_keys), 6)
- 1.0
- >>> round(sum(lm.score(w, ("a",)) for w in V_keys), 6)
- 1.0
- >>> [lm.score(w, ("b",)) for w in V_keys]
- [0.05, 0.05, 0.8, 0.05, 0.05]
- >>> [round(lm.score(w, ("a",)), 4) for w in V_keys]
- [0.0222, 0.0222, 0.4667, 0.2444, 0.2444]
- Here's reproducing @afourney's comment:
- https://github.com/nltk/nltk/issues/367#issuecomment-15686289
- >>> sent = ['foo', 'foo', 'foo', 'foo', 'bar', 'baz']
- >>> ngram_order = 3
- >>> from nltk.lm.preprocessing import padded_everygram_pipeline
- >>> train_data, vocab_data = padded_everygram_pipeline(ngram_order, [sent])
- >>> from nltk.lm import Lidstone
- >>> lm = Lidstone(0.2, ngram_order)
- >>> lm.fit(train_data, vocab_data)
- The vocabulary includes the "UNK" symbol as well as two padding symbols.
- >>> len(lm.vocab)
- 6
- >>> word = "foo"
- >>> context = ("bar", "baz")
- The raw counts.
- >>> lm.context_counts(context)[word]
- 0
- >>> lm.context_counts(context).N()
- 1
- Counts with Lidstone smoothing.
- >>> lm.context_counts(context)[word] + lm.gamma
- 0.2
- >>> lm.context_counts(context).N() + len(lm.vocab) * lm.gamma
- 2.2
- Without any backoff, just using Lidstone smoothing, P("foo" | "bar", "baz") should be:
- 0.2 / 2.2 ~= 0.090909
- >>> round(lm.score(word, context), 6)
- 0.090909
- Issue 380
- ---------
- https://github.com/nltk/nltk/issues/380
- Reproducing setup akin to this comment:
- https://github.com/nltk/nltk/issues/380#issue-12879030
- For speed take only the first 100 sentences of reuters. Shouldn't affect the test.
- >>> from nltk.corpus import reuters
- >>> sents = reuters.sents()[:100]
- >>> ngram_order = 3
- >>> from nltk.lm.preprocessing import padded_everygram_pipeline
- >>> train_data, vocab_data = padded_everygram_pipeline(ngram_order, sents)
- >>> from nltk.lm import Lidstone
- >>> lm = Lidstone(0.2, ngram_order)
- >>> lm.fit(train_data, vocab_data)
- >>> lm.score("said", ("",)) < 1
- True
|