classify.doctest 6.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184
  1. .. Copyright (C) 2001-2019 NLTK Project
  2. .. For license information, see LICENSE.TXT
  3. =============
  4. Classifiers
  5. =============
  6. Classifiers label tokens with category labels (or *class labels*).
  7. Typically, labels are represented with strings (such as ``"health"``
  8. or ``"sports"``. In NLTK, classifiers are defined using classes that
  9. implement the `ClassifyI` interface:
  10. >>> import nltk
  11. >>> nltk.usage(nltk.classify.ClassifierI)
  12. ClassifierI supports the following operations:
  13. - self.classify(featureset)
  14. - self.classify_many(featuresets)
  15. - self.labels()
  16. - self.prob_classify(featureset)
  17. - self.prob_classify_many(featuresets)
  18. NLTK defines several classifier classes:
  19. - `ConditionalExponentialClassifier`
  20. - `DecisionTreeClassifier`
  21. - `MaxentClassifier`
  22. - `NaiveBayesClassifier`
  23. - `WekaClassifier`
  24. Classifiers are typically created by training them on a training
  25. corpus.
  26. Regression Tests
  27. ~~~~~~~~~~~~~~~~
  28. We define a very simple training corpus with 3 binary features: ['a',
  29. 'b', 'c'], and are two labels: ['x', 'y']. We use a simple feature set so
  30. that the correct answers can be calculated analytically (although we
  31. haven't done this yet for all tests).
  32. >>> train = [
  33. ... (dict(a=1,b=1,c=1), 'y'),
  34. ... (dict(a=1,b=1,c=1), 'x'),
  35. ... (dict(a=1,b=1,c=0), 'y'),
  36. ... (dict(a=0,b=1,c=1), 'x'),
  37. ... (dict(a=0,b=1,c=1), 'y'),
  38. ... (dict(a=0,b=0,c=1), 'y'),
  39. ... (dict(a=0,b=1,c=0), 'x'),
  40. ... (dict(a=0,b=0,c=0), 'x'),
  41. ... (dict(a=0,b=1,c=1), 'y'),
  42. ... ]
  43. >>> test = [
  44. ... (dict(a=1,b=0,c=1)), # unseen
  45. ... (dict(a=1,b=0,c=0)), # unseen
  46. ... (dict(a=0,b=1,c=1)), # seen 3 times, labels=y,y,x
  47. ... (dict(a=0,b=1,c=0)), # seen 1 time, label=x
  48. ... ]
  49. Test the Naive Bayes classifier:
  50. >>> classifier = nltk.classify.NaiveBayesClassifier.train(train)
  51. >>> sorted(classifier.labels())
  52. ['x', 'y']
  53. >>> classifier.classify_many(test)
  54. ['y', 'x', 'y', 'x']
  55. >>> for pdist in classifier.prob_classify_many(test):
  56. ... print('%.4f %.4f' % (pdist.prob('x'), pdist.prob('y')))
  57. 0.3203 0.6797
  58. 0.5857 0.4143
  59. 0.3792 0.6208
  60. 0.6470 0.3530
  61. >>> classifier.show_most_informative_features()
  62. Most Informative Features
  63. c = 0 x : y = 2.0 : 1.0
  64. c = 1 y : x = 1.5 : 1.0
  65. a = 1 y : x = 1.4 : 1.0
  66. b = 0 x : y = 1.2 : 1.0
  67. a = 0 x : y = 1.2 : 1.0
  68. b = 1 y : x = 1.1 : 1.0
  69. Test the Decision Tree classifier:
  70. >>> classifier = nltk.classify.DecisionTreeClassifier.train(
  71. ... train, entropy_cutoff=0,
  72. ... support_cutoff=0)
  73. >>> sorted(classifier.labels())
  74. ['x', 'y']
  75. >>> print(classifier)
  76. c=0? .................................................. x
  77. a=0? ................................................ x
  78. a=1? ................................................ y
  79. c=1? .................................................. y
  80. <BLANKLINE>
  81. >>> classifier.classify_many(test)
  82. ['y', 'y', 'y', 'x']
  83. >>> for pdist in classifier.prob_classify_many(test):
  84. ... print('%.4f %.4f' % (pdist.prob('x'), pdist.prob('y')))
  85. Traceback (most recent call last):
  86. . . .
  87. NotImplementedError
  88. Test SklearnClassifier, which requires the scikit-learn package.
  89. >>> from nltk.classify import SklearnClassifier
  90. >>> from sklearn.naive_bayes import BernoulliNB
  91. >>> from sklearn.svm import SVC
  92. >>> train_data = [({"a": 4, "b": 1, "c": 0}, "ham"),
  93. ... ({"a": 5, "b": 2, "c": 1}, "ham"),
  94. ... ({"a": 0, "b": 3, "c": 4}, "spam"),
  95. ... ({"a": 5, "b": 1, "c": 1}, "ham"),
  96. ... ({"a": 1, "b": 4, "c": 3}, "spam")]
  97. >>> classif = SklearnClassifier(BernoulliNB()).train(train_data)
  98. >>> test_data = [{"a": 3, "b": 2, "c": 1},
  99. ... {"a": 0, "b": 3, "c": 7}]
  100. >>> classif.classify_many(test_data)
  101. ['ham', 'spam']
  102. >>> classif = SklearnClassifier(SVC(), sparse=False).train(train_data)
  103. >>> classif.classify_many(test_data)
  104. ['ham', 'spam']
  105. Test the Maximum Entropy classifier training algorithms; they should all
  106. generate the same results.
  107. >>> def print_maxent_test_header():
  108. ... print(' '*11+''.join([' test[%s] ' % i
  109. ... for i in range(len(test))]))
  110. ... print(' '*11+' p(x) p(y)'*len(test))
  111. ... print('-'*(11+15*len(test)))
  112. >>> def test_maxent(algorithm):
  113. ... print('%11s' % algorithm, end=' ')
  114. ... try:
  115. ... classifier = nltk.classify.MaxentClassifier.train(
  116. ... train, algorithm, trace=0, max_iter=1000)
  117. ... except Exception as e:
  118. ... print('Error: %r' % e)
  119. ... return
  120. ...
  121. ... for featureset in test:
  122. ... pdist = classifier.prob_classify(featureset)
  123. ... print('%8.2f%6.2f' % (pdist.prob('x'), pdist.prob('y')), end=' ')
  124. ... print()
  125. >>> print_maxent_test_header(); test_maxent('GIS'); test_maxent('IIS')
  126. test[0] test[1] test[2] test[3]
  127. p(x) p(y) p(x) p(y) p(x) p(y) p(x) p(y)
  128. -----------------------------------------------------------------------
  129. GIS 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24
  130. IIS 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24
  131. >>> test_maxent('MEGAM'); test_maxent('TADM') # doctest: +SKIP
  132. MEGAM 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24
  133. TADM 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24
  134. Regression tests for TypedMaxentFeatureEncoding
  135. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  136. >>> from nltk.classify import maxent
  137. >>> train = [
  138. ... ({'a': 1, 'b': 1, 'c': 1}, 'y'),
  139. ... ({'a': 5, 'b': 5, 'c': 5}, 'x'),
  140. ... ({'a': 0.9, 'b': 0.9, 'c': 0.9}, 'y'),
  141. ... ({'a': 5.5, 'b': 5.4, 'c': 5.3}, 'x'),
  142. ... ({'a': 0.8, 'b': 1.2, 'c': 1}, 'y'),
  143. ... ({'a': 5.1, 'b': 4.9, 'c': 5.2}, 'x')
  144. ... ]
  145. >>> test = [
  146. ... {'a': 1, 'b': 0.8, 'c': 1.2},
  147. ... {'a': 5.2, 'b': 5.1, 'c': 5}
  148. ... ]
  149. >>> encoding = maxent.TypedMaxentFeatureEncoding.train(
  150. ... train, count_cutoff=3, alwayson_features=True)
  151. >>> classifier = maxent.MaxentClassifier.train(
  152. ... train, bernoulli=False, encoding=encoding, trace=0)
  153. >>> classifier.classify_many(test)
  154. ['y', 'x']