123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184 |
- .. Copyright (C) 2001-2019 NLTK Project
- .. For license information, see LICENSE.TXT
- =============
- Classifiers
- =============
- Classifiers label tokens with category labels (or *class labels*).
- Typically, labels are represented with strings (such as ``"health"``
- or ``"sports"``. In NLTK, classifiers are defined using classes that
- implement the `ClassifyI` interface:
- >>> import nltk
- >>> nltk.usage(nltk.classify.ClassifierI)
- ClassifierI supports the following operations:
- - self.classify(featureset)
- - self.classify_many(featuresets)
- - self.labels()
- - self.prob_classify(featureset)
- - self.prob_classify_many(featuresets)
- NLTK defines several classifier classes:
- - `ConditionalExponentialClassifier`
- - `DecisionTreeClassifier`
- - `MaxentClassifier`
- - `NaiveBayesClassifier`
- - `WekaClassifier`
- Classifiers are typically created by training them on a training
- corpus.
- Regression Tests
- ~~~~~~~~~~~~~~~~
- We define a very simple training corpus with 3 binary features: ['a',
- 'b', 'c'], and are two labels: ['x', 'y']. We use a simple feature set so
- that the correct answers can be calculated analytically (although we
- haven't done this yet for all tests).
- >>> train = [
- ... (dict(a=1,b=1,c=1), 'y'),
- ... (dict(a=1,b=1,c=1), 'x'),
- ... (dict(a=1,b=1,c=0), 'y'),
- ... (dict(a=0,b=1,c=1), 'x'),
- ... (dict(a=0,b=1,c=1), 'y'),
- ... (dict(a=0,b=0,c=1), 'y'),
- ... (dict(a=0,b=1,c=0), 'x'),
- ... (dict(a=0,b=0,c=0), 'x'),
- ... (dict(a=0,b=1,c=1), 'y'),
- ... ]
- >>> test = [
- ... (dict(a=1,b=0,c=1)), # unseen
- ... (dict(a=1,b=0,c=0)), # unseen
- ... (dict(a=0,b=1,c=1)), # seen 3 times, labels=y,y,x
- ... (dict(a=0,b=1,c=0)), # seen 1 time, label=x
- ... ]
- Test the Naive Bayes classifier:
- >>> classifier = nltk.classify.NaiveBayesClassifier.train(train)
- >>> sorted(classifier.labels())
- ['x', 'y']
- >>> classifier.classify_many(test)
- ['y', 'x', 'y', 'x']
- >>> for pdist in classifier.prob_classify_many(test):
- ... print('%.4f %.4f' % (pdist.prob('x'), pdist.prob('y')))
- 0.3203 0.6797
- 0.5857 0.4143
- 0.3792 0.6208
- 0.6470 0.3530
- >>> classifier.show_most_informative_features()
- Most Informative Features
- c = 0 x : y = 2.0 : 1.0
- c = 1 y : x = 1.5 : 1.0
- a = 1 y : x = 1.4 : 1.0
- b = 0 x : y = 1.2 : 1.0
- a = 0 x : y = 1.2 : 1.0
- b = 1 y : x = 1.1 : 1.0
- Test the Decision Tree classifier:
- >>> classifier = nltk.classify.DecisionTreeClassifier.train(
- ... train, entropy_cutoff=0,
- ... support_cutoff=0)
- >>> sorted(classifier.labels())
- ['x', 'y']
- >>> print(classifier)
- c=0? .................................................. x
- a=0? ................................................ x
- a=1? ................................................ y
- c=1? .................................................. y
- <BLANKLINE>
- >>> classifier.classify_many(test)
- ['y', 'y', 'y', 'x']
- >>> for pdist in classifier.prob_classify_many(test):
- ... print('%.4f %.4f' % (pdist.prob('x'), pdist.prob('y')))
- Traceback (most recent call last):
- . . .
- NotImplementedError
- Test SklearnClassifier, which requires the scikit-learn package.
- >>> from nltk.classify import SklearnClassifier
- >>> from sklearn.naive_bayes import BernoulliNB
- >>> from sklearn.svm import SVC
- >>> train_data = [({"a": 4, "b": 1, "c": 0}, "ham"),
- ... ({"a": 5, "b": 2, "c": 1}, "ham"),
- ... ({"a": 0, "b": 3, "c": 4}, "spam"),
- ... ({"a": 5, "b": 1, "c": 1}, "ham"),
- ... ({"a": 1, "b": 4, "c": 3}, "spam")]
- >>> classif = SklearnClassifier(BernoulliNB()).train(train_data)
- >>> test_data = [{"a": 3, "b": 2, "c": 1},
- ... {"a": 0, "b": 3, "c": 7}]
- >>> classif.classify_many(test_data)
- ['ham', 'spam']
- >>> classif = SklearnClassifier(SVC(), sparse=False).train(train_data)
- >>> classif.classify_many(test_data)
- ['ham', 'spam']
- Test the Maximum Entropy classifier training algorithms; they should all
- generate the same results.
- >>> def print_maxent_test_header():
- ... print(' '*11+''.join([' test[%s] ' % i
- ... for i in range(len(test))]))
- ... print(' '*11+' p(x) p(y)'*len(test))
- ... print('-'*(11+15*len(test)))
- >>> def test_maxent(algorithm):
- ... print('%11s' % algorithm, end=' ')
- ... try:
- ... classifier = nltk.classify.MaxentClassifier.train(
- ... train, algorithm, trace=0, max_iter=1000)
- ... except Exception as e:
- ... print('Error: %r' % e)
- ... return
- ...
- ... for featureset in test:
- ... pdist = classifier.prob_classify(featureset)
- ... print('%8.2f%6.2f' % (pdist.prob('x'), pdist.prob('y')), end=' ')
- ... print()
- >>> print_maxent_test_header(); test_maxent('GIS'); test_maxent('IIS')
- test[0] test[1] test[2] test[3]
- p(x) p(y) p(x) p(y) p(x) p(y) p(x) p(y)
- -----------------------------------------------------------------------
- GIS 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24
- IIS 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24
- >>> test_maxent('MEGAM'); test_maxent('TADM') # doctest: +SKIP
- MEGAM 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24
- TADM 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24
- Regression tests for TypedMaxentFeatureEncoding
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- >>> from nltk.classify import maxent
- >>> train = [
- ... ({'a': 1, 'b': 1, 'c': 1}, 'y'),
- ... ({'a': 5, 'b': 5, 'c': 5}, 'x'),
- ... ({'a': 0.9, 'b': 0.9, 'c': 0.9}, 'y'),
- ... ({'a': 5.5, 'b': 5.4, 'c': 5.3}, 'x'),
- ... ({'a': 0.8, 'b': 1.2, 'c': 1}, 'y'),
- ... ({'a': 5.1, 'b': 4.9, 'c': 5.2}, 'x')
- ... ]
- >>> test = [
- ... {'a': 1, 'b': 0.8, 'c': 1.2},
- ... {'a': 5.2, 'b': 5.1, 'c': 5}
- ... ]
- >>> encoding = maxent.TypedMaxentFeatureEncoding.train(
- ... train, count_cutoff=3, alwayson_features=True)
- >>> classifier = maxent.MaxentClassifier.train(
- ... train, bernoulli=False, encoding=encoding, trace=0)
- >>> classifier.classify_many(test)
- ['y', 'x']
|