123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374 |
- .. Copyright (C) 2001-2019 NLTK Project
- .. For license information, see LICENSE.TXT
- ==========
- Chunking
- ==========
- >>> from nltk.chunk import *
- >>> from nltk.chunk.util import *
- >>> from nltk.chunk.regexp import *
- >>> from nltk import Tree
- >>> tagged_text = "[ The/DT cat/NN ] sat/VBD on/IN [ the/DT mat/NN ] [ the/DT dog/NN ] chewed/VBD ./."
- >>> gold_chunked_text = tagstr2tree(tagged_text)
- >>> unchunked_text = gold_chunked_text.flatten()
- Chunking uses a special regexp syntax for rules that delimit the chunks. These
- rules must be converted to 'regular' regular expressions before a sentence can
- be chunked.
- >>> tag_pattern = "<DT>?<JJ>*<NN.*>"
- >>> regexp_pattern = tag_pattern2re_pattern(tag_pattern)
- >>> regexp_pattern
- '(<(DT)>)?(<(JJ)>)*(<(NN[^\\{\\}<>]*)>)'
- Construct some new chunking rules.
- >>> chunk_rule = ChunkRule("<.*>+", "Chunk everything")
- >>> chink_rule = ChinkRule("<VBD|IN|\.>", "Chink on verbs/prepositions")
- >>> split_rule = SplitRule("<DT><NN>", "<DT><NN>",
- ... "Split successive determiner/noun pairs")
- Create and score a series of chunk parsers, successively more complex.
- >>> chunk_parser = RegexpChunkParser([chunk_rule], chunk_label='NP')
- >>> chunked_text = chunk_parser.parse(unchunked_text)
- >>> print(chunked_text)
- (S
- (NP
- The/DT
- cat/NN
- sat/VBD
- on/IN
- the/DT
- mat/NN
- the/DT
- dog/NN
- chewed/VBD
- ./.))
- >>> chunkscore = ChunkScore()
- >>> chunkscore.score(gold_chunked_text, chunked_text)
- >>> print(chunkscore.precision())
- 0.0
- >>> print(chunkscore.recall())
- 0.0
- >>> print(chunkscore.f_measure())
- 0
- >>> for chunk in sorted(chunkscore.missed()): print(chunk)
- (NP The/DT cat/NN)
- (NP the/DT dog/NN)
- (NP the/DT mat/NN)
- >>> for chunk in chunkscore.incorrect(): print(chunk)
- (NP
- The/DT
- cat/NN
- sat/VBD
- on/IN
- the/DT
- mat/NN
- the/DT
- dog/NN
- chewed/VBD
- ./.)
- >>> chunk_parser = RegexpChunkParser([chunk_rule, chink_rule],
- ... chunk_label='NP')
- >>> chunked_text = chunk_parser.parse(unchunked_text)
- >>> print(chunked_text)
- (S
- (NP The/DT cat/NN)
- sat/VBD
- on/IN
- (NP the/DT mat/NN the/DT dog/NN)
- chewed/VBD
- ./.)
- >>> assert chunked_text == chunk_parser.parse(list(unchunked_text))
- >>> chunkscore = ChunkScore()
- >>> chunkscore.score(gold_chunked_text, chunked_text)
- >>> chunkscore.precision()
- 0.5
- >>> print(chunkscore.recall())
- 0.33333333...
- >>> print(chunkscore.f_measure())
- 0.4
- >>> for chunk in sorted(chunkscore.missed()): print(chunk)
- (NP the/DT dog/NN)
- (NP the/DT mat/NN)
- >>> for chunk in chunkscore.incorrect(): print(chunk)
- (NP the/DT mat/NN the/DT dog/NN)
- >>> chunk_parser = RegexpChunkParser([chunk_rule, chink_rule, split_rule],
- ... chunk_label='NP')
- >>> chunked_text = chunk_parser.parse(unchunked_text, trace=True)
- # Input:
- <DT> <NN> <VBD> <IN> <DT> <NN> <DT> <NN> <VBD> <.>
- # Chunk everything:
- {<DT> <NN> <VBD> <IN> <DT> <NN> <DT> <NN> <VBD> <.>}
- # Chink on verbs/prepositions:
- {<DT> <NN>} <VBD> <IN> {<DT> <NN> <DT> <NN>} <VBD> <.>
- # Split successive determiner/noun pairs:
- {<DT> <NN>} <VBD> <IN> {<DT> <NN>}{<DT> <NN>} <VBD> <.>
- >>> print(chunked_text)
- (S
- (NP The/DT cat/NN)
- sat/VBD
- on/IN
- (NP the/DT mat/NN)
- (NP the/DT dog/NN)
- chewed/VBD
- ./.)
- >>> chunkscore = ChunkScore()
- >>> chunkscore.score(gold_chunked_text, chunked_text)
- >>> chunkscore.precision()
- 1.0
- >>> chunkscore.recall()
- 1.0
- >>> chunkscore.f_measure()
- 1.0
- >>> chunkscore.missed()
- []
- >>> chunkscore.incorrect()
- []
- >>> chunk_parser.rules() # doctest: +NORMALIZE_WHITESPACE
- [<ChunkRule: '<.*>+'>, <ChinkRule: '<VBD|IN|\\.>'>,
- <SplitRule: '<DT><NN>', '<DT><NN>'>]
- Printing parsers:
- >>> print(repr(chunk_parser))
- <RegexpChunkParser with 3 rules>
- >>> print(chunk_parser)
- RegexpChunkParser with 3 rules:
- Chunk everything
- <ChunkRule: '<.*>+'>
- Chink on verbs/prepositions
- <ChinkRule: '<VBD|IN|\\.>'>
- Split successive determiner/noun pairs
- <SplitRule: '<DT><NN>', '<DT><NN>'>
- Regression Tests
- ~~~~~~~~~~~~~~~~
- ChunkParserI
- ------------
- `ChunkParserI` is an abstract interface -- it is not meant to be
- instantiated directly.
- >>> ChunkParserI().parse([])
- Traceback (most recent call last):
- . . .
- NotImplementedError
- ChunkString
- -----------
- ChunkString can be built from a tree of tagged tuples, a tree of
- trees, or a mixed list of both:
- >>> t1 = Tree('S', [('w%d' % i, 't%d' % i) for i in range(10)])
- >>> t2 = Tree('S', [Tree('t0', []), Tree('t1', ['c1'])])
- >>> t3 = Tree('S', [('w0', 't0'), Tree('t1', ['c1'])])
- >>> ChunkString(t1)
- <ChunkString: '<t0><t1><t2><t3><t4><t5><t6><t7><t8><t9>'>
- >>> ChunkString(t2)
- <ChunkString: '<t0><t1>'>
- >>> ChunkString(t3)
- <ChunkString: '<t0><t1>'>
- Other values generate an error:
- >>> ChunkString(Tree('S', ['x']))
- Traceback (most recent call last):
- . . .
- ValueError: chunk structures must contain tagged tokens or trees
- The `str()` for a chunk string adds spaces to it, which makes it line
- up with `str()` output for other chunk strings over the same
- underlying input.
- >>> cs = ChunkString(t1)
- >>> print(cs)
- <t0> <t1> <t2> <t3> <t4> <t5> <t6> <t7> <t8> <t9>
- >>> cs.xform('<t3>', '{<t3>}')
- >>> print(cs)
- <t0> <t1> <t2> {<t3>} <t4> <t5> <t6> <t7> <t8> <t9>
- The `_verify()` method makes sure that our transforms don't corrupt
- the chunk string. By setting debug_level=2, `_verify()` will be
- called at the end of every call to `xform`.
- >>> cs = ChunkString(t1, debug_level=3)
- >>> # tag not marked with <...>:
- >>> cs.xform('<t3>', 't3')
- Traceback (most recent call last):
- . . .
- ValueError: Transformation generated invalid chunkstring:
- <t0><t1><t2>t3<t4><t5><t6><t7><t8><t9>
- >>> # brackets not balanced:
- >>> cs.xform('<t3>', '{<t3>')
- Traceback (most recent call last):
- . . .
- ValueError: Transformation generated invalid chunkstring:
- <t0><t1><t2>{<t3><t4><t5><t6><t7><t8><t9>
- >>> # nested brackets:
- >>> cs.xform('<t3><t4><t5>', '{<t3>{<t4>}<t5>}')
- Traceback (most recent call last):
- . . .
- ValueError: Transformation generated invalid chunkstring:
- <t0><t1><t2>{<t3>{<t4>}<t5>}<t6><t7><t8><t9>
- >>> # modified tags:
- >>> cs.xform('<t3>', '<t9>')
- Traceback (most recent call last):
- . . .
- ValueError: Transformation generated invalid chunkstring: tag changed
- >>> # added tags:
- >>> cs.xform('<t9>', '<t9><t10>')
- Traceback (most recent call last):
- . . .
- ValueError: Transformation generated invalid chunkstring: tag changed
- Chunking Rules
- --------------
- Test the different rule constructors & __repr__ methods:
- >>> r1 = RegexpChunkRule('<a|b>'+ChunkString.IN_CHINK_PATTERN,
- ... '{<a|b>}', 'chunk <a> and <b>')
- >>> r2 = RegexpChunkRule(re.compile('<a|b>'+ChunkString.IN_CHINK_PATTERN),
- ... '{<a|b>}', 'chunk <a> and <b>')
- >>> r3 = ChunkRule('<a|b>', 'chunk <a> and <b>')
- >>> r4 = ChinkRule('<a|b>', 'chink <a> and <b>')
- >>> r5 = UnChunkRule('<a|b>', 'unchunk <a> and <b>')
- >>> r6 = MergeRule('<a>', '<b>', 'merge <a> w/ <b>')
- >>> r7 = SplitRule('<a>', '<b>', 'split <a> from <b>')
- >>> r8 = ExpandLeftRule('<a>', '<b>', 'expand left <a> <b>')
- >>> r9 = ExpandRightRule('<a>', '<b>', 'expand right <a> <b>')
- >>> for rule in r1, r2, r3, r4, r5, r6, r7, r8, r9:
- ... print(rule)
- <RegexpChunkRule: '<a|b>(?=[^\\}]*(\\{|$))'->'{<a|b>}'>
- <RegexpChunkRule: '<a|b>(?=[^\\}]*(\\{|$))'->'{<a|b>}'>
- <ChunkRule: '<a|b>'>
- <ChinkRule: '<a|b>'>
- <UnChunkRule: '<a|b>'>
- <MergeRule: '<a>', '<b>'>
- <SplitRule: '<a>', '<b>'>
- <ExpandLeftRule: '<a>', '<b>'>
- <ExpandRightRule: '<a>', '<b>'>
- `tag_pattern2re_pattern()` complains if the tag pattern looks problematic:
- >>> tag_pattern2re_pattern('{}')
- Traceback (most recent call last):
- . . .
- ValueError: Bad tag pattern: '{}'
- RegexpChunkParser
- -----------------
- A warning is printed when parsing an empty sentence:
- >>> parser = RegexpChunkParser([ChunkRule('<a>', '')])
- >>> parser.parse(Tree('S', []))
- Warning: parsing empty text
- Tree('S', [])
- RegexpParser
- ------------
- >>> parser = RegexpParser('''
- ... NP: {<DT>? <JJ>* <NN>*} # NP
- ... P: {<IN>} # Preposition
- ... V: {<V.*>} # Verb
- ... PP: {<P> <NP>} # PP -> P NP
- ... VP: {<V> <NP|PP>*} # VP -> V (NP|PP)*
- ... ''')
- >>> print(repr(parser))
- <chunk.RegexpParser with 5 stages>
- >>> print(parser)
- chunk.RegexpParser with 5 stages:
- RegexpChunkParser with 1 rules:
- NP <ChunkRule: '<DT>? <JJ>* <NN>*'>
- RegexpChunkParser with 1 rules:
- Preposition <ChunkRule: '<IN>'>
- RegexpChunkParser with 1 rules:
- Verb <ChunkRule: '<V.*>'>
- RegexpChunkParser with 1 rules:
- PP -> P NP <ChunkRule: '<P> <NP>'>
- RegexpChunkParser with 1 rules:
- VP -> V (NP|PP)* <ChunkRule: '<V> <NP|PP>*'>
- >>> print(parser.parse(unchunked_text, trace=True))
- # Input:
- <DT> <NN> <VBD> <IN> <DT> <NN> <DT> <NN> <VBD> <.>
- # NP:
- {<DT> <NN>} <VBD> <IN> {<DT> <NN>}{<DT> <NN>} <VBD> <.>
- # Input:
- <NP> <VBD> <IN> <NP> <NP> <VBD> <.>
- # Preposition:
- <NP> <VBD> {<IN>} <NP> <NP> <VBD> <.>
- # Input:
- <NP> <VBD> <P> <NP> <NP> <VBD> <.>
- # Verb:
- <NP> {<VBD>} <P> <NP> <NP> {<VBD>} <.>
- # Input:
- <NP> <V> <P> <NP> <NP> <V> <.>
- # PP -> P NP:
- <NP> <V> {<P> <NP>} <NP> <V> <.>
- # Input:
- <NP> <V> <PP> <NP> <V> <.>
- # VP -> V (NP|PP)*:
- <NP> {<V> <PP> <NP>}{<V>} <.>
- (S
- (NP The/DT cat/NN)
- (VP
- (V sat/VBD)
- (PP (P on/IN) (NP the/DT mat/NN))
- (NP the/DT dog/NN))
- (VP (V chewed/VBD))
- ./.)
- Test parsing of other rule types:
- >>> print(RegexpParser('''
- ... X:
- ... }<a><b>{ # chink rule
- ... <a>}{<b> # split rule
- ... <a>{}<b> # merge rule
- ... <a>{<b>}<c> # chunk rule w/ context
- ... '''))
- chunk.RegexpParser with 1 stages:
- RegexpChunkParser with 4 rules:
- chink rule <ChinkRule: '<a><b>'>
- split rule <SplitRule: '<a>', '<b>'>
- merge rule <MergeRule: '<a>', '<b>'>
- chunk rule w/ context <ChunkRuleWithContext: '<a>', '<b>', '<c>'>
- Illegal patterns give an error message:
- >>> print(RegexpParser('X: {<foo>} {<bar>}'))
- Traceback (most recent call last):
- . . .
- ValueError: Illegal chunk pattern: {<foo>} {<bar>}
|