chunk.doctest 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374
  1. .. Copyright (C) 2001-2019 NLTK Project
  2. .. For license information, see LICENSE.TXT
  3. ==========
  4. Chunking
  5. ==========
  6. >>> from nltk.chunk import *
  7. >>> from nltk.chunk.util import *
  8. >>> from nltk.chunk.regexp import *
  9. >>> from nltk import Tree
  10. >>> tagged_text = "[ The/DT cat/NN ] sat/VBD on/IN [ the/DT mat/NN ] [ the/DT dog/NN ] chewed/VBD ./."
  11. >>> gold_chunked_text = tagstr2tree(tagged_text)
  12. >>> unchunked_text = gold_chunked_text.flatten()
  13. Chunking uses a special regexp syntax for rules that delimit the chunks. These
  14. rules must be converted to 'regular' regular expressions before a sentence can
  15. be chunked.
  16. >>> tag_pattern = "<DT>?<JJ>*<NN.*>"
  17. >>> regexp_pattern = tag_pattern2re_pattern(tag_pattern)
  18. >>> regexp_pattern
  19. '(<(DT)>)?(<(JJ)>)*(<(NN[^\\{\\}<>]*)>)'
  20. Construct some new chunking rules.
  21. >>> chunk_rule = ChunkRule("<.*>+", "Chunk everything")
  22. >>> chink_rule = ChinkRule("<VBD|IN|\.>", "Chink on verbs/prepositions")
  23. >>> split_rule = SplitRule("<DT><NN>", "<DT><NN>",
  24. ... "Split successive determiner/noun pairs")
  25. Create and score a series of chunk parsers, successively more complex.
  26. >>> chunk_parser = RegexpChunkParser([chunk_rule], chunk_label='NP')
  27. >>> chunked_text = chunk_parser.parse(unchunked_text)
  28. >>> print(chunked_text)
  29. (S
  30. (NP
  31. The/DT
  32. cat/NN
  33. sat/VBD
  34. on/IN
  35. the/DT
  36. mat/NN
  37. the/DT
  38. dog/NN
  39. chewed/VBD
  40. ./.))
  41. >>> chunkscore = ChunkScore()
  42. >>> chunkscore.score(gold_chunked_text, chunked_text)
  43. >>> print(chunkscore.precision())
  44. 0.0
  45. >>> print(chunkscore.recall())
  46. 0.0
  47. >>> print(chunkscore.f_measure())
  48. 0
  49. >>> for chunk in sorted(chunkscore.missed()): print(chunk)
  50. (NP The/DT cat/NN)
  51. (NP the/DT dog/NN)
  52. (NP the/DT mat/NN)
  53. >>> for chunk in chunkscore.incorrect(): print(chunk)
  54. (NP
  55. The/DT
  56. cat/NN
  57. sat/VBD
  58. on/IN
  59. the/DT
  60. mat/NN
  61. the/DT
  62. dog/NN
  63. chewed/VBD
  64. ./.)
  65. >>> chunk_parser = RegexpChunkParser([chunk_rule, chink_rule],
  66. ... chunk_label='NP')
  67. >>> chunked_text = chunk_parser.parse(unchunked_text)
  68. >>> print(chunked_text)
  69. (S
  70. (NP The/DT cat/NN)
  71. sat/VBD
  72. on/IN
  73. (NP the/DT mat/NN the/DT dog/NN)
  74. chewed/VBD
  75. ./.)
  76. >>> assert chunked_text == chunk_parser.parse(list(unchunked_text))
  77. >>> chunkscore = ChunkScore()
  78. >>> chunkscore.score(gold_chunked_text, chunked_text)
  79. >>> chunkscore.precision()
  80. 0.5
  81. >>> print(chunkscore.recall())
  82. 0.33333333...
  83. >>> print(chunkscore.f_measure())
  84. 0.4
  85. >>> for chunk in sorted(chunkscore.missed()): print(chunk)
  86. (NP the/DT dog/NN)
  87. (NP the/DT mat/NN)
  88. >>> for chunk in chunkscore.incorrect(): print(chunk)
  89. (NP the/DT mat/NN the/DT dog/NN)
  90. >>> chunk_parser = RegexpChunkParser([chunk_rule, chink_rule, split_rule],
  91. ... chunk_label='NP')
  92. >>> chunked_text = chunk_parser.parse(unchunked_text, trace=True)
  93. # Input:
  94. <DT> <NN> <VBD> <IN> <DT> <NN> <DT> <NN> <VBD> <.>
  95. # Chunk everything:
  96. {<DT> <NN> <VBD> <IN> <DT> <NN> <DT> <NN> <VBD> <.>}
  97. # Chink on verbs/prepositions:
  98. {<DT> <NN>} <VBD> <IN> {<DT> <NN> <DT> <NN>} <VBD> <.>
  99. # Split successive determiner/noun pairs:
  100. {<DT> <NN>} <VBD> <IN> {<DT> <NN>}{<DT> <NN>} <VBD> <.>
  101. >>> print(chunked_text)
  102. (S
  103. (NP The/DT cat/NN)
  104. sat/VBD
  105. on/IN
  106. (NP the/DT mat/NN)
  107. (NP the/DT dog/NN)
  108. chewed/VBD
  109. ./.)
  110. >>> chunkscore = ChunkScore()
  111. >>> chunkscore.score(gold_chunked_text, chunked_text)
  112. >>> chunkscore.precision()
  113. 1.0
  114. >>> chunkscore.recall()
  115. 1.0
  116. >>> chunkscore.f_measure()
  117. 1.0
  118. >>> chunkscore.missed()
  119. []
  120. >>> chunkscore.incorrect()
  121. []
  122. >>> chunk_parser.rules() # doctest: +NORMALIZE_WHITESPACE
  123. [<ChunkRule: '<.*>+'>, <ChinkRule: '<VBD|IN|\\.>'>,
  124. <SplitRule: '<DT><NN>', '<DT><NN>'>]
  125. Printing parsers:
  126. >>> print(repr(chunk_parser))
  127. <RegexpChunkParser with 3 rules>
  128. >>> print(chunk_parser)
  129. RegexpChunkParser with 3 rules:
  130. Chunk everything
  131. <ChunkRule: '<.*>+'>
  132. Chink on verbs/prepositions
  133. <ChinkRule: '<VBD|IN|\\.>'>
  134. Split successive determiner/noun pairs
  135. <SplitRule: '<DT><NN>', '<DT><NN>'>
  136. Regression Tests
  137. ~~~~~~~~~~~~~~~~
  138. ChunkParserI
  139. ------------
  140. `ChunkParserI` is an abstract interface -- it is not meant to be
  141. instantiated directly.
  142. >>> ChunkParserI().parse([])
  143. Traceback (most recent call last):
  144. . . .
  145. NotImplementedError
  146. ChunkString
  147. -----------
  148. ChunkString can be built from a tree of tagged tuples, a tree of
  149. trees, or a mixed list of both:
  150. >>> t1 = Tree('S', [('w%d' % i, 't%d' % i) for i in range(10)])
  151. >>> t2 = Tree('S', [Tree('t0', []), Tree('t1', ['c1'])])
  152. >>> t3 = Tree('S', [('w0', 't0'), Tree('t1', ['c1'])])
  153. >>> ChunkString(t1)
  154. <ChunkString: '<t0><t1><t2><t3><t4><t5><t6><t7><t8><t9>'>
  155. >>> ChunkString(t2)
  156. <ChunkString: '<t0><t1>'>
  157. >>> ChunkString(t3)
  158. <ChunkString: '<t0><t1>'>
  159. Other values generate an error:
  160. >>> ChunkString(Tree('S', ['x']))
  161. Traceback (most recent call last):
  162. . . .
  163. ValueError: chunk structures must contain tagged tokens or trees
  164. The `str()` for a chunk string adds spaces to it, which makes it line
  165. up with `str()` output for other chunk strings over the same
  166. underlying input.
  167. >>> cs = ChunkString(t1)
  168. >>> print(cs)
  169. <t0> <t1> <t2> <t3> <t4> <t5> <t6> <t7> <t8> <t9>
  170. >>> cs.xform('<t3>', '{<t3>}')
  171. >>> print(cs)
  172. <t0> <t1> <t2> {<t3>} <t4> <t5> <t6> <t7> <t8> <t9>
  173. The `_verify()` method makes sure that our transforms don't corrupt
  174. the chunk string. By setting debug_level=2, `_verify()` will be
  175. called at the end of every call to `xform`.
  176. >>> cs = ChunkString(t1, debug_level=3)
  177. >>> # tag not marked with <...>:
  178. >>> cs.xform('<t3>', 't3')
  179. Traceback (most recent call last):
  180. . . .
  181. ValueError: Transformation generated invalid chunkstring:
  182. <t0><t1><t2>t3<t4><t5><t6><t7><t8><t9>
  183. >>> # brackets not balanced:
  184. >>> cs.xform('<t3>', '{<t3>')
  185. Traceback (most recent call last):
  186. . . .
  187. ValueError: Transformation generated invalid chunkstring:
  188. <t0><t1><t2>{<t3><t4><t5><t6><t7><t8><t9>
  189. >>> # nested brackets:
  190. >>> cs.xform('<t3><t4><t5>', '{<t3>{<t4>}<t5>}')
  191. Traceback (most recent call last):
  192. . . .
  193. ValueError: Transformation generated invalid chunkstring:
  194. <t0><t1><t2>{<t3>{<t4>}<t5>}<t6><t7><t8><t9>
  195. >>> # modified tags:
  196. >>> cs.xform('<t3>', '<t9>')
  197. Traceback (most recent call last):
  198. . . .
  199. ValueError: Transformation generated invalid chunkstring: tag changed
  200. >>> # added tags:
  201. >>> cs.xform('<t9>', '<t9><t10>')
  202. Traceback (most recent call last):
  203. . . .
  204. ValueError: Transformation generated invalid chunkstring: tag changed
  205. Chunking Rules
  206. --------------
  207. Test the different rule constructors & __repr__ methods:
  208. >>> r1 = RegexpChunkRule('<a|b>'+ChunkString.IN_CHINK_PATTERN,
  209. ... '{<a|b>}', 'chunk <a> and <b>')
  210. >>> r2 = RegexpChunkRule(re.compile('<a|b>'+ChunkString.IN_CHINK_PATTERN),
  211. ... '{<a|b>}', 'chunk <a> and <b>')
  212. >>> r3 = ChunkRule('<a|b>', 'chunk <a> and <b>')
  213. >>> r4 = ChinkRule('<a|b>', 'chink <a> and <b>')
  214. >>> r5 = UnChunkRule('<a|b>', 'unchunk <a> and <b>')
  215. >>> r6 = MergeRule('<a>', '<b>', 'merge <a> w/ <b>')
  216. >>> r7 = SplitRule('<a>', '<b>', 'split <a> from <b>')
  217. >>> r8 = ExpandLeftRule('<a>', '<b>', 'expand left <a> <b>')
  218. >>> r9 = ExpandRightRule('<a>', '<b>', 'expand right <a> <b>')
  219. >>> for rule in r1, r2, r3, r4, r5, r6, r7, r8, r9:
  220. ... print(rule)
  221. <RegexpChunkRule: '<a|b>(?=[^\\}]*(\\{|$))'->'{<a|b>}'>
  222. <RegexpChunkRule: '<a|b>(?=[^\\}]*(\\{|$))'->'{<a|b>}'>
  223. <ChunkRule: '<a|b>'>
  224. <ChinkRule: '<a|b>'>
  225. <UnChunkRule: '<a|b>'>
  226. <MergeRule: '<a>', '<b>'>
  227. <SplitRule: '<a>', '<b>'>
  228. <ExpandLeftRule: '<a>', '<b>'>
  229. <ExpandRightRule: '<a>', '<b>'>
  230. `tag_pattern2re_pattern()` complains if the tag pattern looks problematic:
  231. >>> tag_pattern2re_pattern('{}')
  232. Traceback (most recent call last):
  233. . . .
  234. ValueError: Bad tag pattern: '{}'
  235. RegexpChunkParser
  236. -----------------
  237. A warning is printed when parsing an empty sentence:
  238. >>> parser = RegexpChunkParser([ChunkRule('<a>', '')])
  239. >>> parser.parse(Tree('S', []))
  240. Warning: parsing empty text
  241. Tree('S', [])
  242. RegexpParser
  243. ------------
  244. >>> parser = RegexpParser('''
  245. ... NP: {<DT>? <JJ>* <NN>*} # NP
  246. ... P: {<IN>} # Preposition
  247. ... V: {<V.*>} # Verb
  248. ... PP: {<P> <NP>} # PP -> P NP
  249. ... VP: {<V> <NP|PP>*} # VP -> V (NP|PP)*
  250. ... ''')
  251. >>> print(repr(parser))
  252. <chunk.RegexpParser with 5 stages>
  253. >>> print(parser)
  254. chunk.RegexpParser with 5 stages:
  255. RegexpChunkParser with 1 rules:
  256. NP <ChunkRule: '<DT>? <JJ>* <NN>*'>
  257. RegexpChunkParser with 1 rules:
  258. Preposition <ChunkRule: '<IN>'>
  259. RegexpChunkParser with 1 rules:
  260. Verb <ChunkRule: '<V.*>'>
  261. RegexpChunkParser with 1 rules:
  262. PP -> P NP <ChunkRule: '<P> <NP>'>
  263. RegexpChunkParser with 1 rules:
  264. VP -> V (NP|PP)* <ChunkRule: '<V> <NP|PP>*'>
  265. >>> print(parser.parse(unchunked_text, trace=True))
  266. # Input:
  267. <DT> <NN> <VBD> <IN> <DT> <NN> <DT> <NN> <VBD> <.>
  268. # NP:
  269. {<DT> <NN>} <VBD> <IN> {<DT> <NN>}{<DT> <NN>} <VBD> <.>
  270. # Input:
  271. <NP> <VBD> <IN> <NP> <NP> <VBD> <.>
  272. # Preposition:
  273. <NP> <VBD> {<IN>} <NP> <NP> <VBD> <.>
  274. # Input:
  275. <NP> <VBD> <P> <NP> <NP> <VBD> <.>
  276. # Verb:
  277. <NP> {<VBD>} <P> <NP> <NP> {<VBD>} <.>
  278. # Input:
  279. <NP> <V> <P> <NP> <NP> <V> <.>
  280. # PP -> P NP:
  281. <NP> <V> {<P> <NP>} <NP> <V> <.>
  282. # Input:
  283. <NP> <V> <PP> <NP> <V> <.>
  284. # VP -> V (NP|PP)*:
  285. <NP> {<V> <PP> <NP>}{<V>} <.>
  286. (S
  287. (NP The/DT cat/NN)
  288. (VP
  289. (V sat/VBD)
  290. (PP (P on/IN) (NP the/DT mat/NN))
  291. (NP the/DT dog/NN))
  292. (VP (V chewed/VBD))
  293. ./.)
  294. Test parsing of other rule types:
  295. >>> print(RegexpParser('''
  296. ... X:
  297. ... }<a><b>{ # chink rule
  298. ... <a>}{<b> # split rule
  299. ... <a>{}<b> # merge rule
  300. ... <a>{<b>}<c> # chunk rule w/ context
  301. ... '''))
  302. chunk.RegexpParser with 1 stages:
  303. RegexpChunkParser with 4 rules:
  304. chink rule <ChinkRule: '<a><b>'>
  305. split rule <SplitRule: '<a>', '<b>'>
  306. merge rule <MergeRule: '<a>', '<b>'>
  307. chunk rule w/ context <ChunkRuleWithContext: '<a>', '<b>', '<c>'>
  308. Illegal patterns give an error message:
  309. >>> print(RegexpParser('X: {<foo>} {<bar>}'))
  310. Traceback (most recent call last):
  311. . . .
  312. ValueError: Illegal chunk pattern: {<foo>} {<bar>}