propbank.doctest 6.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177
  1. .. Copyright (C) 2001-2019 NLTK Project
  2. .. For license information, see LICENSE.TXT
  3. ========
  4. PropBank
  5. ========
  6. The PropBank Corpus provides predicate-argument annotation for the
  7. entire Penn Treebank. Each verb in the treebank is annotated by a single
  8. instance in PropBank, containing information about the location of
  9. the verb, and the location and identity of its arguments:
  10. >>> from nltk.corpus import propbank
  11. >>> pb_instances = propbank.instances()
  12. >>> print(pb_instances) # doctest: +NORMALIZE_WHITESPACE
  13. [<PropbankInstance: wsj_0001.mrg, sent 0, word 8>,
  14. <PropbankInstance: wsj_0001.mrg, sent 1, word 10>, ...]
  15. Each propbank instance defines the following member variables:
  16. - Location information: `fileid`, `sentnum`, `wordnum`
  17. - Annotator information: `tagger`
  18. - Inflection information: `inflection`
  19. - Roleset identifier: `roleset`
  20. - Verb (aka predicate) location: `predicate`
  21. - Argument locations and types: `arguments`
  22. The following examples show the types of these arguments:
  23. >>> inst = pb_instances[103]
  24. >>> (inst.fileid, inst.sentnum, inst.wordnum)
  25. ('wsj_0004.mrg', 8, 16)
  26. >>> inst.tagger
  27. 'gold'
  28. >>> inst.inflection
  29. <PropbankInflection: vp--a>
  30. >>> infl = inst.inflection
  31. >>> infl.form, infl.tense, infl.aspect, infl.person, infl.voice
  32. ('v', 'p', '-', '-', 'a')
  33. >>> inst.roleset
  34. 'rise.01'
  35. >>> inst.predicate
  36. PropbankTreePointer(16, 0)
  37. >>> inst.arguments # doctest: +NORMALIZE_WHITESPACE
  38. ((PropbankTreePointer(0, 2), 'ARG1'),
  39. (PropbankTreePointer(13, 1), 'ARGM-DIS'),
  40. (PropbankTreePointer(17, 1), 'ARG4-to'),
  41. (PropbankTreePointer(20, 1), 'ARG3-from'))
  42. The location of the predicate and of the arguments are encoded using
  43. `PropbankTreePointer` objects, as well as `PropbankChainTreePointer`
  44. objects and `PropbankSplitTreePointer` objects. A
  45. `PropbankTreePointer` consists of a `wordnum` and a `height`:
  46. >>> print(inst.predicate.wordnum, inst.predicate.height)
  47. 16 0
  48. This identifies the tree constituent that is headed by the word that
  49. is the `wordnum`\ 'th token in the sentence, and whose span is found
  50. by going `height` nodes up in the tree. This type of pointer is only
  51. useful if we also have the corresponding tree structure, since it
  52. includes empty elements such as traces in the word number count. The
  53. trees for 10% of the standard PropBank Corpus are contained in the
  54. `treebank` corpus:
  55. >>> tree = inst.tree
  56. >>> from nltk.corpus import treebank
  57. >>> assert tree == treebank.parsed_sents(inst.fileid)[inst.sentnum]
  58. >>> inst.predicate.select(tree)
  59. Tree('VBD', ['rose'])
  60. >>> for (argloc, argid) in inst.arguments:
  61. ... print('%-10s %s' % (argid, argloc.select(tree).pformat(500)[:50]))
  62. ARG1 (NP-SBJ (NP (DT The) (NN yield)) (PP (IN on) (NP (
  63. ARGM-DIS (PP (IN for) (NP (NN example)))
  64. ARG4-to (PP-DIR (TO to) (NP (CD 8.04) (NN %)))
  65. ARG3-from (PP-DIR (IN from) (NP (CD 7.90) (NN %)))
  66. Propbank tree pointers can be converted to standard tree locations,
  67. which are usually easier to work with, using the `treepos()` method:
  68. >>> treepos = inst.predicate.treepos(tree)
  69. >>> print (treepos, tree[treepos])
  70. (4, 0) (VBD rose)
  71. In some cases, argument locations will be encoded using
  72. `PropbankChainTreePointer`\ s (for trace chains) or
  73. `PropbankSplitTreePointer`\ s (for discontinuous constituents). Both
  74. of these objects contain a single member variable, `pieces`,
  75. containing a list of the constituent pieces. They also define the
  76. method `select()`, which will return a tree containing all the
  77. elements of the argument. (A new head node is created, labeled
  78. "*CHAIN*" or "*SPLIT*", since the argument is not a single constituent
  79. in the original tree). Sentence #6 contains an example of an argument
  80. that is both discontinuous and contains a chain:
  81. >>> inst = pb_instances[6]
  82. >>> inst.roleset
  83. 'expose.01'
  84. >>> argloc, argid = inst.arguments[2]
  85. >>> argloc
  86. <PropbankChainTreePointer: 22:1,24:0,25:1*27:0>
  87. >>> argloc.pieces
  88. [<PropbankSplitTreePointer: 22:1,24:0,25:1>, PropbankTreePointer(27, 0)]
  89. >>> argloc.pieces[0].pieces
  90. ... # doctest: +NORMALIZE_WHITESPACE
  91. [PropbankTreePointer(22, 1), PropbankTreePointer(24, 0),
  92. PropbankTreePointer(25, 1)]
  93. >>> print(argloc.select(inst.tree))
  94. (*CHAIN*
  95. (*SPLIT* (NP (DT a) (NN group)) (IN of) (NP (NNS workers)))
  96. (-NONE- *))
  97. The PropBank Corpus also provides access to the frameset files, which
  98. define the argument labels used by the annotations, on a per-verb
  99. basis. Each frameset file contains one or more predicates, such as
  100. 'turn' or 'turn_on', each of which is divided into coarse-grained word
  101. senses called rolesets. For each roleset, the frameset file provides
  102. descriptions of the argument roles, along with examples.
  103. >>> expose_01 = propbank.roleset('expose.01')
  104. >>> turn_01 = propbank.roleset('turn.01')
  105. >>> print(turn_01) # doctest: +ELLIPSIS
  106. <Element 'roleset' at ...>
  107. >>> for role in turn_01.findall("roles/role"):
  108. ... print(role.attrib['n'], role.attrib['descr'])
  109. 0 turner
  110. 1 thing turning
  111. m direction, location
  112. >>> from xml.etree import ElementTree
  113. >>> print(ElementTree.tostring(turn_01.find('example')).decode('utf8').strip())
  114. <example name="transitive agentive">
  115. <text>
  116. John turned the key in the lock.
  117. </text>
  118. <arg n="0">John</arg>
  119. <rel>turned</rel>
  120. <arg n="1">the key</arg>
  121. <arg f="LOC" n="m">in the lock</arg>
  122. </example>
  123. Note that the standard corpus distribution only contains 10% of the
  124. treebank, so the parse trees are not available for instances starting
  125. at 9353:
  126. >>> inst = pb_instances[9352]
  127. >>> inst.fileid
  128. 'wsj_0199.mrg'
  129. >>> print(inst.tree) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
  130. (S (NP-SBJ (NNP Trinity)) (VP (VBD said) (SBAR (-NONE- 0) ...))
  131. >>> print(inst.predicate.select(inst.tree))
  132. (VB begin)
  133. >>> inst = pb_instances[9353]
  134. >>> inst.fileid
  135. 'wsj_0200.mrg'
  136. >>> print(inst.tree)
  137. None
  138. >>> print(inst.predicate.select(inst.tree))
  139. Traceback (most recent call last):
  140. . . .
  141. ValueError: Parse tree not avaialable
  142. However, if you supply your own version of the treebank corpus (by
  143. putting it before the nltk-provided version on `nltk.data.path`, or
  144. by creating a `ptb` directory as described above and using the
  145. `propbank_ptb` module), then you can access the trees for all
  146. instances.
  147. A list of the verb lemmas contained in PropBank is returned by the
  148. `propbank.verbs()` method:
  149. >>> propbank.verbs()
  150. ['abandon', 'abate', 'abdicate', 'abet', 'abide', ...]