123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177 |
- .. Copyright (C) 2001-2019 NLTK Project
- .. For license information, see LICENSE.TXT
- ========
- PropBank
- ========
- The PropBank Corpus provides predicate-argument annotation for the
- entire Penn Treebank. Each verb in the treebank is annotated by a single
- instance in PropBank, containing information about the location of
- the verb, and the location and identity of its arguments:
- >>> from nltk.corpus import propbank
- >>> pb_instances = propbank.instances()
- >>> print(pb_instances) # doctest: +NORMALIZE_WHITESPACE
- [<PropbankInstance: wsj_0001.mrg, sent 0, word 8>,
- <PropbankInstance: wsj_0001.mrg, sent 1, word 10>, ...]
- Each propbank instance defines the following member variables:
- - Location information: `fileid`, `sentnum`, `wordnum`
- - Annotator information: `tagger`
- - Inflection information: `inflection`
- - Roleset identifier: `roleset`
- - Verb (aka predicate) location: `predicate`
- - Argument locations and types: `arguments`
- The following examples show the types of these arguments:
- >>> inst = pb_instances[103]
- >>> (inst.fileid, inst.sentnum, inst.wordnum)
- ('wsj_0004.mrg', 8, 16)
- >>> inst.tagger
- 'gold'
- >>> inst.inflection
- <PropbankInflection: vp--a>
- >>> infl = inst.inflection
- >>> infl.form, infl.tense, infl.aspect, infl.person, infl.voice
- ('v', 'p', '-', '-', 'a')
- >>> inst.roleset
- 'rise.01'
- >>> inst.predicate
- PropbankTreePointer(16, 0)
- >>> inst.arguments # doctest: +NORMALIZE_WHITESPACE
- ((PropbankTreePointer(0, 2), 'ARG1'),
- (PropbankTreePointer(13, 1), 'ARGM-DIS'),
- (PropbankTreePointer(17, 1), 'ARG4-to'),
- (PropbankTreePointer(20, 1), 'ARG3-from'))
- The location of the predicate and of the arguments are encoded using
- `PropbankTreePointer` objects, as well as `PropbankChainTreePointer`
- objects and `PropbankSplitTreePointer` objects. A
- `PropbankTreePointer` consists of a `wordnum` and a `height`:
- >>> print(inst.predicate.wordnum, inst.predicate.height)
- 16 0
- This identifies the tree constituent that is headed by the word that
- is the `wordnum`\ 'th token in the sentence, and whose span is found
- by going `height` nodes up in the tree. This type of pointer is only
- useful if we also have the corresponding tree structure, since it
- includes empty elements such as traces in the word number count. The
- trees for 10% of the standard PropBank Corpus are contained in the
- `treebank` corpus:
- >>> tree = inst.tree
- >>> from nltk.corpus import treebank
- >>> assert tree == treebank.parsed_sents(inst.fileid)[inst.sentnum]
- >>> inst.predicate.select(tree)
- Tree('VBD', ['rose'])
- >>> for (argloc, argid) in inst.arguments:
- ... print('%-10s %s' % (argid, argloc.select(tree).pformat(500)[:50]))
- ARG1 (NP-SBJ (NP (DT The) (NN yield)) (PP (IN on) (NP (
- ARGM-DIS (PP (IN for) (NP (NN example)))
- ARG4-to (PP-DIR (TO to) (NP (CD 8.04) (NN %)))
- ARG3-from (PP-DIR (IN from) (NP (CD 7.90) (NN %)))
- Propbank tree pointers can be converted to standard tree locations,
- which are usually easier to work with, using the `treepos()` method:
- >>> treepos = inst.predicate.treepos(tree)
- >>> print (treepos, tree[treepos])
- (4, 0) (VBD rose)
- In some cases, argument locations will be encoded using
- `PropbankChainTreePointer`\ s (for trace chains) or
- `PropbankSplitTreePointer`\ s (for discontinuous constituents). Both
- of these objects contain a single member variable, `pieces`,
- containing a list of the constituent pieces. They also define the
- method `select()`, which will return a tree containing all the
- elements of the argument. (A new head node is created, labeled
- "*CHAIN*" or "*SPLIT*", since the argument is not a single constituent
- in the original tree). Sentence #6 contains an example of an argument
- that is both discontinuous and contains a chain:
- >>> inst = pb_instances[6]
- >>> inst.roleset
- 'expose.01'
- >>> argloc, argid = inst.arguments[2]
- >>> argloc
- <PropbankChainTreePointer: 22:1,24:0,25:1*27:0>
- >>> argloc.pieces
- [<PropbankSplitTreePointer: 22:1,24:0,25:1>, PropbankTreePointer(27, 0)]
- >>> argloc.pieces[0].pieces
- ... # doctest: +NORMALIZE_WHITESPACE
- [PropbankTreePointer(22, 1), PropbankTreePointer(24, 0),
- PropbankTreePointer(25, 1)]
- >>> print(argloc.select(inst.tree))
- (*CHAIN*
- (*SPLIT* (NP (DT a) (NN group)) (IN of) (NP (NNS workers)))
- (-NONE- *))
- The PropBank Corpus also provides access to the frameset files, which
- define the argument labels used by the annotations, on a per-verb
- basis. Each frameset file contains one or more predicates, such as
- 'turn' or 'turn_on', each of which is divided into coarse-grained word
- senses called rolesets. For each roleset, the frameset file provides
- descriptions of the argument roles, along with examples.
- >>> expose_01 = propbank.roleset('expose.01')
- >>> turn_01 = propbank.roleset('turn.01')
- >>> print(turn_01) # doctest: +ELLIPSIS
- <Element 'roleset' at ...>
- >>> for role in turn_01.findall("roles/role"):
- ... print(role.attrib['n'], role.attrib['descr'])
- 0 turner
- 1 thing turning
- m direction, location
- >>> from xml.etree import ElementTree
- >>> print(ElementTree.tostring(turn_01.find('example')).decode('utf8').strip())
- <example name="transitive agentive">
- <text>
- John turned the key in the lock.
- </text>
- <arg n="0">John</arg>
- <rel>turned</rel>
- <arg n="1">the key</arg>
- <arg f="LOC" n="m">in the lock</arg>
- </example>
- Note that the standard corpus distribution only contains 10% of the
- treebank, so the parse trees are not available for instances starting
- at 9353:
- >>> inst = pb_instances[9352]
- >>> inst.fileid
- 'wsj_0199.mrg'
- >>> print(inst.tree) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
- (S (NP-SBJ (NNP Trinity)) (VP (VBD said) (SBAR (-NONE- 0) ...))
- >>> print(inst.predicate.select(inst.tree))
- (VB begin)
- >>> inst = pb_instances[9353]
- >>> inst.fileid
- 'wsj_0200.mrg'
- >>> print(inst.tree)
- None
- >>> print(inst.predicate.select(inst.tree))
- Traceback (most recent call last):
- . . .
- ValueError: Parse tree not avaialable
- However, if you supply your own version of the treebank corpus (by
- putting it before the nltk-provided version on `nltk.data.path`, or
- by creating a `ptb` directory as described above and using the
- `propbank_ptb` module), then you can access the trees for all
- instances.
- A list of the verb lemmas contained in PropBank is returned by the
- `propbank.verbs()` method:
- >>> propbank.verbs()
- ['abandon', 'abate', 'abdicate', 'abet', 'abide', ...]