123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264 |
- .. Copyright (C) 2001-2019 NLTK Project
- .. For license information, see LICENSE.TXT
- ======================
- Information Extraction
- ======================
- Information Extraction standardly consists of three subtasks:
- #. Named Entity Recognition
- #. Relation Extraction
- #. Template Filling
- Named Entities
- ~~~~~~~~~~~~~~
- The IEER corpus is marked up for a variety of Named Entities. A `Named
- Entity`:dt: (more strictly, a Named Entity mention) is a name of an
- entity belonging to a specified class. For example, the Named Entity
- classes in IEER include PERSON, LOCATION, ORGANIZATION, DATE and so
- on. Within NLTK, Named Entities are represented as subtrees within a
- chunk structure: the class name is treated as node label, while the
- entity mention itself appears as the leaves of the subtree. This is
- illustrated below, where we have show an extract of the chunk
- representation of document NYT_19980315.064:
- >>> from nltk.corpus import ieer
- >>> docs = ieer.parsed_docs('NYT_19980315')
- >>> tree = docs[1].text
- >>> print(tree) # doctest: +ELLIPSIS
- (DOCUMENT
- ...
- ``It's
- a
- chance
- to
- think
- about
- first-level
- questions,''
- said
- Ms.
- (PERSON Cohn)
- ,
- a
- partner
- in
- the
- (ORGANIZATION McGlashan & Sarrail)
- firm
- in
- (LOCATION San Mateo)
- ,
- (LOCATION Calif.)
- ...)
- Thus, the Named Entity mentions in this example are *Cohn*, *McGlashan &
- Sarrail*, *San Mateo* and *Calif.*.
- The CoNLL2002 Dutch and Spanish data is treated similarly, although in
- this case, the strings are also POS tagged.
- >>> from nltk.corpus import conll2002
- >>> for doc in conll2002.chunked_sents('ned.train')[27]:
- ... print(doc)
- (u'Het', u'Art')
- (ORG Hof/N van/Prep Cassatie/N)
- (u'verbrak', u'V')
- (u'het', u'Art')
- (u'arrest', u'N')
- (u'zodat', u'Conj')
- (u'het', u'Pron')
- (u'moest', u'V')
- (u'worden', u'V')
- (u'overgedaan', u'V')
- (u'door', u'Prep')
- (u'het', u'Art')
- (u'hof', u'N')
- (u'van', u'Prep')
- (u'beroep', u'N')
- (u'van', u'Prep')
- (LOC Antwerpen/N)
- (u'.', u'Punc')
- Relation Extraction
- ~~~~~~~~~~~~~~~~~~~
- Relation Extraction standardly consists of identifying specified
- relations between Named Entities. For example, assuming that we can
- recognize ORGANIZATIONs and LOCATIONs in text, we might want to also
- recognize pairs *(o, l)* of these kinds of entities such that *o* is
- located in *l*.
- The `sem.relextract` module provides some tools to help carry out a
- simple version of this task. The `tree2semi_rel()` function splits a chunk
- document into a list of two-member lists, each of which consists of a
- (possibly empty) string followed by a `Tree` (i.e., a Named Entity):
- >>> from nltk.sem import relextract
- >>> pairs = relextract.tree2semi_rel(tree)
- >>> for s, tree in pairs[18:22]:
- ... print('("...%s", %s)' % (" ".join(s[-5:]),tree))
- ("...about first-level questions,'' said Ms.", (PERSON Cohn))
- ("..., a partner in the", (ORGANIZATION McGlashan & Sarrail))
- ("...firm in", (LOCATION San Mateo))
- ("...,", (LOCATION Calif.))
- The function `semi_rel2reldict()` processes triples of these pairs, i.e.,
- pairs of the form ``((string1, Tree1), (string2, Tree2), (string3,
- Tree3))`` and outputs a dictionary (a `reldict`) in which ``Tree1`` is
- the subject of the relation, ``string2`` is the filler
- and ``Tree3`` is the object of the relation. ``string1`` and ``string3`` are
- stored as left and right context respectively.
- >>> reldicts = relextract.semi_rel2reldict(pairs)
- >>> for k, v in sorted(reldicts[0].items()):
- ... print(k, '=>', v) # doctest: +ELLIPSIS
- filler => of messages to their own ``Cyberia'' ...
- lcon => transactions.'' Each week, they post
- objclass => ORGANIZATION
- objsym => white_house
- objtext => White House
- rcon => for access to its planned
- subjclass => CARDINAL
- subjsym => hundreds
- subjtext => hundreds
- untagged_filler => of messages to their own ``Cyberia'' ...
- The next example shows some of the values for two `reldict`\ s
- corresponding to the ``'NYT_19980315'`` text extract shown earlier.
- >>> for r in reldicts[18:20]:
- ... print('=' * 20)
- ... print(r['subjtext'])
- ... print(r['filler'])
- ... print(r['objtext'])
- ====================
- Cohn
- , a partner in the
- McGlashan & Sarrail
- ====================
- McGlashan & Sarrail
- firm in
- San Mateo
- The function `relextract()` allows us to filter the `reldict`\ s
- according to the classes of the subject and object named entities. In
- addition, we can specify that the filler text has to match a given
- regular expression, as illustrated in the next example. Here, we are
- looking for pairs of entities in the IN relation, where IN has
- signature <ORG, LOC>.
- >>> import re
- >>> IN = re.compile(r'.*\bin\b(?!\b.+ing\b)')
- >>> for fileid in ieer.fileids():
- ... for doc in ieer.parsed_docs(fileid):
- ... for rel in relextract.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN):
- ... print(relextract.rtuple(rel)) # doctest: +ELLIPSIS
- [ORG: 'Christian Democrats'] ', the leading political forces in' [LOC: 'Italy']
- [ORG: 'AP'] ') _ Lebanese guerrillas attacked Israeli forces in southern' [LOC: 'Lebanon']
- [ORG: 'Security Council'] 'adopted Resolution 425. Huge yellow banners hung across intersections in' [LOC: 'Beirut']
- [ORG: 'U.N.'] 'failures in' [LOC: 'Africa']
- [ORG: 'U.N.'] 'peacekeeping operation in' [LOC: 'Somalia']
- [ORG: 'U.N.'] 'partners on a more effective role in' [LOC: 'Africa']
- [ORG: 'AP'] ') _ A bomb exploded in a mosque in central' [LOC: 'San`a']
- [ORG: 'Krasnoye Sormovo'] 'shipyard in the Soviet city of' [LOC: 'Gorky']
- [ORG: 'Kelab Golf Darul Ridzuan'] 'in' [LOC: 'Perak']
- [ORG: 'U.N.'] 'peacekeeping operation in' [LOC: 'Somalia']
- [ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
- [ORG: 'McGlashan & Sarrail'] 'firm in' [LOC: 'San Mateo']
- [ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
- [ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
- [ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
- [ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
- ...
- The next example illustrates a case where the patter is a disjunction
- of roles that a PERSON can occupy in an ORGANIZATION.
- >>> roles = """
- ... (.*(
- ... analyst|
- ... chair(wo)?man|
- ... commissioner|
- ... counsel|
- ... director|
- ... economist|
- ... editor|
- ... executive|
- ... foreman|
- ... governor|
- ... head|
- ... lawyer|
- ... leader|
- ... librarian).*)|
- ... manager|
- ... partner|
- ... president|
- ... producer|
- ... professor|
- ... researcher|
- ... spokes(wo)?man|
- ... writer|
- ... ,\sof\sthe?\s* # "X, of (the) Y"
- ... """
- >>> ROLES = re.compile(roles, re.VERBOSE)
- >>> for fileid in ieer.fileids():
- ... for doc in ieer.parsed_docs(fileid):
- ... for rel in relextract.extract_rels('PER', 'ORG', doc, corpus='ieer', pattern=ROLES):
- ... print(relextract.rtuple(rel)) # doctest: +ELLIPSIS
- [PER: 'Kivutha Kibwana'] ', of the' [ORG: 'National Convention Assembly']
- [PER: 'Boban Boskovic'] ', chief executive of the' [ORG: 'Plastika']
- [PER: 'Annan'] ', the first sub-Saharan African to head the' [ORG: 'United Nations']
- [PER: 'Kiriyenko'] 'became a foreman at the' [ORG: 'Krasnoye Sormovo']
- [PER: 'Annan'] ', the first sub-Saharan African to head the' [ORG: 'United Nations']
- [PER: 'Mike Godwin'] ', chief counsel for the' [ORG: 'Electronic Frontier Foundation']
- ...
- In the case of the CoNLL2002 data, we can include POS tags in the
- query pattern. This example also illustrates how the output can be
- presented as something that looks more like a clause in a logical language.
- >>> de = """
- ... .*
- ... (
- ... de/SP|
- ... del/SP
- ... )
- ... """
- >>> DE = re.compile(de, re.VERBOSE)
- >>> rels = [rel for doc in conll2002.chunked_sents('esp.train')
- ... for rel in relextract.extract_rels('ORG', 'LOC', doc, corpus='conll2002', pattern = DE)]
- >>> for r in rels[:10]:
- ... print(relextract.clause(r, relsym='DE')) # doctest: +NORMALIZE_WHITESPACE
- DE(u'tribunal_supremo', u'victoria')
- DE(u'museo_de_arte', u'alcorc\xf3n')
- DE(u'museo_de_bellas_artes', u'a_coru\xf1a')
- DE(u'siria', u'l\xedbano')
- DE(u'uni\xf3n_europea', u'pek\xedn')
- DE(u'ej\xe9rcito', u'rogberi')
- DE(u'juzgado_de_instrucci\xf3n_n\xfamero_1', u'san_sebasti\xe1n')
- DE(u'psoe', u'villanueva_de_la_serena')
- DE(u'ej\xe9rcito', u'l\xedbano')
- DE(u'juzgado_de_lo_penal_n\xfamero_2', u'ceuta')
- >>> vnv = """
- ... (
- ... is/V|
- ... was/V|
- ... werd/V|
- ... wordt/V
- ... )
- ... .*
- ... van/Prep
- ... """
- >>> VAN = re.compile(vnv, re.VERBOSE)
- >>> for doc in conll2002.chunked_sents('ned.train'):
- ... for r in relextract.extract_rels('PER', 'ORG', doc, corpus='conll2002', pattern=VAN):
- ... print(relextract.clause(r, relsym="VAN"))
- VAN(u"cornet_d'elzius", u'buitenlandse_handel')
- VAN(u'johan_rottiers', u'kardinaal_van_roey_instituut')
- VAN(u'annie_lennox', u'eurythmics')
|