relextract.doctest 9.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264
  1. .. Copyright (C) 2001-2019 NLTK Project
  2. .. For license information, see LICENSE.TXT
  3. ======================
  4. Information Extraction
  5. ======================
  6. Information Extraction standardly consists of three subtasks:
  7. #. Named Entity Recognition
  8. #. Relation Extraction
  9. #. Template Filling
  10. Named Entities
  11. ~~~~~~~~~~~~~~
  12. The IEER corpus is marked up for a variety of Named Entities. A `Named
  13. Entity`:dt: (more strictly, a Named Entity mention) is a name of an
  14. entity belonging to a specified class. For example, the Named Entity
  15. classes in IEER include PERSON, LOCATION, ORGANIZATION, DATE and so
  16. on. Within NLTK, Named Entities are represented as subtrees within a
  17. chunk structure: the class name is treated as node label, while the
  18. entity mention itself appears as the leaves of the subtree. This is
  19. illustrated below, where we have show an extract of the chunk
  20. representation of document NYT_19980315.064:
  21. >>> from nltk.corpus import ieer
  22. >>> docs = ieer.parsed_docs('NYT_19980315')
  23. >>> tree = docs[1].text
  24. >>> print(tree) # doctest: +ELLIPSIS
  25. (DOCUMENT
  26. ...
  27. ``It's
  28. a
  29. chance
  30. to
  31. think
  32. about
  33. first-level
  34. questions,''
  35. said
  36. Ms.
  37. (PERSON Cohn)
  38. ,
  39. a
  40. partner
  41. in
  42. the
  43. (ORGANIZATION McGlashan & Sarrail)
  44. firm
  45. in
  46. (LOCATION San Mateo)
  47. ,
  48. (LOCATION Calif.)
  49. ...)
  50. Thus, the Named Entity mentions in this example are *Cohn*, *McGlashan &
  51. Sarrail*, *San Mateo* and *Calif.*.
  52. The CoNLL2002 Dutch and Spanish data is treated similarly, although in
  53. this case, the strings are also POS tagged.
  54. >>> from nltk.corpus import conll2002
  55. >>> for doc in conll2002.chunked_sents('ned.train')[27]:
  56. ... print(doc)
  57. (u'Het', u'Art')
  58. (ORG Hof/N van/Prep Cassatie/N)
  59. (u'verbrak', u'V')
  60. (u'het', u'Art')
  61. (u'arrest', u'N')
  62. (u'zodat', u'Conj')
  63. (u'het', u'Pron')
  64. (u'moest', u'V')
  65. (u'worden', u'V')
  66. (u'overgedaan', u'V')
  67. (u'door', u'Prep')
  68. (u'het', u'Art')
  69. (u'hof', u'N')
  70. (u'van', u'Prep')
  71. (u'beroep', u'N')
  72. (u'van', u'Prep')
  73. (LOC Antwerpen/N)
  74. (u'.', u'Punc')
  75. Relation Extraction
  76. ~~~~~~~~~~~~~~~~~~~
  77. Relation Extraction standardly consists of identifying specified
  78. relations between Named Entities. For example, assuming that we can
  79. recognize ORGANIZATIONs and LOCATIONs in text, we might want to also
  80. recognize pairs *(o, l)* of these kinds of entities such that *o* is
  81. located in *l*.
  82. The `sem.relextract` module provides some tools to help carry out a
  83. simple version of this task. The `tree2semi_rel()` function splits a chunk
  84. document into a list of two-member lists, each of which consists of a
  85. (possibly empty) string followed by a `Tree` (i.e., a Named Entity):
  86. >>> from nltk.sem import relextract
  87. >>> pairs = relextract.tree2semi_rel(tree)
  88. >>> for s, tree in pairs[18:22]:
  89. ... print('("...%s", %s)' % (" ".join(s[-5:]),tree))
  90. ("...about first-level questions,'' said Ms.", (PERSON Cohn))
  91. ("..., a partner in the", (ORGANIZATION McGlashan & Sarrail))
  92. ("...firm in", (LOCATION San Mateo))
  93. ("...,", (LOCATION Calif.))
  94. The function `semi_rel2reldict()` processes triples of these pairs, i.e.,
  95. pairs of the form ``((string1, Tree1), (string2, Tree2), (string3,
  96. Tree3))`` and outputs a dictionary (a `reldict`) in which ``Tree1`` is
  97. the subject of the relation, ``string2`` is the filler
  98. and ``Tree3`` is the object of the relation. ``string1`` and ``string3`` are
  99. stored as left and right context respectively.
  100. >>> reldicts = relextract.semi_rel2reldict(pairs)
  101. >>> for k, v in sorted(reldicts[0].items()):
  102. ... print(k, '=>', v) # doctest: +ELLIPSIS
  103. filler => of messages to their own ``Cyberia'' ...
  104. lcon => transactions.'' Each week, they post
  105. objclass => ORGANIZATION
  106. objsym => white_house
  107. objtext => White House
  108. rcon => for access to its planned
  109. subjclass => CARDINAL
  110. subjsym => hundreds
  111. subjtext => hundreds
  112. untagged_filler => of messages to their own ``Cyberia'' ...
  113. The next example shows some of the values for two `reldict`\ s
  114. corresponding to the ``'NYT_19980315'`` text extract shown earlier.
  115. >>> for r in reldicts[18:20]:
  116. ... print('=' * 20)
  117. ... print(r['subjtext'])
  118. ... print(r['filler'])
  119. ... print(r['objtext'])
  120. ====================
  121. Cohn
  122. , a partner in the
  123. McGlashan & Sarrail
  124. ====================
  125. McGlashan & Sarrail
  126. firm in
  127. San Mateo
  128. The function `relextract()` allows us to filter the `reldict`\ s
  129. according to the classes of the subject and object named entities. In
  130. addition, we can specify that the filler text has to match a given
  131. regular expression, as illustrated in the next example. Here, we are
  132. looking for pairs of entities in the IN relation, where IN has
  133. signature <ORG, LOC>.
  134. >>> import re
  135. >>> IN = re.compile(r'.*\bin\b(?!\b.+ing\b)')
  136. >>> for fileid in ieer.fileids():
  137. ... for doc in ieer.parsed_docs(fileid):
  138. ... for rel in relextract.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN):
  139. ... print(relextract.rtuple(rel)) # doctest: +ELLIPSIS
  140. [ORG: 'Christian Democrats'] ', the leading political forces in' [LOC: 'Italy']
  141. [ORG: 'AP'] ') _ Lebanese guerrillas attacked Israeli forces in southern' [LOC: 'Lebanon']
  142. [ORG: 'Security Council'] 'adopted Resolution 425. Huge yellow banners hung across intersections in' [LOC: 'Beirut']
  143. [ORG: 'U.N.'] 'failures in' [LOC: 'Africa']
  144. [ORG: 'U.N.'] 'peacekeeping operation in' [LOC: 'Somalia']
  145. [ORG: 'U.N.'] 'partners on a more effective role in' [LOC: 'Africa']
  146. [ORG: 'AP'] ') _ A bomb exploded in a mosque in central' [LOC: 'San`a']
  147. [ORG: 'Krasnoye Sormovo'] 'shipyard in the Soviet city of' [LOC: 'Gorky']
  148. [ORG: 'Kelab Golf Darul Ridzuan'] 'in' [LOC: 'Perak']
  149. [ORG: 'U.N.'] 'peacekeeping operation in' [LOC: 'Somalia']
  150. [ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
  151. [ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']
  152. [ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
  153. [ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
  154. [ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
  155. [ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
  156. ...
  157. The next example illustrates a case where the patter is a disjunction
  158. of roles that a PERSON can occupy in an ORGANIZATION.
  159. >>> roles = """
  160. ... (.*(
  161. ... analyst|
  162. ... chair(wo)?man|
  163. ... commissioner|
  164. ... counsel|
  165. ... director|
  166. ... economist|
  167. ... editor|
  168. ... executive|
  169. ... foreman|
  170. ... governor|
  171. ... head|
  172. ... lawyer|
  173. ... leader|
  174. ... librarian).*)|
  175. ... manager|
  176. ... partner|
  177. ... president|
  178. ... producer|
  179. ... professor|
  180. ... researcher|
  181. ... spokes(wo)?man|
  182. ... writer|
  183. ... ,\sof\sthe?\s* # "X, of (the) Y"
  184. ... """
  185. >>> ROLES = re.compile(roles, re.VERBOSE)
  186. >>> for fileid in ieer.fileids():
  187. ... for doc in ieer.parsed_docs(fileid):
  188. ... for rel in relextract.extract_rels('PER', 'ORG', doc, corpus='ieer', pattern=ROLES):
  189. ... print(relextract.rtuple(rel)) # doctest: +ELLIPSIS
  190. [PER: 'Kivutha Kibwana'] ', of the' [ORG: 'National Convention Assembly']
  191. [PER: 'Boban Boskovic'] ', chief executive of the' [ORG: 'Plastika']
  192. [PER: 'Annan'] ', the first sub-Saharan African to head the' [ORG: 'United Nations']
  193. [PER: 'Kiriyenko'] 'became a foreman at the' [ORG: 'Krasnoye Sormovo']
  194. [PER: 'Annan'] ', the first sub-Saharan African to head the' [ORG: 'United Nations']
  195. [PER: 'Mike Godwin'] ', chief counsel for the' [ORG: 'Electronic Frontier Foundation']
  196. ...
  197. In the case of the CoNLL2002 data, we can include POS tags in the
  198. query pattern. This example also illustrates how the output can be
  199. presented as something that looks more like a clause in a logical language.
  200. >>> de = """
  201. ... .*
  202. ... (
  203. ... de/SP|
  204. ... del/SP
  205. ... )
  206. ... """
  207. >>> DE = re.compile(de, re.VERBOSE)
  208. >>> rels = [rel for doc in conll2002.chunked_sents('esp.train')
  209. ... for rel in relextract.extract_rels('ORG', 'LOC', doc, corpus='conll2002', pattern = DE)]
  210. >>> for r in rels[:10]:
  211. ... print(relextract.clause(r, relsym='DE')) # doctest: +NORMALIZE_WHITESPACE
  212. DE(u'tribunal_supremo', u'victoria')
  213. DE(u'museo_de_arte', u'alcorc\xf3n')
  214. DE(u'museo_de_bellas_artes', u'a_coru\xf1a')
  215. DE(u'siria', u'l\xedbano')
  216. DE(u'uni\xf3n_europea', u'pek\xedn')
  217. DE(u'ej\xe9rcito', u'rogberi')
  218. DE(u'juzgado_de_instrucci\xf3n_n\xfamero_1', u'san_sebasti\xe1n')
  219. DE(u'psoe', u'villanueva_de_la_serena')
  220. DE(u'ej\xe9rcito', u'l\xedbano')
  221. DE(u'juzgado_de_lo_penal_n\xfamero_2', u'ceuta')
  222. >>> vnv = """
  223. ... (
  224. ... is/V|
  225. ... was/V|
  226. ... werd/V|
  227. ... wordt/V
  228. ... )
  229. ... .*
  230. ... van/Prep
  231. ... """
  232. >>> VAN = re.compile(vnv, re.VERBOSE)
  233. >>> for doc in conll2002.chunked_sents('ned.train'):
  234. ... for r in relextract.extract_rels('PER', 'ORG', doc, corpus='conll2002', pattern=VAN):
  235. ... print(relextract.clause(r, relsym="VAN"))
  236. VAN(u"cornet_d'elzius", u'buitenlandse_handel')
  237. VAN(u'johan_rottiers', u'kardinaal_van_roey_instituut')
  238. VAN(u'annie_lennox', u'eurythmics')