chat80.doctest 8.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235
  1. .. Copyright (C) 2001-2019 NLTK Project
  2. .. For license information, see LICENSE.TXT
  3. =======
  4. Chat-80
  5. =======
  6. Chat-80 was a natural language system which allowed the user to
  7. interrogate a Prolog knowledge base in the domain of world
  8. geography. It was developed in the early '80s by Warren and Pereira; see
  9. `<http://acl.ldc.upenn.edu/J/J82/J82-3002.pdf>`_ for a description and
  10. `<http://www.cis.upenn.edu/~pereira/oldies.html>`_ for the source
  11. files.
  12. The ``chat80`` module contains functions to extract data from the Chat-80
  13. relation files ('the world database'), and convert then into a format
  14. that can be incorporated in the FOL models of
  15. ``nltk.sem.evaluate``. The code assumes that the Prolog
  16. input files are available in the NLTK corpora directory.
  17. The Chat-80 World Database consists of the following files::
  18. world0.pl
  19. rivers.pl
  20. cities.pl
  21. countries.pl
  22. contain.pl
  23. borders.pl
  24. This module uses a slightly modified version of ``world0.pl``, in which
  25. a set of Prolog rules have been omitted. The modified file is named
  26. ``world1.pl``. Currently, the file ``rivers.pl`` is not read in, since
  27. it uses a list rather than a string in the second field.
  28. Reading Chat-80 Files
  29. =====================
  30. Chat-80 relations are like tables in a relational database. The
  31. relation acts as the name of the table; the first argument acts as the
  32. 'primary key'; and subsequent arguments are further fields in the
  33. table. In general, the name of the table provides a label for a unary
  34. predicate whose extension is all the primary keys. For example,
  35. relations in ``cities.pl`` are of the following form::
  36. 'city(athens,greece,1368).'
  37. Here, ``'athens'`` is the key, and will be mapped to a member of the
  38. unary predicate *city*.
  39. By analogy with NLTK corpora, ``chat80`` defines a number of 'items'
  40. which correspond to the relations.
  41. >>> from nltk.sem import chat80
  42. >>> print(chat80.items) # doctest: +ELLIPSIS
  43. ('borders', 'circle_of_lat', 'circle_of_long', 'city', ...)
  44. The fields in the table are mapped to binary predicates. The first
  45. argument of the predicate is the primary key, while the second
  46. argument is the data in the relevant field. Thus, in the above
  47. example, the third field is mapped to the binary predicate
  48. *population_of*, whose extension is a set of pairs such as
  49. ``'(athens, 1368)'``.
  50. An exception to this general framework is required by the relations in
  51. the files ``borders.pl`` and ``contains.pl``. These contain facts of the
  52. following form::
  53. 'borders(albania,greece).'
  54. 'contains0(africa,central_africa).'
  55. We do not want to form a unary concept out the element in
  56. the first field of these records, and we want the label of the binary
  57. relation just to be ``'border'``/``'contain'`` respectively.
  58. In order to drive the extraction process, we use 'relation metadata bundles'
  59. which are Python dictionaries such as the following::
  60. city = {'label': 'city',
  61. 'closures': [],
  62. 'schema': ['city', 'country', 'population'],
  63. 'filename': 'cities.pl'}
  64. According to this, the file ``city['filename']`` contains a list of
  65. relational tuples (or more accurately, the corresponding strings in
  66. Prolog form) whose predicate symbol is ``city['label']`` and whose
  67. relational schema is ``city['schema']``. The notion of a ``closure`` is
  68. discussed in the next section.
  69. Concepts
  70. ========
  71. In order to encapsulate the results of the extraction, a class of
  72. ``Concept``\ s is introduced. A ``Concept`` object has a number of
  73. attributes, in particular a ``prefLabel``, an arity and ``extension``.
  74. >>> c1 = chat80.Concept('dog', arity=1, extension=set(['d1', 'd2']))
  75. >>> print(c1)
  76. Label = 'dog'
  77. Arity = 1
  78. Extension = ['d1', 'd2']
  79. The ``extension`` attribute makes it easier to inspect the output of
  80. the extraction.
  81. >>> schema = ['city', 'country', 'population']
  82. >>> concepts = chat80.clause2concepts('cities.pl', 'city', schema)
  83. >>> concepts
  84. [Concept('city'), Concept('country_of'), Concept('population_of')]
  85. >>> for c in concepts: # doctest: +NORMALIZE_WHITESPACE
  86. ... print("%s:\n\t%s" % (c.prefLabel, c.extension[:4]))
  87. city:
  88. ['athens', 'bangkok', 'barcelona', 'berlin']
  89. country_of:
  90. [('athens', 'greece'), ('bangkok', 'thailand'), ('barcelona', 'spain'), ('berlin', 'east_germany')]
  91. population_of:
  92. [('athens', '1368'), ('bangkok', '1178'), ('barcelona', '1280'), ('berlin', '3481')]
  93. In addition, the ``extension`` can be further
  94. processed: in the case of the ``'border'`` relation, we check that the
  95. relation is **symmetric**, and in the case of the ``'contain'``
  96. relation, we carry out the **transitive closure**. The closure
  97. properties associated with a concept is indicated in the relation
  98. metadata, as indicated earlier.
  99. >>> borders = set([('a1', 'a2'), ('a2', 'a3')])
  100. >>> c2 = chat80.Concept('borders', arity=2, extension=borders)
  101. >>> print(c2)
  102. Label = 'borders'
  103. Arity = 2
  104. Extension = [('a1', 'a2'), ('a2', 'a3')]
  105. >>> c3 = chat80.Concept('borders', arity=2, closures=['symmetric'], extension=borders)
  106. >>> c3.close()
  107. >>> print(c3)
  108. Label = 'borders'
  109. Arity = 2
  110. Extension = [('a1', 'a2'), ('a2', 'a1'), ('a2', 'a3'), ('a3', 'a2')]
  111. The ``extension`` of a ``Concept`` object is then incorporated into a
  112. ``Valuation`` object.
  113. Persistence
  114. ===========
  115. The functions ``val_dump`` and ``val_load`` are provided to allow a
  116. valuation to be stored in a persistent database and re-loaded, rather
  117. than having to be re-computed each time.
  118. Individuals and Lexical Items
  119. =============================
  120. As well as deriving relations from the Chat-80 data, we also create a
  121. set of individual constants, one for each entity in the domain. The
  122. individual constants are string-identical to the entities. For
  123. example, given a data item such as ``'zloty'``, we add to the valuation
  124. a pair ``('zloty', 'zloty')``. In order to parse English sentences that
  125. refer to these entities, we also create a lexical item such as the
  126. following for each individual constant::
  127. PropN[num=sg, sem=<\P.(P zloty)>] -> 'Zloty'
  128. The set of rules is written to the file ``chat_pnames.fcfg`` in the
  129. current directory.
  130. SQL Query
  131. =========
  132. The ``city`` relation is also available in RDB form and can be queried
  133. using SQL statements.
  134. >>> import nltk
  135. >>> q = "SELECT City, Population FROM city_table WHERE Country = 'china' and Population > 1000"
  136. >>> for answer in chat80.sql_query('corpora/city_database/city.db', q):
  137. ... print("%-10s %4s" % answer)
  138. canton 1496
  139. chungking 1100
  140. mukden 1551
  141. peking 2031
  142. shanghai 5407
  143. tientsin 1795
  144. The (deliberately naive) grammar ``sql.fcfg`` translates from English
  145. to SQL:
  146. >>> nltk.data.show_cfg('grammars/book_grammars/sql0.fcfg')
  147. % start S
  148. S[SEM=(?np + WHERE + ?vp)] -> NP[SEM=?np] VP[SEM=?vp]
  149. VP[SEM=(?v + ?pp)] -> IV[SEM=?v] PP[SEM=?pp]
  150. VP[SEM=(?v + ?ap)] -> IV[SEM=?v] AP[SEM=?ap]
  151. NP[SEM=(?det + ?n)] -> Det[SEM=?det] N[SEM=?n]
  152. PP[SEM=(?p + ?np)] -> P[SEM=?p] NP[SEM=?np]
  153. AP[SEM=?pp] -> A[SEM=?a] PP[SEM=?pp]
  154. NP[SEM='Country="greece"'] -> 'Greece'
  155. NP[SEM='Country="china"'] -> 'China'
  156. Det[SEM='SELECT'] -> 'Which' | 'What'
  157. N[SEM='City FROM city_table'] -> 'cities'
  158. IV[SEM=''] -> 'are'
  159. A[SEM=''] -> 'located'
  160. P[SEM=''] -> 'in'
  161. Given this grammar, we can express, and then execute, queries in English.
  162. >>> cp = nltk.parse.load_parser('grammars/book_grammars/sql0.fcfg')
  163. >>> query = 'What cities are in China'
  164. >>> for tree in cp.parse(query.split()):
  165. ... answer = tree.label()['SEM']
  166. ... q = " ".join(answer)
  167. ... print(q)
  168. ...
  169. SELECT City FROM city_table WHERE Country="china"
  170. >>> rows = chat80.sql_query('corpora/city_database/city.db', q)
  171. >>> for r in rows: print("%s" % r, end=' ')
  172. canton chungking dairen harbin kowloon mukden peking shanghai sian tientsin
  173. Using Valuations
  174. -----------------
  175. In order to convert such an extension into a valuation, we use the
  176. ``make_valuation()`` method; setting ``read=True`` creates and returns
  177. a new ``Valuation`` object which contains the results.
  178. >>> val = chat80.make_valuation(concepts, read=True)
  179. >>> 'calcutta' in val['city']
  180. True
  181. >>> [town for (town, country) in val['country_of'] if country == 'india']
  182. ['bombay', 'calcutta', 'delhi', 'hyderabad', 'madras']
  183. >>> dom = val.domain
  184. >>> g = nltk.sem.Assignment(dom)
  185. >>> m = nltk.sem.Model(dom, val)
  186. >>> m.evaluate(r'population_of(jakarta, 533)', g)
  187. True