123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235 |
- .. Copyright (C) 2001-2019 NLTK Project
- .. For license information, see LICENSE.TXT
- =======
- Chat-80
- =======
- Chat-80 was a natural language system which allowed the user to
- interrogate a Prolog knowledge base in the domain of world
- geography. It was developed in the early '80s by Warren and Pereira; see
- `<http://acl.ldc.upenn.edu/J/J82/J82-3002.pdf>`_ for a description and
- `<http://www.cis.upenn.edu/~pereira/oldies.html>`_ for the source
- files.
- The ``chat80`` module contains functions to extract data from the Chat-80
- relation files ('the world database'), and convert then into a format
- that can be incorporated in the FOL models of
- ``nltk.sem.evaluate``. The code assumes that the Prolog
- input files are available in the NLTK corpora directory.
- The Chat-80 World Database consists of the following files::
- world0.pl
- rivers.pl
- cities.pl
- countries.pl
- contain.pl
- borders.pl
- This module uses a slightly modified version of ``world0.pl``, in which
- a set of Prolog rules have been omitted. The modified file is named
- ``world1.pl``. Currently, the file ``rivers.pl`` is not read in, since
- it uses a list rather than a string in the second field.
- Reading Chat-80 Files
- =====================
- Chat-80 relations are like tables in a relational database. The
- relation acts as the name of the table; the first argument acts as the
- 'primary key'; and subsequent arguments are further fields in the
- table. In general, the name of the table provides a label for a unary
- predicate whose extension is all the primary keys. For example,
- relations in ``cities.pl`` are of the following form::
- 'city(athens,greece,1368).'
- Here, ``'athens'`` is the key, and will be mapped to a member of the
- unary predicate *city*.
- By analogy with NLTK corpora, ``chat80`` defines a number of 'items'
- which correspond to the relations.
- >>> from nltk.sem import chat80
- >>> print(chat80.items) # doctest: +ELLIPSIS
- ('borders', 'circle_of_lat', 'circle_of_long', 'city', ...)
- The fields in the table are mapped to binary predicates. The first
- argument of the predicate is the primary key, while the second
- argument is the data in the relevant field. Thus, in the above
- example, the third field is mapped to the binary predicate
- *population_of*, whose extension is a set of pairs such as
- ``'(athens, 1368)'``.
- An exception to this general framework is required by the relations in
- the files ``borders.pl`` and ``contains.pl``. These contain facts of the
- following form::
- 'borders(albania,greece).'
- 'contains0(africa,central_africa).'
- We do not want to form a unary concept out the element in
- the first field of these records, and we want the label of the binary
- relation just to be ``'border'``/``'contain'`` respectively.
- In order to drive the extraction process, we use 'relation metadata bundles'
- which are Python dictionaries such as the following::
- city = {'label': 'city',
- 'closures': [],
- 'schema': ['city', 'country', 'population'],
- 'filename': 'cities.pl'}
- According to this, the file ``city['filename']`` contains a list of
- relational tuples (or more accurately, the corresponding strings in
- Prolog form) whose predicate symbol is ``city['label']`` and whose
- relational schema is ``city['schema']``. The notion of a ``closure`` is
- discussed in the next section.
- Concepts
- ========
- In order to encapsulate the results of the extraction, a class of
- ``Concept``\ s is introduced. A ``Concept`` object has a number of
- attributes, in particular a ``prefLabel``, an arity and ``extension``.
- >>> c1 = chat80.Concept('dog', arity=1, extension=set(['d1', 'd2']))
- >>> print(c1)
- Label = 'dog'
- Arity = 1
- Extension = ['d1', 'd2']
- The ``extension`` attribute makes it easier to inspect the output of
- the extraction.
- >>> schema = ['city', 'country', 'population']
- >>> concepts = chat80.clause2concepts('cities.pl', 'city', schema)
- >>> concepts
- [Concept('city'), Concept('country_of'), Concept('population_of')]
- >>> for c in concepts: # doctest: +NORMALIZE_WHITESPACE
- ... print("%s:\n\t%s" % (c.prefLabel, c.extension[:4]))
- city:
- ['athens', 'bangkok', 'barcelona', 'berlin']
- country_of:
- [('athens', 'greece'), ('bangkok', 'thailand'), ('barcelona', 'spain'), ('berlin', 'east_germany')]
- population_of:
- [('athens', '1368'), ('bangkok', '1178'), ('barcelona', '1280'), ('berlin', '3481')]
- In addition, the ``extension`` can be further
- processed: in the case of the ``'border'`` relation, we check that the
- relation is **symmetric**, and in the case of the ``'contain'``
- relation, we carry out the **transitive closure**. The closure
- properties associated with a concept is indicated in the relation
- metadata, as indicated earlier.
- >>> borders = set([('a1', 'a2'), ('a2', 'a3')])
- >>> c2 = chat80.Concept('borders', arity=2, extension=borders)
- >>> print(c2)
- Label = 'borders'
- Arity = 2
- Extension = [('a1', 'a2'), ('a2', 'a3')]
- >>> c3 = chat80.Concept('borders', arity=2, closures=['symmetric'], extension=borders)
- >>> c3.close()
- >>> print(c3)
- Label = 'borders'
- Arity = 2
- Extension = [('a1', 'a2'), ('a2', 'a1'), ('a2', 'a3'), ('a3', 'a2')]
- The ``extension`` of a ``Concept`` object is then incorporated into a
- ``Valuation`` object.
- Persistence
- ===========
- The functions ``val_dump`` and ``val_load`` are provided to allow a
- valuation to be stored in a persistent database and re-loaded, rather
- than having to be re-computed each time.
- Individuals and Lexical Items
- =============================
- As well as deriving relations from the Chat-80 data, we also create a
- set of individual constants, one for each entity in the domain. The
- individual constants are string-identical to the entities. For
- example, given a data item such as ``'zloty'``, we add to the valuation
- a pair ``('zloty', 'zloty')``. In order to parse English sentences that
- refer to these entities, we also create a lexical item such as the
- following for each individual constant::
- PropN[num=sg, sem=<\P.(P zloty)>] -> 'Zloty'
- The set of rules is written to the file ``chat_pnames.fcfg`` in the
- current directory.
- SQL Query
- =========
- The ``city`` relation is also available in RDB form and can be queried
- using SQL statements.
- >>> import nltk
- >>> q = "SELECT City, Population FROM city_table WHERE Country = 'china' and Population > 1000"
- >>> for answer in chat80.sql_query('corpora/city_database/city.db', q):
- ... print("%-10s %4s" % answer)
- canton 1496
- chungking 1100
- mukden 1551
- peking 2031
- shanghai 5407
- tientsin 1795
- The (deliberately naive) grammar ``sql.fcfg`` translates from English
- to SQL:
- >>> nltk.data.show_cfg('grammars/book_grammars/sql0.fcfg')
- % start S
- S[SEM=(?np + WHERE + ?vp)] -> NP[SEM=?np] VP[SEM=?vp]
- VP[SEM=(?v + ?pp)] -> IV[SEM=?v] PP[SEM=?pp]
- VP[SEM=(?v + ?ap)] -> IV[SEM=?v] AP[SEM=?ap]
- NP[SEM=(?det + ?n)] -> Det[SEM=?det] N[SEM=?n]
- PP[SEM=(?p + ?np)] -> P[SEM=?p] NP[SEM=?np]
- AP[SEM=?pp] -> A[SEM=?a] PP[SEM=?pp]
- NP[SEM='Country="greece"'] -> 'Greece'
- NP[SEM='Country="china"'] -> 'China'
- Det[SEM='SELECT'] -> 'Which' | 'What'
- N[SEM='City FROM city_table'] -> 'cities'
- IV[SEM=''] -> 'are'
- A[SEM=''] -> 'located'
- P[SEM=''] -> 'in'
- Given this grammar, we can express, and then execute, queries in English.
- >>> cp = nltk.parse.load_parser('grammars/book_grammars/sql0.fcfg')
- >>> query = 'What cities are in China'
- >>> for tree in cp.parse(query.split()):
- ... answer = tree.label()['SEM']
- ... q = " ".join(answer)
- ... print(q)
- ...
- SELECT City FROM city_table WHERE Country="china"
- >>> rows = chat80.sql_query('corpora/city_database/city.db', q)
- >>> for r in rows: print("%s" % r, end=' ')
- canton chungking dairen harbin kowloon mukden peking shanghai sian tientsin
- Using Valuations
- -----------------
- In order to convert such an extension into a valuation, we use the
- ``make_valuation()`` method; setting ``read=True`` creates and returns
- a new ``Valuation`` object which contains the results.
- >>> val = chat80.make_valuation(concepts, read=True)
- >>> 'calcutta' in val['city']
- True
- >>> [town for (town, country) in val['country_of'] if country == 'india']
- ['bombay', 'calcutta', 'delhi', 'hyderabad', 'madras']
- >>> dom = val.domain
- >>> g = nltk.sem.Assignment(dom)
- >>> m = nltk.sem.Model(dom, val)
- >>> m.evaluate(r'population_of(jakarta, 533)', g)
- True
|