123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380 |
- .. Copyright (C) 2001-2019 NLTK Project
- .. For license information, see LICENSE.TXT
- =========================================
- Loading Resources From the Data Package
- =========================================
- >>> import nltk.data
- Overview
- ~~~~~~~~
- The `nltk.data` module contains functions that can be used to load
- NLTK resource files, such as corpora, grammars, and saved processing
- objects.
- Loading Data Files
- ~~~~~~~~~~~~~~~~~~
- Resources are loaded using the function `nltk.data.load()`, which
- takes as its first argument a URL specifying what file should be
- loaded. The ``nltk:`` protocol loads files from the NLTK data
- distribution:
- >>> from __future__ import print_function
- >>> tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')
- >>> tokenizer.tokenize('Hello. This is a test. It works!')
- ['Hello.', 'This is a test.', 'It works!']
- It is important to note that there should be no space following the
- colon (':') in the URL; 'nltk: tokenizers/punkt/english.pickle' will
- not work!
- The ``nltk:`` protocol is used by default if no protocol is specified:
- >>> nltk.data.load('tokenizers/punkt/english.pickle') # doctest: +ELLIPSIS
- <nltk.tokenize.punkt.PunktSentenceTokenizer object at ...>
- But it is also possible to load resources from ``http:``, ``ftp:``,
- and ``file:`` URLs, e.g. ``cfg = nltk.data.load('http://example.com/path/to/toy.cfg')``
- >>> # Load a grammar using an absolute path.
- >>> url = 'file:%s' % nltk.data.find('grammars/sample_grammars/toy.cfg')
- >>> url.replace('\\', '/') # doctest: +ELLIPSIS
- 'file:...toy.cfg'
- >>> print(nltk.data.load(url)) # doctest: +ELLIPSIS
- Grammar with 14 productions (start state = S)
- S -> NP VP
- PP -> P NP
- ...
- P -> 'on'
- P -> 'in'
- The second argument to the `nltk.data.load()` function specifies the
- file format, which determines how the file's contents are processed
- before they are returned by ``load()``. The formats that are
- currently supported by the data module are described by the dictionary
- `nltk.data.FORMATS`:
- >>> for format, descr in sorted(nltk.data.FORMATS.items()):
- ... print('{0:<7} {1:}'.format(format, descr)) # doctest: +NORMALIZE_WHITESPACE
- cfg A context free grammar.
- fcfg A feature CFG.
- fol A list of first order logic expressions, parsed with
- nltk.sem.logic.Expression.fromstring.
- json A serialized python object, stored using the json module.
- logic A list of first order logic expressions, parsed with
- nltk.sem.logic.LogicParser. Requires an additional logic_parser
- parameter
- pcfg A probabilistic CFG.
- pickle A serialized python object, stored using the pickle
- module.
- raw The raw (byte string) contents of a file.
- text The raw (unicode string) contents of a file.
- val A semantic valuation, parsed by
- nltk.sem.Valuation.fromstring.
- yaml A serialized python object, stored using the yaml module.
- `nltk.data.load()` will raise a ValueError if a bad format name is
- specified:
- >>> nltk.data.load('grammars/sample_grammars/toy.cfg', 'bar')
- Traceback (most recent call last):
- . . .
- ValueError: Unknown format type!
- By default, the ``"auto"`` format is used, which chooses a format
- based on the filename's extension. The mapping from file extensions
- to format names is specified by `nltk.data.AUTO_FORMATS`:
- >>> for ext, format in sorted(nltk.data.AUTO_FORMATS.items()):
- ... print('.%-7s -> %s' % (ext, format))
- .cfg -> cfg
- .fcfg -> fcfg
- .fol -> fol
- .json -> json
- .logic -> logic
- .pcfg -> pcfg
- .pickle -> pickle
- .text -> text
- .txt -> text
- .val -> val
- .yaml -> yaml
- If `nltk.data.load()` is unable to determine the format based on the
- filename's extension, it will raise a ValueError:
- >>> nltk.data.load('foo.bar')
- Traceback (most recent call last):
- . . .
- ValueError: Could not determine format for foo.bar based on its file
- extension; use the "format" argument to specify the format explicitly.
- Note that by explicitly specifying the ``format`` argument, you can
- override the load method's default processing behavior. For example,
- to get the raw contents of any file, simply use ``format="raw"``:
- >>> s = nltk.data.load('grammars/sample_grammars/toy.cfg', 'text')
- >>> print(s) # doctest: +ELLIPSIS
- S -> NP VP
- PP -> P NP
- NP -> Det N | NP PP
- VP -> V NP | VP PP
- ...
- Making Local Copies
- ~~~~~~~~~~~~~~~~~~~
- .. This will not be visible in the html output: create a tempdir to
- play in.
- >>> import tempfile, os
- >>> tempdir = tempfile.mkdtemp()
- >>> old_dir = os.path.abspath('.')
- >>> os.chdir(tempdir)
- The function `nltk.data.retrieve()` copies a given resource to a local
- file. This can be useful, for example, if you want to edit one of the
- sample grammars.
- >>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg')
- Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy.cfg'
- >>> # Simulate editing the grammar.
- >>> with open('toy.cfg') as inp:
- ... s = inp.read().replace('NP', 'DP')
- >>> with open('toy.cfg', 'w') as out:
- ... _bytes_written = out.write(s)
- >>> # Load the edited grammar, & display it.
- >>> cfg = nltk.data.load('file:///' + os.path.abspath('toy.cfg'))
- >>> print(cfg) # doctest: +ELLIPSIS
- Grammar with 14 productions (start state = S)
- S -> DP VP
- PP -> P DP
- ...
- P -> 'on'
- P -> 'in'
- The second argument to `nltk.data.retrieve()` specifies the filename
- for the new copy of the file. By default, the source file's filename
- is used.
- >>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg', 'mytoy.cfg')
- Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'mytoy.cfg'
- >>> os.path.isfile('./mytoy.cfg')
- True
- >>> nltk.data.retrieve('grammars/sample_grammars/np.fcfg')
- Retrieving 'nltk:grammars/sample_grammars/np.fcfg', saving to 'np.fcfg'
- >>> os.path.isfile('./np.fcfg')
- True
- If a file with the specified (or default) filename already exists in
- the current directory, then `nltk.data.retrieve()` will raise a
- ValueError exception. It will *not* overwrite the file:
- >>> os.path.isfile('./toy.cfg')
- True
- >>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg') # doctest: +ELLIPSIS
- Traceback (most recent call last):
- . . .
- ValueError: File '...toy.cfg' already exists!
- .. This will not be visible in the html output: clean up the tempdir.
- >>> os.chdir(old_dir)
- >>> for f in os.listdir(tempdir):
- ... os.remove(os.path.join(tempdir, f))
- >>> os.rmdir(tempdir)
- Finding Files in the NLTK Data Package
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- The `nltk.data.find()` function searches the NLTK data package for a
- given file, and returns a pointer to that file. This pointer can
- either be a `FileSystemPathPointer` (whose `path` attribute gives the
- absolute path of the file); or a `ZipFilePathPointer`, specifying a
- zipfile and the name of an entry within that zipfile. Both pointer
- types define the `open()` method, which can be used to read the string
- contents of the file.
- >>> path = nltk.data.find('corpora/abc/rural.txt')
- >>> str(path) # doctest: +ELLIPSIS
- '...rural.txt'
- >>> print(path.open().read(60).decode())
- PM denies knowledge of AWB kickbacks
- The Prime Minister has
- Alternatively, the `nltk.data.load()` function can be used with the
- keyword argument ``format="raw"``:
- >>> s = nltk.data.load('corpora/abc/rural.txt', format='raw')[:60]
- >>> print(s.decode())
- PM denies knowledge of AWB kickbacks
- The Prime Minister has
- Alternatively, you can use the keyword argument ``format="text"``:
- >>> s = nltk.data.load('corpora/abc/rural.txt', format='text')[:60]
- >>> print(s)
- PM denies knowledge of AWB kickbacks
- The Prime Minister has
- Resource Caching
- ~~~~~~~~~~~~~~~~
- NLTK uses a weakref dictionary to maintain a cache of resources that
- have been loaded. If you load a resource that is already stored in
- the cache, then the cached copy will be returned. This behavior can
- be seen by the trace output generated when verbose=True:
- >>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg', verbose=True)
- <<Loading nltk:grammars/book_grammars/feat0.fcfg>>
- >>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg', verbose=True)
- <<Using cached copy of nltk:grammars/book_grammars/feat0.fcfg>>
- If you wish to load a resource from its source, bypassing the cache,
- use the ``cache=False`` argument to `nltk.data.load()`. This can be
- useful, for example, if the resource is loaded from a local file, and
- you are actively editing that file:
- >>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg',cache=False,verbose=True)
- <<Loading nltk:grammars/book_grammars/feat0.fcfg>>
- The cache *no longer* uses weak references. A resource will not be
- automatically expunged from the cache when no more objects are using
- it. In the following example, when we clear the variable ``feat0``,
- the reference count for the feature grammar object drops to zero.
- However, the object remains cached:
- >>> del feat0
- >>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg',
- ... verbose=True)
- <<Using cached copy of nltk:grammars/book_grammars/feat0.fcfg>>
- You can clear the entire contents of the cache, using
- `nltk.data.clear_cache()`:
- >>> nltk.data.clear_cache()
- Retrieving other Data Sources
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- >>> formulas = nltk.data.load('grammars/book_grammars/background.fol')
- >>> for f in formulas: print(str(f))
- all x.(boxerdog(x) -> dog(x))
- all x.(boxer(x) -> person(x))
- all x.-(dog(x) & person(x))
- all x.(married(x) <-> exists y.marry(x,y))
- all x.(bark(x) -> dog(x))
- all x y.(marry(x,y) -> (person(x) & person(y)))
- -(Vincent = Mia)
- -(Vincent = Fido)
- -(Mia = Fido)
- Regression Tests
- ~~~~~~~~~~~~~~~~
- Create a temp dir for tests that write files:
- >>> import tempfile, os
- >>> tempdir = tempfile.mkdtemp()
- >>> old_dir = os.path.abspath('.')
- >>> os.chdir(tempdir)
- The `retrieve()` function accepts all url types:
- >>> urls = ['https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/toy.cfg',
- ... 'file:%s' % nltk.data.find('grammars/sample_grammars/toy.cfg'),
- ... 'nltk:grammars/sample_grammars/toy.cfg',
- ... 'grammars/sample_grammars/toy.cfg']
- >>> for i, url in enumerate(urls):
- ... nltk.data.retrieve(url, 'toy-%d.cfg' % i) # doctest: +ELLIPSIS
- Retrieving 'https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/toy.cfg', saving to 'toy-0.cfg'
- Retrieving 'file:...toy.cfg', saving to 'toy-1.cfg'
- Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy-2.cfg'
- Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy-3.cfg'
- Clean up the temp dir:
- >>> os.chdir(old_dir)
- >>> for f in os.listdir(tempdir):
- ... os.remove(os.path.join(tempdir, f))
- >>> os.rmdir(tempdir)
- Lazy Loader
- -----------
- A lazy loader is a wrapper object that defers loading a resource until
- it is accessed or used in any way. This is mainly intended for
- internal use by NLTK's corpus readers.
- >>> # Create a lazy loader for toy.cfg.
- >>> ll = nltk.data.LazyLoader('grammars/sample_grammars/toy.cfg')
- >>> # Show that it's not loaded yet:
- >>> object.__repr__(ll) # doctest: +ELLIPSIS
- '<nltk.data.LazyLoader object at ...>'
- >>> # printing it is enough to cause it to be loaded:
- >>> print(ll)
- <Grammar with 14 productions>
- >>> # Show that it's now been loaded:
- >>> object.__repr__(ll) # doctest: +ELLIPSIS
- '<nltk.grammar.CFG object at ...>'
- >>> # Test that accessing an attribute also loads it:
- >>> ll = nltk.data.LazyLoader('grammars/sample_grammars/toy.cfg')
- >>> ll.start()
- S
- >>> object.__repr__(ll) # doctest: +ELLIPSIS
- '<nltk.grammar.CFG object at ...>'
- Buffered Gzip Reading and Writing
- ---------------------------------
- Write performance to gzip-compressed is extremely poor when the files become large.
- File creation can become a bottleneck in those cases.
- Read performance from large gzipped pickle files was improved in data.py by
- buffering the reads. A similar fix can be applied to writes by buffering
- the writes to a StringIO object first.
- This is mainly intended for internal use. The test simply tests that reading
- and writing work as intended and does not test how much improvement buffering
- provides.
- >>> from nltk.compat import StringIO
- >>> test = nltk.data.BufferedGzipFile('testbuf.gz', 'wb', size=2**10)
- >>> ans = []
- >>> for i in range(10000):
- ... ans.append(str(i).encode('ascii'))
- ... test.write(str(i).encode('ascii'))
- >>> test.close()
- >>> test = nltk.data.BufferedGzipFile('testbuf.gz', 'rb')
- >>> test.read() == b''.join(ans)
- True
- >>> test.close()
- >>> import os
- >>> os.unlink('testbuf.gz')
- JSON Encoding and Decoding
- --------------------------
- JSON serialization is used instead of pickle for some classes.
- >>> from nltk import jsontags
- >>> from nltk.jsontags import JSONTaggedEncoder, JSONTaggedDecoder, register_tag
- >>> @jsontags.register_tag
- ... class JSONSerializable:
- ... json_tag = 'JSONSerializable'
- ...
- ... def __init__(self, n):
- ... self.n = n
- ...
- ... def encode_json_obj(self):
- ... return self.n
- ...
- ... @classmethod
- ... def decode_json_obj(cls, obj):
- ... n = obj
- ... return cls(n)
- ...
- >>> JSONTaggedEncoder().encode(JSONSerializable(1))
- '{"!JSONSerializable": 1}'
- >>> JSONTaggedDecoder().decode('{"!JSONSerializable": 1}').n
- 1
|