data.doctest 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380
  1. .. Copyright (C) 2001-2019 NLTK Project
  2. .. For license information, see LICENSE.TXT
  3. =========================================
  4. Loading Resources From the Data Package
  5. =========================================
  6. >>> import nltk.data
  7. Overview
  8. ~~~~~~~~
  9. The `nltk.data` module contains functions that can be used to load
  10. NLTK resource files, such as corpora, grammars, and saved processing
  11. objects.
  12. Loading Data Files
  13. ~~~~~~~~~~~~~~~~~~
  14. Resources are loaded using the function `nltk.data.load()`, which
  15. takes as its first argument a URL specifying what file should be
  16. loaded. The ``nltk:`` protocol loads files from the NLTK data
  17. distribution:
  18. >>> from __future__ import print_function
  19. >>> tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')
  20. >>> tokenizer.tokenize('Hello. This is a test. It works!')
  21. ['Hello.', 'This is a test.', 'It works!']
  22. It is important to note that there should be no space following the
  23. colon (':') in the URL; 'nltk: tokenizers/punkt/english.pickle' will
  24. not work!
  25. The ``nltk:`` protocol is used by default if no protocol is specified:
  26. >>> nltk.data.load('tokenizers/punkt/english.pickle') # doctest: +ELLIPSIS
  27. <nltk.tokenize.punkt.PunktSentenceTokenizer object at ...>
  28. But it is also possible to load resources from ``http:``, ``ftp:``,
  29. and ``file:`` URLs, e.g. ``cfg = nltk.data.load('http://example.com/path/to/toy.cfg')``
  30. >>> # Load a grammar using an absolute path.
  31. >>> url = 'file:%s' % nltk.data.find('grammars/sample_grammars/toy.cfg')
  32. >>> url.replace('\\', '/') # doctest: +ELLIPSIS
  33. 'file:...toy.cfg'
  34. >>> print(nltk.data.load(url)) # doctest: +ELLIPSIS
  35. Grammar with 14 productions (start state = S)
  36. S -> NP VP
  37. PP -> P NP
  38. ...
  39. P -> 'on'
  40. P -> 'in'
  41. The second argument to the `nltk.data.load()` function specifies the
  42. file format, which determines how the file's contents are processed
  43. before they are returned by ``load()``. The formats that are
  44. currently supported by the data module are described by the dictionary
  45. `nltk.data.FORMATS`:
  46. >>> for format, descr in sorted(nltk.data.FORMATS.items()):
  47. ... print('{0:<7} {1:}'.format(format, descr)) # doctest: +NORMALIZE_WHITESPACE
  48. cfg A context free grammar.
  49. fcfg A feature CFG.
  50. fol A list of first order logic expressions, parsed with
  51. nltk.sem.logic.Expression.fromstring.
  52. json A serialized python object, stored using the json module.
  53. logic A list of first order logic expressions, parsed with
  54. nltk.sem.logic.LogicParser. Requires an additional logic_parser
  55. parameter
  56. pcfg A probabilistic CFG.
  57. pickle A serialized python object, stored using the pickle
  58. module.
  59. raw The raw (byte string) contents of a file.
  60. text The raw (unicode string) contents of a file.
  61. val A semantic valuation, parsed by
  62. nltk.sem.Valuation.fromstring.
  63. yaml A serialized python object, stored using the yaml module.
  64. `nltk.data.load()` will raise a ValueError if a bad format name is
  65. specified:
  66. >>> nltk.data.load('grammars/sample_grammars/toy.cfg', 'bar')
  67. Traceback (most recent call last):
  68. . . .
  69. ValueError: Unknown format type!
  70. By default, the ``"auto"`` format is used, which chooses a format
  71. based on the filename's extension. The mapping from file extensions
  72. to format names is specified by `nltk.data.AUTO_FORMATS`:
  73. >>> for ext, format in sorted(nltk.data.AUTO_FORMATS.items()):
  74. ... print('.%-7s -> %s' % (ext, format))
  75. .cfg -> cfg
  76. .fcfg -> fcfg
  77. .fol -> fol
  78. .json -> json
  79. .logic -> logic
  80. .pcfg -> pcfg
  81. .pickle -> pickle
  82. .text -> text
  83. .txt -> text
  84. .val -> val
  85. .yaml -> yaml
  86. If `nltk.data.load()` is unable to determine the format based on the
  87. filename's extension, it will raise a ValueError:
  88. >>> nltk.data.load('foo.bar')
  89. Traceback (most recent call last):
  90. . . .
  91. ValueError: Could not determine format for foo.bar based on its file
  92. extension; use the "format" argument to specify the format explicitly.
  93. Note that by explicitly specifying the ``format`` argument, you can
  94. override the load method's default processing behavior. For example,
  95. to get the raw contents of any file, simply use ``format="raw"``:
  96. >>> s = nltk.data.load('grammars/sample_grammars/toy.cfg', 'text')
  97. >>> print(s) # doctest: +ELLIPSIS
  98. S -> NP VP
  99. PP -> P NP
  100. NP -> Det N | NP PP
  101. VP -> V NP | VP PP
  102. ...
  103. Making Local Copies
  104. ~~~~~~~~~~~~~~~~~~~
  105. .. This will not be visible in the html output: create a tempdir to
  106. play in.
  107. >>> import tempfile, os
  108. >>> tempdir = tempfile.mkdtemp()
  109. >>> old_dir = os.path.abspath('.')
  110. >>> os.chdir(tempdir)
  111. The function `nltk.data.retrieve()` copies a given resource to a local
  112. file. This can be useful, for example, if you want to edit one of the
  113. sample grammars.
  114. >>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg')
  115. Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy.cfg'
  116. >>> # Simulate editing the grammar.
  117. >>> with open('toy.cfg') as inp:
  118. ... s = inp.read().replace('NP', 'DP')
  119. >>> with open('toy.cfg', 'w') as out:
  120. ... _bytes_written = out.write(s)
  121. >>> # Load the edited grammar, & display it.
  122. >>> cfg = nltk.data.load('file:///' + os.path.abspath('toy.cfg'))
  123. >>> print(cfg) # doctest: +ELLIPSIS
  124. Grammar with 14 productions (start state = S)
  125. S -> DP VP
  126. PP -> P DP
  127. ...
  128. P -> 'on'
  129. P -> 'in'
  130. The second argument to `nltk.data.retrieve()` specifies the filename
  131. for the new copy of the file. By default, the source file's filename
  132. is used.
  133. >>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg', 'mytoy.cfg')
  134. Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'mytoy.cfg'
  135. >>> os.path.isfile('./mytoy.cfg')
  136. True
  137. >>> nltk.data.retrieve('grammars/sample_grammars/np.fcfg')
  138. Retrieving 'nltk:grammars/sample_grammars/np.fcfg', saving to 'np.fcfg'
  139. >>> os.path.isfile('./np.fcfg')
  140. True
  141. If a file with the specified (or default) filename already exists in
  142. the current directory, then `nltk.data.retrieve()` will raise a
  143. ValueError exception. It will *not* overwrite the file:
  144. >>> os.path.isfile('./toy.cfg')
  145. True
  146. >>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg') # doctest: +ELLIPSIS
  147. Traceback (most recent call last):
  148. . . .
  149. ValueError: File '...toy.cfg' already exists!
  150. .. This will not be visible in the html output: clean up the tempdir.
  151. >>> os.chdir(old_dir)
  152. >>> for f in os.listdir(tempdir):
  153. ... os.remove(os.path.join(tempdir, f))
  154. >>> os.rmdir(tempdir)
  155. Finding Files in the NLTK Data Package
  156. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  157. The `nltk.data.find()` function searches the NLTK data package for a
  158. given file, and returns a pointer to that file. This pointer can
  159. either be a `FileSystemPathPointer` (whose `path` attribute gives the
  160. absolute path of the file); or a `ZipFilePathPointer`, specifying a
  161. zipfile and the name of an entry within that zipfile. Both pointer
  162. types define the `open()` method, which can be used to read the string
  163. contents of the file.
  164. >>> path = nltk.data.find('corpora/abc/rural.txt')
  165. >>> str(path) # doctest: +ELLIPSIS
  166. '...rural.txt'
  167. >>> print(path.open().read(60).decode())
  168. PM denies knowledge of AWB kickbacks
  169. The Prime Minister has
  170. Alternatively, the `nltk.data.load()` function can be used with the
  171. keyword argument ``format="raw"``:
  172. >>> s = nltk.data.load('corpora/abc/rural.txt', format='raw')[:60]
  173. >>> print(s.decode())
  174. PM denies knowledge of AWB kickbacks
  175. The Prime Minister has
  176. Alternatively, you can use the keyword argument ``format="text"``:
  177. >>> s = nltk.data.load('corpora/abc/rural.txt', format='text')[:60]
  178. >>> print(s)
  179. PM denies knowledge of AWB kickbacks
  180. The Prime Minister has
  181. Resource Caching
  182. ~~~~~~~~~~~~~~~~
  183. NLTK uses a weakref dictionary to maintain a cache of resources that
  184. have been loaded. If you load a resource that is already stored in
  185. the cache, then the cached copy will be returned. This behavior can
  186. be seen by the trace output generated when verbose=True:
  187. >>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg', verbose=True)
  188. <<Loading nltk:grammars/book_grammars/feat0.fcfg>>
  189. >>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg', verbose=True)
  190. <<Using cached copy of nltk:grammars/book_grammars/feat0.fcfg>>
  191. If you wish to load a resource from its source, bypassing the cache,
  192. use the ``cache=False`` argument to `nltk.data.load()`. This can be
  193. useful, for example, if the resource is loaded from a local file, and
  194. you are actively editing that file:
  195. >>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg',cache=False,verbose=True)
  196. <<Loading nltk:grammars/book_grammars/feat0.fcfg>>
  197. The cache *no longer* uses weak references. A resource will not be
  198. automatically expunged from the cache when no more objects are using
  199. it. In the following example, when we clear the variable ``feat0``,
  200. the reference count for the feature grammar object drops to zero.
  201. However, the object remains cached:
  202. >>> del feat0
  203. >>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg',
  204. ... verbose=True)
  205. <<Using cached copy of nltk:grammars/book_grammars/feat0.fcfg>>
  206. You can clear the entire contents of the cache, using
  207. `nltk.data.clear_cache()`:
  208. >>> nltk.data.clear_cache()
  209. Retrieving other Data Sources
  210. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  211. >>> formulas = nltk.data.load('grammars/book_grammars/background.fol')
  212. >>> for f in formulas: print(str(f))
  213. all x.(boxerdog(x) -> dog(x))
  214. all x.(boxer(x) -> person(x))
  215. all x.-(dog(x) & person(x))
  216. all x.(married(x) <-> exists y.marry(x,y))
  217. all x.(bark(x) -> dog(x))
  218. all x y.(marry(x,y) -> (person(x) & person(y)))
  219. -(Vincent = Mia)
  220. -(Vincent = Fido)
  221. -(Mia = Fido)
  222. Regression Tests
  223. ~~~~~~~~~~~~~~~~
  224. Create a temp dir for tests that write files:
  225. >>> import tempfile, os
  226. >>> tempdir = tempfile.mkdtemp()
  227. >>> old_dir = os.path.abspath('.')
  228. >>> os.chdir(tempdir)
  229. The `retrieve()` function accepts all url types:
  230. >>> urls = ['https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/toy.cfg',
  231. ... 'file:%s' % nltk.data.find('grammars/sample_grammars/toy.cfg'),
  232. ... 'nltk:grammars/sample_grammars/toy.cfg',
  233. ... 'grammars/sample_grammars/toy.cfg']
  234. >>> for i, url in enumerate(urls):
  235. ... nltk.data.retrieve(url, 'toy-%d.cfg' % i) # doctest: +ELLIPSIS
  236. Retrieving 'https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/toy.cfg', saving to 'toy-0.cfg'
  237. Retrieving 'file:...toy.cfg', saving to 'toy-1.cfg'
  238. Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy-2.cfg'
  239. Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy-3.cfg'
  240. Clean up the temp dir:
  241. >>> os.chdir(old_dir)
  242. >>> for f in os.listdir(tempdir):
  243. ... os.remove(os.path.join(tempdir, f))
  244. >>> os.rmdir(tempdir)
  245. Lazy Loader
  246. -----------
  247. A lazy loader is a wrapper object that defers loading a resource until
  248. it is accessed or used in any way. This is mainly intended for
  249. internal use by NLTK's corpus readers.
  250. >>> # Create a lazy loader for toy.cfg.
  251. >>> ll = nltk.data.LazyLoader('grammars/sample_grammars/toy.cfg')
  252. >>> # Show that it's not loaded yet:
  253. >>> object.__repr__(ll) # doctest: +ELLIPSIS
  254. '<nltk.data.LazyLoader object at ...>'
  255. >>> # printing it is enough to cause it to be loaded:
  256. >>> print(ll)
  257. <Grammar with 14 productions>
  258. >>> # Show that it's now been loaded:
  259. >>> object.__repr__(ll) # doctest: +ELLIPSIS
  260. '<nltk.grammar.CFG object at ...>'
  261. >>> # Test that accessing an attribute also loads it:
  262. >>> ll = nltk.data.LazyLoader('grammars/sample_grammars/toy.cfg')
  263. >>> ll.start()
  264. S
  265. >>> object.__repr__(ll) # doctest: +ELLIPSIS
  266. '<nltk.grammar.CFG object at ...>'
  267. Buffered Gzip Reading and Writing
  268. ---------------------------------
  269. Write performance to gzip-compressed is extremely poor when the files become large.
  270. File creation can become a bottleneck in those cases.
  271. Read performance from large gzipped pickle files was improved in data.py by
  272. buffering the reads. A similar fix can be applied to writes by buffering
  273. the writes to a StringIO object first.
  274. This is mainly intended for internal use. The test simply tests that reading
  275. and writing work as intended and does not test how much improvement buffering
  276. provides.
  277. >>> from nltk.compat import StringIO
  278. >>> test = nltk.data.BufferedGzipFile('testbuf.gz', 'wb', size=2**10)
  279. >>> ans = []
  280. >>> for i in range(10000):
  281. ... ans.append(str(i).encode('ascii'))
  282. ... test.write(str(i).encode('ascii'))
  283. >>> test.close()
  284. >>> test = nltk.data.BufferedGzipFile('testbuf.gz', 'rb')
  285. >>> test.read() == b''.join(ans)
  286. True
  287. >>> test.close()
  288. >>> import os
  289. >>> os.unlink('testbuf.gz')
  290. JSON Encoding and Decoding
  291. --------------------------
  292. JSON serialization is used instead of pickle for some classes.
  293. >>> from nltk import jsontags
  294. >>> from nltk.jsontags import JSONTaggedEncoder, JSONTaggedDecoder, register_tag
  295. >>> @jsontags.register_tag
  296. ... class JSONSerializable:
  297. ... json_tag = 'JSONSerializable'
  298. ...
  299. ... def __init__(self, n):
  300. ... self.n = n
  301. ...
  302. ... def encode_json_obj(self):
  303. ... return self.n
  304. ...
  305. ... @classmethod
  306. ... def decode_json_obj(cls, obj):
  307. ... n = obj
  308. ... return cls(n)
  309. ...
  310. >>> JSONTaggedEncoder().encode(JSONSerializable(1))
  311. '{"!JSONSerializable": 1}'
  312. >>> JSONTaggedDecoder().decode('{"!JSONSerializable": 1}').n
  313. 1