123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566 |
- .. Copyright (C) 2001-2019 NLTK Project
- .. For license information, see LICENSE.TXT
- Crubadan Corpus Reader
- ======================
- Crubadan is an NLTK corpus reader for ngram files provided
- by the Crubadan project. It supports several languages.
- >>> from nltk.corpus import crubadan
- >>> crubadan.langs() # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
- ['abk', 'abn',..., 'zpa', 'zul']
- ----------------------------------------
- Language code mapping and helper methods
- ----------------------------------------
- The web crawler that generates the 3-gram frequencies works at the
- level of "writing systems" rather than languages. Writing systems
- are assigned internal 2-3 letter codes that require mapping to the
- standard ISO 639-3 codes. For more information, please refer to
- the README in nltk_data/crubadan folder after installing it.
- To translate ISO 639-3 codes to "Crubadan Code":
- >>> crubadan.iso_to_crubadan('eng')
- 'en'
- >>> crubadan.iso_to_crubadan('fra')
- 'fr'
- >>> crubadan.iso_to_crubadan('aaa')
- In reverse, print ISO 639-3 code if we have the Crubadan Code:
- >>> crubadan.crubadan_to_iso('en')
- 'eng'
- >>> crubadan.crubadan_to_iso('fr')
- 'fra'
- >>> crubadan.crubadan_to_iso('aa')
- ---------------------------
- Accessing ngram frequencies
- ---------------------------
- On initialization the reader will create a dictionary of every
- language supported by the Crubadan project, mapping the ISO 639-3
- language code to its corresponding ngram frequency.
- You can access individual language FreqDist and the ngrams within them as follows:
- >>> english_fd = crubadan.lang_freq('eng')
- >>> english_fd['the']
- 728135
- Above accesses the FreqDist of English and returns the frequency of the ngram 'the'.
- A ngram that isn't found within the language will return 0:
- >>> english_fd['sometest']
- 0
- A language that isn't supported will raise an exception:
- >>> crubadan.lang_freq('elvish')
- Traceback (most recent call last):
- ...
- RuntimeError: Unsupported language.
|