crubadan.doctest 2.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
  1. .. Copyright (C) 2001-2019 NLTK Project
  2. .. For license information, see LICENSE.TXT
  3. Crubadan Corpus Reader
  4. ======================
  5. Crubadan is an NLTK corpus reader for ngram files provided
  6. by the Crubadan project. It supports several languages.
  7. >>> from nltk.corpus import crubadan
  8. >>> crubadan.langs() # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
  9. ['abk', 'abn',..., 'zpa', 'zul']
  10. ----------------------------------------
  11. Language code mapping and helper methods
  12. ----------------------------------------
  13. The web crawler that generates the 3-gram frequencies works at the
  14. level of "writing systems" rather than languages. Writing systems
  15. are assigned internal 2-3 letter codes that require mapping to the
  16. standard ISO 639-3 codes. For more information, please refer to
  17. the README in nltk_data/crubadan folder after installing it.
  18. To translate ISO 639-3 codes to "Crubadan Code":
  19. >>> crubadan.iso_to_crubadan('eng')
  20. 'en'
  21. >>> crubadan.iso_to_crubadan('fra')
  22. 'fr'
  23. >>> crubadan.iso_to_crubadan('aaa')
  24. In reverse, print ISO 639-3 code if we have the Crubadan Code:
  25. >>> crubadan.crubadan_to_iso('en')
  26. 'eng'
  27. >>> crubadan.crubadan_to_iso('fr')
  28. 'fra'
  29. >>> crubadan.crubadan_to_iso('aa')
  30. ---------------------------
  31. Accessing ngram frequencies
  32. ---------------------------
  33. On initialization the reader will create a dictionary of every
  34. language supported by the Crubadan project, mapping the ISO 639-3
  35. language code to its corresponding ngram frequency.
  36. You can access individual language FreqDist and the ngrams within them as follows:
  37. >>> english_fd = crubadan.lang_freq('eng')
  38. >>> english_fd['the']
  39. 728135
  40. Above accesses the FreqDist of English and returns the frequency of the ngram 'the'.
  41. A ngram that isn't found within the language will return 0:
  42. >>> english_fd['sometest']
  43. 0
  44. A language that isn't supported will raise an exception:
  45. >>> crubadan.lang_freq('elvish')
  46. Traceback (most recent call last):
  47. ...
  48. RuntimeError: Unsupported language.