treetransforms.py 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339
  1. # Natural Language Toolkit: Tree Transformations
  2. #
  3. # Copyright (C) 2005-2007 Oregon Graduate Institute
  4. # Author: Nathan Bodenstab <bodenstab@cslu.ogi.edu>
  5. # URL: <http://nltk.org/>
  6. # For license information, see LICENSE.TXT
  7. """
  8. A collection of methods for tree (grammar) transformations used
  9. in parsing natural language.
  10. Although many of these methods are technically grammar transformations
  11. (ie. Chomsky Norm Form), when working with treebanks it is much more
  12. natural to visualize these modifications in a tree structure. Hence,
  13. we will do all transformation directly to the tree itself.
  14. Transforming the tree directly also allows us to do parent annotation.
  15. A grammar can then be simply induced from the modified tree.
  16. The following is a short tutorial on the available transformations.
  17. 1. Chomsky Normal Form (binarization)
  18. It is well known that any grammar has a Chomsky Normal Form (CNF)
  19. equivalent grammar where CNF is defined by every production having
  20. either two non-terminals or one terminal on its right hand side.
  21. When we have hierarchically structured data (ie. a treebank), it is
  22. natural to view this in terms of productions where the root of every
  23. subtree is the head (left hand side) of the production and all of
  24. its children are the right hand side constituents. In order to
  25. convert a tree into CNF, we simply need to ensure that every subtree
  26. has either two subtrees as children (binarization), or one leaf node
  27. (non-terminal). In order to binarize a subtree with more than two
  28. children, we must introduce artificial nodes.
  29. There are two popular methods to convert a tree into CNF: left
  30. factoring and right factoring. The following example demonstrates
  31. the difference between them. Example::
  32. Original Right-Factored Left-Factored
  33. A A A
  34. / | \ / \ / \
  35. B C D ==> B A|<C-D> OR A|<B-C> D
  36. / \ / \
  37. C D B C
  38. 2. Parent Annotation
  39. In addition to binarizing the tree, there are two standard
  40. modifications to node labels we can do in the same traversal: parent
  41. annotation and Markov order-N smoothing (or sibling smoothing).
  42. The purpose of parent annotation is to refine the probabilities of
  43. productions by adding a small amount of context. With this simple
  44. addition, a CYK (inside-outside, dynamic programming chart parse)
  45. can improve from 74% to 79% accuracy. A natural generalization from
  46. parent annotation is to grandparent annotation and beyond. The
  47. tradeoff becomes accuracy gain vs. computational complexity. We
  48. must also keep in mind data sparcity issues. Example::
  49. Original Parent Annotation
  50. A A^<?>
  51. / | \ / \
  52. B C D ==> B^<A> A|<C-D>^<?> where ? is the
  53. / \ parent of A
  54. C^<A> D^<A>
  55. 3. Markov order-N smoothing
  56. Markov smoothing combats data sparcity issues as well as decreasing
  57. computational requirements by limiting the number of children
  58. included in artificial nodes. In practice, most people use an order
  59. 2 grammar. Example::
  60. Original No Smoothing Markov order 1 Markov order 2 etc.
  61. __A__ A A A
  62. / /|\ \ / \ / \ / \
  63. B C D E F ==> B A|<C-D-E-F> ==> B A|<C> ==> B A|<C-D>
  64. / \ / \ / \
  65. C ... C ... C ...
  66. Annotation decisions can be thought about in the vertical direction
  67. (parent, grandparent, etc) and the horizontal direction (number of
  68. siblings to keep). Parameters to the following functions specify
  69. these values. For more information see:
  70. Dan Klein and Chris Manning (2003) "Accurate Unlexicalized
  71. Parsing", ACL-03. http://www.aclweb.org/anthology/P03-1054
  72. 4. Unary Collapsing
  73. Collapse unary productions (ie. subtrees with a single child) into a
  74. new non-terminal (Tree node). This is useful when working with
  75. algorithms that do not allow unary productions, yet you do not wish
  76. to lose the parent information. Example::
  77. A
  78. |
  79. B ==> A+B
  80. / \ / \
  81. C D C D
  82. """
  83. from __future__ import print_function
  84. from nltk.tree import Tree
  85. def chomsky_normal_form(
  86. tree, factor="right", horzMarkov=None, vertMarkov=0, childChar="|", parentChar="^"
  87. ):
  88. # assume all subtrees have homogeneous children
  89. # assume all terminals have no siblings
  90. # A semi-hack to have elegant looking code below. As a result,
  91. # any subtree with a branching factor greater than 999 will be incorrectly truncated.
  92. if horzMarkov is None:
  93. horzMarkov = 999
  94. # Traverse the tree depth-first keeping a list of ancestor nodes to the root.
  95. # I chose not to use the tree.treepositions() method since it requires
  96. # two traversals of the tree (one to get the positions, one to iterate
  97. # over them) and node access time is proportional to the height of the node.
  98. # This method is 7x faster which helps when parsing 40,000 sentences.
  99. nodeList = [(tree, [tree.label()])]
  100. while nodeList != []:
  101. node, parent = nodeList.pop()
  102. if isinstance(node, Tree):
  103. # parent annotation
  104. parentString = ""
  105. originalNode = node.label()
  106. if vertMarkov != 0 and node != tree and isinstance(node[0], Tree):
  107. parentString = "%s<%s>" % (parentChar, "-".join(parent))
  108. node.set_label(node.label() + parentString)
  109. parent = [originalNode] + parent[: vertMarkov - 1]
  110. # add children to the agenda before we mess with them
  111. for child in node:
  112. nodeList.append((child, parent))
  113. # chomsky normal form factorization
  114. if len(node) > 2:
  115. childNodes = [child.label() for child in node]
  116. nodeCopy = node.copy()
  117. node[0:] = [] # delete the children
  118. curNode = node
  119. numChildren = len(nodeCopy)
  120. for i in range(1, numChildren - 1):
  121. if factor == "right":
  122. newHead = "%s%s<%s>%s" % (
  123. originalNode,
  124. childChar,
  125. "-".join(
  126. childNodes[i : min([i + horzMarkov, numChildren])]
  127. ),
  128. parentString,
  129. ) # create new head
  130. newNode = Tree(newHead, [])
  131. curNode[0:] = [nodeCopy.pop(0), newNode]
  132. else:
  133. newHead = "%s%s<%s>%s" % (
  134. originalNode,
  135. childChar,
  136. "-".join(
  137. childNodes[max([numChildren - i - horzMarkov, 0]) : -i]
  138. ),
  139. parentString,
  140. )
  141. newNode = Tree(newHead, [])
  142. curNode[0:] = [newNode, nodeCopy.pop()]
  143. curNode = newNode
  144. curNode[0:] = [child for child in nodeCopy]
  145. def un_chomsky_normal_form(
  146. tree, expandUnary=True, childChar="|", parentChar="^", unaryChar="+"
  147. ):
  148. # Traverse the tree-depth first keeping a pointer to the parent for modification purposes.
  149. nodeList = [(tree, [])]
  150. while nodeList != []:
  151. node, parent = nodeList.pop()
  152. if isinstance(node, Tree):
  153. # if the node contains the 'childChar' character it means that
  154. # it is an artificial node and can be removed, although we still need
  155. # to move its children to its parent
  156. childIndex = node.label().find(childChar)
  157. if childIndex != -1:
  158. nodeIndex = parent.index(node)
  159. parent.remove(parent[nodeIndex])
  160. # Generated node was on the left if the nodeIndex is 0 which
  161. # means the grammar was left factored. We must insert the children
  162. # at the beginning of the parent's children
  163. if nodeIndex == 0:
  164. parent.insert(0, node[0])
  165. parent.insert(1, node[1])
  166. else:
  167. parent.extend([node[0], node[1]])
  168. # parent is now the current node so the children of parent will be added to the agenda
  169. node = parent
  170. else:
  171. parentIndex = node.label().find(parentChar)
  172. if parentIndex != -1:
  173. # strip the node name of the parent annotation
  174. node.set_label(node.label()[:parentIndex])
  175. # expand collapsed unary productions
  176. if expandUnary == True:
  177. unaryIndex = node.label().find(unaryChar)
  178. if unaryIndex != -1:
  179. newNode = Tree(
  180. node.label()[unaryIndex + 1 :], [i for i in node]
  181. )
  182. node.set_label(node.label()[:unaryIndex])
  183. node[0:] = [newNode]
  184. for child in node:
  185. nodeList.append((child, node))
  186. def collapse_unary(tree, collapsePOS=False, collapseRoot=False, joinChar="+"):
  187. """
  188. Collapse subtrees with a single child (ie. unary productions)
  189. into a new non-terminal (Tree node) joined by 'joinChar'.
  190. This is useful when working with algorithms that do not allow
  191. unary productions, and completely removing the unary productions
  192. would require loss of useful information. The Tree is modified
  193. directly (since it is passed by reference) and no value is returned.
  194. :param tree: The Tree to be collapsed
  195. :type tree: Tree
  196. :param collapsePOS: 'False' (default) will not collapse the parent of leaf nodes (ie.
  197. Part-of-Speech tags) since they are always unary productions
  198. :type collapsePOS: bool
  199. :param collapseRoot: 'False' (default) will not modify the root production
  200. if it is unary. For the Penn WSJ treebank corpus, this corresponds
  201. to the TOP -> productions.
  202. :type collapseRoot: bool
  203. :param joinChar: A string used to connect collapsed node values (default = "+")
  204. :type joinChar: str
  205. """
  206. if collapseRoot == False and isinstance(tree, Tree) and len(tree) == 1:
  207. nodeList = [tree[0]]
  208. else:
  209. nodeList = [tree]
  210. # depth-first traversal of tree
  211. while nodeList != []:
  212. node = nodeList.pop()
  213. if isinstance(node, Tree):
  214. if (
  215. len(node) == 1
  216. and isinstance(node[0], Tree)
  217. and (collapsePOS == True or isinstance(node[0, 0], Tree))
  218. ):
  219. node.set_label(node.label() + joinChar + node[0].label())
  220. node[0:] = [child for child in node[0]]
  221. # since we assigned the child's children to the current node,
  222. # evaluate the current node again
  223. nodeList.append(node)
  224. else:
  225. for child in node:
  226. nodeList.append(child)
  227. #################################################################
  228. # Demonstration
  229. #################################################################
  230. def demo():
  231. """
  232. A demonstration showing how each tree transform can be used.
  233. """
  234. from nltk.draw.tree import draw_trees
  235. from nltk import tree, treetransforms
  236. from copy import deepcopy
  237. # original tree from WSJ bracketed text
  238. sentence = """(TOP
  239. (S
  240. (S
  241. (VP
  242. (VBN Turned)
  243. (ADVP (RB loose))
  244. (PP
  245. (IN in)
  246. (NP
  247. (NP (NNP Shane) (NNP Longman) (POS 's))
  248. (NN trading)
  249. (NN room)))))
  250. (, ,)
  251. (NP (DT the) (NN yuppie) (NNS dealers))
  252. (VP (AUX do) (NP (NP (RB little)) (ADJP (RB right))))
  253. (. .)))"""
  254. t = tree.Tree.fromstring(sentence, remove_empty_top_bracketing=True)
  255. # collapse subtrees with only one child
  256. collapsedTree = deepcopy(t)
  257. treetransforms.collapse_unary(collapsedTree)
  258. # convert the tree to CNF
  259. cnfTree = deepcopy(collapsedTree)
  260. treetransforms.chomsky_normal_form(cnfTree)
  261. # convert the tree to CNF with parent annotation (one level) and horizontal smoothing of order two
  262. parentTree = deepcopy(collapsedTree)
  263. treetransforms.chomsky_normal_form(parentTree, horzMarkov=2, vertMarkov=1)
  264. # convert the tree back to its original form (used to make CYK results comparable)
  265. original = deepcopy(parentTree)
  266. treetransforms.un_chomsky_normal_form(original)
  267. # convert tree back to bracketed text
  268. sentence2 = original.pprint()
  269. print(sentence)
  270. print(sentence2)
  271. print("Sentences the same? ", sentence == sentence2)
  272. draw_trees(t, collapsedTree, cnfTree, parentTree, original)
  273. if __name__ == '__main__':
  274. demo()
  275. __all__ = ["chomsky_normal_form", "un_chomsky_normal_form", "collapse_unary"]