123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339 |
- # Natural Language Toolkit: Tree Transformations
- #
- # Copyright (C) 2005-2007 Oregon Graduate Institute
- # Author: Nathan Bodenstab <bodenstab@cslu.ogi.edu>
- # URL: <http://nltk.org/>
- # For license information, see LICENSE.TXT
- """
- A collection of methods for tree (grammar) transformations used
- in parsing natural language.
- Although many of these methods are technically grammar transformations
- (ie. Chomsky Norm Form), when working with treebanks it is much more
- natural to visualize these modifications in a tree structure. Hence,
- we will do all transformation directly to the tree itself.
- Transforming the tree directly also allows us to do parent annotation.
- A grammar can then be simply induced from the modified tree.
- The following is a short tutorial on the available transformations.
- 1. Chomsky Normal Form (binarization)
- It is well known that any grammar has a Chomsky Normal Form (CNF)
- equivalent grammar where CNF is defined by every production having
- either two non-terminals or one terminal on its right hand side.
- When we have hierarchically structured data (ie. a treebank), it is
- natural to view this in terms of productions where the root of every
- subtree is the head (left hand side) of the production and all of
- its children are the right hand side constituents. In order to
- convert a tree into CNF, we simply need to ensure that every subtree
- has either two subtrees as children (binarization), or one leaf node
- (non-terminal). In order to binarize a subtree with more than two
- children, we must introduce artificial nodes.
- There are two popular methods to convert a tree into CNF: left
- factoring and right factoring. The following example demonstrates
- the difference between them. Example::
- Original Right-Factored Left-Factored
- A A A
- / | \ / \ / \
- B C D ==> B A|<C-D> OR A|<B-C> D
- / \ / \
- C D B C
- 2. Parent Annotation
- In addition to binarizing the tree, there are two standard
- modifications to node labels we can do in the same traversal: parent
- annotation and Markov order-N smoothing (or sibling smoothing).
- The purpose of parent annotation is to refine the probabilities of
- productions by adding a small amount of context. With this simple
- addition, a CYK (inside-outside, dynamic programming chart parse)
- can improve from 74% to 79% accuracy. A natural generalization from
- parent annotation is to grandparent annotation and beyond. The
- tradeoff becomes accuracy gain vs. computational complexity. We
- must also keep in mind data sparcity issues. Example::
- Original Parent Annotation
- A A^<?>
- / | \ / \
- B C D ==> B^<A> A|<C-D>^<?> where ? is the
- / \ parent of A
- C^<A> D^<A>
- 3. Markov order-N smoothing
- Markov smoothing combats data sparcity issues as well as decreasing
- computational requirements by limiting the number of children
- included in artificial nodes. In practice, most people use an order
- 2 grammar. Example::
- Original No Smoothing Markov order 1 Markov order 2 etc.
- __A__ A A A
- / /|\ \ / \ / \ / \
- B C D E F ==> B A|<C-D-E-F> ==> B A|<C> ==> B A|<C-D>
- / \ / \ / \
- C ... C ... C ...
- Annotation decisions can be thought about in the vertical direction
- (parent, grandparent, etc) and the horizontal direction (number of
- siblings to keep). Parameters to the following functions specify
- these values. For more information see:
- Dan Klein and Chris Manning (2003) "Accurate Unlexicalized
- Parsing", ACL-03. http://www.aclweb.org/anthology/P03-1054
- 4. Unary Collapsing
- Collapse unary productions (ie. subtrees with a single child) into a
- new non-terminal (Tree node). This is useful when working with
- algorithms that do not allow unary productions, yet you do not wish
- to lose the parent information. Example::
- A
- |
- B ==> A+B
- / \ / \
- C D C D
- """
- from __future__ import print_function
- from nltk.tree import Tree
- def chomsky_normal_form(
- tree, factor="right", horzMarkov=None, vertMarkov=0, childChar="|", parentChar="^"
- ):
- # assume all subtrees have homogeneous children
- # assume all terminals have no siblings
- # A semi-hack to have elegant looking code below. As a result,
- # any subtree with a branching factor greater than 999 will be incorrectly truncated.
- if horzMarkov is None:
- horzMarkov = 999
- # Traverse the tree depth-first keeping a list of ancestor nodes to the root.
- # I chose not to use the tree.treepositions() method since it requires
- # two traversals of the tree (one to get the positions, one to iterate
- # over them) and node access time is proportional to the height of the node.
- # This method is 7x faster which helps when parsing 40,000 sentences.
- nodeList = [(tree, [tree.label()])]
- while nodeList != []:
- node, parent = nodeList.pop()
- if isinstance(node, Tree):
- # parent annotation
- parentString = ""
- originalNode = node.label()
- if vertMarkov != 0 and node != tree and isinstance(node[0], Tree):
- parentString = "%s<%s>" % (parentChar, "-".join(parent))
- node.set_label(node.label() + parentString)
- parent = [originalNode] + parent[: vertMarkov - 1]
- # add children to the agenda before we mess with them
- for child in node:
- nodeList.append((child, parent))
- # chomsky normal form factorization
- if len(node) > 2:
- childNodes = [child.label() for child in node]
- nodeCopy = node.copy()
- node[0:] = [] # delete the children
- curNode = node
- numChildren = len(nodeCopy)
- for i in range(1, numChildren - 1):
- if factor == "right":
- newHead = "%s%s<%s>%s" % (
- originalNode,
- childChar,
- "-".join(
- childNodes[i : min([i + horzMarkov, numChildren])]
- ),
- parentString,
- ) # create new head
- newNode = Tree(newHead, [])
- curNode[0:] = [nodeCopy.pop(0), newNode]
- else:
- newHead = "%s%s<%s>%s" % (
- originalNode,
- childChar,
- "-".join(
- childNodes[max([numChildren - i - horzMarkov, 0]) : -i]
- ),
- parentString,
- )
- newNode = Tree(newHead, [])
- curNode[0:] = [newNode, nodeCopy.pop()]
- curNode = newNode
- curNode[0:] = [child for child in nodeCopy]
- def un_chomsky_normal_form(
- tree, expandUnary=True, childChar="|", parentChar="^", unaryChar="+"
- ):
- # Traverse the tree-depth first keeping a pointer to the parent for modification purposes.
- nodeList = [(tree, [])]
- while nodeList != []:
- node, parent = nodeList.pop()
- if isinstance(node, Tree):
- # if the node contains the 'childChar' character it means that
- # it is an artificial node and can be removed, although we still need
- # to move its children to its parent
- childIndex = node.label().find(childChar)
- if childIndex != -1:
- nodeIndex = parent.index(node)
- parent.remove(parent[nodeIndex])
- # Generated node was on the left if the nodeIndex is 0 which
- # means the grammar was left factored. We must insert the children
- # at the beginning of the parent's children
- if nodeIndex == 0:
- parent.insert(0, node[0])
- parent.insert(1, node[1])
- else:
- parent.extend([node[0], node[1]])
- # parent is now the current node so the children of parent will be added to the agenda
- node = parent
- else:
- parentIndex = node.label().find(parentChar)
- if parentIndex != -1:
- # strip the node name of the parent annotation
- node.set_label(node.label()[:parentIndex])
- # expand collapsed unary productions
- if expandUnary == True:
- unaryIndex = node.label().find(unaryChar)
- if unaryIndex != -1:
- newNode = Tree(
- node.label()[unaryIndex + 1 :], [i for i in node]
- )
- node.set_label(node.label()[:unaryIndex])
- node[0:] = [newNode]
- for child in node:
- nodeList.append((child, node))
- def collapse_unary(tree, collapsePOS=False, collapseRoot=False, joinChar="+"):
- """
- Collapse subtrees with a single child (ie. unary productions)
- into a new non-terminal (Tree node) joined by 'joinChar'.
- This is useful when working with algorithms that do not allow
- unary productions, and completely removing the unary productions
- would require loss of useful information. The Tree is modified
- directly (since it is passed by reference) and no value is returned.
- :param tree: The Tree to be collapsed
- :type tree: Tree
- :param collapsePOS: 'False' (default) will not collapse the parent of leaf nodes (ie.
- Part-of-Speech tags) since they are always unary productions
- :type collapsePOS: bool
- :param collapseRoot: 'False' (default) will not modify the root production
- if it is unary. For the Penn WSJ treebank corpus, this corresponds
- to the TOP -> productions.
- :type collapseRoot: bool
- :param joinChar: A string used to connect collapsed node values (default = "+")
- :type joinChar: str
- """
- if collapseRoot == False and isinstance(tree, Tree) and len(tree) == 1:
- nodeList = [tree[0]]
- else:
- nodeList = [tree]
- # depth-first traversal of tree
- while nodeList != []:
- node = nodeList.pop()
- if isinstance(node, Tree):
- if (
- len(node) == 1
- and isinstance(node[0], Tree)
- and (collapsePOS == True or isinstance(node[0, 0], Tree))
- ):
- node.set_label(node.label() + joinChar + node[0].label())
- node[0:] = [child for child in node[0]]
- # since we assigned the child's children to the current node,
- # evaluate the current node again
- nodeList.append(node)
- else:
- for child in node:
- nodeList.append(child)
- #################################################################
- # Demonstration
- #################################################################
- def demo():
- """
- A demonstration showing how each tree transform can be used.
- """
- from nltk.draw.tree import draw_trees
- from nltk import tree, treetransforms
- from copy import deepcopy
- # original tree from WSJ bracketed text
- sentence = """(TOP
- (S
- (S
- (VP
- (VBN Turned)
- (ADVP (RB loose))
- (PP
- (IN in)
- (NP
- (NP (NNP Shane) (NNP Longman) (POS 's))
- (NN trading)
- (NN room)))))
- (, ,)
- (NP (DT the) (NN yuppie) (NNS dealers))
- (VP (AUX do) (NP (NP (RB little)) (ADJP (RB right))))
- (. .)))"""
- t = tree.Tree.fromstring(sentence, remove_empty_top_bracketing=True)
- # collapse subtrees with only one child
- collapsedTree = deepcopy(t)
- treetransforms.collapse_unary(collapsedTree)
- # convert the tree to CNF
- cnfTree = deepcopy(collapsedTree)
- treetransforms.chomsky_normal_form(cnfTree)
- # convert the tree to CNF with parent annotation (one level) and horizontal smoothing of order two
- parentTree = deepcopy(collapsedTree)
- treetransforms.chomsky_normal_form(parentTree, horzMarkov=2, vertMarkov=1)
- # convert the tree back to its original form (used to make CYK results comparable)
- original = deepcopy(parentTree)
- treetransforms.un_chomsky_normal_form(original)
- # convert tree back to bracketed text
- sentence2 = original.pprint()
- print(sentence)
- print(sentence2)
- print("Sentences the same? ", sentence == sentence2)
- draw_trees(t, collapsedTree, cnfTree, parentTree, original)
- if __name__ == '__main__':
- demo()
- __all__ = ["chomsky_normal_form", "un_chomsky_normal_form", "collapse_unary"]
|