bscheibel
/
technical_drawings_extraction


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374
							.. Copyright (C) 2001-2019 NLTK Project
.. For license information, see LICENSE.TXT

==========
 Chunking
==========

    >>> from nltk.chunk import *
    >>> from nltk.chunk.util import *
    >>> from nltk.chunk.regexp import *
    >>> from nltk import Tree

    >>> tagged_text = "[ The/DT cat/NN ] sat/VBD on/IN [ the/DT mat/NN ] [ the/DT dog/NN ] chewed/VBD ./."
    >>> gold_chunked_text = tagstr2tree(tagged_text)
    >>> unchunked_text = gold_chunked_text.flatten()

Chunking uses a special regexp syntax for rules that delimit the chunks. These
rules must be converted to 'regular' regular expressions before a sentence can
be chunked.

    >>> tag_pattern = "<DT>?<JJ>*<NN.*>"
    >>> regexp_pattern = tag_pattern2re_pattern(tag_pattern)
    >>> regexp_pattern
    '(<(DT)>)?(<(JJ)>)*(<(NN[^\\{\\}<>]*)>)'

Construct some new chunking rules.

    >>> chunk_rule = ChunkRule("<.*>+", "Chunk everything")
    >>> chink_rule = ChinkRule("<VBD|IN|\.>", "Chink on verbs/prepositions")
    >>> split_rule = SplitRule("<DT><NN>", "<DT><NN>",
    ...                        "Split successive determiner/noun pairs")


Create and score a series of chunk parsers, successively more complex.

    >>> chunk_parser = RegexpChunkParser([chunk_rule], chunk_label='NP')
    >>> chunked_text = chunk_parser.parse(unchunked_text)
    >>> print(chunked_text)
    (S
      (NP
        The/DT
        cat/NN
        sat/VBD
        on/IN
        the/DT
        mat/NN
        the/DT
        dog/NN
        chewed/VBD
        ./.))

    >>> chunkscore = ChunkScore()
    >>> chunkscore.score(gold_chunked_text, chunked_text)
    >>> print(chunkscore.precision())
    0.0

    >>> print(chunkscore.recall())
    0.0

    >>> print(chunkscore.f_measure())
    0

    >>> for chunk in sorted(chunkscore.missed()): print(chunk)
    (NP The/DT cat/NN)
    (NP the/DT dog/NN)
    (NP the/DT mat/NN)

    >>> for chunk in chunkscore.incorrect(): print(chunk)
    (NP
      The/DT
      cat/NN
      sat/VBD
      on/IN
      the/DT
      mat/NN
      the/DT
      dog/NN
      chewed/VBD
      ./.)

    >>> chunk_parser = RegexpChunkParser([chunk_rule, chink_rule],
    ...                                  chunk_label='NP')
    >>> chunked_text = chunk_parser.parse(unchunked_text)
    >>> print(chunked_text)
    (S
      (NP The/DT cat/NN)
      sat/VBD
      on/IN
      (NP the/DT mat/NN the/DT dog/NN)
      chewed/VBD
      ./.)
    >>> assert chunked_text == chunk_parser.parse(list(unchunked_text))

    >>> chunkscore = ChunkScore()
    >>> chunkscore.score(gold_chunked_text, chunked_text)
    >>> chunkscore.precision()
    0.5

    >>> print(chunkscore.recall())
    0.33333333...

    >>> print(chunkscore.f_measure())
    0.4

    >>> for chunk in sorted(chunkscore.missed()): print(chunk)
    (NP the/DT dog/NN)
    (NP the/DT mat/NN)

    >>> for chunk in chunkscore.incorrect(): print(chunk)
    (NP the/DT mat/NN the/DT dog/NN)

    >>> chunk_parser = RegexpChunkParser([chunk_rule, chink_rule, split_rule],
    ...                                  chunk_label='NP')
    >>> chunked_text = chunk_parser.parse(unchunked_text, trace=True)
    # Input:
     <DT>  <NN>  <VBD>  <IN>  <DT>  <NN>  <DT>  <NN>  <VBD>  <.>
    # Chunk everything:
    {<DT>  <NN>  <VBD>  <IN>  <DT>  <NN>  <DT>  <NN>  <VBD>  <.>}
    # Chink on verbs/prepositions:
    {<DT>  <NN>} <VBD>  <IN> {<DT>  <NN>  <DT>  <NN>} <VBD>  <.>
    # Split successive determiner/noun pairs:
    {<DT>  <NN>} <VBD>  <IN> {<DT>  <NN>}{<DT>  <NN>} <VBD>  <.>
    >>> print(chunked_text)
    (S
      (NP The/DT cat/NN)
      sat/VBD
      on/IN
      (NP the/DT mat/NN)
      (NP the/DT dog/NN)
      chewed/VBD
      ./.)

    >>> chunkscore = ChunkScore()
    >>> chunkscore.score(gold_chunked_text, chunked_text)
    >>> chunkscore.precision()
    1.0

    >>> chunkscore.recall()
    1.0

    >>> chunkscore.f_measure()
    1.0

    >>> chunkscore.missed()
    []

    >>> chunkscore.incorrect()
    []

    >>> chunk_parser.rules() # doctest: +NORMALIZE_WHITESPACE
    [<ChunkRule: '<.*>+'>, <ChinkRule: '<VBD|IN|\\.>'>,
     <SplitRule: '<DT><NN>', '<DT><NN>'>]

Printing parsers:

    >>> print(repr(chunk_parser))
    <RegexpChunkParser with 3 rules>
    >>> print(chunk_parser)
    RegexpChunkParser with 3 rules:
        Chunk everything
          <ChunkRule: '<.*>+'>
        Chink on verbs/prepositions
          <ChinkRule: '<VBD|IN|\\.>'>
        Split successive determiner/noun pairs
          <SplitRule: '<DT><NN>', '<DT><NN>'>

Regression Tests
~~~~~~~~~~~~~~~~
ChunkParserI
------------
`ChunkParserI` is an abstract interface -- it is not meant to be
instantiated directly.

    >>> ChunkParserI().parse([])
    Traceback (most recent call last):
      . . .
    NotImplementedError


ChunkString
-----------
ChunkString can be built from a tree of tagged tuples, a tree of
trees, or a mixed list of both:

    >>> t1 = Tree('S', [('w%d' % i, 't%d' % i) for i in range(10)])
    >>> t2 = Tree('S', [Tree('t0', []), Tree('t1', ['c1'])])
    >>> t3 = Tree('S', [('w0', 't0'), Tree('t1', ['c1'])])
    >>> ChunkString(t1)
    <ChunkString: '<t0><t1><t2><t3><t4><t5><t6><t7><t8><t9>'>
    >>> ChunkString(t2)
    <ChunkString: '<t0><t1>'>
    >>> ChunkString(t3)
    <ChunkString: '<t0><t1>'>

Other values generate an error:

    >>> ChunkString(Tree('S', ['x']))
    Traceback (most recent call last):
      . . .
    ValueError: chunk structures must contain tagged tokens or trees

The `str()` for a chunk string adds spaces to it, which makes it line
up with `str()` output for other chunk strings over the same
underlying input.

    >>> cs = ChunkString(t1)
    >>> print(cs)
     <t0>  <t1>  <t2>  <t3>  <t4>  <t5>  <t6>  <t7>  <t8>  <t9>
    >>> cs.xform('<t3>', '{<t3>}')
    >>> print(cs)
     <t0>  <t1>  <t2> {<t3>} <t4>  <t5>  <t6>  <t7>  <t8>  <t9>

The `_verify()` method makes sure that our transforms don't corrupt
the chunk string.  By setting debug_level=2, `_verify()` will be
called at the end of every call to `xform`.

    >>> cs = ChunkString(t1, debug_level=3)

    >>> # tag not marked with <...>:
    >>> cs.xform('<t3>', 't3')
    Traceback (most recent call last):
      . . .
    ValueError: Transformation generated invalid chunkstring:
      <t0><t1><t2>t3<t4><t5><t6><t7><t8><t9>

    >>> # brackets not balanced:
    >>> cs.xform('<t3>', '{<t3>')
    Traceback (most recent call last):
      . . .
    ValueError: Transformation generated invalid chunkstring:
      <t0><t1><t2>{<t3><t4><t5><t6><t7><t8><t9>

    >>> # nested brackets:
    >>> cs.xform('<t3><t4><t5>', '{<t3>{<t4>}<t5>}')
    Traceback (most recent call last):
      . . .
    ValueError: Transformation generated invalid chunkstring:
      <t0><t1><t2>{<t3>{<t4>}<t5>}<t6><t7><t8><t9>

    >>> # modified tags:
    >>> cs.xform('<t3>', '<t9>')
    Traceback (most recent call last):
      . . .
    ValueError: Transformation generated invalid chunkstring: tag changed

    >>> # added tags:
    >>> cs.xform('<t9>', '<t9><t10>')
    Traceback (most recent call last):
      . . .
    ValueError: Transformation generated invalid chunkstring: tag changed

Chunking Rules
--------------

Test the different rule constructors & __repr__ methods:

    >>> r1 = RegexpChunkRule('<a|b>'+ChunkString.IN_CHINK_PATTERN,
    ...                      '{<a|b>}', 'chunk <a> and <b>')
    >>> r2 = RegexpChunkRule(re.compile('<a|b>'+ChunkString.IN_CHINK_PATTERN),
    ...                      '{<a|b>}', 'chunk <a> and <b>')
    >>> r3 = ChunkRule('<a|b>', 'chunk <a> and <b>')
    >>> r4 = ChinkRule('<a|b>', 'chink <a> and <b>')
    >>> r5 = UnChunkRule('<a|b>', 'unchunk <a> and <b>')
    >>> r6 = MergeRule('<a>', '<b>', 'merge <a> w/ <b>')
    >>> r7 = SplitRule('<a>', '<b>', 'split <a> from <b>')
    >>> r8 = ExpandLeftRule('<a>', '<b>', 'expand left <a> <b>')
    >>> r9 = ExpandRightRule('<a>', '<b>', 'expand right <a> <b>')
    >>> for rule in r1, r2, r3, r4, r5, r6, r7, r8, r9:
    ...     print(rule)
    <RegexpChunkRule: '<a|b>(?=[^\\}]*(\\{|$))'->'{<a|b>}'>
    <RegexpChunkRule: '<a|b>(?=[^\\}]*(\\{|$))'->'{<a|b>}'>
    <ChunkRule: '<a|b>'>
    <ChinkRule: '<a|b>'>
    <UnChunkRule: '<a|b>'>
    <MergeRule: '<a>', '<b>'>
    <SplitRule: '<a>', '<b>'>
    <ExpandLeftRule: '<a>', '<b>'>
    <ExpandRightRule: '<a>', '<b>'>

`tag_pattern2re_pattern()` complains if the tag pattern looks problematic:

    >>> tag_pattern2re_pattern('{}')
    Traceback (most recent call last):
      . . .
    ValueError: Bad tag pattern: '{}'

RegexpChunkParser
-----------------

A warning is printed when parsing an empty sentence:

    >>> parser = RegexpChunkParser([ChunkRule('<a>', '')])
    >>> parser.parse(Tree('S', []))
    Warning: parsing empty text
    Tree('S', [])

RegexpParser
------------

    >>> parser = RegexpParser('''
    ... NP: {<DT>? <JJ>* <NN>*} # NP
    ... P: {<IN>}           # Preposition
    ... V: {<V.*>}          # Verb
    ... PP: {<P> <NP>}      # PP -> P NP
    ... VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*
    ... ''')
    >>> print(repr(parser))
    <chunk.RegexpParser with 5 stages>
    >>> print(parser)
    chunk.RegexpParser with 5 stages:
    RegexpChunkParser with 1 rules:
        NP   <ChunkRule: '<DT>? <JJ>* <NN>*'>
    RegexpChunkParser with 1 rules:
        Preposition   <ChunkRule: '<IN>'>
    RegexpChunkParser with 1 rules:
        Verb   <ChunkRule: '<V.*>'>
    RegexpChunkParser with 1 rules:
        PP -> P NP   <ChunkRule: '<P> <NP>'>
    RegexpChunkParser with 1 rules:
        VP -> V (NP|PP)*   <ChunkRule: '<V> <NP|PP>*'>
    >>> print(parser.parse(unchunked_text, trace=True))
    # Input:
     <DT>  <NN>  <VBD>  <IN>  <DT>  <NN>  <DT>  <NN>  <VBD>  <.>
    # NP:
    {<DT>  <NN>} <VBD>  <IN> {<DT>  <NN>}{<DT>  <NN>} <VBD>  <.>
    # Input:
     <NP>  <VBD>  <IN>  <NP>  <NP>  <VBD>  <.>
    # Preposition:
     <NP>  <VBD> {<IN>} <NP>  <NP>  <VBD>  <.>
    # Input:
     <NP>  <VBD>  <P>  <NP>  <NP>  <VBD>  <.>
    # Verb:
     <NP> {<VBD>} <P>  <NP>  <NP> {<VBD>} <.>
    # Input:
     <NP>  <V>  <P>  <NP>  <NP>  <V>  <.>
    # PP -> P NP:
     <NP>  <V> {<P>  <NP>} <NP>  <V>  <.>
    # Input:
     <NP>  <V>  <PP>  <NP>  <V>  <.>
    # VP -> V (NP|PP)*:
     <NP> {<V>  <PP>  <NP>}{<V>} <.>
    (S
      (NP The/DT cat/NN)
      (VP
        (V sat/VBD)
        (PP (P on/IN) (NP the/DT mat/NN))
        (NP the/DT dog/NN))
      (VP (V chewed/VBD))
      ./.)

Test parsing of other rule types:

    >>> print(RegexpParser('''
    ... X:
    ...   }<a><b>{     # chink rule
    ...   <a>}{<b>     # split rule
    ...   <a>{}<b>     # merge rule
    ...   <a>{<b>}<c>  # chunk rule w/ context
    ... '''))
    chunk.RegexpParser with 1 stages:
    RegexpChunkParser with 4 rules:
        chink rule              <ChinkRule: '<a><b>'>
        split rule              <SplitRule: '<a>', '<b>'>
        merge rule              <MergeRule: '<a>', '<b>'>
        chunk rule w/ context   <ChunkRuleWithContext: '<a>', '<b>', '<c>'>

Illegal patterns give an error message:

    >>> print(RegexpParser('X: {<foo>} {<bar>}'))
    Traceback (most recent call last):
      . . .
    ValueError: Illegal chunk pattern: {<foo>} {<bar>}