Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Captain's Log, star date 21:34.5...


devel / comp.lang.python / lxml parsing with validation and target?

SubjectAuthor
o lxml parsing with validation and target?Robin Becker

1
lxml parsing with validation and target?

<mailman.170.1635857720.23718.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=15897&group=comp.lang.python#15897

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: rob...@reportlab.com (Robin Becker)
Newsgroups: comp.lang.python
Subject: lxml parsing with validation and target?
Date: Tue, 2 Nov 2021 12:55:17 +0000
Lines: 106
Message-ID: <mailman.170.1635857720.23718.python-list@python.org>
References: <0723b9fd-62a8-cb87-98e7-d0966dfb4e98@everest.reportlab.co.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: news.uni-berlin.de VyaQgKz0ATgbnj4tce/BywXf8e66BrRKjbA2rzsUMAqQ==
Return-Path: <robin@reportlab.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=reportlab-com.20210112.gappssmtp.com
header.i=@reportlab-com.20210112.gappssmtp.com
header.b=vX3rjWVq; dkim-adsp=none (unprotected policy);
dkim-atps=neutral
X-Spam-Status: OK 0.003
X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'looks': 0.02; 'argument':
0.04; 'def': 0.04; 'traceback': 0.04; '(most': 0.05; 'last):':
0.05; 'else:': 0.09; 'fail.': 0.09; 'import': 0.15; '#if': 0.16;
"'\\n": 0.16; 'column': 0.16; 'lxml': 0.16; 'none:': 0.16; 'os,':
0.16; 'received:192.168.0.16': 0.16; 'resolving': 0.16;
'subject:validation': 0.16; 'tuple': 0.16; 'validates': 0.16;
'problem': 0.16; 'python': 0.16; 'to:addr:python-list': 0.20;
'skip:_ 10': 0.22; 'code': 0.23; 'skip:p 30': 0.23; 'else': 0.27;
'output': 0.28; 'error': 0.29; 'header:User-Agent:1': 0.30;
'assume': 0.32; "i'm": 0.33; 'there': 0.33; 'received:192.168.0':
0.33; 'skip:" 20': 0.34; 'received:google.com': 0.34; 'invalid':
0.35; 'target': 0.36; "skip:' 10": 0.37; 'using': 0.37;
'received:209.85': 0.37; 'class': 0.37; 'received:192.168': 0.37;
'file': 0.38; 'way': 0.38; 'received:209': 0.39; 'use': 0.39;
'skip:( 30': 0.40; 'pass': 0.64; 'skip:r 20': 0.64; 'validation':
0.64; 'skip:t 20': 0.66; 'skip:n 30': 0.67; 'skip:e 20': 0.67;
'below': 0.69; 'raised': 0.70; 'skip:f 20': 0.75; 'out.': 0.80;
'attribute': 0.84; 'skip:" 50': 0.84; 'skip:o 60': 0.84; 'skip:r
100': 0.84; 'declaration': 0.91
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=reportlab-com.20210112.gappssmtp.com; s=20210112;
h=message-id:date:mime-version:user-agent:to:content-language:from
:subject:content-transfer-encoding;
bh=SUQNdzJkGIxpq226tKtYoi9cQ4zuz0Uct5je3cqJNj0=;
b=vX3rjWVqOnH6mVtiAIYYAMxtBF/QpFX2Zkedoe36D10Zk/Uytz5QYycFs4RmUgUwr8
eXRclBGWLdNHgvEdc/QgqvjFLTzsmXo4HeGJwRIfrGID1uAAqfGhS52KKUgLYUeAu6e8
gFrkjraPDYAa4QywGWR2LF7/dzeI5xInOvIhtHTPhkzMxfUsD3jJrQZWF5UTX61HqMDD
1PgAWvUvVDt6wboaPUFXL/1SUk98Lt/Pan6M2ag3Xr2y1Ayt4iWJ/Dny9U6hztBVIZWx
UNr601a+HlhZcd8nZ58h25rHvR5gme2Ms8J+PkIM+/YFClmWqdmGttXiFpV6j7y6aSmK
tegw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20210112;
h=x-gm-message-state:message-id:date:mime-version:user-agent:to
:content-language:from:subject:content-transfer-encoding;
bh=SUQNdzJkGIxpq226tKtYoi9cQ4zuz0Uct5je3cqJNj0=;
b=05lB2BN4dCVkKrMH4r6bRoHaxpwfspzEOCQxz3Lej5m+j+wwpBsD1JFF46DaZKojQf
RduDXGUxfQl+Ely+aj8JqAHP/sGprUgsv8wT/ccnWVh95nkzUIcQoBv/Iqrbzsd1CJ2v
LQwFr9LZUNH25yD889lvCXMZgK17k3+JLqjXGDe4xFET3R0iwxU57kUaYmpq+CNqFpI+
B3uZ+TvWNcxAJu113tcropGieCaV9rV8+27c7WjDOrbL2UyED3lxx7ZE3k2wrDMm7a/c
QsuyfB93/y0qmjlAp81KcZH28mThEF70VcM6XHJIsSop69i9wy8F2rzbtutAEJrwE3HM
Vxtw==
X-Gm-Message-State: AOAM533jWD/fkLaPm9naMAfI5ZkC7IzuZQ07TLctpGRrX1dICH8/AMu1
b3JDhFNwz0rG+eEK+zPp4jqjLMOCjPrrsw==
X-Google-Smtp-Source: ABdhPJwvoOkqQFbKcVO2YpUD/4D/jdFIuCS5bl4RNhWslvh4Af7pJsXRBeoK2r/ljrDlUmgQgfoF/A==
X-Received: by 2002:a7b:cd93:: with SMTP id y19mr6743980wmj.190.1635857718907;
Tue, 02 Nov 2021 05:55:18 -0700 (PDT)
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.2.1
Content-Language: en-US-large
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.35
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <0723b9fd-62a8-cb87-98e7-d0966dfb4e98@everest.reportlab.co.uk>
 by: Robin Becker - Tue, 2 Nov 2021 12:55 UTC

I'm having a problem using lxml.etree to make a treebuilding parser that validates; I have test code where invalid xml
is detected and an error raised when the line below target=ET.TreeBuilder(), is commented out.

The validation error looks as expected > python tlxml.py invalid.rml
> re.compile('^.*(?:\\W|\\b)(?P<fn>dynamic_rml\\.dtd|rml\\.dtd|rml_0_2\\.dtd|rml_0_3\\.dtd|rml_1_0\\.dtd)$', re.MULTILINE)
> Resolving url='../rml.dtd' context=<lxml.etree._ParserContext object at 0x7f66103273c0> dtdPath='rml.dtd'
> Traceback (most recent call last):
> File "/home/robin/devel/reportlab/REPOS/rlextra/tmp/tlxml.py", line 78, in <module>
> tree = ET.parse(sys.argv[1],parser)
> File "src/lxml/etree.pyx", line 3521, in lxml.etree.parse
> File "src/lxml/parser.pxi", line 1859, in lxml.etree._parseDocument
> File "src/lxml/parser.pxi", line 1885, in lxml.etree._parseDocumentFromURL
> File "src/lxml/parser.pxi", line 1789, in lxml.etree._parseDocFromFile
> File "src/lxml/parser.pxi", line 1177, in lxml.etree._BaseParser._parseDocFromFile
> File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
> File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
> File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
> File "invalid.rml", line 23
> lxml.etree.XMLSyntaxError: No declaration for attribute x of element place1, line 23, column 55

when I have the target=etree.TreeBuilder() active the validation does not work and the tree is formed and passed to the
primitive tuple tree builder so the output looks like
> $ python tlxml.py invalid.rml
> Resolving url='../rml.dtd' context=<lxml.etree._TargetParserContext object at 0x7f73d7b159c0> dtdPath='rml.dtd'
> ('document',
> {'filename': 'test_000_simple.pdf', 'invariant': '1'},
> ['\n\n',
> ('stylesheet',
> ........
> None,
> 44),
> '\n \t\t\n \t\t'],
> 40),
> '\n'],
> 35),
> '\n\n'],
> 2)

If I use the standard example EchoTarget the validation also fails. So I assume that the target argument makes the
validation fail. Is there a way to get validation to work with a target?

The code is
######################################################################################################
from pprint import pprint
from lxml import etree as ET
import sys, os, re
from rlextra.rml2pdf.rml2pdf import CompatibleDTDNames as rmlDTDPat
rmlDTDPat = re.compile('^.*(?:\\W|\\b)(?P<fn>%s)$' % '|'.join((re.escape(_) for _ in rmlDTDPat)),re.M)

class TT:
def __init__(self):
pass

def __call__(self,e):
return (e.tag,e.attrib or None,self.content(e),e.sourceline)

def content(self,e):
t = e.text
if len(e)==0 and t is None:
return t
else:
r = [].append
if t is not None: r(t)
for c in e:
r(self(c))
t = c.tail
if t is not None:
r(t)
return r.__self__

class RMLDTDResolver(ET.Resolver):
__dtds = None
def resolve(self, url, id, context):
m = rmlDTDPat.match(url)
if m:
if self.__dtds is None:
from rlextra import rml2pdf
self.__dtds = {}
for fn in ('rml.dtd','dynamic_rml.dtd'):
with open(os.path.join(os.path.dirname(rml2pdf.__file__),fn),'r') as _:
self.__dtds[fn] = _.read()
fn = m.group('fn')
dtdPath = 'rml.dtd' if fn.startswith('rml') else 'dynamic.dtd'
print(f"Resolving url={url!r} context={context!r} {dtdPath=}")
return self.resolve_string(
self.__dtds[dtdPath],
context,
)
else:
return None

parser = ET.XMLParser(
load_dtd=True,
dtd_validation=True,
attribute_defaults=True,
no_network=True,
remove_comments=True,
remove_pis=True,
strip_cdata=True,
resolve_entities=True,
target=ET.TreeBuilder(), #if commented the parser validates
)
parser.resolvers.add(RMLDTDResolver())
tree = ET.parse(sys.argv[1],parser)
pprint(TT()(tree))
######################################################################################################

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor