How do I handle whitespace with Python's elementtree? - python

Problem:
When whitespace is insignificant, representation may be very significant.
Explanation:
In XML Schema Part 2: Datatypes Second Edition the constraining facet whiteSpace is defined for types derived from string (http://www.w3.org/TR/xmlschema-2/#rf-whiteSpace). If this whiteSpace facet is replace or collapse, the value may be changed during normalization.
There is a note at the end of Section 4.3.6:
The notation #xA used here (and elsewhere in this specification)
represents the Universal Character Set (UCS) code point hexadecimal A
(line feed), which is denoted by U+000A. This notation is to be
distinguished from
, which is the XML character reference to that
same UCS code point.
Example:
If the datatype for an element elem has a whitespace constraint collapse, "<elem> text </elem>" should become "text" (leading and trailing whitespace removed), but "<elem> text </elem>" should become " text " (whitespace encoded by character reference not removed).
Questions:
So either the parser/tree builder handles this normalization or this is done afterwards.
Informed parsing:
Where do I provide the parser or tree builder with the information on how to normalize some XML element?
Is there something like set_whitespace_normalization('./country/neighbor', 'collapse')?
Is there a hook like normalize(content) in the parser or tree builder?
Post processing
How do I access the original content of some element?
Is there a elem.original_text, that may return " text "?
Is there a elem.unnormalized_text, that may return " text "?
I would like to use Python's xml.etree.ElementTree but I will consider any other XML library that does the job.
Disclaimer:
Of course it is bad style to declare whitespace insignificant (replace or collapse) and then to cheat by using character references. In most cases either the data or the schema should be changed to prevent that, but sometimes you have to work with foreign XML schemata and foreign XML documents. And the sheer existence of the note cited above indicates that the XML editors were aware of this dilemma and did deliberately not prevent it.

This appears to be a known bug in xml.etree.ElementTree: http://bugs.python.org/issue17582. According to that bug report, this is correctly handled in lxml.etree: https://pypi.python.org/pypi/lxml/.

Related

PyQt5 incorrect label formatting with links

I have two issues with how PyQt is formatting my QLabels
Issue 1:
When hyperlinks are added it displays as if there were no newlines in the string.
For the input text:
https://www.google.co.uk/
https://www.google.co.uk/
https://www.google.co.uk/
It's shown like this without newlines
Issue 2: Sometimes PyQt just doesn't even detect the 'a' tag this happens when the start of string is not a hyperlink but it is then followed by newlines with hyperlinks e.g. this input:
test
https://www.google.co.uk/
https://www.google.co.uk/
https://www.google.co.uk/
As you can see the newlines are properly shown but PyQt has no longer detected the hyperlinks
From the text property documentation of QLabel:
The text will be interpreted either as plain text or as rich text, depending on the text format setting; see setTextFormat(). The default setting is Qt::AutoText; i.e. QLabel will try to auto-detect the format of the text set.
The AutoText flag can only make a guess using simple tag syntax checks (basic tags without arguments, such as <b>, or document type declaration headers, like <html>).
This is obviously done for performance reasons.
If you are sure that you're always setting rich text content, use the appropriate Qt.TextFormat enum:
label.setTextFormat(QtCore.Qt.RichText)
Using the HTML-like syntax of rich text will obviously use the same basic concept HTML had since its birth, almost 30 years ago: line breaks between any word in the document (text or tag) are ignored, as much as multiple spaces are always considered as one.
So, if you want to add line breaks, you have to use the appropriate <br> (or <br/> for xhtml) tag.
Also remember that Qt rich text engine has a limited support, as described in the documentation about the Supported HTML Subset.

Why are non-ASCII characters escaped in attribute-values after writing an XML-file with lxml?

I'm trying to continually build an xml-file with python and with etree.xmlfile from lxml.
My input is an XML-file, where there are umlauts in attribute values. I read this in with lxml, make some changes to the names of the attributes, and then write it to a new file.
This is my code, broken down:
with etree.xmlfile(path_to_new_file, encoding="utf8") as xf:
with xf.element("corpus"):
for _, element in etree.iterparse(path_to_original_file, tag="comment"):
new_element = transform_element(element)
xf.write(new_element)
del element
del new_element
In the original file, I might have an element like this:
<comment title="Kübel">Some text with umlauts like this üä</comment>
But after processing, the same comment in the new file looks like this:
<comment title="Kübel">Some text with umlauts like this üä</comment>
Do you have any idea what might cause this?
ü does not have to be escaped in an XML attribute value (or in a text node child of an element).
Probably the developer of the library was being overly cautious and called an generic escape string function, possibly to leverage its escaping of <, which always has to be escaped, and ' or " which have to be escaped when matching the delimiting quotation mark for the attribute value.
For precise escaping requirements concisely presented, see Simplified XML Escaping.

Regex behaves differently for the same input string

I am trying to get a pdf page with a particular string and the string is:
"statement of profit or loss"
and I'm trying to accomplish this using following regex:
re.search('statement of profit or loss', text, re.IGNORECASE)
But even though the page contained this string "statement of profit or loss" the regex returned None.
On further investigating the document, I found that the characters 'fi' in the "profit" as written in the document are more congested. When I copied it from the document and pasted it in my code it worked fine.
So, If I copy "statement of profit or loss" from document and paste it in re.search() in my code, it works fine. But if I write "statement of profit or loss" manually in my code, re.search() returns none.
How can I avoid this behavior?
The 'congested' characters copied from your PDF are actually a single character: the 'fi ligature' U+FB01: fi.
Either it was entered as such in the source document, or the typesetting engine that was used to create the PDF, replaced the combination f+i by fi.
Combining two or more characters into a single glyph is a fairly usual operation for "nice typesetting", and is not limited to fi, fl, ff, and fj, although these are the most used combinations. (That is because in some fonts the long overhang of the f glyph jarringly touches or overlaps the next character.) Actually, you can have any amount of ligatures; some Adobe fonts use a single ligature for Th.
Usually this is not a problem with text extracting, because in the PDF it can be specified that certain glyphs must be decoded as a string of characters – the original characters. So, possibly your PDF does not contain such a definition, or the typesetting engine did not bother because the single character fi is a valid Unicode character on itself (although it is highly advised not to use it).
You can work around this by explicitly cleaning up your text strings before processing any further:
text = text.replace('fi', 'fi')
– repeat this for other problematic ligatures which have a Unicode codepoint: fl, ff, ffi, ffl (I possibly missed some more).

read xml file using lxml get error EntityRef

i use lxml to read a xml file which has structure like bellow
<domain>http://www.trademe.co.nz</domain>
<start>http://www.trademe.co.nz/Browse/CategoryAttributeSearchResults.aspx?search=1&cid=5748&sidebar=1&rptpath=350-5748-4233-&132=FLAT&134=&153=&29=&122=0&122=0&59=0&59=0&178=0&178=0&sidebarSearch_keypresses=0&sidebarSearch_suggested=0</start>
and my python code is:
from lxml import etree
tree = etree.parse('metaWeb.xml')
when i run it i get
entityref: expecting ';' error
however, when i remove & symbol in xml file, everything is fine.
how can i solve that error?
The problem is that this isn't valid XML. In XML, a & symbol always starts an entity reference, like Ӓ for the character U+04D2 (aka Ӓ), " for the character ", or some custom entity defined in your document/DTD/schema.*
If you want to put a literal & into a string, you have to replace it with something else, typically &, which is a character entity reference for the ampersand character.
So, if you're sure there are no actual entity references in your document, just un-escaped ampersands, you can fix it pretty simply:
with open('metaWeb.xml') as f:
xml = f.read().replace('&', '&')
tree = etree.fromstring(xml)
However, a better solution, if possible, is to fix whatever program is generating this incorrect XML.
* This is slightly misleading quite true; a numeric character reference is not actually an entity reference. Also, a character entity reference like " or & is the same as any other reference with replacement text, the entities just happen to be implicitly defined by the XML/HTML base DTDs. But lxml, like most XML software, uses the term "entity reference" slightly more broadly than the standard.
Replace & with & in your xml file, othewise your xml is not compliant to the XML standard.

Unescape _xHHHH_ XML escape sequences using Python

I'm using Python 2.x [not negotiable] to read XML documents [created by others] that allow the content of many elements to contain characters that are not valid XML characters by escaping them using the _xHHHH_ convention e.g. ASCII BEL aka U+0007 is represented by the 7-character sequence u"_x0007_". Neither the functionality that allows representation of any old character in the document nor the manner of escaping is negotiable. I'm parsing the documents using cElementTree or lxml [semi-negotiable].
Here is my best attempt at unescapeing the parser output as efficiently as possible:
import re
def unescape(s,
subber=re.compile(r'_x[0-9A-Fa-f]{4,4}_').sub,
repl=lambda mobj: unichr(int(mobj.group(0)[2:6], 16)),
):
if "_" in s:
return subber(repl, s)
return s
The above is biassed by observing a very low frequency of "_" in typical text and a better-than-doubling of speed by avoiding the regex apparatus where possible.
The question: Any better ideas out there?
You might as well check for '_x' rather than just _, that won't matter much but surely the two-character sequence's even rarer than the single underscore. Apart from such details, you do seem to be making the best of a bad situation!

Categories