Python XML: write " instead of &quot - python

I am using Python's xml minidom and all works well except that in text sequences it writes out &quot escape characters instead of ". This of course makes sense if a quote appears in a tag, but it bugs me in the text. How do I change this?

looking at the source (Python 3.2 if it matters), this is hardcoded in the _write_data() function. you would need to modify the writexml() method of TextNode - either by subclassing it or simply editing it - so that it didn't call that method, but instead did something similar to escape only < and >.
if you created a subclass outside of the package (instead of copying and hacking the package to make your own custom xmlminidom) then it looks like, with a little care, you could make things work. so you would create your own (subclass of) TextNode, modified as above and then, to add text to the DOM, you would add an instance of your new class (or replace existing text nodes with instances of that class). you would need to set the ownerDocument attribute. perhaps simplest would be to also subclass Document and fix the createTextNode() method.
but i don't see a simpler way of doing what you want. it might be best to use a better dom implementation.
ps i have no idea whether this behaviour is required by the xml spec, or not. update: a quick scan of http://www.w3.org/TR/2008/REC-xml-20081126/#syntax suggests that only < and & must be encoded.

Related

How to use gettext with python >3.6 f-strings

Previously you would use gettext as following:
_('Hey {},').format(username)
but what about new Python's f-string?
f'Hey {username}'
'Hey {},' is contained in your translation dictionary as is.
If you use f'Hey {username},', that creates another string, which won't be translated.
In that case, the format method remains the only one useable, but you could approach the f-string features by using named parameters
_('Hey {username},').format(username=username)
or if you have a dictionary containing your data, this cool trick where format picks the required information in the input dictionary:
d = {"username":"John", "city":"New York", "unused":"doesn't matter"}
_('Hey {username} from {city},').format(**d)
My solution is to make a function f() which performs the f-string interpolation after gettext has been called.
from copy import copy
from inspect import currentframe
def f(s):
frame = currentframe().f_back
kwargs = copy(frame.f_globals)
kwargs.update(frame.f_locals)
return eval(s.format(**kwargs))
Now you just wrap _(...) in f() and don’t preface the string with an f:
f(_('Hey, {username}'))
Note of caution
I’m usually against the use of eval as it could make the function potentially unsafe, but I personally think it should be justified here, so long as you’re aware of what’s being formatted. That said use at your own risk.
Remember
This isn’t a perfect solution, this is just my solution. As per PEP 498 states each formatting method “have their advantages, but in addition have disadvantages” including this.
For example if you need to change the expression inside the string then it will no longer match, therefore not be translated unless you also update your .po file as well. Also if you’re not the one translating them and you use an expression that’s hard to decipher what the outcome will be then that can cause miscommunication or other issues in translation.

Sphinx extension: literal block with leading and/or trailing blank lines?

As far as I can tell, it is not possible to create a literal text block (e.g. with the code-block directive) that starts or ends with a blank line, because this would be ambiguous with regard to the reStructuredText syntax.
That's OK.
But now I want to create a custom directive that uses docutils's literal_block() node, and I want (within the code of my directive) to add empty lines at the beginning and/or end of the directive's contents.
Since this isn't possible in reStructuredText syntax, I'm planning to use the directive's options to specify the number of blank lines, but that's not my problem and not part of my question. Just in case you're wondering ...
Here's a minimal example of what I want to do:
import docutils
class MyDirective(docutils.parsers.rst.Directive):
has_content = True
def run(self):
text = '\n\n' + '\n'.join(self.content.data) + '\n\n'
node = docutils.nodes.literal_block(text, text)
print(node)
return [node]
def setup(app):
app.add_directive('mydirective', MyDirective)
It can be used like this:
.. mydirective::
Hello, world!
This works, but the newlines I added in the directive are somehow swallowed by Sphinx (in both HTML and LaTeX output).
How can I avoid that?
The newlines are actually stored in the node object (as can be seen in the output of print()), but they seem to get lost somewhere later during Sphinx processing.
I don't know enough about the Sphinx machinery to track this down on my own, any help would be very much appreciated!
I would rather try with CSS margin-top and margin-bottom properties.
I found an answer to my own question, but it is far more complicated then I hoped ...
I created a custom node class and added a literal_block instance as a child node.
I'm saving the number of empty lines as attributes of the custom node class.
Then I created "visit" and "depart" functions (actually only the latter) for HTML and LaTeX that take the numbers from the node attributes and do some un-elegant string replacement on self.body fumbling the newlines into place.
This works fine for both HTML and LaTeX but I'd be happy to hear about a more elegant solution!

PyYAML variables in multiline

I'm trying to get a multi-line comment to use variables in PyYAML but not sure if this is even possible.
So, in YAML, you can assign a variable like:
current_host: &hostname myhost
But it doesn't seem to expand in the following:
test: |
Hello, this is my string
which is running on *hostname
Is this at all possible or am I going to have to use Python to parse it?
The anchors (&some_id) and references (*some_id) mechanism is essentially meant to provide the possibility to share complete nodes between parts of the tree representation that is a YAML text. This is e.g. necessary in order to have one and the same complex item (sequence/list resp. mapping/dict) that occurs in a list two times load as one and same item (instead of two copies with the same values).
So yes, you need to do the parsing in Python. You could start with the mechanism I provided in this answer and change the test
if node.value and node.value.startswith(self.d['escape'])
to find the escape character in any place in the scalar and take appropriate action.
You can find the answer here.
Just use a + between lines and your strings need to be enclosed in 's.

Decoding RFC 2231 headers

Trying to address this issue, I'm trying to wrap my head around the various functions in the Python standard library aimed at supporting RFC 2231. The main aim of that RFC appears to be three-fold: allowing non-ASCII encoding in header parameters, noting the language of a given value, and allowing header parameters to span multiple lines. The email.util library provides several functions to deal with various aspects of this. As far as I can tell, they work as follows:
decode_rfc2231 only splits the value of such a parameter into its parts, like this:
>>> email.utils.decode_rfc2231("utf-8''T%C3%A4st.txt")
['utf-8', '', 'T%C3%A4st.txt']
decode_params takes care of detecting RFC2231-encoded parameters. It collects parts which belong together, and also decodes the url-encoded string to a byte sequence. This byte sequence, however, is then encoded as latin1. And all values are enclosed in quotation marks. Furthermore, there is some special handling for the first argument, which still has to be a tuple of two elements, but those two get passed to the result without modification.
>>> email.utils.decode_params([
... (1,2),
... ("foo","bar"),
... ("name*","utf-8''T%C3%A4st.txt"),
... ("baz*0","two"),("baz*1","-part")])
[(1, 2), ('foo', '"bar"'), ('baz', '"two-part"'), ('name', ('utf-8', '', '"Täst.txt"'))]
collapse_rfc2231_value can be used to convert this triple of encoding, language and byte sequence into a proper unicode string. What has me confused, though, is the fact that if the input was such a triple, then the quotes will be carried over to the output. If, on the other hand, the input was a single quoted string, then these quotes will be removed.
>>> [(k, email.utils.collapse_rfc2231_value(v)) for k, v in
... email.utils.decode_params([
... (1,2),
... ("foo","bar"),
... ("name*","utf-8''T%C3%A4st.txt"),
... ("baz*0","two"),("baz*1","-part")])[1:]]
[('foo', 'bar'), ('baz', 'two-part'), ('name', '"Täst.txt"')]
So it seems that in order to use all this machinery, I'd have to add yet another step to unquote the third element of any tuple I'd encounter. Is this true, or am I missing some point here? I had to figure out a lot of the above with help from the source code, since the docs are a bit vague on the details. I cannot imagine what could be the point behind this selective unquoting. Is there a point to it?
What is the best reference on how to use these functions?
The best I found so far is the email.message.Message implementation. There, the process seems to be roughly the one outlined above, but every field gets unquoted via _unquotevalue after the decode_params, and only get_filename and get_boundary collapse their values, all others return a tuple instead. I hope there is something more useful.
Currently the functions from email.utils are rarely used besides within email.message. Most users seem to prefer using email.message.Message directly. There's even a somewhat old issue report on adding unit tests (that would certainly be usable as examples) to Python, even if I'm not sure on how it relates to email.util.
A short example I found is this blogpost which, however, doesn't contain more than once sentence and a few SLOCs of information about RFC2231 parsing. The author notes, however, that many MTAs use RFC2047 instead. Depending on your usecase, that might also be an issue.
Judging from the few examples I could find I assume your way of parsing using email.util is the only way to go, even if the long list comprehension is somewhat ugly.
Because of the lack of examples in some respect it could be wise to write a new RFC2231 parser (if you really need a better, maybe faster or more beautiful codebase). A new implementation could be based on existing implementations like the Dovecot RFC2231 parser for compatibility reasons (you could even use the Dovecot unit test. As the C code seems quite complex to me and since I can't find any python implementation besides email.util and Python2 backports of email.util the task of porting to Python won't be easy (note that Dovecot is LGPL-licensed, which might be an issue in your project)
I think the email.util RFC2231 API has not been designed for easy standalone usage but more as a pile of utility methods for use in email.message.Message.
Old question, but I could not find a complete answer that works on this. So this is what I ended up doing (on Python 2.7):
def decode_rfc2231_header(header):
"""Decode a RFC 2231 header"""
# Remove any quotes
header = email.utils.unquote(header)
encoding, language, value = email.utils.decode_rfc2231(header)
value = urllib.unquote(value)
return email.utils.collapse_rfc2231_value((encoding, language, value))
For example:
>>> name = u'èéêëēėęûüùúūàáâäæãåāāîïíīįì test ôöòóœøōõssśšłžźżçćčñń'
>>> encoded_header = email.utils.encode_rfc2231(name.encode("utf8"), 'utf8', 'en')
>>> print encoded_header
utf8'en'%C3%A8%C3%A9%C3%AA%C3%AB%C4%93%C4%97%C4%99%C3%BB%C3%BC%C3%B9%C3%BA%C5%AB%C3%A0%C3%A1%C3%A2%C3%A4%C3%A6%C3%A3%C3%A5%C4%81%C4%81%C3%AE%C3%AF%C3%AD%C4%AB%C4%AF%C3%AC%20test%20%C3%B4%C3%B6%C3%B2%C3%B3%C5%93%C3%B8%C5%8D%C3%B5ss%C5%9B%C5%A1%C5%82%C5%BE%C5%BA%C5%BC%C3%A7%C4%87%C4%8D%C3%B1%C5%84
>>> print decode_rfc2231_header(encoded_header)
èéêëēėęûüùúūàáâäæãåāāîïíīįì test ôöòóœøōõssśšłžźżçćčñń

Pyparsing: how to implement special processing of C-style comments?

I want to take advantage of the cStyleComment variable, but rather than just ignoring these comments I want to process them specially. Is there any way to make pyparsing call my handler on the piece of input, which it recognizes as a comment, before it's going to be thrown away?
I'm processing some C code, which contain some "special" directives inside comments.
There is nothing inherent in any of the xxxStyleComment expressions that are defined in pyparsing that causes them to be ignored. They are there as a convenience, especially since some comment formats are easy to get wrong. They don't get ignored unless you call the ignore method on your larger grammar, as in:
cHeaderParser.ignore(cStyleComment)
(where cHeaderParser might be something you wrote to read through .h files to extract API information, for instance.)
And having pyparsing callback to a handler is built-in, just use cStyleComment.setParseAction(commentHandler). Pyparsing can handle parse actions with any of these signatures:
def commentHandler(inputString, locn, tokens):
def commentHandler(locn, tokens):
def commentHandler(tokens):
def commentHandler():
If your commentHandler returns a string or list of strings, or a new ParseResults, these will be used to replace the input tokens - if it returns None, or omits the return statement, then the tokens object is used. You can also modify the tokens object in place (such as adding new results names).
So you could write something like this that would uppercase your comments:
def commentHandler(tokens):
return tokens[0].upper()
cStyleComment.setParseAction(commentHandler)
(a parse action as simple as this could even be written cStyleComment.setParseAction(lambda t:t[0].upper()))
When writing a transforming parse action like this, one would likely use transformString rather then parseString,
print cStyleComment.transformString(source)
This will print the original source, but all of the comments will be uppercased.

Categories