Write a literal newline in ElementTree Attribute - python

subelement = SubElement(xml_tree, "image")
stream = BytesIO()
c.image.save(stream, format="PNG")
png = encodebytes(stream.getvalue()).decode("utf-8")
subelement.set("xlink:href", f"data:image/png;base64,{png}")
I am doing a very basic writing of an svg image element and attempting to conform to RFC 2045 which requires that I provide base64 code with lineends within the file.
I get the idiomized version:
<image xlink:href="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAApUAAALUCAIAAADVN145AAAKMWlDQ1BJQ0MgUHJvZmlsZQAAeJyd
...
The written data replaces the \n with
. I need to have ElementTree literally write the \n to disk. Am I missing something? Or is there a workaround?

I think you have the correct result with the XML entity representation of newline character. You're serializing data as XML so you need to encode the value in the way XML defines. So you wrap your image data twice - first with base64 encoding, then with XML encoding (which is incidentally 1:1 for most characters you care about).
Actually, if you put the newline character itself into the attribute, the XML parser could probably normalize it to a space when reading.

That RFC is about MIME encoding, and I think you are trying to be too literal in implementing those formatting rules when encoding in XML format for that attribute.
Note that many implementations may elect to encode the
local representation of various content types directly
rather than converting to canonical form first,
encoding, and then converting back to local
representation. In particular, this may apply to plain
text material on systems that use newline conventions
other than a CRLF terminator sequence. Such an
implementation optimization is permissible, but only
when the combined canonicalization-encoding step is
equivalent to performing the three steps separately.
Similarly, a CRLF sequence in the canonical form of the data
obtained after base64 decoding must be converted to a quoted-
printable hard line break, but ONLY when converting text data.

Related

The correct way to load and read JSON file contains special characters in Python

I'm working with a JSON file contains some unknown-encoded strings as the example below:
"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"
I have loaded this text by using json.load() function in Python 3.7 environment and tried to encode/decode it with some methods I found around the Internet but I still cannot get the proper string as I expected. (In this case, it has to be Lê Nguyễn Phú).
My question is, which is the encoding method they used and how to parse this text in a proper way in Python?
Because the JSON file comes from an external source that I didn't handle so that I cannot know or make any changes in the process of encoding the text.
[Updated] More details:
The JSON file looks like this:
{
"content":"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"
}
Firstly, I loaded the JSON file:
with open(json_path, 'r') as f:
data = json.load(f)
But when I extract the content, it's not what I expected:
string = data.get('content', '')
print(string)
'Lê Nguyá»\x85n Phú'
Someone took "Lê Nguyễn Phú", encoded that as UTF-8, and then took the resulting series of bytes and lied to a JSON encoder by telling it that those bytes were the characters of a string. The JSON encoder then cooperatively produced garbage by encoding those characters. But it is reversible garbage. You can reverse this process using something like
json.loads(in_string).encode("latin_1").decode("utf_8")
Which decodes the string from the JSON, extracts the bytes from it (the 256 symbols in Latin-1 are in a 1-to-1 correspondence with the first 256 Unicode codepoints), and then re-decodes those bytes as UTF-8.
The big problem with this technique is that it only works if you are sure that all of your input is garbled in this fashion... there's no completely reliable way to look at an input and decide whether it should have this broken decoding applied to it. If you try to apply it to a validly-encoded string containing codepoints above U+00FF, it will crash. But if you try to apply it to a validly-encoding string containing only codepoints up to U+00FF, it will turn your perfectly good string into a different kind of garbage.

Decoding a byte with latin-1 characters to string with decimal representation

I am working on a migration project to upgrade a layer of web server from python 2.7.8 to python 3.6.3 and I have hit a roadblock for some special cases.
When a request is received from a client, payload is transmitted locally using pyzmq which now interacts in bytes in python3 instead of str (as it is in python2).
Now, the payload which I am receiving is encoded using iso-8859-1 (latin-1) scheme and I can easily convert it into string as payload.decode('latin-1') and pass it to next service (svc-save-entity) which expects string argument.
However, the subsequent service 'svc-save-entity' expects latin-1 chars (if present) to be represented in ASCII Character Reference (such as é for é) rather than in Hex (such as \xe9 for é).
I am struggling to find an efficient way to achieve this conversion. Can any python expert guide me here? Essentially I need the definition of a function say decode_tostring():
payload = b'Banco Santander (M\xe9xico)' #payload is in bytes
payload_str = decode_tostring(payload) #function to convert into string
payload_str == 'Banco Santander (México)' #payload_str is a string in ASCII Character Reference
Definition of decode_tostring() please. :)
The encode() and decode() methods accept a parameter called errors which allows you to specify how characters which are not representable in the specified encoding should be handled. The one you're looking for is XML numeric character reference replacement, which is fortunately one of the standard handlers provided in the codecs module.
Now, it's a little complex to actually do the replacement the way you want it, because the operation of replacing non-ASCII characters with their corresponding XML numeric character references happens during encoding, not decoding. After all, encoding is the process that takes in characters and emits bytes, so it's only during encoding that you can tell whether you have a character that is not part of ASCII. The cleanest way I can think of at the moment to get the transformation you want is to decode, re-encode, and re-decode, applying the XML entity reference replacement during the encoding step.
def decode_tostring(payload):
return payload.decode('latin-1').encode('ascii', errors='xmlcharrefreplace').decode('ascii')
I wouldn't be surprised if there is a method somewhere out there that will replace all non-ASCII characters in a string with their XML numeric character refs and give you back a string, and if so, you could use it to replace the encoding and the second decoding. But I don't know of one. The closest I found at the moment was xml.sax.saxutils.escape(), but that only acts on certain specific characters.
This isn't really relevant to your main question, but I did want to clarify one thing: the numeric entities like é are a feature of SGML, HTML, and XML, which are markup languages - a way to represent structured data as text. They have nothing to do with ASCII. A character encoding like ASCII is nothing more than a table of some characters and some byte sequences such that each character in the table is mapped to one byte sequence in the table and vice versa, with a few constraints to make the mapping unambiguous.
If you have a string with characters that are not in a particular encoding's table, you can't encode the string using that encoding. But what you can do is convert the string into a new string by replacing the characters which aren't in the table with sequences of characters that are in the table, and then encode the new string. There are many ways to do the replacement, of which XML numeric entity references are one example. Some of the other error handlers in Python's codecs module represent other approaches to this replacement.

Converting Python 3 bytes object to string when bytes object apparently only contains characters

I'm new to Python 3 and it seems that I can't quite completely grasp unicode and character encoding.
I'm working with the output of another tool that returns the content of an html page as a bytes object. Other tools we use need this output to be in bytes type, but, I'd like to convert the bytes output to a string for some parsing and comparison to other strings. For cases that I'm interested in, printing the output bytes object shows only characters and no \x or \u binary. I'm a little confused on how best to do this and why the methods that create the desired output, actually do work.
I've read elsewhere that .decode() should be used in this context and this does work, but I don't understand why I am decoding an object that is already characters. From what I understand, decoding is intended for binary numbers, for example:
>>> b'\x41'.decode('utf-8')
'A'
In my understanding, all I really want to do is tell Python that an object that's been labeled as a bytes type object is actually a str object. Simply using the str() function on the bytes object also accomplishes this goal, but adds the "b" prefix and adds quotations around the string.
Here are the two solutions I'm working with:
>>> str(b'htmltext')
"b'htmltext'"
>>> b'htmltext'.decode('utf-8')
'htmltext'
Essentially, either of these solutions appears to achieve what I'm looking for, but the decode() obviously seems cleaner and, from what I've read, the recommended method. I'm wondering why decode() works, given that, apparently, I'm not converting binary numbers to characters. Furthermore, is there any reason other than the unappealing "b" and quotation marks in the output that str() would not be a valid solution here?
Don't confuse the developer-friendly representation of the bytes object with the data that is contained in it. You have binary data either way.
The developer representation makes it easy for you to see what is contained by showing anything that just happens to be a valid ASCII codepoint as that ASCII character, rather than the \xhh escape code. It's just easier to read text encoded as ASCII that way, and a lot of the world's text happens to be ASCII encoded.
You'll have a harder time when the data is not within the ASCII range however:
>>> 'Åæøéï'.encode('utf8')
b'\xc3\x85\xc3\xa6\xc3\xb8\xc3\xa9\xc3\xaf'
That's a UTF-8 byte sequence encoding text with accents. The above may be a little bit contrived, but most non-English text will include some non-ASCII text. Even English text can contain em-dashes or fancy quotes, and the b'...' bytes version of that is not nearly as readable as the properly decoded text version:
>>> '“Kragerø” is a town in Norway – in the province of Vestfold'.encode('utf8')
b'\xe2\x80\x9cKrager\xc3\xb8\xe2\x80\x9d is a town in Norway \xe2\x80\x93 in the province of Vestfold'
Note that the b'....' output is the result of using the repr() function on a bytes object; that calls the object.__repr__() method, which has the explicit function of producing a developer-friendly string for you. There is no dedicated object.__str__() method on a bytes object, the __repr__ method is called instead, even when you use the str() function. The proper way to convert a bytes value to a string is to decode (using the correct codec for the data).
Of course, when you have binary data that represents something else, like, say, image data, then keep it as bytes. There is no text to decode there.

Separate binary data (blobs) in csv files

Is there any safe way of mixing binary with text data in a (pseudo)csv file?
One naive and partial solution would be:
using a compound field separator, made of more than one character (e.g. the \a\b sequence for example)
saving each field as either text or as binary data would require the parser of the pseudocsv to look for the \a\b sequence and read the data between separators according to a known rule (e.g. by the means of a known header with field name and field type, for example)
The core issue is that binary data is not guaranteed to not contain the \a\b sequence somewhere inside its body, before the actual end of the data.
The proper solution would be to save the individual blob fields in their own separate physical files and only include the filenames in a .csv, but this is not acceptable in this scenario.
Is there any proper and safe solution, either already implemented or applicable given these restrictions?
If you need everything in a single file, just use one of the methods to encode binary as printable ASCII, and add that results to the CSV vfieds (letting the CSV module add and escape quotes as needed).
One such method is base64 - but even on Python's base64 codec, there are more efficient codecs like base85 (on newer Pythons, version 3.4 and above, I guess).
So, an example in Python 2.7 would be:
import csv, base64
import random
data = b''.join(chr(random.randrange(0,256)) for i in range(50))
writer = csv.writer(open("testfile.csv", "wt"))
writer.writerow(["some text", base64.b64encode(data)])
Of course, you have to do the proper base64 decoding on reading the file as well - but it is certainly better than trying to create an ad-hoc escaping method.

Storing VT100 escape codes in an XML file

I'm writing a Python program that logs terminal interaction (similar to the script program), and I'd like to store the log in XML format.
The problem is that the terminal interaction includes VT100 escape codes. Python doesn't complain if I write the data to a file as UTF-8 encoded, e.g.:
...
pid, fd = pty.fork()
if pid==0:
os.execvp("bash",("bash","-l"))
else:
# Lots of TTY-related stuff here
# see http://groups.google.com/group/comp.lang.python/msg/de40b36c6f0c53cc
fout = codecs.open("session.xml", encoding="utf-8", mode="w")
fout.write('<?xml version="1.0" encoding="UTF-8"?>\n')
fout.write("<session>\n")
...
r, w, e = select.select([0, fd], [], [], 1)
for f in r:
if f==fd:
fout.write("<entry><![CDATA[")
buf = os.read(fd, 1024)
fout.write(buf)
fout.write("]]></entry>\n")
else:
....
fout.write("</session>")
fout.close()
This script "works" in the sense that it writes a file to disk, but the resulting file is not proper utf-8, which causes XML parsers like etree to barf on the escape codes.
One way to deal with this is to filter out the escape codes first. But if is it possible to do something like this where the escape codes are maintained and the resulting file can be parsed by XML tools like etree?
Your problem is not that the control codes aren't proper UTF-8, they are, it's just ASCII ESC and friends are not proper XML characters, even inside a CDATA section.
The only valid XML characters in XML 1.0 which have values less than U+0020 are U+0009 (tab), U+000A (newline) amd U+000D (carriage return). If you want to record things involving other codes such as escape (U+001B) then you will have to escape them in some way. There is no other option.
As Charles said, most control codes may not be included in a XML 1.0 file at all.
However if you can live with requiring XML 1.1, you can use them there. They can't be included as raw characters, but can be as character references. eg:

because you can't write character references in a CDATA section (they'd just be interpreted as ampersand-hash-...), you would have to lose the <![CDATA[ wrapper and manually escape &<> characters to their entity-reference equivalents.
Note that you should do this anyway: CDATA sections do not absolve you of the responsibility for text escaping, because they will fail if the text inside included the sequence ]]>. (Since you always have to do some escaping anyway, this makes CDATA sections pretty useless most of the time.)
XML 1.1 is more lenient about control codes but not everything supports it and you still can't include the NUL character (). In general it's not a good idea to include control characters in XML. You could use an ad-hoc encoding scheme to fit binary in; base-64 is popular, but not very human-readable. Alternatives might include using random characters from the Private Use Area as substitutes, if it's only ever your own application that will be handling the files, or encoding them as elements (eg <esc color="1"/>).
Did you try put your data inside a CDATA section ? this should prevent the parser to try to read the content of the tag.
http://en.wikipedia.org/wiki/CDATA

Categories