Storing VT100 escape codes in an XML file - python

I'm writing a Python program that logs terminal interaction (similar to the script program), and I'd like to store the log in XML format.
The problem is that the terminal interaction includes VT100 escape codes. Python doesn't complain if I write the data to a file as UTF-8 encoded, e.g.:
...
pid, fd = pty.fork()
if pid==0:
os.execvp("bash",("bash","-l"))
else:
# Lots of TTY-related stuff here
# see http://groups.google.com/group/comp.lang.python/msg/de40b36c6f0c53cc
fout = codecs.open("session.xml", encoding="utf-8", mode="w")
fout.write('<?xml version="1.0" encoding="UTF-8"?>\n')
fout.write("<session>\n")
...
r, w, e = select.select([0, fd], [], [], 1)
for f in r:
if f==fd:
fout.write("<entry><![CDATA[")
buf = os.read(fd, 1024)
fout.write(buf)
fout.write("]]></entry>\n")
else:
....
fout.write("</session>")
fout.close()
This script "works" in the sense that it writes a file to disk, but the resulting file is not proper utf-8, which causes XML parsers like etree to barf on the escape codes.
One way to deal with this is to filter out the escape codes first. But if is it possible to do something like this where the escape codes are maintained and the resulting file can be parsed by XML tools like etree?

Your problem is not that the control codes aren't proper UTF-8, they are, it's just ASCII ESC and friends are not proper XML characters, even inside a CDATA section.
The only valid XML characters in XML 1.0 which have values less than U+0020 are U+0009 (tab), U+000A (newline) amd U+000D (carriage return). If you want to record things involving other codes such as escape (U+001B) then you will have to escape them in some way. There is no other option.

As Charles said, most control codes may not be included in a XML 1.0 file at all.
However if you can live with requiring XML 1.1, you can use them there. They can't be included as raw characters, but can be as character references. eg:

because you can't write character references in a CDATA section (they'd just be interpreted as ampersand-hash-...), you would have to lose the <![CDATA[ wrapper and manually escape &<> characters to their entity-reference equivalents.
Note that you should do this anyway: CDATA sections do not absolve you of the responsibility for text escaping, because they will fail if the text inside included the sequence ]]>. (Since you always have to do some escaping anyway, this makes CDATA sections pretty useless most of the time.)
XML 1.1 is more lenient about control codes but not everything supports it and you still can't include the NUL character (). In general it's not a good idea to include control characters in XML. You could use an ad-hoc encoding scheme to fit binary in; base-64 is popular, but not very human-readable. Alternatives might include using random characters from the Private Use Area as substitutes, if it's only ever your own application that will be handling the files, or encoding them as elements (eg <esc color="1"/>).

Did you try put your data inside a CDATA section ? this should prevent the parser to try to read the content of the tag.
http://en.wikipedia.org/wiki/CDATA

Related

Write a literal newline in ElementTree Attribute

subelement = SubElement(xml_tree, "image")
stream = BytesIO()
c.image.save(stream, format="PNG")
png = encodebytes(stream.getvalue()).decode("utf-8")
subelement.set("xlink:href", f"data:image/png;base64,{png}")
I am doing a very basic writing of an svg image element and attempting to conform to RFC 2045 which requires that I provide base64 code with lineends within the file.
I get the idiomized version:
<image xlink:href="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAApUAAALUCAIAAADVN145AAAKMWlDQ1BJQ0MgUHJvZmlsZQAAeJyd
...
The written data replaces the \n with
. I need to have ElementTree literally write the \n to disk. Am I missing something? Or is there a workaround?
I think you have the correct result with the XML entity representation of newline character. You're serializing data as XML so you need to encode the value in the way XML defines. So you wrap your image data twice - first with base64 encoding, then with XML encoding (which is incidentally 1:1 for most characters you care about).
Actually, if you put the newline character itself into the attribute, the XML parser could probably normalize it to a space when reading.
That RFC is about MIME encoding, and I think you are trying to be too literal in implementing those formatting rules when encoding in XML format for that attribute.
Note that many implementations may elect to encode the
local representation of various content types directly
rather than converting to canonical form first,
encoding, and then converting back to local
representation. In particular, this may apply to plain
text material on systems that use newline conventions
other than a CRLF terminator sequence. Such an
implementation optimization is permissible, but only
when the combined canonicalization-encoding step is
equivalent to performing the three steps separately.
Similarly, a CRLF sequence in the canonical form of the data
obtained after base64 decoding must be converted to a quoted-
printable hard line break, but ONLY when converting text data.

BOM character copied into JSON in Python 3

Inside my application, user can upload the file (text file), and I need to read it and construct json object for another API call.
I open file with
f = open(file, encoding="utf-8")
get the first word and construct Json object,...
My problem is that some files (especially from Microsoft environment) that have BOM object at the beginning. Problem is that my Json now have this character inside
{
"word":"\\ufeffMyWord"
}
and of course, the API is not working from this point on.
I obviously miss something, because, shouldn't utf-8 remove BOM objects? (Because it is not utf-8-sig).
How to overcome this?
No, the UTF-8 standard does not define a BOM character. That's because UTF-8 has no byte order ambiguity issue like UTF-16 and UTF-32 do. The Unicode consortium doesn't recommend using U+FEFF at the start of a UTF-8 encoded file, while the IETF actively discourages it if alternatives to specify the codec exist. From the Wikipedia article on BOM usage in UTF-8:
The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use.
[...]
The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature."
The Unicode standard only 'permits' the BOM because it is a regular character, just like any other; it's a zero-width non-breaking space character. As a result, the Unicode consortium recommends it is not removed when decoding, to preserve information (in case it had a different meaning or you wanted to retain compatibility with tools that have come to rely on it).
You have two options:
Strip the string first, U+FEFF is considered whitespace so removed with str.strip(). Or explicitly just strip the BOM:
text = text.lstrip('\ufeff') # remove the BOM if present
(technically that'll remove any number of zero-width non-breaking space characters, but that is probably what you'd want anyway).
Open the file with the utf-8-sig codec instead. That codec was added to handle such files, explicitly removing the UTF-8 BOM bytesequence from the start if present, before decoding. It'll work on files without those bytes.
UTF-8 doesn't removes BOM (Byte Order Mark). You have to put a check if the file contains BOM, remove it.
if text.startswith(codecs.BOM_UTF8):
headers[0] = (headers[0])[3:]
print "Removed BOM"
else:
print "No BOM char, Process your file"

encoding problems writing UTF-8 SQL statements to a local file

I'm writing SQL to a file on a server this way:
import codecs
f = codecs.open('translate.sql',mode='a',encoding='utf8',errors='strict')
and then writing SQL statements like this:
query = (u"""INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(%s,#last_story_id,%s,'%s');
""" % (kw.get('to'), lookup.get(q), kw.get(q)))
f.write(query)
I have confirmed that the text was okay when I pulled it. Here is the data from the dictionary (kw) passed out to a webpage:
46:埼玉県
47:熊谷市
42:お散歩デモ
It appears correct (I want it to be utf8 escaped).
But the file.write output is garbage (encoding problems):
INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(279,#last_story_id,62,'ãã©ã³ãã£ã¢ããã'); )
/* updating the story text on old story_id */
UPDATE story_question_response
SET answer = '大学ã®ãã­ã·ã§ã¯ãã¦å­¦çãæ¬å¤§éç½ã®è¢«ç½å°(岩æçã®å¤§è¹æ¸¡å¸)ã«æ´¾é£ãããããã¦ã¯ç¾å°ã®å¤ç¥­ãã®ãæ$
WHERE story_id = 65591
AND question_id = 41
AND group_id = 276;
using an explicit decode gives an error:
f.write(query.decode('utf8'))
I don't know what else to try.
Question: What am I doing wrong, in writing a utf8 file?
We don't have enough information to be sure, but I'd give decent odds that your file is actually perfectly valid UTF-8, and you're just viewing it as if it were something else.
For example, on Windows, if you open a file in Notepad, by default, it will only treat it as UTF-8 if it starts with a UTF-8 BOM (which no valid file ever should, but Microsoft likes them anyway); otherwise, it will treat it as whatever your default code page is. Which is probably some Latin-1 derivative like CP1252.
So, your string of kana and kanji ends up encoded as a bunch of three-byte UTF-8 sequences like '\xe6\xad\xa9'. Then, that gets displayed in Notepad as whatever each of those bytes happen to mean in CP1252, like æ­© (note that there's an invisible character between the two visible ones).
As a general rule, whenever you see weirdly-accented versions of lowercase A and E every 2 or 3 characters, that almost always means you've interpreted some CJK UTF-8 as some Latin-1-derived character set, because UTF-8 uses \xE3 through \xED as the prefix bytes for most CJK characters, and Latin-1 has accented lowercase A and E characters in that range. (Similarly, weirdly-accented capital A versions usually mean European or symbolic UTF-8 interpreted as Latin-1, especially when you've got stray Âs inserted into what looks like otherwise valid or almost-valid European text. If you look at the charts, you should be able to tell why.)
Assuming your input is utf8, you should probably use the following code to generate the query:
query = (u"""INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(%s,#last_story_id,%s,'%s');
""" % (kw.get('to').decode('utf8'), lookup.get(q).decode('utf8'), kw.get(q).decode('utf8')))
I would also suggest trying to output the contents of kw and lookup to some log file to debug this issue.
You should use encode on objects of class unicode, and decode on objects of class str in python.
You should escape any string you insert into SQL statement to prevent nasty SQL injections.
The code above doesn't include such escaping, so be careful.

Erasing all unreadable characters in tab-delimited txt

I am running a python program to process a tab-delimited txt data.
But it causes trouble because it often has unicodes such as U+001A or those in http://en.wikipedia.org/wiki/Newline#Unicode
(Worse, these characters are not even seen unless the txt is opened by sublime txt, not even by notepad++)
If the python program is run on Linux then it automatically ignores such characters, but on Windows, it can't.
For example if there is U+001A in the txt, then the python program will automatically think that's the end of the file.
For another example, if there is U+0085 in the txt, then the python program will think that's the point where a new line starts.
So I just want a separate program that will erase EVERY unicode characters that are not shown in ordinary file openers like notepad++(and that program should work on Windows).
I do want to keep things like あ and ä . But I only to delete things like U+001A and U+0085 which are not seen by notepad++
How can this be achieved?
There is no such thing as an "unicode character". A character is a character and how it is encoded is on a different page. The capital letter "A" can be encoded in a lot of ways, amongst these UTF-8, EBDIC, ASCII, etc.
If you want to delete every character that cannot be represented in ASCII, then you can use the following (py3):
a = 'aあäbc'
a.encode ('ascii', 'ignore')
This will yield abc.
And if there are really U+001A, i.e. SUBSTITUTE, characters in your document, most probably something has gone haywire in a prior encoding step.
Using unicodedata looks to be the best way to do it, as suggested by #Hyperboreus (Stripping non printable characters from a string in python) but as a quick hack you could do (in Python 2.x):
Open source in binary mode. This prevents Windows from truncating reads when it finds the EOL Control Char.
my_file = open("filename.txt", "rb")
Decode the file (assumes encoding was UTF-8:
my_str = my_file.read().decode("UTF-8")
Replace known "bad" code points:
my_str.replace(u"\u001A", "")
You could skip step 2 and replace the UTF-8 encoded value of each "bad" code point in step 3, for example \x1A, but the method above allows for UTF-16/32 source if required.

How to properly write utf8 IPTC metadata with python library IPTCInfo?

After generating a JPEG thumbnail file with PIL, I would like to use IPTCInfo in order to write IPTC metadata containing french characters with accents. I was thinking about using UTF8 character encoding.
So I tried the following:
info = IPTCInfo(input_file, force=True, inp_charset='utf8')
info.data['credit'] = some_unicode_string
info.saveAs(output_file)
and many other variations:
info = IPTCInfo(input_file, force=True)
info = IPTCInfo(input_file, force=True, inp_charset='utf8')
info = IPTCInfo(input_file, force=True, inp_charset='utf_8')
info = IPTCInfo(input_file, force=True, inp_charset='utf8', out_charset='utf8')
info = IPTCInfo(input_file, force=True, inp_charset='utf_8', out_charset='utf_8')
...
While reading with IPTCInfo the metadata written by IPTCInfo preserves the unicode python string, I always find weird characters when trying to read with other piece of software: OSX file information, Exiftools, PhotoShop, ViewNX2.
So what is the right way to write unicode with IPTCInfo and produce a standard compliant file understandable by all software?
Something related to your question. Coming from the IPTC forum
Using the XMP packet makes things quite easy: UTF-8 is the default character set. Thus you can use and even mix different characters sets and scripts.
The IPTC IIM header is a bit more tricky: it includes a field to indicate which character set has been used for textual fields (for the IIM experts: this is dataset 1:90) but unfortunately this field has not been used by a vast majority of imaging software and only in most recent years some of them are using it.
Also in the IPTC EnvelopeRecord Tags, you will find:
90 CodedCharacterSet string[0,32]!
(values are entered in the form "ESC X Y[, ...]". The escape sequence for UTF-8 character coding is "ESC % G", but this is displayed as "UTF8" for convenience. Either string may be used when writing. The value of this tag affects the decoding of string values in the Application and NewsPhoto records. This tag is marked as "unsafe" to prevent it from being copied by default in a group operation because existing tags in the destination image may use a different encoding. When creating a new IPTC record from scratch, it is suggested that this be set to "UTF8" if special characters are a possibility)
See also -charset CHARSET
Certain meta information formats allow coded character sets other than plain ASCII. When reading, most known encodings are converted to the external character set according to the exiftool "-charset CHARSET" or -L option, or to UTF‑8 by default. When writing, the inverse conversion is performed. Alternatively, special characters may be converted to/from HTML character entities with the -E option.
Though the comment in the code of IPTCInfo implementation of IPTC is not very encouraging, but there is still a dictionary of encoding in the code which gives more clues.
In your code example which seems to be correct, you are giving. :)
info.data['credit'] = some_unicode_string
What do you call some_unicode_string? Are you sure it's a utf-8 string (!= unicode).

Categories