How to properly write utf8 IPTC metadata with python library IPTCInfo? - python

After generating a JPEG thumbnail file with PIL, I would like to use IPTCInfo in order to write IPTC metadata containing french characters with accents. I was thinking about using UTF8 character encoding.
So I tried the following:
info = IPTCInfo(input_file, force=True, inp_charset='utf8')
info.data['credit'] = some_unicode_string
info.saveAs(output_file)
and many other variations:
info = IPTCInfo(input_file, force=True)
info = IPTCInfo(input_file, force=True, inp_charset='utf8')
info = IPTCInfo(input_file, force=True, inp_charset='utf_8')
info = IPTCInfo(input_file, force=True, inp_charset='utf8', out_charset='utf8')
info = IPTCInfo(input_file, force=True, inp_charset='utf_8', out_charset='utf_8')
...
While reading with IPTCInfo the metadata written by IPTCInfo preserves the unicode python string, I always find weird characters when trying to read with other piece of software: OSX file information, Exiftools, PhotoShop, ViewNX2.
So what is the right way to write unicode with IPTCInfo and produce a standard compliant file understandable by all software?

Something related to your question. Coming from the IPTC forum
Using the XMP packet makes things quite easy: UTF-8 is the default character set. Thus you can use and even mix different characters sets and scripts.
The IPTC IIM header is a bit more tricky: it includes a field to indicate which character set has been used for textual fields (for the IIM experts: this is dataset 1:90) but unfortunately this field has not been used by a vast majority of imaging software and only in most recent years some of them are using it.
Also in the IPTC EnvelopeRecord Tags, you will find:
90 CodedCharacterSet string[0,32]!
(values are entered in the form "ESC X Y[, ...]". The escape sequence for UTF-8 character coding is "ESC % G", but this is displayed as "UTF8" for convenience. Either string may be used when writing. The value of this tag affects the decoding of string values in the Application and NewsPhoto records. This tag is marked as "unsafe" to prevent it from being copied by default in a group operation because existing tags in the destination image may use a different encoding. When creating a new IPTC record from scratch, it is suggested that this be set to "UTF8" if special characters are a possibility)
See also -charset CHARSET
Certain meta information formats allow coded character sets other than plain ASCII. When reading, most known encodings are converted to the external character set according to the exiftool "-charset CHARSET" or -L option, or to UTF‑8 by default. When writing, the inverse conversion is performed. Alternatively, special characters may be converted to/from HTML character entities with the -E option.
Though the comment in the code of IPTCInfo implementation of IPTC is not very encouraging, but there is still a dictionary of encoding in the code which gives more clues.
In your code example which seems to be correct, you are giving. :)
info.data['credit'] = some_unicode_string
What do you call some_unicode_string? Are you sure it's a utf-8 string (!= unicode).

Related

Unicode Emoji's in python from csv files

I have some csv data of some users tweet.
In excel it is displayed like this:
‰ÛÏIt felt like they were my friends and I was living the story with them‰Û #retired #IAN1
I had imported this csv file into python and in python the same tweet appears like this (I am using putty to connect to a server and I copied this from putty's screen)
▒▒▒It felt like they were my friends and I was living the story with them▒۝ #retired #IAN1
I am wondering how to display these emoji characters properly. I am trying to separate all the words in this tweet but I am not sure how I can separate those emoji unicode characters.
In fact, you certainly have a loss of data…
I don’t know how you get your CSV file from users tweet (you may explain that). But generally, CSV files are encoded in "cp1252" (or "windows-1252"), sometimes in "iso-8859-1" encoding. Nowadays, we can found CSV files encoded in "utf-8".
If you tweets are encoded in "cp1252" or any 8-bit single-byte coded character sets, the Emojis are lost (replaced by "?") or badly converted.
Then, if you open your CSV file into Excel, it will use it’s default encoding ("cp1252") and load the file with corrupted characters. You can try with Libre Office, it has a dialog box which allows you to choose your encoding more easily.
The copy/paste from Putty will also convert your characters depending of your console encoding… It is worst!
If your CSV file use "utf-8" encoding (or "utf-16", "utf-32") you may have more chance to preserve the Emojis. But there is still a problem: most Emojis have a code-point greater that U+FFFF (65535 in decimal). For instance, Grinning Face "😀" has the code-point U+1F600).
This kind of characters are badly handled in Python, try this:
# coding: utf8
from __future__ import unicode_literals
emoji = u"😀"
print(u"emoji: " + emoji)
print(u"repr: " + repr(emoji))
print(u"len: {}".format(len(emoji)))
You’ll get (if your console allow it):
emoji: 😀
repr: u'\U0001f600'
len: 2
The first line won’t print if your console don’t allow unicode,
The \U escape sequence is similar to the \u, but expects 8 hex digits, not 4.
Yes, this character has a length of 2!
EDIT: With Python 3, you get:
emoji: 😀
repr: '😀'
len: 1
No escape sequence for repr(),
the length is 1!
What you can do is posting your CSV file (a fragment) as attachment, then one could analyse it…
See also Unicode Literals in Python Source Code in the Python 2.7 documentation.
First of all you shouldn't work with text copied from a console (nonetheless from a remote connection) because of formatting differences and how unreliable clipboards are. I'd suggest exporting your CSV and reading it directly.
I'm not quite sure what you are trying to do but twitter emojis cannot be displayed in a console due to them being basically compressed images. Would you mind explaning your issue further?
I would personally treat the whole string as Unicode, separate each character in a list then rebuilding words based on spaces.

Separate binary data (blobs) in csv files

Is there any safe way of mixing binary with text data in a (pseudo)csv file?
One naive and partial solution would be:
using a compound field separator, made of more than one character (e.g. the \a\b sequence for example)
saving each field as either text or as binary data would require the parser of the pseudocsv to look for the \a\b sequence and read the data between separators according to a known rule (e.g. by the means of a known header with field name and field type, for example)
The core issue is that binary data is not guaranteed to not contain the \a\b sequence somewhere inside its body, before the actual end of the data.
The proper solution would be to save the individual blob fields in their own separate physical files and only include the filenames in a .csv, but this is not acceptable in this scenario.
Is there any proper and safe solution, either already implemented or applicable given these restrictions?
If you need everything in a single file, just use one of the methods to encode binary as printable ASCII, and add that results to the CSV vfieds (letting the CSV module add and escape quotes as needed).
One such method is base64 - but even on Python's base64 codec, there are more efficient codecs like base85 (on newer Pythons, version 3.4 and above, I guess).
So, an example in Python 2.7 would be:
import csv, base64
import random
data = b''.join(chr(random.randrange(0,256)) for i in range(50))
writer = csv.writer(open("testfile.csv", "wt"))
writer.writerow(["some text", base64.b64encode(data)])
Of course, you have to do the proper base64 decoding on reading the file as well - but it is certainly better than trying to create an ad-hoc escaping method.

encoding problems writing UTF-8 SQL statements to a local file

I'm writing SQL to a file on a server this way:
import codecs
f = codecs.open('translate.sql',mode='a',encoding='utf8',errors='strict')
and then writing SQL statements like this:
query = (u"""INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(%s,#last_story_id,%s,'%s');
""" % (kw.get('to'), lookup.get(q), kw.get(q)))
f.write(query)
I have confirmed that the text was okay when I pulled it. Here is the data from the dictionary (kw) passed out to a webpage:
46:埼玉県
47:熊谷市
42:お散歩デモ
It appears correct (I want it to be utf8 escaped).
But the file.write output is garbage (encoding problems):
INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(279,#last_story_id,62,'ãã©ã³ãã£ã¢ããã'); )
/* updating the story text on old story_id */
UPDATE story_question_response
SET answer = '大学ã®ãã­ã·ã§ã¯ãã¦å­¦çãæ¬å¤§éç½ã®è¢«ç½å°(岩æçã®å¤§è¹æ¸¡å¸)ã«æ´¾é£ãããããã¦ã¯ç¾å°ã®å¤ç¥­ãã®ãæ$
WHERE story_id = 65591
AND question_id = 41
AND group_id = 276;
using an explicit decode gives an error:
f.write(query.decode('utf8'))
I don't know what else to try.
Question: What am I doing wrong, in writing a utf8 file?
We don't have enough information to be sure, but I'd give decent odds that your file is actually perfectly valid UTF-8, and you're just viewing it as if it were something else.
For example, on Windows, if you open a file in Notepad, by default, it will only treat it as UTF-8 if it starts with a UTF-8 BOM (which no valid file ever should, but Microsoft likes them anyway); otherwise, it will treat it as whatever your default code page is. Which is probably some Latin-1 derivative like CP1252.
So, your string of kana and kanji ends up encoded as a bunch of three-byte UTF-8 sequences like '\xe6\xad\xa9'. Then, that gets displayed in Notepad as whatever each of those bytes happen to mean in CP1252, like æ­© (note that there's an invisible character between the two visible ones).
As a general rule, whenever you see weirdly-accented versions of lowercase A and E every 2 or 3 characters, that almost always means you've interpreted some CJK UTF-8 as some Latin-1-derived character set, because UTF-8 uses \xE3 through \xED as the prefix bytes for most CJK characters, and Latin-1 has accented lowercase A and E characters in that range. (Similarly, weirdly-accented capital A versions usually mean European or symbolic UTF-8 interpreted as Latin-1, especially when you've got stray Âs inserted into what looks like otherwise valid or almost-valid European text. If you look at the charts, you should be able to tell why.)
Assuming your input is utf8, you should probably use the following code to generate the query:
query = (u"""INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(%s,#last_story_id,%s,'%s');
""" % (kw.get('to').decode('utf8'), lookup.get(q).decode('utf8'), kw.get(q).decode('utf8')))
I would also suggest trying to output the contents of kw and lookup to some log file to debug this issue.
You should use encode on objects of class unicode, and decode on objects of class str in python.
You should escape any string you insert into SQL statement to prevent nasty SQL injections.
The code above doesn't include such escaping, so be careful.

Storing VT100 escape codes in an XML file

I'm writing a Python program that logs terminal interaction (similar to the script program), and I'd like to store the log in XML format.
The problem is that the terminal interaction includes VT100 escape codes. Python doesn't complain if I write the data to a file as UTF-8 encoded, e.g.:
...
pid, fd = pty.fork()
if pid==0:
os.execvp("bash",("bash","-l"))
else:
# Lots of TTY-related stuff here
# see http://groups.google.com/group/comp.lang.python/msg/de40b36c6f0c53cc
fout = codecs.open("session.xml", encoding="utf-8", mode="w")
fout.write('<?xml version="1.0" encoding="UTF-8"?>\n')
fout.write("<session>\n")
...
r, w, e = select.select([0, fd], [], [], 1)
for f in r:
if f==fd:
fout.write("<entry><![CDATA[")
buf = os.read(fd, 1024)
fout.write(buf)
fout.write("]]></entry>\n")
else:
....
fout.write("</session>")
fout.close()
This script "works" in the sense that it writes a file to disk, but the resulting file is not proper utf-8, which causes XML parsers like etree to barf on the escape codes.
One way to deal with this is to filter out the escape codes first. But if is it possible to do something like this where the escape codes are maintained and the resulting file can be parsed by XML tools like etree?
Your problem is not that the control codes aren't proper UTF-8, they are, it's just ASCII ESC and friends are not proper XML characters, even inside a CDATA section.
The only valid XML characters in XML 1.0 which have values less than U+0020 are U+0009 (tab), U+000A (newline) amd U+000D (carriage return). If you want to record things involving other codes such as escape (U+001B) then you will have to escape them in some way. There is no other option.
As Charles said, most control codes may not be included in a XML 1.0 file at all.
However if you can live with requiring XML 1.1, you can use them there. They can't be included as raw characters, but can be as character references. eg:

because you can't write character references in a CDATA section (they'd just be interpreted as ampersand-hash-...), you would have to lose the <![CDATA[ wrapper and manually escape &<> characters to their entity-reference equivalents.
Note that you should do this anyway: CDATA sections do not absolve you of the responsibility for text escaping, because they will fail if the text inside included the sequence ]]>. (Since you always have to do some escaping anyway, this makes CDATA sections pretty useless most of the time.)
XML 1.1 is more lenient about control codes but not everything supports it and you still can't include the NUL character (). In general it's not a good idea to include control characters in XML. You could use an ad-hoc encoding scheme to fit binary in; base-64 is popular, but not very human-readable. Alternatives might include using random characters from the Private Use Area as substitutes, if it's only ever your own application that will be handling the files, or encoding them as elements (eg <esc color="1"/>).
Did you try put your data inside a CDATA section ? this should prevent the parser to try to read the content of the tag.
http://en.wikipedia.org/wiki/CDATA

how do I write a custom encoding in python to clean up my data?

I know I've done this before at another job, but I can't remember what I did.
I have a database that is full of varchar and memo fields that were cut and pasted from Office, webpages, and who knows where else. This is starting to cause encoding errors for me. Since Python has a very nice "decode" function to take a byte stream and translate it into Unicode, I thought that would just write my own encoding to fix this up. (For example, to take "smart quotes" and turn them into "standard quotes".)
But I can't remember how to get started. I think I copied one of the encodings that was close (cp1252.py) and then updated it.
Can anyone put me on the right path? Or suggest a better path?
I've expanded this with a bit more detail.
If you are reasonably sure of the encoding of the text in the database, you can do text.decode('cp1252') to get a Unicode string. If the guess is wrong this will likely blow up with an exception, or the decoder will 'disappear' some characters.
Creating a decoder along the lines you describe (modifying cp1252.py) is easy. You just need to define the translation table from bytes to Unicode characters.
However if not all of the text in the database has the same encoding, your decoder will need some rules to decide which is the correct mapping. In this case you may want punt and use the chardet module, which can scan the text and make a guess the encoding.
Maybe the best approach would be try to decode using the most likely encoding (cp1252) and if that fails, fallback to using chardet to guess the correct encoding.
If you use text.decode() and/or chardet, you'll end up with a Unicode string. Below is a simple routine which can translate characters in a Unicode string, e.g. "convert curly quotes to ASCII":
CHARMAP = [
(u'\u201c\u201d', '"'),
(u'\u2018\u2019', "'")
]
# replace with text.decode('cp1252') or chardet
text = u'\u201cit\u2019s probably going to work\u201d, he said'
_map = dict((c, r) for chars, r in CHARMAP for c in list(chars))
fixed = ''.join(_map.get(c, c) for c in text)
print fixed
Output:
"it's probably going to work", he said

Categories