Python unicode string rejected by psycopg

Python unicode string rejected by psycopg - python

I've received a unicode string from the wild that causes some of our psycopg2 statements to fail.
I have reduced the problem down to a SSCE:
import psycopg2
conn = psycopg2.connect(...)
cur = conn.cursor()
x = u'\ud837'
cur.execute("SELECT %s", (x,))
print cur.fetchone()
Running this gives the following exception:
Traceback (most recent call last):
File ".../run.py", line 65, in <module>
cur.execute("SELECT %s AS test", (x,))
psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xb7
Based on some of the comments, it has become clear that this particular character is one half of a surrogate pair, making it invalid to live on its own.
Specifically then, I am looking for a mechanism to detect when a string contains an incomplete surrogate pair in Python 2.
One such method I have found that leads to an exception is trying x.encode('utf16').decode('utf16'), however, since I don't totally understand the risks associated, I would be somewhat concerned here.
Edit: Reduced SSCE string to single character causing the problem, added information based on comments.

The string u'\ud837' consists of a lone member of a surrogate pair, two physical characters that appear in sequence to form a logical character. As such, it does not define a Unicode code point - instead, it is an implementation detail of the UTF-16 encoding which uses it to pack the full code point range into 16-bit code units. Python 3 correctly rejects attempts to encode lone surrogates in any byte encoding, including the UTF-* variants.
The string probably originated from a system that internally uses UTF-16 (such as Java, C#, Windows, or Python 2 built with 16-bit Py_UNICODE) that naively shortened the string without taking care of surrogates.
Taking the regex from this answer, it should be possible to efficiently detect such strings using code such as:
import re
lone = re.compile(
ur'''(?x) # verbose expression (allows comments)
( # begin group
[\ud800-\udbff] # match leading surrogate
(?![\udc00-\udfff]) # but only if not followed by trailing surrogate
) # end group
| # OR
( # begin group
(?<![\ud800-\udbff]) # if not preceded by leading surrogate
[\udc00-\udfff] # match trailing surrogate
) # end group
''')
def invalid_unicode(s):
assert isinstance(s, unicode)
return lone.search(s) is not None

To detect that the string is invalid utf-8, just wrap an attempt to encode it inside a try/except before executing it in psycopg2.
As for what caused the problem, there is a specific character in the middle of the string that is utf-16 encoded: \U000d8a85. So it's not that Postgres does not consider it utf-8, it really isn't.

Related

Decoding a byte with latin-1 characters to string with decimal representation

I am working on a migration project to upgrade a layer of web server from python 2.7.8 to python 3.6.3 and I have hit a roadblock for some special cases.
When a request is received from a client, payload is transmitted locally using pyzmq which now interacts in bytes in python3 instead of str (as it is in python2).
Now, the payload which I am receiving is encoded using iso-8859-1 (latin-1) scheme and I can easily convert it into string as payload.decode('latin-1') and pass it to next service (svc-save-entity) which expects string argument.
However, the subsequent service 'svc-save-entity' expects latin-1 chars (if present) to be represented in ASCII Character Reference (such as é for é) rather than in Hex (such as \xe9 for é).
I am struggling to find an efficient way to achieve this conversion. Can any python expert guide me here? Essentially I need the definition of a function say decode_tostring():
payload = b'Banco Santander (M\xe9xico)' #payload is in bytes
payload_str = decode_tostring(payload) #function to convert into string
payload_str == 'Banco Santander (México)' #payload_str is a string in ASCII Character Reference
Definition of decode_tostring() please. :)

The encode() and decode() methods accept a parameter called errors which allows you to specify how characters which are not representable in the specified encoding should be handled. The one you're looking for is XML numeric character reference replacement, which is fortunately one of the standard handlers provided in the codecs module.
Now, it's a little complex to actually do the replacement the way you want it, because the operation of replacing non-ASCII characters with their corresponding XML numeric character references happens during encoding, not decoding. After all, encoding is the process that takes in characters and emits bytes, so it's only during encoding that you can tell whether you have a character that is not part of ASCII. The cleanest way I can think of at the moment to get the transformation you want is to decode, re-encode, and re-decode, applying the XML entity reference replacement during the encoding step.
def decode_tostring(payload):
return payload.decode('latin-1').encode('ascii', errors='xmlcharrefreplace').decode('ascii')
I wouldn't be surprised if there is a method somewhere out there that will replace all non-ASCII characters in a string with their XML numeric character refs and give you back a string, and if so, you could use it to replace the encoding and the second decoding. But I don't know of one. The closest I found at the moment was xml.sax.saxutils.escape(), but that only acts on certain specific characters.
This isn't really relevant to your main question, but I did want to clarify one thing: the numeric entities like é are a feature of SGML, HTML, and XML, which are markup languages - a way to represent structured data as text. They have nothing to do with ASCII. A character encoding like ASCII is nothing more than a table of some characters and some byte sequences such that each character in the table is mapped to one byte sequence in the table and vice versa, with a few constraints to make the mapping unambiguous.
If you have a string with characters that are not in a particular encoding's table, you can't encode the string using that encoding. But what you can do is convert the string into a new string by replacing the characters which aren't in the table with sequences of characters that are in the table, and then encode the new string. There are many ways to do the replacement, of which XML numeric entity references are one example. Some of the other error handlers in Python's codecs module represent other approaches to this replacement.

UnicodeDecodeError Loading with sqlalchemy

I am querying a MySQL database with sqlalchemy and getting the following error:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 498-499: unexpected end of data
A column in the table was defined as Unicode(500) so this error suggests to me that there is an entry that was truncated because it was longer than 500 characters. Is there a way to handle this error and still load the entry? Is there a way to find the errant entry and delete it other than trying to load every entry one by one (or in batches) until I get the error?

In short, you should change:
Unicode(500)
to:
Unicode(500, unicode_errors='ignore', convert_unicode='force')
(Python 2 code follows, but the principles hold in python 3; only some of the output will differ.)
What's going on is that when you decode a bytestring, it complains if the bytestring can't be decoded, with the error you saw.
>>> u = u'ABCDEFGH\N{TRADE MARK SIGN}'
>>> u
u'ABCDEFGH\u2122'
>>> print(u)
ABCDEFGH™
>>> s = u.encode('utf-8')
>>> s
'ABCDEFGH\xe2\x84\xa2'
>>> truncated = s[:-1]
>>> truncated
'ABCDEFGH\xe2\x84'
>>> truncated.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/cliffdyer/.virtualenvs/edx-platform/lib/python2.7/encodings/utf_8.py",
line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-9: unexpected
end of data
Python provides different optional modes of handling decode errors, though. Raising an exception is the default, but you can also truncate the text or convert the malformed part of the string to the official unicode replacement character.
>>> trunc.decode('utf-8', errors='replace')
u'ABCDEFGH\ufffd'
>>> trunc.decode('utf-8', errors='ignore')
u'ABCDEFGH'
This is exactly what's happening within the column handling.
Looking at the Unicode and String classes in sqlalchemy/sql/sqltypes.py, it looks like there is a unicode_errors argument that you can pass to the constructor which passes its value through to the encoder's errors argument. There is also a note that you will need to set convert_unicode='force' to make it work.
Thus Unicode(500, unicode_errors='ignore', convert_unicode='force') should solve your problem, if you're okay with truncating the ends of your data.
If you have some control over the database, you should be able to prevent this issue in the future by defining your database to use the utf8mb4 character set. (Don't just use utf8, or it will fail on four byte utf8 characters, including most emojis). Then you will be guaranteed to have valid utf-8 stored in and returned from your database.

In short, your MySQL setup is incorrect in that it truncates UTF-8 characters in mid-sequence. I would check twice that MySQL actually expects the character encoding of UTF-8 within the sessions and in the tables themselves.
I would suggest switching to PostgreSQL (seriously) to avoid this kind of problem: not only does PostgreSQL understand UTF-8 properly in default configurations, but also it would not ever truncate a string to fit into the value, choosing to raise an error instead:
psql (9.5.3, server 9.5.3)
Type "help" for help.
testdb=> create table foo(bar varchar(4));
CREATE TABLE
testdb=> insert into foo values ('aaaaa');
ERROR: value too long for type character varying(4)
This is also not unlike the Zen of Python:
Explicit is better than implicit.
and
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.

Make the column you are storing into a BLOB. After loading the data, do various things such as
SELECT MAX(LENGTH(col)) FROM ... -- to see what the longest is in _bytes_.
Copy the data into another BLOB column and do
ALTER TABLE t MODIFY col2 TEXT CHARACTER SET utf8 ... -- to see if it converts correctly
If that succeeds, then do
SELECT MAX(CHAR_LENGTH(col2)) ... -- to see if the longest is more than 500 _characters_.
After you have tried a few things like that, we can see what direction to take next.

Reading unicode characters from file/sqlite database and using it in Python

I have a list of variables with unicode characters, some of them for chemicals like Ozone gas: like 'O\u2083'. All of them are stored in a sqlite database which is read in a Python code to produce O3. However, when I read I get 'O\\u2083'. The sqlite database is created using an csv file that contains the string 'O\u2083' among others. I understand that \u2083 is not being stored in sqlite database as unicode character but as 6 unicode characters (which would be \,u,2,0,8,3). Is there any way to recognize unicode characters in this context? Now my first option to solve it is to create a function to recognize set of characters and replace for unicode characters. Is there anything like this already implemented?

SQLite allows you to read/write Unicode text directly. u'O\u2083' is two characters u'O' and u'\u2083' (your question has a typo: 'u\2083' != '\u2083').
I understand that u\2083 is not being stored in sqlite database as unicode character but as 6 unicode characters (which would be u,\,2,0,8,3)
Don't confuse u'u\2083' and u'\u2083': the latter is a single character while the former is 4-character sequence: u'u', u'\x10' ('\20' is interpreted as octal in Python), u'8', u'3'.
If you save a single Unicode character u'\u2083' into a SQLite database; it is stored as a single Unicode character (the internal representation of Unicode inside the database is irrelevant as long as the abstraction holds).
On Python 2, if there is no from __future__ import unicode_literals at the top of the module then 'abc' string literal creates a bytestring instead of a Unicode string -- in that case both 'u\2083' and '\u2083' are sequences of bytes, not text characters (\uxxxx is not recognized as a unicode escape sequence inside bytestrings).

If you have a byte string (length 7), decode the Unicode escape.
>>> s = 'O\u2083'
>>> len(s)
7
>>> s
'O\\u2083'
>>> print(s)
O\u2083
>>> u = s.decode('unicode-escape')
>>> len(u)
2
>>> u
u'O\u2083'
>>> print(u)
O₃
Caveat: Your console/IDE used to print the character needs to use an encoding that supports the character or you'll get a UnicodeEncodeError when printing. The font must support the symbol as well.

It's important to remember everything is bytes. To pull bytes into something useful to you, you kind of have to know what encoding is used when you pull in data. There are too many ambiguous cases to determine encoding by analyzing the data. When you send data out of your program, it's all back out to bytes again. Depending on whether you're using Python 2.x or 3.x you'll have a very different experience with Unicode and Python.
You can, however attempt encoding and simply do a "replace" on errors. For example the_string.encode("utf-8","replace") will try to encode as utf-8 and will replace problems with a ? -- You could also anticipate problem characters and replace them beforehand, but that gets unmanageable quickly. Take a look at codecs classes for more replacement options.

How can I convert surrogate pairs to normal string in Python?

This is a follow-up to Converting to Emoji. In that question, the OP had a json.dumps()-encoded file with an emoji represented as a surrogate pair - \ud83d\ude4f. S/he was having problems reading the file and translating the emoji correctly, and the correct answer was to json.loads() each line from the file, and the json module would handle the conversion from surrogate pair back to (I'm assuming UTF8-encoded) emoji.
So here is my situation: say I have just a regular Python 3 unicode string with a surrogate pair in it:
emoji = "This is \ud83d\ude4f, an emoji."
How do I process this string to get a representation of the emoji out of it? I'm looking to get something like this:
"This is 🙏, an emoji."
# or
"This is \U0001f64f, an emoji."
I've tried:
print(emoji)
print(emoji.encode("utf-8")) # also tried "ascii", "utf-16", and "utf-16-le"
json.loads(emoji) # and `.encode()` with various codecs
Generally I get an error similar to UnicodeEncodeError: XXX codec can't encode character '\ud83d' in position 8: surrogates no allowed.
I'm running Python 3.5.1 on Linux, with $LANG set to en_US.UTF-8. I've run these samples both in the Python interpreter on the command line, and within IPython running in Sublime Text - there don't appear to be any differences.

You've mixed a literal string \ud83d in a json file on disk (six characters: \ u d 8 3 d) and a single character u'\ud83d' (specified using a string literal in Python source code) in memory. It is the difference between len(r'\ud83d') == 6 and len('\ud83d') == 1 on Python 3.
If you see '\ud83d\ude4f' Python string (2 characters) then there is a bug upstream. Normally, you shouldn't get such string. If you get one and you can't fix upstream that generates it; you could fix it using surrogatepass error handler:
>>> "\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16')
'🙏'
Python 2 was more permissive.
Note: even if your json file contains literal \ud83d\ude4f (12 characters); you shouldn't get the surrogate pair:
>>> print(ascii(json.loads(r'"\ud83d\ude4f"')))
'\U0001f64f'
Notice: the result is 1 character ( '\U0001f64f'), not the surrogate pair ('\ud83d\ude4f').

Because this is a recurring question and the error message is slightly obscure, here is a more detailed explanation.
Surrogates are a way to express Unicode code points bigger than U+FFFF.
Recall that Unicode was originally specified to contain 65,536 characters, but that it was soon found that this was not enough to accommodate all the glyphs of the world.
As an extension mechanism for the (otherwise fixed-width) UTF-16 encoding, a reserved area was set up to contain a mechanism for expressing code points outside the Basic Multilingual Plane: Any code point in this special area would have to be followed by another character code from the same area, and together, they would express a code point with a number larger than the old limit.
(Strictly speaking, the surrogates area is divided into two halves; the first surrogate in a pair needs to come from the High Surrogates half, and the second, from the Low Surrogates. Confusingly, the High Surrogates U+D800-U+DBFF have lower code point numbers than the Low Surrogates U+DC00-U+DFFF.)
This is a legacy mechanism to support the UTF-16 encoding specifically, and should not be used in other encodings; they do not need it, and the applicable standards specifically say that this is disallowed.
In other words, while U+12345 can be expressed with the surrogate pair U+D808 U+DF45, you should simply express it directly instead unless you are specifically using UTF-16.
In some more detail, here is how this would be expressed in UTF-8 as a single character:
0xF0 0x92 0x8D 0x85
And here is the corresponding surrogate sequence:
0xED 0xA0 0x88
0xED 0xBD 0x85
As already suggested in the accepted answer, you can round-trip with something like
>>> "\ud808\udf45".encode('utf-16', 'surrogatepass').decode('utf-16').encode('utf-8')
b'\xf0\x92\x8d\x85'
Perhaps see also http://www.russellcottrell.com/greek/utilities/surrogatepaircalculator.htm

encoding problems writing UTF-8 SQL statements to a local file

I'm writing SQL to a file on a server this way:
import codecs
f = codecs.open('translate.sql',mode='a',encoding='utf8',errors='strict')
and then writing SQL statements like this:
query = (u"""INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(%s,#last_story_id,%s,'%s');
""" % (kw.get('to'), lookup.get(q), kw.get(q)))
f.write(query)
I have confirmed that the text was okay when I pulled it. Here is the data from the dictionary (kw) passed out to a webpage:
46:埼玉県
47:熊谷市
42:お散歩デモ
It appears correct (I want it to be utf8 escaped).
But the file.write output is garbage (encoding problems):
INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(279,#last_story_id,62,'ãã©ã³ãã£ã¢ããã'); )
/* updating the story text on old story_id */
UPDATE story_question_response
SET answer = 'å¤§å¦ã®ããã·ã§ã¯ãã¦å¦çãæ¬å¤§éç½ã®è¢«ç½å°(å²©æçã®å¤§è¹æ¸¡å¸)ã«æ´¾é£ãããããã¦ã¯ç¾å°ã®å¤ç¥ãã®ãæ$
WHERE story_id = 65591
AND question_id = 41
AND group_id = 276;
using an explicit decode gives an error:
f.write(query.decode('utf8'))
I don't know what else to try.
Question: What am I doing wrong, in writing a utf8 file?

We don't have enough information to be sure, but I'd give decent odds that your file is actually perfectly valid UTF-8, and you're just viewing it as if it were something else.
For example, on Windows, if you open a file in Notepad, by default, it will only treat it as UTF-8 if it starts with a UTF-8 BOM (which no valid file ever should, but Microsoft likes them anyway); otherwise, it will treat it as whatever your default code page is. Which is probably some Latin-1 derivative like CP1252.
So, your string of kana and kanji ends up encoded as a bunch of three-byte UTF-8 sequences like '\xe6\xad\xa9'. Then, that gets displayed in Notepad as whatever each of those bytes happen to mean in CP1252, like æ© (note that there's an invisible character between the two visible ones).
As a general rule, whenever you see weirdly-accented versions of lowercase A and E every 2 or 3 characters, that almost always means you've interpreted some CJK UTF-8 as some Latin-1-derived character set, because UTF-8 uses \xE3 through \xED as the prefix bytes for most CJK characters, and Latin-1 has accented lowercase A and E characters in that range. (Similarly, weirdly-accented capital A versions usually mean European or symbolic UTF-8 interpreted as Latin-1, especially when you've got stray Âs inserted into what looks like otherwise valid or almost-valid European text. If you look at the charts, you should be able to tell why.)

Assuming your input is utf8, you should probably use the following code to generate the query:
query = (u"""INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(%s,#last_story_id,%s,'%s');
""" % (kw.get('to').decode('utf8'), lookup.get(q).decode('utf8'), kw.get(q).decode('utf8')))
I would also suggest trying to output the contents of kw and lookup to some log file to debug this issue.
You should use encode on objects of class unicode, and decode on objects of class str in python.
You should escape any string you insert into SQL statement to prevent nasty SQL injections.
The code above doesn't include such escaping, so be careful.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.