Reliable way of handling non-ASCII characters in Python?

Reliable way of handling non-ASCII characters in Python? - python

I have a column a spreadsheet whose header contains non-ASCII characters thus:
'ï»¿Campaign'
If I pop this string into the interpreter, I get:
'\xc3\xaf\xc2\xbb\xc2\xbfCampaign'
The string is one the keys in the rows of a csv.DictReader()
When I try to populate a new dict with with the value of this key:
spends['ï»¿Campaign'] = 2
I get:
Key Error: '\xc3\xaf\xc2\xbb\xc2\xbfCampaign'
If I print the value of the keys of row, I can see that it is '\xef\xbb\xbfCampaign'
Obviously then I can just update my program to access this key thus:
spends['\xef\xbb\xbfCampaign']
But is there a "better" way of doing this, in Python? Indeed, if the value of this key every changes to contain other non-ASCII characters, what is an all-encompassing way of handling any all non-ASCII characters that may arise?

Your specific problem is the first three bytes of the file, "\xef\xbb\xbf". That's the UTF-8 encoding of the byte order mask and often prepended to text files to indicate they're encoded using UTF-8. You should strip these bytes. See Removing BOM from gzip'ed CSV in Python.
Second, you're decoding with the wrong codec. "ï»¿" is what you get if you decode those bytes using the Windows-1252 character set. That's why the bytes look different if you use these characters in a source file. See the Python 2 Unicode howto.

In general, you should decode a bytestring into Unicode text using the corresponding character encoding as soon as possible on input. And, in reverse, encode Unicode text into a bytestring as late as possible on output. Some APIs such as io.open() can do it implicitly so that your code sees only Unicode.
Unfortunately, csv module does not support Unicode directly on Python 2. See UnicodeReader, UnicodeWriter in the doc examples. You could create their analog for csv.DictReader or as an alternative just pass utf-8 encoded bytestrings to csv module.

Related

Decoding a byte with latin-1 characters to string with decimal representation

I am working on a migration project to upgrade a layer of web server from python 2.7.8 to python 3.6.3 and I have hit a roadblock for some special cases.
When a request is received from a client, payload is transmitted locally using pyzmq which now interacts in bytes in python3 instead of str (as it is in python2).
Now, the payload which I am receiving is encoded using iso-8859-1 (latin-1) scheme and I can easily convert it into string as payload.decode('latin-1') and pass it to next service (svc-save-entity) which expects string argument.
However, the subsequent service 'svc-save-entity' expects latin-1 chars (if present) to be represented in ASCII Character Reference (such as é for é) rather than in Hex (such as \xe9 for é).
I am struggling to find an efficient way to achieve this conversion. Can any python expert guide me here? Essentially I need the definition of a function say decode_tostring():
payload = b'Banco Santander (M\xe9xico)' #payload is in bytes
payload_str = decode_tostring(payload) #function to convert into string
payload_str == 'Banco Santander (México)' #payload_str is a string in ASCII Character Reference
Definition of decode_tostring() please. :)

The encode() and decode() methods accept a parameter called errors which allows you to specify how characters which are not representable in the specified encoding should be handled. The one you're looking for is XML numeric character reference replacement, which is fortunately one of the standard handlers provided in the codecs module.
Now, it's a little complex to actually do the replacement the way you want it, because the operation of replacing non-ASCII characters with their corresponding XML numeric character references happens during encoding, not decoding. After all, encoding is the process that takes in characters and emits bytes, so it's only during encoding that you can tell whether you have a character that is not part of ASCII. The cleanest way I can think of at the moment to get the transformation you want is to decode, re-encode, and re-decode, applying the XML entity reference replacement during the encoding step.
def decode_tostring(payload):
return payload.decode('latin-1').encode('ascii', errors='xmlcharrefreplace').decode('ascii')
I wouldn't be surprised if there is a method somewhere out there that will replace all non-ASCII characters in a string with their XML numeric character refs and give you back a string, and if so, you could use it to replace the encoding and the second decoding. But I don't know of one. The closest I found at the moment was xml.sax.saxutils.escape(), but that only acts on certain specific characters.
This isn't really relevant to your main question, but I did want to clarify one thing: the numeric entities like é are a feature of SGML, HTML, and XML, which are markup languages - a way to represent structured data as text. They have nothing to do with ASCII. A character encoding like ASCII is nothing more than a table of some characters and some byte sequences such that each character in the table is mapped to one byte sequence in the table and vice versa, with a few constraints to make the mapping unambiguous.
If you have a string with characters that are not in a particular encoding's table, you can't encode the string using that encoding. But what you can do is convert the string into a new string by replacing the characters which aren't in the table with sequences of characters that are in the table, and then encode the new string. There are many ways to do the replacement, of which XML numeric entity references are one example. Some of the other error handlers in Python's codecs module represent other approaches to this replacement.

How to convert "\uxxxx\uxxxx" to u'\Uxxxxxxxx'?

I have a text file who is filled with unicode characters as "\ud83d\udca5" but python don't seem to like them.
But if I replace it by u'\U0001f4a5' which seems to be his python escape style (Charbase), it works.
Is there a solution to convert them all into the u"\Uxxxxxxxx" escape format than python can understand ?
Thanks.

You're mixing up Unicode and encoded strings. u'\U0001f4a5' is a Unicode object, Python's internal datatype for handling strings. (In Python 3, the u is optional since now all strings are Unicode objects).
Files, on the other hand, use encodings. UTF-8 is the most common one, but it's just one means of storing a Unicode object in a byte-oriented file or stream. When opening such a file, you need to specify the encoding so Python can translate the bytes into meaningful Unicode objects.
In your case, it seems you need to open file using the UTF-16 codec instead of UTF-8.
with open("myfile.txt", encoding="utf-16") as f:
s = f.read()
will give you the proper contents if the codec is in fact UTF-16. If it doesn't look right, try "utf-16-le" or "utf-16-be".

Reading unicode characters from file/sqlite database and using it in Python

I have a list of variables with unicode characters, some of them for chemicals like Ozone gas: like 'O\u2083'. All of them are stored in a sqlite database which is read in a Python code to produce O3. However, when I read I get 'O\\u2083'. The sqlite database is created using an csv file that contains the string 'O\u2083' among others. I understand that \u2083 is not being stored in sqlite database as unicode character but as 6 unicode characters (which would be \,u,2,0,8,3). Is there any way to recognize unicode characters in this context? Now my first option to solve it is to create a function to recognize set of characters and replace for unicode characters. Is there anything like this already implemented?

SQLite allows you to read/write Unicode text directly. u'O\u2083' is two characters u'O' and u'\u2083' (your question has a typo: 'u\2083' != '\u2083').
I understand that u\2083 is not being stored in sqlite database as unicode character but as 6 unicode characters (which would be u,\,2,0,8,3)
Don't confuse u'u\2083' and u'\u2083': the latter is a single character while the former is 4-character sequence: u'u', u'\x10' ('\20' is interpreted as octal in Python), u'8', u'3'.
If you save a single Unicode character u'\u2083' into a SQLite database; it is stored as a single Unicode character (the internal representation of Unicode inside the database is irrelevant as long as the abstraction holds).
On Python 2, if there is no from __future__ import unicode_literals at the top of the module then 'abc' string literal creates a bytestring instead of a Unicode string -- in that case both 'u\2083' and '\u2083' are sequences of bytes, not text characters (\uxxxx is not recognized as a unicode escape sequence inside bytestrings).

If you have a byte string (length 7), decode the Unicode escape.
>>> s = 'O\u2083'
>>> len(s)
7
>>> s
'O\\u2083'
>>> print(s)
O\u2083
>>> u = s.decode('unicode-escape')
>>> len(u)
2
>>> u
u'O\u2083'
>>> print(u)
O₃
Caveat: Your console/IDE used to print the character needs to use an encoding that supports the character or you'll get a UnicodeEncodeError when printing. The font must support the symbol as well.

It's important to remember everything is bytes. To pull bytes into something useful to you, you kind of have to know what encoding is used when you pull in data. There are too many ambiguous cases to determine encoding by analyzing the data. When you send data out of your program, it's all back out to bytes again. Depending on whether you're using Python 2.x or 3.x you'll have a very different experience with Unicode and Python.
You can, however attempt encoding and simply do a "replace" on errors. For example the_string.encode("utf-8","replace") will try to encode as utf-8 and will replace problems with a ? -- You could also anticipate problem characters and replace them beforehand, but that gets unmanageable quickly. Take a look at codecs classes for more replacement options.

Python 2: Comparing a unicode and a str

This topic is already on StackOverflow but I didn't find any satisfying solution:
I have some strings in Unicode coming from a server and I have some hardcoded strings in the code which I'd like to match against. And I do understand why I can't just make a == but I do not succeed in converting them properly (I don't care if I've to do str -> unicode or unicode -> str).
I tried encode and decode but it didn't gave any result.
Here is what I receive...
fromServer = {unicode} u'Führerschein nötig'
fromCode = {str} 'Führerschein nötig'
(as you can see, it is german!)
How can have them equals in Python 2 ?

First make sure you declare the encoding of your Python source file at the top of the file. Eg. if your file is encoded as latin-1:
# -*- coding: latin-1 -*-
And second, always store text as Unicode strings:
fromCode = u'Führerschein nötig'
If you get bytes from somewhere, convert them to Unicode with str.decode before working with the text. For text files, specify the encoding when opening the file, eg:
# use codecs.open to open a text file
f = codecs.open('unicode.rst', encoding='utf-8')
Code which compares byte strings with Unicode strings will often fail at random, depending on system settings, or whatever encoding happens to be used for a text file. Don't rely on it, always make sure you compare either two unicode strings or two byte strings.
Python 3 changed this behaviour, it will not try to convert any strings. 'a' and b'a' are considered objects of a different type and comparing them will always return False.

tested on 2.7
for German umlauts latin-1 is used.
if 'Führerschein nötig'.decode('latin-1') == u'Führerschein nötig':
print('yes....')
yes....

Python & fql: getting "Dami\u00e1n" instead of "Damián"

I created a file containing a dictionary with data written in Spanish (i.e. Damián, etc.):
fileNameX.write(json.dumps(dictionaryX, indent=4))
The data come from some fql fetching operations, i.e.:
select name from user where uid in XXX
When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n".
I've tried some options:
ensure_ascii=False:
fileNameX.write(json.dumps(dictionaryX, indent=4, ensure_ascii=False))
But I get an error (UnicodeEncodeError: 'ascii' codec can´t encode character u'\xe1' in position XXX: ordinal not in range(128)).
encode(encoding='latin-1):
dictionaryX.append({
'name': unicodeVar.encode(encoding='latin-1'),
...
})
But I get another error (UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position XXX: invalid continuation byte)
To sum up, I've tried several possibilities, but have less than a clue. I'm lost. Please, I need help. Thanks!

You have many options, and have stumbled upon something rather complicated that depends on your Python version and which you absolutely must understand fully in order to write correct code. Generally the approach taken in 3.x is stricter and a bit harder to work with, but it is much less likely that you will make a mistake or get yourself into a complicated situation. (Based on the exact symptoms you report, you seem to be using 2.x.)
json.dumps has different behaviour in 2.x and 3.x. In 2.x, it produces a str, which is a byte-string (unknown encoding). In 3.x, it still produces a str, but now str in 3.x is a proper Unicode string.
JSON is inherently a Unicode-supporting format, but it expects files to be in UTF-8 encoding. However, please understand that JSON supports \u style escapes in strings. When you read in this data, you will get the correct encoded string back. The reading code produces unicode objects (no matter whether you use 2.x or 3.x) when it reads strings out of the JSON.
When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n"
á cannot be represented in ASCII. It gets encoded as \u00e1 by default, to avoid the other problems you had. This happens even in 3.x.
ensure_ascii=False
This disables the previous encoding. In 2.x, it means you get a unicode object instead - a real Unicode object, preserving the original á character. In 3.x, it means that the character is not explicitly translated. But either way, ensure_ascii=False means that json.dumps will give you a Unicode string.
Unicode strings must be encoded to be written to a file. There is no such thing as "unicode data"; Unicode is an abstraction. In 2.x, this encoding is implicitly 'ascii' when you feed a Unicode object to file.write; it was expecting a str. To get around this, you can use the codecs module, or explicitly encode as 'utf-8' before writing. In 3.x, the encoding is set with the encoding keyword argument when you open the file (the default is again probably not what you want).
encode(encoding='latin-1')
Here, you are encoding before producing the dictionary, so that you have a str object in your data. Now a problem occurs because when there are str objects in your data, the JSON encoder assumes, by default, that they represent Unicode strings in UTF-8 encoding. This can be changed, in 2.x, using the encoding keyword argument to json.dumps. (In 3.x, the encoder will simply refuse to serialize bytes objects, i.e. non-Unicode strings!)
However, if your goal is simply to get the data into the file directly, then json.dumps is the wrong tool for you. Have you wondered what that s in the name is for? It stands for "string"; this is the special case. The ordinary case, in fact, is writing directly to a file! (Instead of giving you a string and expecting you to write it yourself.) Which is what json.dump (no 's') does. Again, the JSON standard expects UTF-8 encoding, and again 2.x has an encoding keyword parameter that defaults to UTF-8 (you should leave this alone).

Use codecs.open() to open fileNameX with a specific encoding like encoding='utf-8' for example instead of using open().
Also, json.dump().

Since the string has a \u inside it that means it's a Unicode string. The string is actually correct! Your problem lies in displaying the string. If you print the string, Python's output encoding should print it in the proper encoding for your environment.
For example, this is what I get inside IDLE on Windows:
>>> print u'Dami\u00e1n'
Damián

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.