I have a text file who is filled with unicode characters as "\ud83d\udca5" but python don't seem to like them.
But if I replace it by u'\U0001f4a5' which seems to be his python escape style (Charbase), it works.
Is there a solution to convert them all into the u"\Uxxxxxxxx" escape format than python can understand ?
Thanks.
You're mixing up Unicode and encoded strings. u'\U0001f4a5' is a Unicode object, Python's internal datatype for handling strings. (In Python 3, the u is optional since now all strings are Unicode objects).
Files, on the other hand, use encodings. UTF-8 is the most common one, but it's just one means of storing a Unicode object in a byte-oriented file or stream. When opening such a file, you need to specify the encoding so Python can translate the bytes into meaningful Unicode objects.
In your case, it seems you need to open file using the UTF-16 codec instead of UTF-8.
with open("myfile.txt", encoding="utf-16") as f:
s = f.read()
will give you the proper contents if the codec is in fact UTF-16. If it doesn't look right, try "utf-16-le" or "utf-16-be".
Related
Encoding in JS means converting a string with special characters to escaped usable string. like : encodeURIComponent would convert spaces to %20 etc to be usable in URIs.
So encoding here means converting to a particular format.
In Python 2.7, I have a string : 奥多比. To convert it into UTF-8 format, however, I need to use decode() function.
Like: "奥多比".decode("utf-8") == u'\u5965\u591a\u6bd4'
I want to understand how the meaning of encode and decode is changing with language. To me essentially I should be doing "奥多比".encode("utf-8")
What am I missing here.
You appear to be confusing Unicode text (represented in Python 2 as the unicode type, indicated by the u prefix on the literal syntax), with one of the standard Unicode encodings, UTF-8.
You are not creating UTF-8, you created a Unicode text object, by decoding from a UTF-8 byte stream.
The byte string literal `"奥多比"' is a sequence of binary data, bytes. You either entered these in a text editor and saved the file as UTF-8 (and told Python to treat your source code as UTF-8 by starting the file with a PEP 263 codec header), or you typed it into the Python interactive prompt in a terminal that was configured to send UTF-8 data.
I strongly urge you to read more about the difference between bytes, codecs and Unicode text. The following links are highly recommended:
Ned Batchelder's Pragmatic Unicode
The Python Unicode HOWTO
Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
In Python v2, it's type str, i.e. sequence of bytes. To convert it to a Unicode string, you need to decode this sequence of bytes using a codec. Simply said, it specifies how should bytes be converted to a sequence of Unicode code points. Look into Unicode HOWTO for more in-depth article on this.
I have a column a spreadsheet whose header contains non-ASCII characters thus:
'Campaign'
If I pop this string into the interpreter, I get:
'\xc3\xaf\xc2\xbb\xc2\xbfCampaign'
The string is one the keys in the rows of a csv.DictReader()
When I try to populate a new dict with with the value of this key:
spends['Campaign'] = 2
I get:
Key Error: '\xc3\xaf\xc2\xbb\xc2\xbfCampaign'
If I print the value of the keys of row, I can see that it is '\xef\xbb\xbfCampaign'
Obviously then I can just update my program to access this key thus:
spends['\xef\xbb\xbfCampaign']
But is there a "better" way of doing this, in Python? Indeed, if the value of this key every changes to contain other non-ASCII characters, what is an all-encompassing way of handling any all non-ASCII characters that may arise?
Your specific problem is the first three bytes of the file, "\xef\xbb\xbf". That's the UTF-8 encoding of the byte order mask and often prepended to text files to indicate they're encoded using UTF-8. You should strip these bytes. See Removing BOM from gzip'ed CSV in Python.
Second, you're decoding with the wrong codec. "" is what you get if you decode those bytes using the Windows-1252 character set. That's why the bytes look different if you use these characters in a source file. See the Python 2 Unicode howto.
In general, you should decode a bytestring into Unicode text using the corresponding character encoding as soon as possible on input. And, in reverse, encode Unicode text into a bytestring as late as possible on output. Some APIs such as io.open() can do it implicitly so that your code sees only Unicode.
Unfortunately, csv module does not support Unicode directly on Python 2. See UnicodeReader, UnicodeWriter in the doc examples. You could create their analog for csv.DictReader or as an alternative just pass utf-8 encoded bytestrings to csv module.
When reading about codecs, encoding and decoding I found that i should use the encode function on the string directly and that worked fine. I've after that read about what the unicode and ascii is in addition to the different utf encodings.
But when reading further i found that most people seem to import the codecs module and use encode from the module. I dont see much of a difference between String.encode and codecs.encode.. does it matter which one i use ? I'm just specifying the encoding i need in the encode function.
Also, when reading this thread python string encode / decode i looked at the link in the accepted answer which shows a slide show which is suppose to "completely demystify unicode and utf" but on one of the slides he says that utf is used to translate numbers to characters which i cant see is correct.
From my understanding based on http://www.rrn.dk/the-difference-between-utf-8-and-unicode which was also quoted in another SO thread utf is not translating numbers to characters. Its translating binary numbers to numbers found in the unicode or the other choosen character set being used. So utf would be translation of a binary number to a number and then unicode would be translating that number again to a character..so he got it wrong when trying to completely mystify this?
The Python doc pages for these two functions are here:
https://docs.python.org/2/library/stdtypes.html#str.encode
https://docs.python.org/2/library/codecs.html#codecs.encode
str.encode() is called on a string object like this:
"this is a string".encode()
codecs.encode() is called with a string as an argument like this:
codecs.encode("this is a string")
They each take an optional encoding argument.
str.encode()'s default encoding is the current default, according to the doc page, but according the the Unicode HOWTO, that's "ascii"
codecs.encode()'s default encoding is "ascii"
Both functions take an errors argument that defaults to "strict".
It looks like they're pretty much the same except for the way they're called.
codecs.encode(obj, encoding='utf-8', errors='strict')
encode text to bytes, text to text, and bytes to bytes
str.encode(encoding="utf-8", errors="strict")
encode text to bytes
so, I think 2.⊆1.
One difference is what codecs you can use. str.encode is fine for casting among string codecs, but try converting a string to base64.
str.encode("base64")
LookupError: 'base64' is not a text encoding; use codecs.encode() to handle arbitrary codecs
but this will work
codecs.encode(str.encode(), "base64")
or this
base64.encodestring(str.encode())
The code bellow will cause an UnicodeDecodeError:
#-*- coding:utf-8 -*-
s="中文"
u=u"123"
u=s+u
I know it's because python interpreter is using ascii to decode s.
Why don't python interpreter use the file format(utf-8) for decoding?
Implicit decoding cannot know what source encoding was used. That information is not stored with strings.
All that Python has after importing is a byte string with characters representing bytes in the range 0-255. You could have imported that string from another module, or read it from a file object, etc. The fact that the parser knew what encoding was used for those bytes doesn't even matter for plain byte strings.
As such, it is always better to decode bytes explicitly, rather than rely on the implicit decoding. Either make use a Unicode literal for s as well, or explicitly decode using str.decode()
u = s.decode('utf8') + u
The types of the 2 strings are different - the first is a normal string, second is a unicode string, hence the error.
So, instead of doing s="中文", do as following to get unicode strings for both:
s=u"中文"
u=u"123"
u=s+u
The code works perfectly fine on Python 3.
However, in Python 2, if you do not add a u before a string literal, you are constructing a string of bytes. When one wants to combine a string of bytes and a string of characters, one either has to decode the string of bytes, or encode the string of characters. Python 2.x opted for the former. In order to prevent accidents (for example, someone appending binary data to a user input and thus generating garbage), the Python developers chose ascii as the encoding for that conversion.
You can add a line
from __future__ import unicode_literals
after the #coding declaration so that literals without u or b prefixes are always character and not byte literals.
I created a file containing a dictionary with data written in Spanish (i.e. Damián, etc.):
fileNameX.write(json.dumps(dictionaryX, indent=4))
The data come from some fql fetching operations, i.e.:
select name from user where uid in XXX
When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n".
I've tried some options:
ensure_ascii=False:
fileNameX.write(json.dumps(dictionaryX, indent=4, ensure_ascii=False))
But I get an error (UnicodeEncodeError: 'ascii' codec can´t encode character u'\xe1' in position XXX: ordinal not in range(128)).
encode(encoding='latin-1):
dictionaryX.append({
'name': unicodeVar.encode(encoding='latin-1'),
...
})
But I get another error (UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position XXX: invalid continuation byte)
To sum up, I've tried several possibilities, but have less than a clue. I'm lost. Please, I need help. Thanks!
You have many options, and have stumbled upon something rather complicated that depends on your Python version and which you absolutely must understand fully in order to write correct code. Generally the approach taken in 3.x is stricter and a bit harder to work with, but it is much less likely that you will make a mistake or get yourself into a complicated situation. (Based on the exact symptoms you report, you seem to be using 2.x.)
json.dumps has different behaviour in 2.x and 3.x. In 2.x, it produces a str, which is a byte-string (unknown encoding). In 3.x, it still produces a str, but now str in 3.x is a proper Unicode string.
JSON is inherently a Unicode-supporting format, but it expects files to be in UTF-8 encoding. However, please understand that JSON supports \u style escapes in strings. When you read in this data, you will get the correct encoded string back. The reading code produces unicode objects (no matter whether you use 2.x or 3.x) when it reads strings out of the JSON.
When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n"
á cannot be represented in ASCII. It gets encoded as \u00e1 by default, to avoid the other problems you had. This happens even in 3.x.
ensure_ascii=False
This disables the previous encoding. In 2.x, it means you get a unicode object instead - a real Unicode object, preserving the original á character. In 3.x, it means that the character is not explicitly translated. But either way, ensure_ascii=False means that json.dumps will give you a Unicode string.
Unicode strings must be encoded to be written to a file. There is no such thing as "unicode data"; Unicode is an abstraction. In 2.x, this encoding is implicitly 'ascii' when you feed a Unicode object to file.write; it was expecting a str. To get around this, you can use the codecs module, or explicitly encode as 'utf-8' before writing. In 3.x, the encoding is set with the encoding keyword argument when you open the file (the default is again probably not what you want).
encode(encoding='latin-1')
Here, you are encoding before producing the dictionary, so that you have a str object in your data. Now a problem occurs because when there are str objects in your data, the JSON encoder assumes, by default, that they represent Unicode strings in UTF-8 encoding. This can be changed, in 2.x, using the encoding keyword argument to json.dumps. (In 3.x, the encoder will simply refuse to serialize bytes objects, i.e. non-Unicode strings!)
However, if your goal is simply to get the data into the file directly, then json.dumps is the wrong tool for you. Have you wondered what that s in the name is for? It stands for "string"; this is the special case. The ordinary case, in fact, is writing directly to a file! (Instead of giving you a string and expecting you to write it yourself.) Which is what json.dump (no 's') does. Again, the JSON standard expects UTF-8 encoding, and again 2.x has an encoding keyword parameter that defaults to UTF-8 (you should leave this alone).
Use codecs.open() to open fileNameX with a specific encoding like encoding='utf-8' for example instead of using open().
Also, json.dump().
Since the string has a \u inside it that means it's a Unicode string. The string is actually correct! Your problem lies in displaying the string. If you print the string, Python's output encoding should print it in the proper encoding for your environment.
For example, this is what I get inside IDLE on Windows:
>>> print u'Dami\u00e1n'
Damián