I have a string S="Test" in Python. I want to encode the string into CP1256, ISO-8859-1, ISO-8859-2, ISO-8859-6, ISO-8859-15 and Window-1252 formats. How can I do the encoding of the string into the mentioned formats?
I don't know why Slava Bacherikov deleted his answer, but it was the right answer, so I'll repeat it with more detail.
str.encode is exactly what you want:
Return an encoded version of the string as a bytes object. Default encoding is 'utf-8'. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' and any other name registered via codecs.register_error(), see section Codec Base Classes. For a list of possible encodings, see section Standard Encodings.
If you follow that link to Standard Encodings, you'll see a nice table that shows you the names to use for each of these (you can use either the main codec name, or any of the aliases).
So:
encoded_bytes = [S.encode(codec) for codec in
('cp1256', 'iso-8859-1', 'iso-8859-2', 'iso-8859-6',
'iso-8859-15', 'windows-1252')]
While you could use codecs.encode as the other answers suggest, there's really no good reason to do so, and one good reason not to: str.encode enforces the fact that you're calling it on a str object, and using a codec that translates str to bytes; you'll get an exception if you accidentally use it on an already-encoded bytes or a list or something.
All of the above is assuming you're using Python 3. If you're using Python 2, a str is already encoded. So, if you can start with a unicode object, like u"Test" instead of "Test", do that; if not, you will want to decode it first. Unfortunately, Python 2 won't enforce that; if you call str.encode it will actually decode it with sys.getdefaultencoding, which will usually be ASCII, which will lead to silly errors.
its what that codecs module is for :
codecs.encode(S,'CP1256')
Just use the codecs module
import codecs
codecs.encode("hello", "iso-8859-6")
If you want to first check if python is aware of a certain encoding format just use
format_name = "iso-8859-6"
try:
codecs.lookup(format_name)
except LookupError:
print "Encoding {} can't be found".format(format_name)
Related
I’m trying to understand how encoding unicode works in python2.7, so far it’s easy to find the solution but I haven’t found any clear explanation as what is going on here. Here is an example.
The introduction
We have a unicode variable we received, called filter_type
filter_type = u'some_välüe'.
We put this into a dict and pass this into the python library urllib.urlencode.
Like so:
urllib.urlencode({"param:" ..., "filter_type": filter_type}
This issue.
Inside urllib.urlencode it loops around the data given to it and wraps the keys and values into the str() builtin function to get a string representation of each key and value before encoding it into a url.
We get an error similar to the following:
{UnicodeEncodeError}'ascii' codec can't encode character u'\xf1' in position 42: ordinal not in range(128).
You get this same error by doing str(u'some_välüe').
So after some research and digging into this it looks like when you wrap unicode values in the str() it tries to encode the value into the default encoding that is set. (my assumption)
>>> import sys
>>> sys.getdefaultencoding()
ascii
The solution.
So we can fix this by encoding these unicode strings with utf-8.
filter_type = u'some_välüe'.encode('utf-8').
The question.
But here is the question. Before i mentioned that urllib.urlencode wraps keys and values into the str() function.
These values are already encoded now, so..
What does str() does in this case now?
Does the representation of a unicode object change when it’s encoded to utf-8?
If it does why did str() try to encode the unicode object to ascii (default) in the first place.
When reading about codecs, encoding and decoding I found that i should use the encode function on the string directly and that worked fine. I've after that read about what the unicode and ascii is in addition to the different utf encodings.
But when reading further i found that most people seem to import the codecs module and use encode from the module. I dont see much of a difference between String.encode and codecs.encode.. does it matter which one i use ? I'm just specifying the encoding i need in the encode function.
Also, when reading this thread python string encode / decode i looked at the link in the accepted answer which shows a slide show which is suppose to "completely demystify unicode and utf" but on one of the slides he says that utf is used to translate numbers to characters which i cant see is correct.
From my understanding based on http://www.rrn.dk/the-difference-between-utf-8-and-unicode which was also quoted in another SO thread utf is not translating numbers to characters. Its translating binary numbers to numbers found in the unicode or the other choosen character set being used. So utf would be translation of a binary number to a number and then unicode would be translating that number again to a character..so he got it wrong when trying to completely mystify this?
The Python doc pages for these two functions are here:
https://docs.python.org/2/library/stdtypes.html#str.encode
https://docs.python.org/2/library/codecs.html#codecs.encode
str.encode() is called on a string object like this:
"this is a string".encode()
codecs.encode() is called with a string as an argument like this:
codecs.encode("this is a string")
They each take an optional encoding argument.
str.encode()'s default encoding is the current default, according to the doc page, but according the the Unicode HOWTO, that's "ascii"
codecs.encode()'s default encoding is "ascii"
Both functions take an errors argument that defaults to "strict".
It looks like they're pretty much the same except for the way they're called.
codecs.encode(obj, encoding='utf-8', errors='strict')
encode text to bytes, text to text, and bytes to bytes
str.encode(encoding="utf-8", errors="strict")
encode text to bytes
so, I think 2.⊆1.
One difference is what codecs you can use. str.encode is fine for casting among string codecs, but try converting a string to base64.
str.encode("base64")
LookupError: 'base64' is not a text encoding; use codecs.encode() to handle arbitrary codecs
but this will work
codecs.encode(str.encode(), "base64")
or this
base64.encodestring(str.encode())
Why in Python 3 would the following code
print(str(b"Hello"))
output b'Hello' instead of just Hello as it happens with regular text strings? It looks like ultimately explicit, would-be-easy creating a str object from the most related binary string type is so counter-intuitive.
In Python 3, bytes.__str__ is not defined, so bytes.__repr__ is used instead, when you use str() on the object. Note that print() also calls str() on objects passed in, so the call is entirely redundant here.
If you are expecting text, decode explicitly instead:
print(b'Hello'.decode('ascii'))
The str() type can handle bytes objects explicitly, but only if (again) you provide an explicit codec to decode the bytes with first:
print(str(b'Hello', 'ascii'))
The documentation is very explicit about this behaviour:
If neither encoding nor errors is given, str(object) returns object.__str__(), which is the “informal” or nicely printable string representation of object. For string objects, this is the string itself. If object does not have a __str__() method, then str() falls back to returning repr(object).
If at least one of encoding or errors is given, object should be a bytes-like object (e.g. bytes or bytearray). In this case, if object is a bytes (or bytearray) object, then str(bytes, encoding, errors) is equivalent to bytes.decode(encoding, errors).
and
Passing a bytes object to str() without the encoding or errors arguments falls under the first case of returning the informal string representation.
Emphasis mine.
Why do you want this to "work"? A bytes object is a bytes object, and its string representation in Python 3 is on that form. You can convert it's contents to a proper text string (in Python3 - which in Python2 would be "unicode" objects) you have to decode it to text.
And for that you need to know the encoding -
Try the following instead:
print(b"Hello".decode("latin-1"))
Note the assumed "latin-1" text codec which will translate transparently codes not in ASCII range (128-256) to unicode. It is the codec used by default by Windows for western-European languages.
The "utf-8" codec can represent a much larger range of characters, and is the preferred encoding for international text - but if your byte string is not properly composed of utf-8 characters you might have an UnicodeDecode error on the process.
Please read http://www.joelonsoftware.com/articles/Unicode.html to proper undestand what text is about.
Beforehand, sorry for my English...
Hey, I had this problem some weeks ago. It works as the people above said.
Here is a tip if the exceptions of the decoding process do not matter. In this case you can use:
bytesText.decode(textEncoding, 'ignore')
Ex:
>>> b'text \xab text'.decode('utf-8', 'ignore') # Using UTF-8 is nice as you might know!
'text text' # As you can see, the « (\xab) symbol was
# ignored :D
I created a file containing a dictionary with data written in Spanish (i.e. Damián, etc.):
fileNameX.write(json.dumps(dictionaryX, indent=4))
The data come from some fql fetching operations, i.e.:
select name from user where uid in XXX
When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n".
I've tried some options:
ensure_ascii=False:
fileNameX.write(json.dumps(dictionaryX, indent=4, ensure_ascii=False))
But I get an error (UnicodeEncodeError: 'ascii' codec can´t encode character u'\xe1' in position XXX: ordinal not in range(128)).
encode(encoding='latin-1):
dictionaryX.append({
'name': unicodeVar.encode(encoding='latin-1'),
...
})
But I get another error (UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position XXX: invalid continuation byte)
To sum up, I've tried several possibilities, but have less than a clue. I'm lost. Please, I need help. Thanks!
You have many options, and have stumbled upon something rather complicated that depends on your Python version and which you absolutely must understand fully in order to write correct code. Generally the approach taken in 3.x is stricter and a bit harder to work with, but it is much less likely that you will make a mistake or get yourself into a complicated situation. (Based on the exact symptoms you report, you seem to be using 2.x.)
json.dumps has different behaviour in 2.x and 3.x. In 2.x, it produces a str, which is a byte-string (unknown encoding). In 3.x, it still produces a str, but now str in 3.x is a proper Unicode string.
JSON is inherently a Unicode-supporting format, but it expects files to be in UTF-8 encoding. However, please understand that JSON supports \u style escapes in strings. When you read in this data, you will get the correct encoded string back. The reading code produces unicode objects (no matter whether you use 2.x or 3.x) when it reads strings out of the JSON.
When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n"
á cannot be represented in ASCII. It gets encoded as \u00e1 by default, to avoid the other problems you had. This happens even in 3.x.
ensure_ascii=False
This disables the previous encoding. In 2.x, it means you get a unicode object instead - a real Unicode object, preserving the original á character. In 3.x, it means that the character is not explicitly translated. But either way, ensure_ascii=False means that json.dumps will give you a Unicode string.
Unicode strings must be encoded to be written to a file. There is no such thing as "unicode data"; Unicode is an abstraction. In 2.x, this encoding is implicitly 'ascii' when you feed a Unicode object to file.write; it was expecting a str. To get around this, you can use the codecs module, or explicitly encode as 'utf-8' before writing. In 3.x, the encoding is set with the encoding keyword argument when you open the file (the default is again probably not what you want).
encode(encoding='latin-1')
Here, you are encoding before producing the dictionary, so that you have a str object in your data. Now a problem occurs because when there are str objects in your data, the JSON encoder assumes, by default, that they represent Unicode strings in UTF-8 encoding. This can be changed, in 2.x, using the encoding keyword argument to json.dumps. (In 3.x, the encoder will simply refuse to serialize bytes objects, i.e. non-Unicode strings!)
However, if your goal is simply to get the data into the file directly, then json.dumps is the wrong tool for you. Have you wondered what that s in the name is for? It stands for "string"; this is the special case. The ordinary case, in fact, is writing directly to a file! (Instead of giving you a string and expecting you to write it yourself.) Which is what json.dump (no 's') does. Again, the JSON standard expects UTF-8 encoding, and again 2.x has an encoding keyword parameter that defaults to UTF-8 (you should leave this alone).
Use codecs.open() to open fileNameX with a specific encoding like encoding='utf-8' for example instead of using open().
Also, json.dump().
Since the string has a \u inside it that means it's a Unicode string. The string is actually correct! Your problem lies in displaying the string. If you print the string, Python's output encoding should print it in the proper encoding for your environment.
For example, this is what I get inside IDLE on Windows:
>>> print u'Dami\u00e1n'
Damián
I am trying to write some strings to a file (the strings have been given to me by the HTML parser BeautifulSoup).
I can use "print" to display them, but when I use file.write() I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 6: ordinal not in range(128)
How can I parse this?
If I type 'python unicode' into Google, I get about 14 million results; the first is the official doc which describes the whole situation in excruciating detail; and the fourth is a more practical overview that will pretty much spoon-feed you an answer, and also make sure you understand what's going on.
You really do need to read and understand these sorts of overviews, however long they seem. There really isn't any getting around it. Text is hard. There is no such thing as "plain text", there hasn't been a reasonable facsimile for years, and there never really was, although we spent decades pretending there was. But Unicode is at least a standard.
You also should read http://www.joelonsoftware.com/articles/Unicode.html .
This error occurs when you pass a Unicode string containing non-English characters (Unicode characters beyond 128) to something that expects an ASCII bytestring. The default encoding for a Python bytestring is ASCII, "which handles exactly 128 (English) characters". This is why trying to convert Unicode characters beyond 128 produces the error.
The unicode()
unicode(string[, encoding, errors])
constructor has the signature unicode(string[, encoding, errors]). All of its arguments should be 8-bit strings.
The first argument is converted to Unicode using the specified encoding; if you leave off the encoding argument, the ASCII encoding is used for the conversion, so characters greater than 127 will be treated as errors
for example
s = u'La Pe\xf1a'
print s.encode('latin-1')
or
write(s.encode('latin-1'))
will encode using latin-1
The answer to your question is "use codecs". The appeded code also shows some gettext magic, FWIW. http://wiki.wxpython.org/Internationalization
import codecs
import gettext
localedir = './locale'
langid = wx.LANGUAGE_DEFAULT # use OS default; or use LANGUAGE_JAPANESE, etc.
domain = "MyApp"
mylocale = wx.Locale(langid)
mylocale.AddCatalogLookupPathPrefix(localedir)
mylocale.AddCatalog(domain)
translater = gettext.translation(domain, localedir,
[mylocale.GetCanonicalName()], fallback = True)
translater.install(unicode = True)
# translater.install() installs the gettext _() translater function into our namespace...
msg = _("A message that gettext will translate, probably putting Unicode in here")
# use codecs.open() to convert Unicode strings to UTF8
Logfile = codecs.open(logfile_name, 'w', encoding='utf-8')
Logfile.write(msg + '\n')
Despite Google being full of hits on this problem, I found it rather hard to find this simple solution (it is actually in the Python docs about Unicode, but rather burried).
So ... HTH...
GaJ