Is u'string' the same as 'string'.decode('XXX')

Is u'string' the same as 'string'.decode('XXX') - python

Although the title is a question, the short answer is apparently no. I've tried in the shell. The real question is why?
ps: string is some non-ascii characters like Chinese and XXX is the current encoding of string
>>> u'中文' == '中文'.decode('gbk')
False
//The first one is u'\xd6\xd0\xce\xc4' while the second one u'\u4e2d\u6587'
The example is above. I am using windows chinese simplyfied. The default encoding is gbk, so is the python shell. And I got the two unicode object unequal.
UPDATES
a = '中文'.decode('gbk')
>>> a
u'\u4e2d\u6587'
>>> print a
中文
>>> b = u'中文'
>>> print b
ÖÐÎÄ

Yes, str.decode() usually returns a unicode string, if the codec successfully can decode the bytes. But the values only represent the same text if the correct codec is used.
Your sample text is not using the right codec; you have text that is GBK encoded, decoded as Latin1:
>>> print u'\u4e2d\u6587'
中文
>>> u'\u4e2d\u6587'.encode('gbk')
'\xd6\xd0\xce\xc4'
>>> u'\u4e2d\u6587'.encode('gbk').decode('latin1')
u'\xd6\xd0\xce\xc4'
The values are indeed not equal, because they are not the same text.
Again, it is important that you use the right codec; a different codec will result in very different results:
>>> print u'\u4e2d\u6587'.encode('gbk').decode('latin1')
ÖÐÎÄ
I encoded the sample text to Latin-1, not GBK or UTF-8. Decoding may have succeeded, but the resulting text is not readable.
Note also that pasting non-ASCII characters only work because the Python interpreter has detected my terminal codec correctly. I can paste text from my browser into my terminal, which then passes the text to Python as UTF-8-encoded data. Because Python has asked the terminal what codec was used, it was able to decode back again from the u'....' Unicode literal value. When printing the encoded.decode('utf8') unicode result, Python once more auto-encodes the data to fit my terminal encoding.
To see what codec Python detected, print sys.stdin.encoding:
>>> import sys
>>> sys.stdin.encoding
'UTF-8'
Similar decisions have to be made when dealing with different sources of text. Reading string literals from the source file, for example, requires that you either use ASCII only (and use escape codes for everything else), or provide Python with an explicit codec notation at the top of the file.
I urge you to read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
to gain a more complete understanding on how Unicode works, and how Python handles Unicode.

Assuming Python2.7 by the title.
The answer is no. No because when you issue string.decode(XXX) you'll get a Unicode depending on the codec you pass as argument.
When you use u'string' the codec is inferred by the shell's current encoding, or if it's a file, you'll get ascii as default or whatever # coding: utf-8 special comment you insert at the beginning of the script.
Just to clearify, if codec XXX is ensured to always be the same codec used for the script's input (either the shell or the file) then both approaches behave pretty much the same.
Hope this helps!

Related

encoding issue. Replace special character

I have a dictionary that looks like this:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Now I'm trying to replace the \xf6 with ö ,
but trying .replace('\xf6', 'ö') returns an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position
0: ordinal not in range(128)
How can I fix this?

Now encoding is a mine field, and I might be off on this one - please correct me if that's the case.
From what I've gathered over the years is that Python2 assumes ASCII unless you defined a encoding at the top of your script. Mainly because either it's compiled that way or the OS/Terminal uses ASCII as it's primary encoding.
With that said, what you see in your example data:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Is the ASCII representation of a unicode string. Some how Python needs to tell you there's an ö in there - but it can't with ASCII because ö has no representation in the ASCII table.
But when you try to replace it using:
x.replace('\xf6', 'ö')
You're trying to find a ASCII character/string called \xf6 that is outside of the accepted bytes ranges of ASCII, so that will raise an exception. And you're trying to replace it with another invalid ASCII character and that will cause the same exception.
Hence why you get the "'ascii' codec can't decode byte...' message.
You can do unicode replacements like this:
a = u'Ganztags ge\xf6ffnet'
a.replace(u'\xf6', u'ö')
This will tell Python to find a unicode string, and replace it with another unicode string.
But the output data will result in the same thing in the example above, because \xf6 is ö in unicode.
What you want to do, is encode your string into something you want to use, for instance - UTF-8:
a.encode('UTF-8')
'Ganztags ge\xc3\xb6ffnet'
And define UTF-8 as your primary encoding by placing this at the top of your code:
#!/usr/bin/python
# -*- coding: UTF-8
This should in theory make your application a little easier to work with.
And you can from then on work with UTF-8 as your base model.
But there's no way that I know of, to convert your representation into a ASCII ö, because there really isn't such a thing. There's just different ways Python will do this encoding magic for you to make you believe it's possible to "just write ö".
In Python3 most of the strings you encounter will either be bytes data or treated a bit differently from Python2. And for the most part it's a lot easier.
There's numerous ways to change the encoding that is not part of the standard praxis. But there are ways to do it.
The closest to "good" praxis, would be the locale:
locale.setlocale(locale.LC_ALL, 'sv_SE.UTF-8')
I also had a horrendous solution and approach to this years back, it looked something like this (it was a great bodge for me at the time):
Python - Encoding string - Swedish Letters
tl;dr:
Your code usually assume/use ASCII as it's encoder/decoder.
ö is not a part of ASCII, there for you'll always see \xf6 if you've some how gotten unicode characters. Normally, if you print u'Ganztags ge\xf6ffnet' it will be shown as a Ö because of automatic encoding, if you need to verify if input matches that string, you have to compare them u'ö' == u'ö', if other systems depend on this data, encode it with something they understand .encode('UTF-8'). But replacing \xf6 with ö is the same thing, just that ö doesn't exist in ASCII and you need to do u'ö' - which, will result in the same data at the end.

As you are using German language, you should be aware of non ascii characters. You know whether your system prefers Latin1 (Windows console and some Unixes), UTF8 (most Linux variants), or native unicode (Windows GUI).
If you can process everything as native unicode things are cleaner and you should just accept the fact that u'ö' and u'\xf6' are the same character - the latter is simply independant of the python source file charset.
If you have to output byte strings of store them in files, you should encode them in UTF8 (can process any unicode character but characters of code above 127 use more than 1 byte) or Latin1 (one byte per character, but only supports unicode code point below 256)
In that case just use an explicit encoding to convert your unicode strings to byte strings:
print u'Ganztags ge\xf6ffnet'.encode('Latin1') # or .encode('utf8')
should give what you expect.

Python os.walk() umlauts u'\u0308'

I'm on a OSX machine and running Python 2.7. I'm trying to do a os.walk on a smb share.
for root, dirnames, filenames in os.walk("./test"):
for filename in filenames:
print filename
matchObj = re.match( r".*ö.*",filename,re.UNICODE)
if i use the above code it works as long as the filename do not contain umlauts.
In my shell the umlauts are printed fine but when I copy them back to a utf8 formated Textdeditor (in my case Sublime), I get:
screenshot
Expected:
filename.jpeg
filename_ö.jpg
Of course the regex fails with that.
if i hardcode the filename like:
re.match( r".*ö.*",'filename_ö',re.UNICODE)
it works fine.
I tried:
os.walk(u"./test")
filename.decode('utf8')
but gives me:
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0308' in position 10: ordinal not in range(128)
u'\u0308' are the dots above the umlauts.
I'm overlooking something stupid i guess?

Unicode characters can be represented in various forms; there's "ö", but then there's also the possibility to represent that same character using an "o" and separate combining diacritics. OS X generally prefers the separated variant, and your editor doesn't seem to handle that very gracefully, nor do these two separate characters match your regex.
You need to normalize your Unicode data if you require one way or the other in particular. See unicodedata.normalize. You want the NFC normalized form.

There are several issues:
The screenshot as #deceze explained is due to Unicode normalization. Note: it is not necessary for the codepoints to look different e.g., ö (U+00f6) and ö (U+006f U+0308) look the same in my browser
r".*ö.*" is a bytestring in Python 2 and the value depends on the encoding declaration at the top of your Python source file (something like: # -*- coding: utf-8 -*-) e.g., if the declared encoding is utf-8 then 'ö' bytestring is a sequence of two bytes: '\xc3\xb6'.
There is no way for the regex engine to know the actual encoding that should be used to interpret input bytestrings.
You should not use bytestrings, to represent text; use Unicode instead (either use u'' literals or add from __future__ import unicode_literals at the top)
filename.decode('utf8') raises UnicodeEncodeError if you use os.walk(u"./test") because filename is Unicode already. Python 2 tries to encode filename implicitly using the default encoding that is 'ascii'. Do not decode Unicode: drop .decode('utf-8')
btw, the last two issues are impossible in Python 3: r".*ö.*" is a Unicode literal, and you can't create a bytestring with literal non-ascii characters there, and there is no .decode() method (you would get AttributeError if you try to decode Unicode). You could run your script on Python 3, to detect Unicode-related bugs.

'ascii' codec can't encode character u'\xe9'

I already tried all previous answers and solution.
I am trying to use this value, which gave me encoding related error.
ar = [u'http://dbpedia.org/resource/Anne_Hathaway', u'http://dbpedia.org/resource/Jodie_Bain', u'http://dbpedia.org/resource/Wendy_Divine', u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno', u'http://dbpedia.org/resource/Baaba_Maal']
So I tried,
d = [x.decode('utf-8') for x in ar]
which gives:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 31: ordinal not in range(128)
I tried out
d = [x.encode('utf-8') for x in ar]
which removes error but changes the original content
original value was u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' which converted to 'http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno' while using encode
what is correct way to deal with this scenario?
Edit
Error comes when I feed these links in
req = urllib2.Request()

The second version of your string is the correct utf-8 representation of your original unicode string. If you want to have a meaningful comparison, you have to use the same representation for both the stored string and the user input string. The sane thing to do here is to always use Unicode string internally (in your code), and make sure both your user inputs and stored strings are correctly decoded to unicode from their respective encodings at your system's boundaries (storage subsystem and user inputs subsystem).
Also you seem to be a bit confused about unicode and encodings, so reading this and this might help.

Unicode strings in python are "raw" unicode, so make sure to .encode() and .decode() them as appropriate. Using utf8 encoding is considered a best practice among multiple dev groups all over the world.
To encode use the quote function from the urllib2 library:
from urllib2 import quote
escaped_string = quote(unicode_string.encode('utf-8'))
To decode, use unquote:
from urllib2 import unquote
src = "http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno"
unicode_string = unquote(src).decode('utf-8')
Also, if you're more interested in Unicode and UTF-8 work, check out Unicode HOWTO and

In your Unicode list, u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' is an ASCII safe way to represent a Unicode string. When encoded in a form that supports the full Western European character set, such as UTF-8, it's: http://dbpedia.org/resource/José_Elías_Moreno
Your .encode("UTF-8") is correct and would have looked ok in a UTF-8 editor or browser. What you saw after the encode was an ASCII safe representation of UTF-8.
For example, your trouble chars were é and í.
é = 00E9 Unicode = C3A9 UTF-8
í = 00ED Unicode = C3AD UTF-8
In short, your .encode() method is correct and should be used for writing to files or to a browser.

Able to run Python code with Unicode string in Eclipse, but getting UnicodeEncodeError when running via command line or Idle.

I've experienced this a lot, where I'll decode/encode some string of Unicode in Eclipse (PyDev), and it runs fine and how I expected, but then when I launch the same script from the command line (for example) instead, I'll get encoding errors.
Is there any simple explanation for this? Is Eclipse doing something to the Unicode/manipulating it in some different way?
EDIT:
Example:
value = u'\u2019'.decode( 'utf-8', 'ignore' )
return value
This works in Eclipse (PyDev) but not if I run it in Idle or on the command line.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 135: ordinal not in range(128)

Just wanted to add why it worked on PyDev: it has a special sitecustomize that'll customize python through sys.setdefaultencoding to use the encoding of the PyDev console.
Note that the response from bobince is correct, if you have a unicode string, you have to use the encode() method to transform it into a proper string (you'd use decode if you had a string and wanted to transform it into a unicode).

value = u'\u2019'.decode( 'utf-8', 'ignore' )
Byte strings are DECODED into Unicode strings.
Unicode strings are ENCODED into byte strings.
So if you say someunicodestring.decode, it tries to coerce the Unicode string to a byte string, in order to be able to decode it (back to Unicode!). Being an implicit conversion, this encoding step will plump for the default encoding, which may differ between different environments, and is likely to be the ‘safe’ value ascii, which will certainly produce the error you mention as ASCII can't contain the character U+2019. It's almost never a good idea to rely on the default encoding.
So it doesn't make sense to try to decode a Unicode string. I'm pretty sure you mean:
value = u'\u2019'.encode('utf-8')
(ignore is redundant for encoding to UTF-8 as there is no character that this encoding can't represent.)

Unicode utf-8/utf-16 encoding in Python

In python:
u'\u3053\n'
Is it utf-16?
I'm not really aware of all the unicode/encoding stuff, but this type of thing is coming up in my dataset,
like if I have a=u'\u3053\n'.
print gives an exception and
decoding gives an exception.
a.encode("utf-16") > '\xff\xfeS0\n\x00'
a.encode("utf-8") > '\xe3\x81\x93\n'
print a.encode("utf-8") > πüô
print a.encode("utf-16") >  ■S0
What's going on here?

It's a unicode character that doesn't seem to be displayable in your terminals encoding. print tries to encode the unicode object in the encoding of your terminal and if this can't be done you get an exception.
On a terminal that can display utf-8 you get:
>>> print u'\u3053'
こ
Your terminal doesn't seem to be able to display utf-8, else at least the print a.encode("utf-8") line should produce the correct character.

You ask:
u'\u3053\n'
Is it utf-16?
The answer is no: it's unicode, not any specific encoding. utf-16 is an encoding.
To print a Unicode string effectively to your terminal, you need to find out what encoding that terminal is willing to accept and able to display. For example, the Terminal.app on my laptop is set to UTF-8 and with a rich font, so:
(source: aleax.it)
...the Hiragana letter displays correctly. On a Linux workstation I have a terminal program that keeps resetting to Latin-1 so it would mangle things somewhat like yours -- I can set it to utf-8, but it doesn't have huge number of glyphs in the font, so it would display somewhat-useless placeholder glyphs instead.

Character U+3053 "HIRAGANA LETTER KO".
The \xff\xfe bit at the start of the UTF-16 binary format is the encoded byte order mark (U+FEFF), then "S0" is \x5e\x30, then there's the \n from the original string. (Each of the characters has its bytes "reversed" as it's using little endian UTF-16 encoding.)
The UTF-8 form represents the same Hiragana character in three bytes, with the bit pattern as documented here.
Now, as for whether you should really have it in your data set... where is this data coming from? Is it reasonable for it to have Hiragana characters in it?

Here's the Unicode HowTo Doc for Python 2.6.2:
http://docs.python.org/howto/unicode.html
Also see the links in the Reference section of that document for other explanations, including one by Joel Spolsky.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.