string.decode() function in python2 - python

So I am converting some code from python2 to python3. I don't understand the python2 encode/decode functionality enough to even determine what I should be doing in python3
In python2, I can do the following things:
>>> c = '\xe5\xb8\x90\xe6\x88\xb7'
>>> print c
帐户
>>> c.decode('utf8')
u'\u5e10\u6237'
What did I just do there? Doesn't the 'u' prefix mean unicode? Shouldn't the utf8 be '\xe5\xb8\x90\xe6\x88\xb7' since that is what I input in the first place?

Your variable c was not declared as a unicode (with prefix 'u'). If you decode it using the 'latin1' encoding you will get the same result:
>>> c.decode('latin1')
u'\xe5\xb8\x90\xe6\x88\xb7'
Note that the result of decode is a unicode string:
>>> type(c)
<type 'str'>
>>> type(c.decode('latin1'))
<type 'unicode'>
If you declare c as a unicode and keep the same input, you will not print the same characters:
>>> c=u'\xe5\xb8\x90\xe6\x88\xb7'
>>> print c
å¸æ·
If you use the input '\u5e10\u6237', you will print the initial characters:
>>> c=u'\u5e10\u6237'
>>> print c
帐户
Encoding and decoding is just a matter of using a table of correspondence value<->character. The thing is that the same value does not render the same character according to the encoding (ie table) used.
The main difficulty is when you don't know the encoding of an input string that you have to handle. Some tools can try to guess it, but it is not always successful (see https://superuser.com/questions/301552/how-to-auto-detect-text-file-encoding).

Related

Python Unicode Casting on Variable Bug

I've found out this weird python2 behavior related to unicode and variable:
>>> u"\u2730".encode('utf-8').encode('hex')
'e29cb0'
This is the expected result I need, but I want to dynamically control the first part ("u\u2730")
>>> type(u"\u2027")
<type 'unicode'>
Good, so the first part is casted as unicode. Now declaring a string variable and casting it to unicode:
>>> a='20'
>>> b='27'
>>> myvar='\u'+a+b.decode('utf-8')
>>> type(myvar)
<type 'unicode'>
>>> print myvar
\u2027
It seems that now I can use the variable in my original code, right?
>>> myvar.encode('utf-8').encode('hex')
'5c7532303237'
The results, as you can see, is not the original one. It seems that python is treating 'myvar' as string instead of unicode. Do I miss something?
Anyway, my final goal is to loop Unicode from \u0000 to \uFFFF, cast them as string and cast the string as HEX. Is there an easy way?
unichr() in Python 2 or chr() in Python 3 are the ways to construct a character from a number. \uxxxx escapes codes can only be typed directly in code.
Python 2:
>>> a='20'
>>> b='27'
>>> unichr(int(a+b,16))
u'\u2027'
Python 3:
>>> a='20'
>>> b='27'
>>> chr(int(a+b,16))
'‧'
You are confusing the Unicode escape sequence with an the \u characters. It's like confusing r"\n" (or "\\n") with an actual newline. You want to usecodecs.raw_unicode_escape_decode decode the str with 'unicode_escape':
>>> import codecs
>>> a='20'
>>> b='27'
>>> myvar='\u'+a+b.decode('utf-8')
>>> myvar
u'\\u2027'
>>> myvar.decode('unicode_escape')
(u'\u2027', 6)
>>> print(myvar.decode('unicode_escape')[0])
‧

Python: instantly decoding after encode

Found in legacy:
somevar.encode('utf-8').decode('utf-8')
Can we find this construction useful when not trying to catch encoding errors?
Experimentation in Python 2.7.6 interpreter:
a = u"string"
a
Output: u'string'
b = a.encode('utf-8').decode('utf-8')
b
Output: u'string'
b = a.decode('utf-8').encode('utf-8')
b
Output: 'string'
a = "string"
a
Output: 'string'
b = a.encode('utf-8').decode('utf-8')
b
Output: u'string'
b = a.decode('utf-8').encode('utf-8')
b
Output: 'string'
Note that whether the original string is Unicode or not, the output of encode -> decode will be a Unicode string. The output of decode -> encode will not be a unicode string. A trivial note though, is that since strings are immutable, the code line as you posted it is useless for anything besides checking for UnicodeErrors because it doesn't catch the return value of the function calls.
The only real effect of the encode -> decode construct is that all strings passed through it (and caught from the return) will be Unicode strings. Why you would want to do this instead of unicode_string = unicode(normal_string, encoding='UTF-8') I have no idea.

What is the type of an "utf8" string encoding in Python?

I'm using Python 2.7
I'm reading a file containing "iso-8859-1" coded information.
After parsing, I get the results in strings, ie s1:
>>> s1
'D\xf6rfli'
>>> type(s1)
<type 'str'>
>>> s2=s1.decode("iso-8859-1").encode("utf8")
>>> s2
'D\xc3\xb6rfli'
>>> type(s2)
<type 'str'>
>>> print s1, s2
D�rfli Dörfli
>>>
Why is the type of s2 still a str after the call to .encode?
How can I convert it from str to utf-8?
str in Python 2 means an encoded string, i.e. a sequence of bytes. This is documented behavior. The decoded str would be of type unicode.
UTF-8 is an encoding, as well as ISO-8859-1. So you just decode your string and then encode in another encoding, producing data of the same type.
On the contrary, in Python 3 str would be a text string (in Unicode) and calling encode on it would give you an instance of bytes.
So, in Python 2, a UTF-8 string will be str, because it is encoded.
I second the recommendation by Ned: take a look at the presentation he links to (oh my, is it his own talk?). It helped me a lot when I was struggling with these things.
I'm not sure if this answers your questions, but here's what I observed.
If you just want to get the string into a printable form, just stop after calling decode. I'm not sure why you are trying to encode into UTF8 after successfully converting from is8859 into unicode.
>>> s1 = 'D\xf6rfli'
>>> s1
'D\xf6rfli'
>>> s2 = s1.decode("iso-8859-1")
>>> s2
u'D\xf6rfli'
>>> print s2
Dörfli
>>>

Python: convert a dot separated hex values to string?

In this post: Print a string as hex bytes? I learned how to print as string into an "array" of hex bytes now I need something the other way around:
So for example the input would be: 73.69.67.6e.61.74.75.72.65 and the output would be a string.
you can use the built in binascii module. Do note however that this function will only work on ASCII encoded characters.
binascii.unhexlify(hexstr)
Your input string will need to be dotless however, but that is quite easy with a simple
string = string.replace('.','')
another (arguably safer) method would be to use base64 in the following way:
import base64
encoded = base64.b16encode(b'data to be encoded')
print (encoded)
data = base64.b16decode(encoded)
print (data)
or in your example:
data = base64.b16decode(b"7369676e6174757265", True)
print (data.decode("utf-8"))
The string can be sanitised before input into the b16decode method.
Note that I am using python 3.2 and you may not necessarily need the b out the front of the string to denote bytes.
Example was found here
Without binascii:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(chr(int(e, 16)) for e in a.split('.'))
'signature'
>>>
or better:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(e.decode('hex') for e in a.split('.'))
PS: works with unicode:
>>> a='.'.join(x.encode('hex') for x in 'Hellö Wörld!')
>>> a
'48.65.6c.6c.94.20.57.94.72.6c.64.21'
>>> print "".join(e.decode('hex') for e in a.split('.'))
Hellö Wörld!
>>>
EDIT:
No need for a generator expression here (thx to thg435):
a.replace('.', '').decode('hex')
Use string split to get a list of strings, then base 16 for decoding the bytes.
>>> inp="73.69.67.6e.61.74.75.72.65"
>>> ''.join((chr(int(i,16)) for i in inp.split('.')))
'signature'
>>>

Why is it when I print something, there is always a unicode next to it? (Python)

[u'Iphones', u'dont', u'receieve', u'messages']
Is there a way to print it without the "u" in front of it?
What you are seeing is the __repr__() representation of the unicode string which includes the u to make it clear. If you don't want the u you could print the object (using __str__) - this works for me:
print [str(x) for x in l]
Probably better is to read up on python unicode and encode using the particular unicode codec you want:
print [x.encode() for x in l]
[edit]: to clarify repr and why the u is there - the goal of repr is to provide a convenient string representation, "to return a string that would yield an object with the same value when passed to eval()". Ie you can copy and paste the printed output and get the same object (list of unicode strings).
Python contains string classes for both unicode strings and regular strings. The u before a string indicates that it is a unicode string.
>>> mystrings = [u'Iphones', u'dont', u'receieve', u'messages']
>>> [str(s) for s in mystrings]
['Iphones', 'dont', 'receieve', 'messages']
>>> type(u'Iphones')
<type 'unicode'>
>>> type('Iphones')
<type 'str'>
See http://docs.python.org/library/stdtypes.html#sequence-types-str-unicode-list-tuple-buffer-xrange for more information about the string types available in Python.

Categories