Python Unicode Casting on Variable Bug - python

I've found out this weird python2 behavior related to unicode and variable:
>>> u"\u2730".encode('utf-8').encode('hex')
'e29cb0'
This is the expected result I need, but I want to dynamically control the first part ("u\u2730")
>>> type(u"\u2027")
<type 'unicode'>
Good, so the first part is casted as unicode. Now declaring a string variable and casting it to unicode:
>>> a='20'
>>> b='27'
>>> myvar='\u'+a+b.decode('utf-8')
>>> type(myvar)
<type 'unicode'>
>>> print myvar
\u2027
It seems that now I can use the variable in my original code, right?
>>> myvar.encode('utf-8').encode('hex')
'5c7532303237'
The results, as you can see, is not the original one. It seems that python is treating 'myvar' as string instead of unicode. Do I miss something?
Anyway, my final goal is to loop Unicode from \u0000 to \uFFFF, cast them as string and cast the string as HEX. Is there an easy way?

unichr() in Python 2 or chr() in Python 3 are the ways to construct a character from a number. \uxxxx escapes codes can only be typed directly in code.
Python 2:
>>> a='20'
>>> b='27'
>>> unichr(int(a+b,16))
u'\u2027'
Python 3:
>>> a='20'
>>> b='27'
>>> chr(int(a+b,16))
'‧'

You are confusing the Unicode escape sequence with an the \u characters. It's like confusing r"\n" (or "\\n") with an actual newline. You want to usecodecs.raw_unicode_escape_decode decode the str with 'unicode_escape':
>>> import codecs
>>> a='20'
>>> b='27'
>>> myvar='\u'+a+b.decode('utf-8')
>>> myvar
u'\\u2027'
>>> myvar.decode('unicode_escape')
(u'\u2027', 6)
>>> print(myvar.decode('unicode_escape')[0])
‧

Related

string.decode() function in python2

So I am converting some code from python2 to python3. I don't understand the python2 encode/decode functionality enough to even determine what I should be doing in python3
In python2, I can do the following things:
>>> c = '\xe5\xb8\x90\xe6\x88\xb7'
>>> print c
帐户
>>> c.decode('utf8')
u'\u5e10\u6237'
What did I just do there? Doesn't the 'u' prefix mean unicode? Shouldn't the utf8 be '\xe5\xb8\x90\xe6\x88\xb7' since that is what I input in the first place?
Your variable c was not declared as a unicode (with prefix 'u'). If you decode it using the 'latin1' encoding you will get the same result:
>>> c.decode('latin1')
u'\xe5\xb8\x90\xe6\x88\xb7'
Note that the result of decode is a unicode string:
>>> type(c)
<type 'str'>
>>> type(c.decode('latin1'))
<type 'unicode'>
If you declare c as a unicode and keep the same input, you will not print the same characters:
>>> c=u'\xe5\xb8\x90\xe6\x88\xb7'
>>> print c
å¸æ·
If you use the input '\u5e10\u6237', you will print the initial characters:
>>> c=u'\u5e10\u6237'
>>> print c
帐户
Encoding and decoding is just a matter of using a table of correspondence value<->character. The thing is that the same value does not render the same character according to the encoding (ie table) used.
The main difficulty is when you don't know the encoding of an input string that you have to handle. Some tools can try to guess it, but it is not always successful (see https://superuser.com/questions/301552/how-to-auto-detect-text-file-encoding).

Python: convert a dot separated hex values to string?

In this post: Print a string as hex bytes? I learned how to print as string into an "array" of hex bytes now I need something the other way around:
So for example the input would be: 73.69.67.6e.61.74.75.72.65 and the output would be a string.
you can use the built in binascii module. Do note however that this function will only work on ASCII encoded characters.
binascii.unhexlify(hexstr)
Your input string will need to be dotless however, but that is quite easy with a simple
string = string.replace('.','')
another (arguably safer) method would be to use base64 in the following way:
import base64
encoded = base64.b16encode(b'data to be encoded')
print (encoded)
data = base64.b16decode(encoded)
print (data)
or in your example:
data = base64.b16decode(b"7369676e6174757265", True)
print (data.decode("utf-8"))
The string can be sanitised before input into the b16decode method.
Note that I am using python 3.2 and you may not necessarily need the b out the front of the string to denote bytes.
Example was found here
Without binascii:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(chr(int(e, 16)) for e in a.split('.'))
'signature'
>>>
or better:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(e.decode('hex') for e in a.split('.'))
PS: works with unicode:
>>> a='.'.join(x.encode('hex') for x in 'Hellö Wörld!')
>>> a
'48.65.6c.6c.94.20.57.94.72.6c.64.21'
>>> print "".join(e.decode('hex') for e in a.split('.'))
Hellö Wörld!
>>>
EDIT:
No need for a generator expression here (thx to thg435):
a.replace('.', '').decode('hex')
Use string split to get a list of strings, then base 16 for decoding the bytes.
>>> inp="73.69.67.6e.61.74.75.72.65"
>>> ''.join((chr(int(i,16)) for i in inp.split('.')))
'signature'
>>>

Python strip() unicode string?

How can you use string methods like strip() on a unicode string? and can't you access characters of a unicode string like with oridnary strings? (ex: mystring[0:4] )
It's working as usual, as long as they are actually unicode, not str (note: every string literal must be preceded by u, like in this example):
>>> a = u"coțofană"
>>> a
u'co\u021bofan\u0103'
>>> a[-1]
u'\u0103'
>>> a[2]
u'\u021b'
>>> a[3]
u'o'
>>> a.strip(u'ă')
u'co\u021bofan'
Maybe it's a bit late to answer to this, but if you are looking for the library function and not the instance method, you can use that as well.
Just use:
yourunicodestring = u' a unicode string with spaces all around '
unicode.strip(yourunicodestring)
In some cases it's easier to use this one, for example inside a map function like:
unicodelist=[u'a',u' a ',u' foo is just...foo ']
map (unicode.strip,unicodelist)
You can do every string operation, actually in Python 3, all str's are unicode.
>>> my_unicode_string = u"abcşiüğ"
>>> my_unicode_string[4]
u'i'
>>> my_unicode_string[3]
u'\u015f'
>>> print(my_unicode_string[3])
ş
>>> my_unicode_string[3:]
u'\u015fi\xfc\u011f'
>>> print(my_unicode_string[3:])
şiüğ
>>> print(my_unicode_string.strip(u"ğ"))
abcşiü
See the Python docs on Unicode strings and the following section on string methods. Unicode strings support all of the usual methods and operations as normal ASCII strings.

Convert UTF-8 octets to unicode code points

I have a set of UTF-8 octets and I need to convert them back to unicode code points. How can I do this in python.
e.g. UTF-8 octet ['0xc5','0x81'] should be converted to 0x141 codepoint.
Python 3.x:
In Python 3.x, str is the class for Unicode text, and bytes is for containing octets.
If by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert to bytes like this:
>>> bytes(int(x,0) for x in ['0xc5', '0x81'])
b'\xc5\x81'
You can then convert to str (ie: Unicode) using the str constructor...
>>> str(b'\xc5\x81', 'utf-8')
'Ł'
...or by calling .decode('utf-8') on the bytes object:
>>> b'\xc5\x81'.decode('utf-8')
'Ł'
>>> hex(ord('Ł'))
'0x141'
Pre-3.x:
Prior to 3.x, the str type was a byte array, and unicode was for Unicode text.
Again, if by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert them like this:
>>> ''.join(chr(int(x,0)) for x in ['0xc5', '0x81'])
'\xc5\x81'
You can then convert to unicode using the constructor...
>>> unicode('\xc5\x81', 'utf-8')
u'\u0141'
...or by calling .decode('utf-8') on the str:
>>> '\xc5\x81'.decode('utf-8')
u'\u0141'
In lovely 3.x, where all strs are Unicode, and bytes are what strs used to be:
>>> s = str(bytes([0xc5, 0x81]), 'utf-8')
>>> s
'Ł'
>>> ord(s)
321
>>> hex(ord(s))
'0x141'
Which is what you asked for.
l = ['0xc5','0x81']
s = ''.join([chr(int(c, 16)) for c in l]).decode('utf8')
s
>>> u'\u0141'
>>> "".join((chr(int(x,16)) for x in ['0xc5','0x81'])).decode("utf8")
u'\u0141'

Why is it when I print something, there is always a unicode next to it? (Python)

[u'Iphones', u'dont', u'receieve', u'messages']
Is there a way to print it without the "u" in front of it?
What you are seeing is the __repr__() representation of the unicode string which includes the u to make it clear. If you don't want the u you could print the object (using __str__) - this works for me:
print [str(x) for x in l]
Probably better is to read up on python unicode and encode using the particular unicode codec you want:
print [x.encode() for x in l]
[edit]: to clarify repr and why the u is there - the goal of repr is to provide a convenient string representation, "to return a string that would yield an object with the same value when passed to eval()". Ie you can copy and paste the printed output and get the same object (list of unicode strings).
Python contains string classes for both unicode strings and regular strings. The u before a string indicates that it is a unicode string.
>>> mystrings = [u'Iphones', u'dont', u'receieve', u'messages']
>>> [str(s) for s in mystrings]
['Iphones', 'dont', 'receieve', 'messages']
>>> type(u'Iphones')
<type 'unicode'>
>>> type('Iphones')
<type 'str'>
See http://docs.python.org/library/stdtypes.html#sequence-types-str-unicode-list-tuple-buffer-xrange for more information about the string types available in Python.

Categories