Decode unicode string in python - python

I'd like to decode the following string:
t\u028c\u02c8m\u0251\u0279o\u028a\u032f
It should be the IPA of 'tomorrow' as given in a JSON string from http://rhymebrain.com/talk?function=getWordInfo&word=tomorrow
My understanding is that it should be something like:
x = 't\u028c\u02c8m\u0251\u0279o\u028a\u032f'
print x.decode()
I have tried the solutions from here , here , here, and here (and several other that more or less apply), and several permutations of its parts, but I can't get it to work.
Thank you

You need a u before your string (in Python 2.x, which you appear to be using) to indicate that this is a unicode string:
>>> x = u't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # note the u
>>> print x
tʌˈmɑɹoʊ̯
If you have already stored the string in a variable, you can use the following constructor to convert the string into unicode:
>>> s = 't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # your string has a unicode-escape encoding but is not unicode
>>> x = unicode(s, encoding='unicode-escape')
>>> print x
tʌˈmɑɹoʊ̯
>>> x
u't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # a unicode string

Related

How to convert String(Word) to hexadecimal Binary with leading \x in Python 3

I have following question with Python 3.
How i can convert a String (Word) in Hexadecimal with leading \x in Python3.x ?
Example:
with integer:
>>> x = 319
>>> x_hex = '{0:04x}'.format(x)
now it looks so
>>> print(x_hex)
013f
and for convert in the right format:
>>> y = bytearray.fromhex(x_hex)
>>> print(y)
b'\x01?'
Now my Question:
How to do this with a word or long numbers ?
When i using the binascii.hexlify tool, the string is wrong for my task:
Example:
>>> word = "hello012"
>>> word_2byte = bytes(word, encodiung='ascii')
>>> word_hex = binascii.hexlify(word_2byte)
>>> print(word_hex)
b'68656c6c6f303132'
The output from binascii.hexlify is correct, but how do i get this format?:
b'\x68\x65\x6c\x6c\x6f\x30\x31\x32'
Thank you for any help :-)
Encoding to bytes is all that is required; there is no difference between b'\x68' and b'h', b'\x65' and b'e', etc.
If you want the representation as a string to be like that then you will need to further encode yourself.
>>> ''.join('\\x{:02x}'.format(c) for c in word_2byte)
'\\x68\\x65\\x6c\\x6c\\x6f\\x30\\x31\\x32'

How can I print unicode without using u'\uXXXX'

I'm trying to make a program to iterate through japanese characters (Python 2.7) and return/yield them in a printable format, but I cannot convert the hexadecimal numbers (3040-309f) into a format that can print the characters. I have found that using u'\u' works, but when I attempt to convert the numbers into that format using unicode('\u3040'), it is different from u'\u3040'. The code explains it better.
>>> s1 = u'\u309d'
>>> s2 = unicode("\u209d")
>>> print type(s1) == type(s2)
True
>>> print s1 == s2
False
>>> print s1, s2
ゝ \u209d
I have tried using UTF-8 and latin-1 for s2 as the second argument, but It does nothing. Also, I found that you can do u'\u{0}'.format(u'3040'), but I cannot make u'3040' in my iterator, and u'\u{0}'.format(unicode('3040') raises an error.
In byte string literals, the \uhhhh escape sequence is not interpreted, so you get a literal 6 characters instead.
Converting that to Unicode only decodes the string as ASCII data, not as a Python escape sequence.
You could decode from the unicode_escape encoding instead:
>>> "\u209d".decode('unicode_escape')
u'\u209d'
>>> print "\u209d".decode('unicode_escape')
₝
There are several downsides to this, however. Any other \ escape sequences also get decoded:
>>> '\\n'
'\\n'
>>> '\\n'.decode('unicode_escape')
u'\n'
so you may have to replace backslashes with doubled backslashes first to come back on top with those literal backslashes retained:
>>> '\\n'.replace('\\', '\\\\').decode('unicode_escape')
u'\\n'
But be very careful that you are not in fact trying to treat JSON data as Python string literals. JSON also uses the same escape sequence format but should instead be treated as JSON; decode with json.loads() instead:
>>> import json
>>> json.loads('"\u209d"')
u'\u209d'

How do you decode an ascii string in python?

For example, in your python shell(IDLE):
>>> a = "\x3cdiv\x3e"
>>> print a
The result you get is:
<div>
but if a is an ascii encoded string:
>>> a = "\\x3cdiv\\x3e" ## it's the actual \x3cdiv\x3e string if you read it from a file
>>> print a
The result you get is:
\x3cdiv\x3e
Now what i really want from a is <div>, so I did this:
>>> b = a.decode("ascii")
>>> print b
BUT surprisingly I did NOT get the result I want, it's still:
\x3cdiv\x3e
So basically what do I do to convert a, which is \x3cdiv\x3e to b, which should be <div>?
Thanks
>>> a = rb"\x3cdiv\x3e"
>>> a.decode('unicode_escape')
'<div>'
Also check out some interesting codecs.
With python 3.x, you would adapt Kabie answer to
a = b"\x3cdiv\x3e"
a.decode('unicode_escape')
or
a = b"\x3cdiv\x3e"
a.decode('ascii')
both give
>>> a
b'<div>'
What is b prefix for ?
Bytes literals are always prefixed with 'b' or 'B'; they produce an
instance of the bytes type instead of the str type. They may only
contain ASCII characters; bytes with a numeric value of 128 or greater
must be expressed with escapes.

Python: convert a dot separated hex values to string?

In this post: Print a string as hex bytes? I learned how to print as string into an "array" of hex bytes now I need something the other way around:
So for example the input would be: 73.69.67.6e.61.74.75.72.65 and the output would be a string.
you can use the built in binascii module. Do note however that this function will only work on ASCII encoded characters.
binascii.unhexlify(hexstr)
Your input string will need to be dotless however, but that is quite easy with a simple
string = string.replace('.','')
another (arguably safer) method would be to use base64 in the following way:
import base64
encoded = base64.b16encode(b'data to be encoded')
print (encoded)
data = base64.b16decode(encoded)
print (data)
or in your example:
data = base64.b16decode(b"7369676e6174757265", True)
print (data.decode("utf-8"))
The string can be sanitised before input into the b16decode method.
Note that I am using python 3.2 and you may not necessarily need the b out the front of the string to denote bytes.
Example was found here
Without binascii:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(chr(int(e, 16)) for e in a.split('.'))
'signature'
>>>
or better:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(e.decode('hex') for e in a.split('.'))
PS: works with unicode:
>>> a='.'.join(x.encode('hex') for x in 'Hellö Wörld!')
>>> a
'48.65.6c.6c.94.20.57.94.72.6c.64.21'
>>> print "".join(e.decode('hex') for e in a.split('.'))
Hellö Wörld!
>>>
EDIT:
No need for a generator expression here (thx to thg435):
a.replace('.', '').decode('hex')
Use string split to get a list of strings, then base 16 for decoding the bytes.
>>> inp="73.69.67.6e.61.74.75.72.65"
>>> ''.join((chr(int(i,16)) for i in inp.split('.')))
'signature'
>>>

Convert UTF-8 octets to unicode code points

I have a set of UTF-8 octets and I need to convert them back to unicode code points. How can I do this in python.
e.g. UTF-8 octet ['0xc5','0x81'] should be converted to 0x141 codepoint.
Python 3.x:
In Python 3.x, str is the class for Unicode text, and bytes is for containing octets.
If by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert to bytes like this:
>>> bytes(int(x,0) for x in ['0xc5', '0x81'])
b'\xc5\x81'
You can then convert to str (ie: Unicode) using the str constructor...
>>> str(b'\xc5\x81', 'utf-8')
'Ł'
...or by calling .decode('utf-8') on the bytes object:
>>> b'\xc5\x81'.decode('utf-8')
'Ł'
>>> hex(ord('Ł'))
'0x141'
Pre-3.x:
Prior to 3.x, the str type was a byte array, and unicode was for Unicode text.
Again, if by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert them like this:
>>> ''.join(chr(int(x,0)) for x in ['0xc5', '0x81'])
'\xc5\x81'
You can then convert to unicode using the constructor...
>>> unicode('\xc5\x81', 'utf-8')
u'\u0141'
...or by calling .decode('utf-8') on the str:
>>> '\xc5\x81'.decode('utf-8')
u'\u0141'
In lovely 3.x, where all strs are Unicode, and bytes are what strs used to be:
>>> s = str(bytes([0xc5, 0x81]), 'utf-8')
>>> s
'Ł'
>>> ord(s)
321
>>> hex(ord(s))
'0x141'
Which is what you asked for.
l = ['0xc5','0x81']
s = ''.join([chr(int(c, 16)) for c in l]).decode('utf8')
s
>>> u'\u0141'
>>> "".join((chr(int(x,16)) for x in ['0xc5','0x81'])).decode("utf8")
u'\u0141'

Categories