I have a set of UTF-8 octets and I need to convert them back to unicode code points. How can I do this in python.
e.g. UTF-8 octet ['0xc5','0x81'] should be converted to 0x141 codepoint.
Python 3.x:
In Python 3.x, str is the class for Unicode text, and bytes is for containing octets.
If by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert to bytes like this:
>>> bytes(int(x,0) for x in ['0xc5', '0x81'])
b'\xc5\x81'
You can then convert to str (ie: Unicode) using the str constructor...
>>> str(b'\xc5\x81', 'utf-8')
'Ł'
...or by calling .decode('utf-8') on the bytes object:
>>> b'\xc5\x81'.decode('utf-8')
'Ł'
>>> hex(ord('Ł'))
'0x141'
Pre-3.x:
Prior to 3.x, the str type was a byte array, and unicode was for Unicode text.
Again, if by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert them like this:
>>> ''.join(chr(int(x,0)) for x in ['0xc5', '0x81'])
'\xc5\x81'
You can then convert to unicode using the constructor...
>>> unicode('\xc5\x81', 'utf-8')
u'\u0141'
...or by calling .decode('utf-8') on the str:
>>> '\xc5\x81'.decode('utf-8')
u'\u0141'
In lovely 3.x, where all strs are Unicode, and bytes are what strs used to be:
>>> s = str(bytes([0xc5, 0x81]), 'utf-8')
>>> s
'Ł'
>>> ord(s)
321
>>> hex(ord(s))
'0x141'
Which is what you asked for.
l = ['0xc5','0x81']
s = ''.join([chr(int(c, 16)) for c in l]).decode('utf8')
s
>>> u'\u0141'
>>> "".join((chr(int(x,16)) for x in ['0xc5','0x81'])).decode("utf8")
u'\u0141'
Related
i have a problem.
i get data like:
hex_num='0EE6'
data_decode=str(codecs.decode(hex_num, 'hex'))[(0):(80)]
print(data_decode)
>>>b'\x0e\xe6'
And i want encode this like:
data_enc=str(codecs.encode(data_decode, 'hex'))[(2):(6)]
print(str(int(data_enc,16)))
>>>TypeError: encoding with 'hex' codec failed (TypeError: a bytes-like object is required, not 'str')
If i wrote this:
data_enc=str(codecs.encode(b'\x0e\xe6', 'hex'))[(2):(6)]
print(str(int(data_enc,16)))
>>>3814
It will retrun number what i want (3814)
Please help.
You can remove the quotation marks like this: data = b'\x0e\xe6'
The Python 3 documentation states:
Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.
When b is within a string, it will not behave like a string literal prefix, so you have to remove the quotations for the literal to work, and convert the text to bytes directly.
Corrected code:
import codecs
data = b'\x0e\xe6'
data_enc=str(codecs.encode(data, 'hex'))[(2):(6)]
print(str(int(data_enc,16)))
Output:
3814
To change from a hex string to binary data, then using binascii.unhexlify is a convenient method. e.g.:
>>> hex_num='0EE6'
>>> import binascii
>>> binascii.unhexlify(hex_num)
b'\x0e\xe6'
Then to convert the binary data to an integer, using int.from_bytes allows you control over the endianness of the data and if it signed. e.g:
>>> bytes_data = b'\x0e\xe6'
>>> int.from_bytes(bytes_data, byteorder='little', signed=False)
58894
>>> int.from_bytes(bytes_data, byteorder='big', signed=False)
3814
I'd like to decode the following string:
t\u028c\u02c8m\u0251\u0279o\u028a\u032f
It should be the IPA of 'tomorrow' as given in a JSON string from http://rhymebrain.com/talk?function=getWordInfo&word=tomorrow
My understanding is that it should be something like:
x = 't\u028c\u02c8m\u0251\u0279o\u028a\u032f'
print x.decode()
I have tried the solutions from here , here , here, and here (and several other that more or less apply), and several permutations of its parts, but I can't get it to work.
Thank you
You need a u before your string (in Python 2.x, which you appear to be using) to indicate that this is a unicode string:
>>> x = u't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # note the u
>>> print x
tʌˈmɑɹoʊ̯
If you have already stored the string in a variable, you can use the following constructor to convert the string into unicode:
>>> s = 't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # your string has a unicode-escape encoding but is not unicode
>>> x = unicode(s, encoding='unicode-escape')
>>> print x
tʌˈmɑɹoʊ̯
>>> x
u't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # a unicode string
I'm using Python 2.7
I'm reading a file containing "iso-8859-1" coded information.
After parsing, I get the results in strings, ie s1:
>>> s1
'D\xf6rfli'
>>> type(s1)
<type 'str'>
>>> s2=s1.decode("iso-8859-1").encode("utf8")
>>> s2
'D\xc3\xb6rfli'
>>> type(s2)
<type 'str'>
>>> print s1, s2
D�rfli Dörfli
>>>
Why is the type of s2 still a str after the call to .encode?
How can I convert it from str to utf-8?
str in Python 2 means an encoded string, i.e. a sequence of bytes. This is documented behavior. The decoded str would be of type unicode.
UTF-8 is an encoding, as well as ISO-8859-1. So you just decode your string and then encode in another encoding, producing data of the same type.
On the contrary, in Python 3 str would be a text string (in Unicode) and calling encode on it would give you an instance of bytes.
So, in Python 2, a UTF-8 string will be str, because it is encoded.
I second the recommendation by Ned: take a look at the presentation he links to (oh my, is it his own talk?). It helped me a lot when I was struggling with these things.
I'm not sure if this answers your questions, but here's what I observed.
If you just want to get the string into a printable form, just stop after calling decode. I'm not sure why you are trying to encode into UTF8 after successfully converting from is8859 into unicode.
>>> s1 = 'D\xf6rfli'
>>> s1
'D\xf6rfli'
>>> s2 = s1.decode("iso-8859-1")
>>> s2
u'D\xf6rfli'
>>> print s2
Dörfli
>>>
In this post: Print a string as hex bytes? I learned how to print as string into an "array" of hex bytes now I need something the other way around:
So for example the input would be: 73.69.67.6e.61.74.75.72.65 and the output would be a string.
you can use the built in binascii module. Do note however that this function will only work on ASCII encoded characters.
binascii.unhexlify(hexstr)
Your input string will need to be dotless however, but that is quite easy with a simple
string = string.replace('.','')
another (arguably safer) method would be to use base64 in the following way:
import base64
encoded = base64.b16encode(b'data to be encoded')
print (encoded)
data = base64.b16decode(encoded)
print (data)
or in your example:
data = base64.b16decode(b"7369676e6174757265", True)
print (data.decode("utf-8"))
The string can be sanitised before input into the b16decode method.
Note that I am using python 3.2 and you may not necessarily need the b out the front of the string to denote bytes.
Example was found here
Without binascii:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(chr(int(e, 16)) for e in a.split('.'))
'signature'
>>>
or better:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(e.decode('hex') for e in a.split('.'))
PS: works with unicode:
>>> a='.'.join(x.encode('hex') for x in 'Hellö Wörld!')
>>> a
'48.65.6c.6c.94.20.57.94.72.6c.64.21'
>>> print "".join(e.decode('hex') for e in a.split('.'))
Hellö Wörld!
>>>
EDIT:
No need for a generator expression here (thx to thg435):
a.replace('.', '').decode('hex')
Use string split to get a list of strings, then base 16 for decoding the bytes.
>>> inp="73.69.67.6e.61.74.75.72.65"
>>> ''.join((chr(int(i,16)) for i in inp.split('.')))
'signature'
>>>
Just wonder how to convert a unicode string like u'é' to its unicode character code u'\xe9'?
You can use Python's repr() function:
>>> unicode_char = u'é'
>>> repr(unicode_char)
"u'\\xe9'"
ord will give you the numeric value, but you'll have to convert it into hex:
>>> ord(u'é')
233
u'é' and u'\xe9' are exactly the same, they are just different representations:
>>> u'é' == u'\xe9'
True