A string that represents encoded characters - python

I use python 2.7 and I have the following string: mystr = '\xde\x05\xd7\x05\xe9\x05\xd1\x05'
I want to get the real unicode string out of it: myuni = u'\u05de\u05d7\u05e9\u05d1'.
The encoding is "cp1255".
How can I get this done?
Thank you!

You don't have CP1255 data. You have UTF-16 (little endian) data instead:
>>> mystr = '\xde\x05\xd7\x05\xe9\x05\xd1\x05'
>>> mystr.decode('utf-16-le')
u'\u05de\u05d7\u05e9\u05d1'
CP1255 looks like this:
>>> u'\u05de\u05d7\u05e9\u05d1'.encode('cp1255')
'\xee\xe7\xf9\xe1'

Related

How do I replace \xc3 etc. with umlauts?

I have an output of spannkr \xc3\xa4ftig, da\xc3\x9f unser in Python. How do I replace this with umlauts?
The German characters are already there, but encoded as utf-8. If you want to see the umlauts etc in the interpreter then you can decode to str:
>>> bs = b'spannkr \xc3\xa4ftig, da\xc3\x9f unser'
>>> s = bs.decode('utf-8')
>>> print(s)
spannkr äftig, daß unser
It's possible that you are dealing with a str that somehow contains utf-8 encoded data. In this case you need to perform an extra step:
>>> s = 'spannkr \xc3\xa4ftig, da\xc3\x9f unser'
>>> bs = s.encode('raw-unicode-escape') # encode to bytes without double-encoding
>>> print(bs)
b'spannkr \xc3\xa4ftig, da\xc3\x9f unser'
>>> decoded = bs.decode('utf-8')
>>> print(decoded)
spannkr äftig, daß unser
There isn't an easy way to distinguish between incorrectly embedded spaces and the spaces between words. You would need to use some kind of spellchecker or natural language application.

python string to hex with escaped hex values

I have a string like "Some characters \x00\x80\x34 and then some other characters". How can I convert the regular characters to their hex equivalent, while converting \x00 to the actual 00 hex value?
binascii.hexlify() considers '\', 'x', '0', '0' as actual characters.
Later edit:
The string itself is produced by another function. When I print it, it actually prints "\x00".
As my understanding you are trying to convert only the characters that are not hex values to hex. It would help if you gave a sample input string that you are trying to convert to hex.
Also you can convert to hex values using just the built in encoding and decoding method. That should take care of what you are trying to do. The following three lines are what I ran in terminal of my machine, and gave the output you are expecting. I also attached an image to show you. Hope it helps:
aStr = "Some characters \x00\x80\x34 and then some other characters"
aStr.encode("hex")
aStr.encode("hex").decode("hex")
It's unclear what you're asking, since binascii.hexlify should work:
>>> import binascii
>>> s = "\x00\x80\x34"
>>> binascii.hexlify(s)
'008034'
>>> s = "foobar \x00\x80\x34 foobar"
>>> binascii.hexlify(s)
'666f6f6261722000803420666f6f626172'
foorbar = 666f6f6261722, space = 20
↳ https://docs.python.org/3/library/binascii.html

Decode unicode string in python

I'd like to decode the following string:
t\u028c\u02c8m\u0251\u0279o\u028a\u032f
It should be the IPA of 'tomorrow' as given in a JSON string from http://rhymebrain.com/talk?function=getWordInfo&word=tomorrow
My understanding is that it should be something like:
x = 't\u028c\u02c8m\u0251\u0279o\u028a\u032f'
print x.decode()
I have tried the solutions from here , here , here, and here (and several other that more or less apply), and several permutations of its parts, but I can't get it to work.
Thank you
You need a u before your string (in Python 2.x, which you appear to be using) to indicate that this is a unicode string:
>>> x = u't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # note the u
>>> print x
tʌˈmɑɹoʊ̯
If you have already stored the string in a variable, you can use the following constructor to convert the string into unicode:
>>> s = 't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # your string has a unicode-escape encoding but is not unicode
>>> x = unicode(s, encoding='unicode-escape')
>>> print x
tʌˈmɑɹoʊ̯
>>> x
u't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # a unicode string

Python: convert a dot separated hex values to string?

In this post: Print a string as hex bytes? I learned how to print as string into an "array" of hex bytes now I need something the other way around:
So for example the input would be: 73.69.67.6e.61.74.75.72.65 and the output would be a string.
you can use the built in binascii module. Do note however that this function will only work on ASCII encoded characters.
binascii.unhexlify(hexstr)
Your input string will need to be dotless however, but that is quite easy with a simple
string = string.replace('.','')
another (arguably safer) method would be to use base64 in the following way:
import base64
encoded = base64.b16encode(b'data to be encoded')
print (encoded)
data = base64.b16decode(encoded)
print (data)
or in your example:
data = base64.b16decode(b"7369676e6174757265", True)
print (data.decode("utf-8"))
The string can be sanitised before input into the b16decode method.
Note that I am using python 3.2 and you may not necessarily need the b out the front of the string to denote bytes.
Example was found here
Without binascii:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(chr(int(e, 16)) for e in a.split('.'))
'signature'
>>>
or better:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(e.decode('hex') for e in a.split('.'))
PS: works with unicode:
>>> a='.'.join(x.encode('hex') for x in 'Hellö Wörld!')
>>> a
'48.65.6c.6c.94.20.57.94.72.6c.64.21'
>>> print "".join(e.decode('hex') for e in a.split('.'))
Hellö Wörld!
>>>
EDIT:
No need for a generator expression here (thx to thg435):
a.replace('.', '').decode('hex')
Use string split to get a list of strings, then base 16 for decoding the bytes.
>>> inp="73.69.67.6e.61.74.75.72.65"
>>> ''.join((chr(int(i,16)) for i in inp.split('.')))
'signature'
>>>

Convert UTF-8 octets to unicode code points

I have a set of UTF-8 octets and I need to convert them back to unicode code points. How can I do this in python.
e.g. UTF-8 octet ['0xc5','0x81'] should be converted to 0x141 codepoint.
Python 3.x:
In Python 3.x, str is the class for Unicode text, and bytes is for containing octets.
If by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert to bytes like this:
>>> bytes(int(x,0) for x in ['0xc5', '0x81'])
b'\xc5\x81'
You can then convert to str (ie: Unicode) using the str constructor...
>>> str(b'\xc5\x81', 'utf-8')
'Ł'
...or by calling .decode('utf-8') on the bytes object:
>>> b'\xc5\x81'.decode('utf-8')
'Ł'
>>> hex(ord('Ł'))
'0x141'
Pre-3.x:
Prior to 3.x, the str type was a byte array, and unicode was for Unicode text.
Again, if by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert them like this:
>>> ''.join(chr(int(x,0)) for x in ['0xc5', '0x81'])
'\xc5\x81'
You can then convert to unicode using the constructor...
>>> unicode('\xc5\x81', 'utf-8')
u'\u0141'
...or by calling .decode('utf-8') on the str:
>>> '\xc5\x81'.decode('utf-8')
u'\u0141'
In lovely 3.x, where all strs are Unicode, and bytes are what strs used to be:
>>> s = str(bytes([0xc5, 0x81]), 'utf-8')
>>> s
'Ł'
>>> ord(s)
321
>>> hex(ord(s))
'0x141'
Which is what you asked for.
l = ['0xc5','0x81']
s = ''.join([chr(int(c, 16)) for c in l]).decode('utf8')
s
>>> u'\u0141'
>>> "".join((chr(int(x,16)) for x in ['0xc5','0x81'])).decode("utf8")
u'\u0141'

Categories