How to programmatically retrieve the unicode char from hexademicals?

How to programmatically retrieve the unicode char from hexademicals? - python

Given a list of hexadecimals that corresponds to the unicode, how to programmatically retrieve the unicode char?
E.g. Given the list:
>>> l = ['9359', '935A', '935B']
how to achieve this list:
>>> u = [u'\u9359', u'\u935A', u'\u935B']
>>> u
['鍙', '鍚', '鍛']
I've tried this but it throws a SyntaxError:
>>> u'\u' + l[0]
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape

\uhhhh escapes are only valid in string literals, you can't use those to turn arbitrary hex values into characters. In other words, they are part of a larger syntax, and can't be used stand-alone.
Decode the hex value to an integer and pass it to the chr() function (or, on Python 2, the unichr() function):
[chr(int(v, 16)) for v in l] #
You could ask Python to interpret a string containing literal \uhhhh text as a Unicode string literal with the unicode_escape codec, but feels like overkill for individual codepoints:
[(b'\\u' + v.encode('ascii')).decode('unicode_escape') for v in l]
Note the double backslash in the prefix added, and that we have to create byte strings for this to work at all.
Demo:
>>> l = ['9359', '935A', '935B']
>>> [chr(int(v, 16)) for v in l]
['鍙', '鍚', '鍛']
>>> [(b'\\u' + v.encode('ascii')).decode('unicode_escape') for v in l]
['鍙', '鍚', '鍛']

Related

joining strings together to make a unicode character

I am trying to create a random unicode generator and made a function that can create 16bit unicode charaters. This is my code:
import random
import string
def rand_unicode():
list = []
list.append(str(random.randint(0,1)))
for i in range(0,3):
if random.randint(0,1):
list.append(string.ascii_letters[random.randint(0, \
len(string.ascii_letters))-1].upper())
else:
list.append(str(random.randint(0,9)))
return ''.join(list)
print(rand_unicode())
The problem is that whenever I try to add a '\u' in the print statement, Python gives me the following error:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
I tried raw strings but that only gives me output like '\u0070' without turning it into a unicode character. How can I properly connect the strings to create a unicode character? Any help is appreciated.

From:
The problem is that whenever I try to add a '\u' in the print statement, Python gives me the following error:
it sounds like the problem may be in code you haven't included in your question:
print('\u' + rand_unicode())
This won't do what you expect, because the '\u' is interpreted before the strings are concatenated. See Process escape sequences in a string in Python and try:
print(bytes('\\u' + rand_unicode(), 'us-ascii').decode('unicode_escape'))

A unicode escape sequence such as \u0070 is a single character. It is not the concatenation of \u and the ordinal.
>>> '\u0070' == 'p'
True
>>> '\u0070' == (r'\u' + '0070')
False
To convert an ordinal to a unicode character, you can pass the numerical ordinal to the chr builtin function. Use int(literal, 16) to convert a hex-literal ordinal to a numerical one:
>>> ordinal = '0070'
>>> chr(int(ordinal, 16)) # convert literal to number to unicode
'p'
>>> chr(int(rand_unicode(), 16))
'ᚈ'
Note that creating a literal ordinal is not required. You can directly create the numerical ordinal:
>>> chr(112) # convert decimal number to unicode
'p'
>>> chr(0x0070) # convert hexadecimal number to unicode
'p'
>>> chr(random.randint(0, 0x10FFF))
'嚟'

python3 how to turn unicode codepoint into unicode char

i know this type is asked alot but no answer was able to specifically help me with my problemsetup.
i have a list of ONLY Unicode codepoints so in this form:
304E
304F
...
No U+XXXX no '\XXXX' version.
Now i've tried to use stringmanipulation to recreate such strings
so i can simply print the corresponding unichar.
what i tried:
x = u'\\u' + listString
x = '\\u' + listString
x = '\u' + listString
the first 2 when printed just give me a '\uXXXX' string, but no idea
how to make it print the char not that string.
the last one gives me this error:
(unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
probably just something i dont get about unicode and stringmanipulation but i hope someone can help me out here.
Thanks in advance o/

You can use chr to get the character for a unicode code point:
>>> chr(0x304E)
'ぎ'
You can use int to convert a hexadecimal string to an integer:
>>> int('304E', 16)
12366
>>> chr(int('304E', 16))
'ぎ'

How to convert a full ascii string to hex in python?

I have this string:
string = '{'id':'other_aud1_aud2','kW':15}'
And, simply put, I would like my string to turn into an hex string like this:'7b276964273a276f746865725f617564315f61756432272c276b57273a31357d'
Have been trying binascii.hexlify(string), but it keeps returning:
TypeError: a bytes-like object is required, not 'str'
Also it's only to make it work with the following method:bytearray.fromhex(data['string_hex']).decode()
For the entire code here it is:
string_data = "{'id':'"+self.id+"','kW':"+str(value)+"}"
print(string_data)
string_data_hex = hexlify(string_data)
get_json = bytearray.fromhex(data['string_hex']).decode()
Also this is python 3.6

You can encode()the string:
string = "{'id':'other_aud1_aud2','kW':15}"
h = hexlify(string.encode())
print(h.decode())
# 7b276964273a276f746865725f617564315f61756432272c276b57273a31357d
s = unhexlify(hex).decode()
print(s)
# {'id':'other_aud1_aud2','kW':15}

The tricky bit here is that a Python 3 string is a sequence of Unicode characters, which is not the same as a sequence of ASCII characters.
In Python2, the str type and the bytes type are synonyms, and there is a separate type, unicode, that represents a sequence of Unicode characters. This makes it something of a mystery, if you have a string: is it a sequence of bytes, or is it a sequence of characters in some character-set?
In Python3, str now means unicode and we use bytes for what used to be str. Given a string—a sequence of Unicode characters—we use encode to convert it to some byte-sequence that can represent it, if there is such a sequence:
>>> 'hello'.encode('ascii')
b'hello'
>>> 'sch\N{latin small letter o with diaeresis}n'
'schön'
>>> 'sch\N{latin small letter o with diaeresis}n'.encode('utf-8')
b'sch\xc3\xb6n'
but:
>>> 'sch\N{latin small letter o with diaeresis}n'.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 3: ordinal not in range(128)
Once you have the bytes object, you already know what to do. In Python2, if you have a str, you have a bytes object; in Python3, use .encode with your chosen encoding.

How to format bytes from strings to hex values

I am reading in a binary file with big5 encoded characters. When I read a double byte sequence, they would appear as a list of bytes e.g.
>>> bytes = ['0xa6', '0x7b']
If I modify these string bytes:
>>> big5_str = ''
>>> for hexVal in bytes:
>>> newHexVal = '\\' + hexVal[1:]
>>> big5_str += newHexVal
so they appear as:
>>> print big5_str
>>> '\xa6\x7b'
but big5_str actually has the escape '\' still in the string:
>>> big5_str
>>> '\\xa6\\x7b'
and if i decode using big5, I only get the same string back (due to the double-backslash):
>>> print byte_string.decode('big5')
>>> '\xa6\x7b'
If I explicitly code the the byte sequence as the hex values:
>>> bytes2 = '\xa6\x7b'
>>> print bytes2.decode('big5')
>>> 州
My question is, how can I read these bytes, format them in '\x**' format them so that they are recognized as bytes, not strings, using a non-escaped backslash?

How to convert an accented character in an unicode string to its unicode character code using Python?

Just wonder how to convert a unicode string like u'é' to its unicode character code u'\xe9'?

You can use Python's repr() function:
>>> unicode_char = u'é'
>>> repr(unicode_char)
"u'\\xe9'"

ord will give you the numeric value, but you'll have to convert it into hex:
>>> ord(u'é')
233

u'é' and u'\xe9' are exactly the same, they are just different representations:
>>> u'é' == u'\xe9'
True

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to programmatically retrieve the unicode char from hexademicals? - python

Related

joining strings together to make a unicode character

python3 how to turn unicode codepoint into unicode char

How to convert a full ascii string to hex in python?

How to format bytes from strings to hex values

How to convert an accented character in an unicode string to its unicode character code using Python?

Categories

Resources