How to turn unicode character into \Uxxxxxxxx format in python 3

How to turn unicode character into \Uxxxxxxxx format in python 3 - python

I have an unicode character like 🏆 and I want to get back the \Uxxxxxxxx format. But until now, couldn't find an easy way. Already tried:
text = 🏆
text.encode('utf-32').decode('utf-8')
returns error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
text.encode('utf-32').decode('unicode-escape')
returns ÿþ
How to make it return \U000XXXXX ? I know I can get the character from \U000XXXXX making:
string = "foo bar foo \U000XXXXX"
string.encode('utf-8').decode('unicode-escape')
returns "foo bar foo 🏆"

For a byte string:
>>> text = '🏆'
>>> text.encode('unicode-escape')
b'\\U0001f3c6'
for a Unicode string:
>>> text.encode('unicode-escape').decode('ascii')
'\\U0001f3c6'

Related

joining strings together to make a unicode character

I am trying to create a random unicode generator and made a function that can create 16bit unicode charaters. This is my code:
import random
import string
def rand_unicode():
list = []
list.append(str(random.randint(0,1)))
for i in range(0,3):
if random.randint(0,1):
list.append(string.ascii_letters[random.randint(0, \
len(string.ascii_letters))-1].upper())
else:
list.append(str(random.randint(0,9)))
return ''.join(list)
print(rand_unicode())
The problem is that whenever I try to add a '\u' in the print statement, Python gives me the following error:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
I tried raw strings but that only gives me output like '\u0070' without turning it into a unicode character. How can I properly connect the strings to create a unicode character? Any help is appreciated.

From:
The problem is that whenever I try to add a '\u' in the print statement, Python gives me the following error:
it sounds like the problem may be in code you haven't included in your question:
print('\u' + rand_unicode())
This won't do what you expect, because the '\u' is interpreted before the strings are concatenated. See Process escape sequences in a string in Python and try:
print(bytes('\\u' + rand_unicode(), 'us-ascii').decode('unicode_escape'))

A unicode escape sequence such as \u0070 is a single character. It is not the concatenation of \u and the ordinal.
>>> '\u0070' == 'p'
True
>>> '\u0070' == (r'\u' + '0070')
False
To convert an ordinal to a unicode character, you can pass the numerical ordinal to the chr builtin function. Use int(literal, 16) to convert a hex-literal ordinal to a numerical one:
>>> ordinal = '0070'
>>> chr(int(ordinal, 16)) # convert literal to number to unicode
'p'
>>> chr(int(rand_unicode(), 16))
'ᚈ'
Note that creating a literal ordinal is not required. You can directly create the numerical ordinal:
>>> chr(112) # convert decimal number to unicode
'p'
>>> chr(0x0070) # convert hexadecimal number to unicode
'p'
>>> chr(random.randint(0, 0x10FFF))
'嚟'

python3 how to turn unicode codepoint into unicode char

i know this type is asked alot but no answer was able to specifically help me with my problemsetup.
i have a list of ONLY Unicode codepoints so in this form:
304E
304F
...
No U+XXXX no '\XXXX' version.
Now i've tried to use stringmanipulation to recreate such strings
so i can simply print the corresponding unichar.
what i tried:
x = u'\\u' + listString
x = '\\u' + listString
x = '\u' + listString
the first 2 when printed just give me a '\uXXXX' string, but no idea
how to make it print the char not that string.
the last one gives me this error:
(unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
probably just something i dont get about unicode and stringmanipulation but i hope someone can help me out here.
Thanks in advance o/

You can use chr to get the character for a unicode code point:
>>> chr(0x304E)
'ぎ'
You can use int to convert a hexadecimal string to an integer:
>>> int('304E', 16)
12366
>>> chr(int('304E', 16))
'ぎ'

How to programmatically retrieve the unicode char from hexademicals?

Given a list of hexadecimals that corresponds to the unicode, how to programmatically retrieve the unicode char?
E.g. Given the list:
>>> l = ['9359', '935A', '935B']
how to achieve this list:
>>> u = [u'\u9359', u'\u935A', u'\u935B']
>>> u
['鍙', '鍚', '鍛']
I've tried this but it throws a SyntaxError:
>>> u'\u' + l[0]
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape

\uhhhh escapes are only valid in string literals, you can't use those to turn arbitrary hex values into characters. In other words, they are part of a larger syntax, and can't be used stand-alone.
Decode the hex value to an integer and pass it to the chr() function (or, on Python 2, the unichr() function):
[chr(int(v, 16)) for v in l] #
You could ask Python to interpret a string containing literal \uhhhh text as a Unicode string literal with the unicode_escape codec, but feels like overkill for individual codepoints:
[(b'\\u' + v.encode('ascii')).decode('unicode_escape') for v in l]
Note the double backslash in the prefix added, and that we have to create byte strings for this to work at all.
Demo:
>>> l = ['9359', '935A', '935B']
>>> [chr(int(v, 16)) for v in l]
['鍙', '鍚', '鍛']
>>> [(b'\\u' + v.encode('ascii')).decode('unicode_escape') for v in l]
['鍙', '鍚', '鍛']

Python and encoding, again

I have the next code snippet in Python (2.7.8) on Windows:
text1 = 'áéíóú'
text2 = text1.encode("utf-8")
and i have the next error exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
Any ideas?

You forgot to specify that you are dealing with a unicode string:
text1 = u'áéíóú' #prefix string with "u"
text2 = text1.encode("utf-8")
In python 3 this behavior has changed, and any string is unicode, so you don't need to specify it.

I have tried the following code in Linux with Python 2.7:
>>> text1 = 'áéíóú'
>>> text1
'\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba'
>>> type(text1)
<type 'str'>
>>> text1.decode("utf-8")
u'\xe1\xe9\xed\xf3\xfa'
>>> print '\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba'
áéíóú
>>> print u'\xe1\xe9\xed\xf3\xfa'
áéíóú
>>> u'\xe1\xe9\xed\xf3\xfa'.encode('utf-8')
'\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba'
\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba is the utf-8 coding of áéíóú. And \xe1\xe9\xed\xf3\xfa is the unicode coding of áéíóú.
text1 is encoded by utf-8, it only can be decoded to unicode by:
text1.decode("utf-8")
an unicode string can be encoded to an utf-8 string:
u'\xe1\xe9\xed\xf3\xfa'.encode('utf-8')

Generating random Unicode between certain range

I am trying to generate random Unicode characters with two starting number+letter combination..
I have tried the following below but I am getting an error.
def rand_unicode():
b = ['03','20']
l = ''.join([random.choice('ABCDEF0123456789') for x in xrange(2)])
return unicode(u'\u'+random.choice(b)+l,'utf8')
The error I am getting:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: end of string in escape sequence
I use Python 2.6.

Yeah, uh, that's not how.
return unichr(random.choice((0x300, 0x2000)) + random.randint(0, 0xff))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to turn unicode character into \Uxxxxxxxx format in python 3 - python

For a byte string: >>> text = '🏆' >>> text.encode('unicode-escape') b'\\U0001f3c6' for a Unicode string: >>> text.encode('unicode-escape').decode('ascii') '\\U0001f3c6'

Related

joining strings together to make a unicode character

python3 how to turn unicode codepoint into unicode char

How to programmatically retrieve the unicode char from hexademicals?

Python and encoding, again

Generating random Unicode between certain range

Categories

Resources