How to turn unicode character into \Uxxxxxxxx format in python 3 - python

I have an unicode character like 🏆 and I want to get back the \Uxxxxxxxx format. But until now, couldn't find an easy way. Already tried:
text = 🏆
text.encode('utf-32').decode('utf-8')
returns error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
text.encode('utf-32').decode('unicode-escape')
returns ÿþ
How to make it return \U000XXXXX ? I know I can get the character from \U000XXXXX making:
string = "foo bar foo \U000XXXXX"
string.encode('utf-8').decode('unicode-escape')
returns "foo bar foo 🏆"

For a byte string:
>>> text = '🏆'
>>> text.encode('unicode-escape')
b'\\U0001f3c6'
for a Unicode string:
>>> text.encode('unicode-escape').decode('ascii')
'\\U0001f3c6'

Related

joining strings together to make a unicode character

I am trying to create a random unicode generator and made a function that can create 16bit unicode charaters. This is my code:
import random
import string
def rand_unicode():
list = []
list.append(str(random.randint(0,1)))
for i in range(0,3):
if random.randint(0,1):
list.append(string.ascii_letters[random.randint(0, \
len(string.ascii_letters))-1].upper())
else:
list.append(str(random.randint(0,9)))
return ''.join(list)
print(rand_unicode())
The problem is that whenever I try to add a '\u' in the print statement, Python gives me the following error:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
I tried raw strings but that only gives me output like '\u0070' without turning it into a unicode character. How can I properly connect the strings to create a unicode character? Any help is appreciated.
From:
The problem is that whenever I try to add a '\u' in the print statement, Python gives me the following error:
it sounds like the problem may be in code you haven't included in your question:
print('\u' + rand_unicode())
This won't do what you expect, because the '\u' is interpreted before the strings are concatenated. See Process escape sequences in a string in Python and try:
print(bytes('\\u' + rand_unicode(), 'us-ascii').decode('unicode_escape'))
A unicode escape sequence such as \u0070 is a single character. It is not the concatenation of \u and the ordinal.
>>> '\u0070' == 'p'
True
>>> '\u0070' == (r'\u' + '0070')
False
To convert an ordinal to a unicode character, you can pass the numerical ordinal to the chr builtin function. Use int(literal, 16) to convert a hex-literal ordinal to a numerical one:
>>> ordinal = '0070'
>>> chr(int(ordinal, 16)) # convert literal to number to unicode
'p'
>>> chr(int(rand_unicode(), 16))
'ᚈ'
Note that creating a literal ordinal is not required. You can directly create the numerical ordinal:
>>> chr(112) # convert decimal number to unicode
'p'
>>> chr(0x0070) # convert hexadecimal number to unicode
'p'
>>> chr(random.randint(0, 0x10FFF))
'嚟'

python3 how to turn unicode codepoint into unicode char

i know this type is asked alot but no answer was able to specifically help me with my problemsetup.
i have a list of ONLY Unicode codepoints so in this form:
304E
304F
...
No U+XXXX no '\XXXX' version.
Now i've tried to use stringmanipulation to recreate such strings
so i can simply print the corresponding unichar.
what i tried:
x = u'\\u' + listString
x = '\\u' + listString
x = '\u' + listString
the first 2 when printed just give me a '\uXXXX' string, but no idea
how to make it print the char not that string.
the last one gives me this error:
(unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
probably just something i dont get about unicode and stringmanipulation but i hope someone can help me out here.
Thanks in advance o/
You can use chr to get the character for a unicode code point:
>>> chr(0x304E)
'ぎ'
You can use int to convert a hexadecimal string to an integer:
>>> int('304E', 16)
12366
>>> chr(int('304E', 16))
'ぎ'

How to programmatically retrieve the unicode char from hexademicals?

Given a list of hexadecimals that corresponds to the unicode, how to programmatically retrieve the unicode char?
E.g. Given the list:
>>> l = ['9359', '935A', '935B']
how to achieve this list:
>>> u = [u'\u9359', u'\u935A', u'\u935B']
>>> u
['鍙', '鍚', '鍛']
I've tried this but it throws a SyntaxError:
>>> u'\u' + l[0]
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
\uhhhh escapes are only valid in string literals, you can't use those to turn arbitrary hex values into characters. In other words, they are part of a larger syntax, and can't be used stand-alone.
Decode the hex value to an integer and pass it to the chr() function (or, on Python 2, the unichr() function):
[chr(int(v, 16)) for v in l] #
You could ask Python to interpret a string containing literal \uhhhh text as a Unicode string literal with the unicode_escape codec, but feels like overkill for individual codepoints:
[(b'\\u' + v.encode('ascii')).decode('unicode_escape') for v in l]
Note the double backslash in the prefix added, and that we have to create byte strings for this to work at all.
Demo:
>>> l = ['9359', '935A', '935B']
>>> [chr(int(v, 16)) for v in l]
['鍙', '鍚', '鍛']
>>> [(b'\\u' + v.encode('ascii')).decode('unicode_escape') for v in l]
['鍙', '鍚', '鍛']

Python and encoding, again

I have the next code snippet in Python (2.7.8) on Windows:
text1 = 'áéíóú'
text2 = text1.encode("utf-8")
and i have the next error exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
Any ideas?
You forgot to specify that you are dealing with a unicode string:
text1 = u'áéíóú' #prefix string with "u"
text2 = text1.encode("utf-8")
In python 3 this behavior has changed, and any string is unicode, so you don't need to specify it.
I have tried the following code in Linux with Python 2.7:
>>> text1 = 'áéíóú'
>>> text1
'\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba'
>>> type(text1)
<type 'str'>
>>> text1.decode("utf-8")
u'\xe1\xe9\xed\xf3\xfa'
>>> print '\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba'
áéíóú
>>> print u'\xe1\xe9\xed\xf3\xfa'
áéíóú
>>> u'\xe1\xe9\xed\xf3\xfa'.encode('utf-8')
'\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba'
\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba is the utf-8 coding of áéíóú. And \xe1\xe9\xed\xf3\xfa is the unicode coding of áéíóú.
text1 is encoded by utf-8, it only can be decoded to unicode by:
text1.decode("utf-8")
an unicode string can be encoded to an utf-8 string:
u'\xe1\xe9\xed\xf3\xfa'.encode('utf-8')

Generating random Unicode between certain range

I am trying to generate random Unicode characters with two starting number+letter combination..
I have tried the following below but I am getting an error.
def rand_unicode():
b = ['03','20']
l = ''.join([random.choice('ABCDEF0123456789') for x in xrange(2)])
return unicode(u'\u'+random.choice(b)+l,'utf8')
The error I am getting:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: end of string in escape sequence
I use Python 2.6.
Yeah, uh, that's not how.
return unichr(random.choice((0x300, 0x2000)) + random.randint(0, 0xff))

Categories