python3 how to turn unicode codepoint into unicode char - python

i know this type is asked alot but no answer was able to specifically help me with my problemsetup.
i have a list of ONLY Unicode codepoints so in this form:
304E
304F
...
No U+XXXX no '\XXXX' version.
Now i've tried to use stringmanipulation to recreate such strings
so i can simply print the corresponding unichar.
what i tried:
x = u'\\u' + listString
x = '\\u' + listString
x = '\u' + listString
the first 2 when printed just give me a '\uXXXX' string, but no idea
how to make it print the char not that string.
the last one gives me this error:
(unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
probably just something i dont get about unicode and stringmanipulation but i hope someone can help me out here.
Thanks in advance o/

You can use chr to get the character for a unicode code point:
>>> chr(0x304E)
'ぎ'
You can use int to convert a hexadecimal string to an integer:
>>> int('304E', 16)
12366
>>> chr(int('304E', 16))
'ぎ'

Related

How can I get the string of 5 digit hexadecimal in Python?

I have an integer that I converted to hexadecimal as follows:
int_N = 193402
hex_value = hex(int_N)
it gives me the following hex: 0x2f37a.
I want to convert the hexadecimal to string.
I tried this:
bytes.fromhex(hex_value[2:]).decode('ASCII')
# [2:] to get rid of the 0x
however, it gives me this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 1: ordinal not in range(128)
Then I tried with decode('utf-8') instead of ASCII, but it gave me this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 1: invalid start byte
Any suggestions how to fix that? whay it's not converting the hexadecimal '0x2f37a' to a string?
After reading some documentations, I assume maybe Hexadecimal should contain even number of digits in order to covert to string, but wasn't able to do so or make it even, as I'm using hex() and it gave me the value.
Thanks and really appreciate any help!
You should look at the struct and binascii module.
import struct
import binascii
int_N = 193402
s = struct.Struct(">l")
val = s.pack(int_N)
output = binascii.hexlify(val)
print(output) #0002f37a
find out more about c_type packing here at PMOTW3.
If you simply want to convert it to a string, no other requirements, then this works (I'll pad it to 8 characters here):
int_N = 193402
s = hex(int_N)[2:].rjust(8, '0') # get rid of '0x' and pad to 8 characters
print(s, type(s))
Output:
0002f37a <class 'str'>
...proving that it's a string type. If you're interested in getting the individual bytes, then something like this will demonstrate:
for b in bytes.fromhex(s):
print(b, type(b))
Output:
0 <class 'int'>
2 <class 'int'>
243 <class 'int'>
122 <class 'int'>
... showing all four bytes (from eight hex digits) and proving they're integers. The key here is an even number of characters I chose 8) so that fromhex() can decode it. An odd number of bytes will give a ValueError.
Now you can either use the string or the bytes as you please.
Format numbers the way you like with f-strings (format strings). Here are examples of various forms of hexadecimal and binary:
>>> n=193402
>>> f'{n:x} {n:08x} {n:#x} {n:020b}'
'2f37a 0002f37a 0x2f37a 00101111001101111010'
See Format Specification Mini-Language.

joining strings together to make a unicode character

I am trying to create a random unicode generator and made a function that can create 16bit unicode charaters. This is my code:
import random
import string
def rand_unicode():
list = []
list.append(str(random.randint(0,1)))
for i in range(0,3):
if random.randint(0,1):
list.append(string.ascii_letters[random.randint(0, \
len(string.ascii_letters))-1].upper())
else:
list.append(str(random.randint(0,9)))
return ''.join(list)
print(rand_unicode())
The problem is that whenever I try to add a '\u' in the print statement, Python gives me the following error:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
I tried raw strings but that only gives me output like '\u0070' without turning it into a unicode character. How can I properly connect the strings to create a unicode character? Any help is appreciated.
From:
The problem is that whenever I try to add a '\u' in the print statement, Python gives me the following error:
it sounds like the problem may be in code you haven't included in your question:
print('\u' + rand_unicode())
This won't do what you expect, because the '\u' is interpreted before the strings are concatenated. See Process escape sequences in a string in Python and try:
print(bytes('\\u' + rand_unicode(), 'us-ascii').decode('unicode_escape'))
A unicode escape sequence such as \u0070 is a single character. It is not the concatenation of \u and the ordinal.
>>> '\u0070' == 'p'
True
>>> '\u0070' == (r'\u' + '0070')
False
To convert an ordinal to a unicode character, you can pass the numerical ordinal to the chr builtin function. Use int(literal, 16) to convert a hex-literal ordinal to a numerical one:
>>> ordinal = '0070'
>>> chr(int(ordinal, 16)) # convert literal to number to unicode
'p'
>>> chr(int(rand_unicode(), 16))
'ᚈ'
Note that creating a literal ordinal is not required. You can directly create the numerical ordinal:
>>> chr(112) # convert decimal number to unicode
'p'
>>> chr(0x0070) # convert hexadecimal number to unicode
'p'
>>> chr(random.randint(0, 0x10FFF))
'嚟'

How to programmatically retrieve the unicode char from hexademicals?

Given a list of hexadecimals that corresponds to the unicode, how to programmatically retrieve the unicode char?
E.g. Given the list:
>>> l = ['9359', '935A', '935B']
how to achieve this list:
>>> u = [u'\u9359', u'\u935A', u'\u935B']
>>> u
['鍙', '鍚', '鍛']
I've tried this but it throws a SyntaxError:
>>> u'\u' + l[0]
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
\uhhhh escapes are only valid in string literals, you can't use those to turn arbitrary hex values into characters. In other words, they are part of a larger syntax, and can't be used stand-alone.
Decode the hex value to an integer and pass it to the chr() function (or, on Python 2, the unichr() function):
[chr(int(v, 16)) for v in l] #
You could ask Python to interpret a string containing literal \uhhhh text as a Unicode string literal with the unicode_escape codec, but feels like overkill for individual codepoints:
[(b'\\u' + v.encode('ascii')).decode('unicode_escape') for v in l]
Note the double backslash in the prefix added, and that we have to create byte strings for this to work at all.
Demo:
>>> l = ['9359', '935A', '935B']
>>> [chr(int(v, 16)) for v in l]
['鍙', '鍚', '鍛']
>>> [(b'\\u' + v.encode('ascii')).decode('unicode_escape') for v in l]
['鍙', '鍚', '鍛']

Remove bad character "\xC2" python string

I have the next code:
string_msg = '\x80\x01\x00\x00\x00\x00\x53\x58\x00\x1C\x00\x00\x00\x08\x00\x01\x00\x74\x00\x00\x00\x0A\x00\x54\x00\x00\x00\x03'
print(string_msg)
if sys.version < '3':
print(":".join("{:02x}".format(ord(c)) for c in string_msg))
else:
print(":".join("{:02x}".format(c) for c in string_msg.encode()))
In python 2, the result is:
80:01:00:00:00:00:53:58:00:1c:00:00:00:08:00:01:00:74:00:00:00:0a:00:54:00:00:00:03
But in python 3, the result is:
c2:80:01:00:00:00:00:53:58:00:1c:00:00:00:08:00:01:00:74:00:00:00:0a:00:54:00:00:00:03
Right now I need to execute this code in python 3 so I have to remove the first byte at the beginning in order to remove the "c2" and everything would be OK, but trying to do that with too many pieces of code I found in this forum such as:
string_msg = string_msg[1:]
string_msg.replace('\xC2', '')
string_msg = ''.join([i if ord(i) < 130 else '' for i in string_msg])
The result is always the same:
01:00:00:00:00:53:58:00:1c:00:00:00:08:00:01:00:74:00:00:00:0a:00:54:00:00:00:03
Removing also the second byte 80, so my question is: How can I remove just the first byte c2 and why when I try to do that the second byte is also removed?
The issue is that string_msg is a bytestring on Python 2 and despite looking the same it is a Unicode string on Python 3 -- a byte b'\x80' is a completely different concept from a Unicode codepoint u'\x80': the same Unicode codepoint can be represented using different bytes in different encodings and vice versa the same byte may represent different characters in different encodings.
If string_msg is a sequence of bytes then use b'' literal:
data = b'\x80\x01\x00\x00\x00\x00\x53\x58\x00\x1C\x00\x00\x00\x08'
print(":".join(map("{:02x}".format, bytearray(data))))
# -> 80:01:00:00:00:00:53:58:00:1c:00:00:00:08
You can convert text in the first 256 characters to its naive byte value by encoding as ISO 8859-1.
3>> '\x80'.encode('latin-1')
b'\x80'

UTF-8 latin-1 conversion issues, python django

ok so my issue is i have the string '\222\222\223\225' which is stored as latin-1 in the db. What I get from django (by printing it) is the following string, 'ââââ¢' which I assume is the UTF conversion of it. Now I need to pass the string into a function that
does this operation:
strdecryptedPassword + chr(ord(c) - 3 - intCounter - 30)
I get this error:
chr() arg not in range(256)
If I try to encode the string as latin-1 first I get this error:
'latin-1' codec can't encode characters in position 0-3: ordinal not
in range(256)
I have read a bunch on how character encoding works, and there is something I am missing because I just don't get it!
Your first error 'chr() arg not in range(256)' probably means you have underflowed the value, because chr cannot take negative numbers. I don't know what the encryption algorithm is supposed to do when the inputcounter + 33 is more than the actual character representation, you'll have to check what to do in that case.
About the second error. you must decode() and not encode() a regular string object to get a proper representation of your data. encode() takes a unicode object (those starting with u') and generates a regular string to be output or written to a file. decode() takes a string object and generate a unicode object with the corresponding code points. This is done with the unicode() call when generated from a string object, you could also call a.decode('latin-1') instead.
>>> a = '\222\222\223\225'
>>> u = unicode(a,'latin-1')
>>> u
u'\x92\x92\x93\x95'
>>> print u.encode('utf-8')
ÂÂÂÂ
>>> print u.encode('utf-16')
ÿþ
>>> print u.encode('latin-1')
>>> for c in u:
... print chr(ord(c) - 3 - 0 -30)
...
q
q
r
t
>>> for c in u:
... print chr(ord(c) - 3 -200 -30)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
ValueError: chr() arg not in range(256)
As Vinko notes, Latin-1 or ISO 8859-1 doesn't have printable characters for the octal string you quote. According to my notes for 8859-1, "C1 Controls (0x80 - 0x9F) are from ISO/IEC 6429:1992. It does not define names for 80, 81, or 99". The code point names are as Vinko lists them:
\222 = 0x92 => PRIVATE USE TWO
\223 = 0x93 => SET TRANSMIT STATE
\225 = 0x95 => MESSAGE WAITING
The correct UTF-8 encoding of those is (Unicode, binary, hex):
U+0092 = %11000010 %10010010 = 0xC2 0x92
U+0093 = %11000010 %10010011 = 0xC2 0x93
U+0095 = %11000010 %10010101 = 0xC2 0x95
The LATIN SMALL LETTER A WITH CIRCUMFLEX is ISO 8859-1 code 0xE2 and hence Unicode U+00E2; in UTF-8, that is %11000011 %10100010 or 0xC3 0xA2.
The CENT SIGN is ISO 8859-1 code 0xA2 and hence Unicode U+00A2; in UTF-8, that is %11000011 %10000010 or 0xC3 0x82.
So, whatever else you are seeing, you do not seem to be seeing a UTF-8 encoding of ISO 8859-1. All else apart, you are seeing but 5 bytes where you would have to see 8.
Added:
The previous part of the answer addresses the 'UTF-8 encoding' claim, but ignores the rest of the question, which says:
Now I need to pass the string into a function that does this operation:
strdecryptedPassword + chr(ord(c) - 3 - intCounter - 30)
I get this error: chr() arg not in range(256). If I try to encode the
string as Latin-1 first I get this error: 'latin-1' codec can't encode
characters in position 0-3: ordinal not in range(256).
You don't actually show us how intCounter is defined, but if it increments gently per character, sooner or later 'ord(c) - 3 - intCounter - 30' is going to be negative (and, by the way, why not combine the constants and use 'ord(c) - intCounter - 33'?), at which point, chr() is likely to complain. You would need to add 256 if the value is negative, or use a modulus operation to ensure you have a positive value between 0 and 255 to pass to chr(). Since we can't see how intCounter is incremented, we can't tell if it cycles from 0 to 255 or whether it increases monotonically. If the latter, then you need an expression such as:
chr(mod(ord(c) - mod(intCounter, 255) + 479, 255))
where 256 - 33 = 223, of course, and 479 = 256 + 223. This guarantees that the value passed to chr() is positive and in the range 0..255 for any input character c and any value of intCounter (and, because the mod() function never gets a negative argument, it also works regardless of how mod() behaves when its arguments are negative).
Well its because its been encrypted with some terrible scheme that just changes the ord() of the character by some request, so the string coming out of the database has been encrypted and this decrypts it. What you supplied above does not seem to work. In the database it is latin-1, django converts it to unicode, but I cannot pass it to the function as unicode, but when i try and encode it to latin-1 i see that error.

Categories