UTF-8 latin-1 conversion issues, python django - python

ok so my issue is i have the string '\222\222\223\225' which is stored as latin-1 in the db. What I get from django (by printing it) is the following string, 'ââââ¢' which I assume is the UTF conversion of it. Now I need to pass the string into a function that
does this operation:
strdecryptedPassword + chr(ord(c) - 3 - intCounter - 30)
I get this error:
chr() arg not in range(256)
If I try to encode the string as latin-1 first I get this error:
'latin-1' codec can't encode characters in position 0-3: ordinal not
in range(256)
I have read a bunch on how character encoding works, and there is something I am missing because I just don't get it!

Your first error 'chr() arg not in range(256)' probably means you have underflowed the value, because chr cannot take negative numbers. I don't know what the encryption algorithm is supposed to do when the inputcounter + 33 is more than the actual character representation, you'll have to check what to do in that case.
About the second error. you must decode() and not encode() a regular string object to get a proper representation of your data. encode() takes a unicode object (those starting with u') and generates a regular string to be output or written to a file. decode() takes a string object and generate a unicode object with the corresponding code points. This is done with the unicode() call when generated from a string object, you could also call a.decode('latin-1') instead.
>>> a = '\222\222\223\225'
>>> u = unicode(a,'latin-1')
>>> u
u'\x92\x92\x93\x95'
>>> print u.encode('utf-8')
ÂÂÂÂ
>>> print u.encode('utf-16')
ÿþ
>>> print u.encode('latin-1')
>>> for c in u:
... print chr(ord(c) - 3 - 0 -30)
...
q
q
r
t
>>> for c in u:
... print chr(ord(c) - 3 -200 -30)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
ValueError: chr() arg not in range(256)

As Vinko notes, Latin-1 or ISO 8859-1 doesn't have printable characters for the octal string you quote. According to my notes for 8859-1, "C1 Controls (0x80 - 0x9F) are from ISO/IEC 6429:1992. It does not define names for 80, 81, or 99". The code point names are as Vinko lists them:
\222 = 0x92 => PRIVATE USE TWO
\223 = 0x93 => SET TRANSMIT STATE
\225 = 0x95 => MESSAGE WAITING
The correct UTF-8 encoding of those is (Unicode, binary, hex):
U+0092 = %11000010 %10010010 = 0xC2 0x92
U+0093 = %11000010 %10010011 = 0xC2 0x93
U+0095 = %11000010 %10010101 = 0xC2 0x95
The LATIN SMALL LETTER A WITH CIRCUMFLEX is ISO 8859-1 code 0xE2 and hence Unicode U+00E2; in UTF-8, that is %11000011 %10100010 or 0xC3 0xA2.
The CENT SIGN is ISO 8859-1 code 0xA2 and hence Unicode U+00A2; in UTF-8, that is %11000011 %10000010 or 0xC3 0x82.
So, whatever else you are seeing, you do not seem to be seeing a UTF-8 encoding of ISO 8859-1. All else apart, you are seeing but 5 bytes where you would have to see 8.
Added:
The previous part of the answer addresses the 'UTF-8 encoding' claim, but ignores the rest of the question, which says:
Now I need to pass the string into a function that does this operation:
strdecryptedPassword + chr(ord(c) - 3 - intCounter - 30)
I get this error: chr() arg not in range(256). If I try to encode the
string as Latin-1 first I get this error: 'latin-1' codec can't encode
characters in position 0-3: ordinal not in range(256).
You don't actually show us how intCounter is defined, but if it increments gently per character, sooner or later 'ord(c) - 3 - intCounter - 30' is going to be negative (and, by the way, why not combine the constants and use 'ord(c) - intCounter - 33'?), at which point, chr() is likely to complain. You would need to add 256 if the value is negative, or use a modulus operation to ensure you have a positive value between 0 and 255 to pass to chr(). Since we can't see how intCounter is incremented, we can't tell if it cycles from 0 to 255 or whether it increases monotonically. If the latter, then you need an expression such as:
chr(mod(ord(c) - mod(intCounter, 255) + 479, 255))
where 256 - 33 = 223, of course, and 479 = 256 + 223. This guarantees that the value passed to chr() is positive and in the range 0..255 for any input character c and any value of intCounter (and, because the mod() function never gets a negative argument, it also works regardless of how mod() behaves when its arguments are negative).

Well its because its been encrypted with some terrible scheme that just changes the ord() of the character by some request, so the string coming out of the database has been encrypted and this decrypts it. What you supplied above does not seem to work. In the database it is latin-1, django converts it to unicode, but I cannot pass it to the function as unicode, but when i try and encode it to latin-1 i see that error.

Related

Why are the byte representations for extended ASCII characters from bytes() different from chr()?

(I am working in python)
Suppose I have this list of integers
a = [170, 140, 139, 180, 225, 200]
and I want to find the raw byte representation of the ASCII character each integer is mapped to. Since these are all greater than 127, they fall in the Extended ASCII set. I was originally using the chr() method in python to get the character and then encode() to get the raw byte representation.
a_bytes = [chr(decimal).encode() for decimal in a]
Using this method, I saw that for numbers greater than 127, the corresponding ASCII character is represented by 2 bytes.
[b'\xc2\xaa', b'\xc2\x8c', b'\xc2\x8b', b'\xc2\xb4', b'\xc3\xa1', b'\xc3\x88']
But when I used the bytes() method, it appears that each character had one byte.
a_bytes2 = bytes(a)
>>> b'\xaa\x8c\x8b\xb4\xe1\xc8'
So why is it different when I use chr().encode() versus bytes()?
There is no such thing as "Extended ASCII". ASCII is defined as bytes (and code points) in the range 0-127. Most standard single-byte code pages (which are used to convert from bytes to code points) use ASCII for bytes 0-127 and then map 128-255 to whatever is convenient for the code page. Russian code pages map those bytes to Cyrillic code points for example.
In your example, .encode() defaults to the multi-byte UTF-8 encoding which maps the code points 0-127 to ASCII and follows multibyte encoding rules for any code point above 128. chr() converts an integer to its corresponding, fixed Unicode code point.
So you have to choose an appropriate encoding to see what a byte in that encoding represents as a character. As you can see below, it varies:
>>> a = [170, 140, 139, 180, 225, 200]
>>> ''.join(chr(x) for x in a) # Unicode code points
'ª\x8c\x8b´áÈ'
>>> bytes(a).decode('latin1') # ISO-8859-1, also matches first 256 Unicode code points.
'ª\x8c\x8b´áÈ'
>>> bytes(a).decode('cp1252') # USA and Western Europe
'ªŒ‹´áÈ'
>>> bytes(a).decode('cp1251') # Russian, Serbian, Bulgarian, ...
'ЄЊ‹ґбИ'
>>> bytes(a).decode('cp1250') # Central and Eastern Europe
'ŞŚ‹´áČ'
>>> bytes(a).decode('ascii') # these bytes aren't defined for ASCII
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xaa in position 0: ordinal not in range(128)
Also, when displaying bytes the Python default is to display printable ASCII characters as characters and anything else (unprintable control characters and >127 values) as escape codes:
>>> bytes([0,1,2,97,98,99,49,50,51,170,140,139,180,225,200])
b'\x00\x01\x02abc123\xaa\x8c\x8b\xb4\xe1\xc8'

Trying To Create Table To Braille In Python

I've Been Trying To Create A System Which Turns A Table Of 1's And 0's To A Braille Character But It Keeps Giving Me This Error
File "brail.py", line 16
stringToWrite=u"\u"+brail([1,1,1,0,0,0,1,1])
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
My Current Code Is
def brail(brailList):
if len(brailList) == 8:
brailList.reverse()
brailHelperList=[0x80,0x40,0x20,0x10,0x8,0x4,0x2,0x1]
brailNum=0x0
for num in range(len(brailList)):
if brailList[num] == 1:
brailNum+=brailHelperList[num]
stringToReturn="28"+str(hex(brailNum))[2:len(str(hex(brailNum)))]
return stringToReturn
else:
return "String Needs To Be 8 In Length"
fileWrite=open('Write.txt','w',encoding="utf-8")
stringToWrite=u"\u"+brail([1,1,1,0,0,0,1,1])
fileWrite.write(stringToWrite)
fileWrite.close()
It Works When I Do fileWrite.write(u"\u28c7") But When I Do A Function Which Should Return That Exact Same Thing It Errors.
Image Of Code Just In Case
\u is the unicode escape sequence for Python literal strings. A 4 hex digit unicode code point is expected to follow the escape sequence. It is a syntax error if the code point is missing or is too short.
>>> '\u28c7'
'⣇'
>>> '\u'
File "<stdin>", line 1
'\u'
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
If you are using Python 3 then the u string prefix is not required as strings are stored as unicode internally. The u prefix was maintained for compatibility with Python 2 code.
That's the cause of the exception, however, you don't need to construct the unicode code point like that. You can use the ord() and chr() functions:
from unicodedata import lookup
braille_start = ord(lookup('BRAILLE PATTERN BLANK'))
return chr(braille_start + brailNum)
You can rewrite
stringToWrite=u"\u"+brail([1,1,1,0,0,0,1,1])
as
stringToWrite="\\u{0}".format(brail([1, 1, 1, 0, 0, 0, 1, 1]))
All strings are unicode in Python 3, so you don't need the leading "u".
def braille(brailleString):
brailleList = []
brailleList[:0]=brailleString
if len(brailleList) > 8:
brailleList=brailleList[0:8]
if len(brailleList) < 8:
while len(brailleList) < 8:
brailleList.append('0')
brailleList1=[
int(brailleList[0]),
int(brailleList[1]),
int(brailleList[2]),
int(brailleList[4]),
int(brailleList[5]),
int(brailleList[6]),
int(brailleList[3]),
int(brailleList[7]),
]
brailleList1.reverse()
brailleHelperList=[128,64,32,16,8,4,2,1]
brailleNum=0
for num in range(len(brailleList1)):
if brailleList1[num] == 1:
brailleNum+=brailleHelperList[num]
brailleStart = 10240
return chr(brailleStart+brailleNum)
fileWrite=open('Write.txt','w',encoding="utf-16")
fileWrite.write(braille('11111111'))
fileWrite.close()
# Think Of The Braille Functions String Like It Has A Seperator In The Middle And The 1s And 0s Are Going Vertically

joining strings together to make a unicode character

I am trying to create a random unicode generator and made a function that can create 16bit unicode charaters. This is my code:
import random
import string
def rand_unicode():
list = []
list.append(str(random.randint(0,1)))
for i in range(0,3):
if random.randint(0,1):
list.append(string.ascii_letters[random.randint(0, \
len(string.ascii_letters))-1].upper())
else:
list.append(str(random.randint(0,9)))
return ''.join(list)
print(rand_unicode())
The problem is that whenever I try to add a '\u' in the print statement, Python gives me the following error:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
I tried raw strings but that only gives me output like '\u0070' without turning it into a unicode character. How can I properly connect the strings to create a unicode character? Any help is appreciated.
From:
The problem is that whenever I try to add a '\u' in the print statement, Python gives me the following error:
it sounds like the problem may be in code you haven't included in your question:
print('\u' + rand_unicode())
This won't do what you expect, because the '\u' is interpreted before the strings are concatenated. See Process escape sequences in a string in Python and try:
print(bytes('\\u' + rand_unicode(), 'us-ascii').decode('unicode_escape'))
A unicode escape sequence such as \u0070 is a single character. It is not the concatenation of \u and the ordinal.
>>> '\u0070' == 'p'
True
>>> '\u0070' == (r'\u' + '0070')
False
To convert an ordinal to a unicode character, you can pass the numerical ordinal to the chr builtin function. Use int(literal, 16) to convert a hex-literal ordinal to a numerical one:
>>> ordinal = '0070'
>>> chr(int(ordinal, 16)) # convert literal to number to unicode
'p'
>>> chr(int(rand_unicode(), 16))
'ᚈ'
Note that creating a literal ordinal is not required. You can directly create the numerical ordinal:
>>> chr(112) # convert decimal number to unicode
'p'
>>> chr(0x0070) # convert hexadecimal number to unicode
'p'
>>> chr(random.randint(0, 0x10FFF))
'嚟'

python3 how to turn unicode codepoint into unicode char

i know this type is asked alot but no answer was able to specifically help me with my problemsetup.
i have a list of ONLY Unicode codepoints so in this form:
304E
304F
...
No U+XXXX no '\XXXX' version.
Now i've tried to use stringmanipulation to recreate such strings
so i can simply print the corresponding unichar.
what i tried:
x = u'\\u' + listString
x = '\\u' + listString
x = '\u' + listString
the first 2 when printed just give me a '\uXXXX' string, but no idea
how to make it print the char not that string.
the last one gives me this error:
(unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
probably just something i dont get about unicode and stringmanipulation but i hope someone can help me out here.
Thanks in advance o/
You can use chr to get the character for a unicode code point:
>>> chr(0x304E)
'ぎ'
You can use int to convert a hexadecimal string to an integer:
>>> int('304E', 16)
12366
>>> chr(int('304E', 16))
'ぎ'

How can I convert utf8 code number to unicode code number in Python3

I want to generate all utf8 characters list.
I wrote the code below but it didn't work well.
I thought that because chr() expected unicode number, but I gave utf8 code number.
I think I have to convert utf8 code number to unicode code number but I don't know the way.
How can I do? Or do you know better way?
def utf8_2byte():
characters = []
# first byte range: [C2-DF]
for first in range(0xC2, 0xDF + 1):
# second byte range: [80-BF]
for second in range(0x80, 0xBF + 1):
num = (first << 8) + second
line = [hex(num), chr(num)]
characters.append(line)
return characters
I expect:
# UTF8 code number, UTF8 character
[0xc380,À]
[0xc381,Á]
[0xc382,Â]
actually:
[0xc380,쎀]
[0xc381,쎁]
[0xc382,쎂]
In python 3, chr takes unicode codepoints, not utf-8. U+C380 is in the Hangul range. Instead you can use bytearray for the decode
>>> bytearray((0xc3, 0x80)).decode('utf-8')
'À'
There are other methods also, like struct or ctypes. Anything that assembles native bytes and converts them to bytes will do.
Unicode is a character set while UTF-8 is a encoding which is a algorithm to encode code point from Unicode to bytes in machine level and vice versa.
The code point 0xc380 is 쎀 in the standard of Unicode.
The bytes 0xc380 is À when you decode it use UTF-8 encoding.
>>> s = "쎀"
>>> hex(ord(s))
'0xc380'
>>> b = bytes.fromhex("C3 80")
>>> b
b'\xc3\x80'
>>> b.decode("utf8")
'À'
>>> bytes((0xc3, 0x80)).decode("utf8")
'À'

Categories