Can't decode an improperly encoded string with à character - python

I'm trying to encode this:
"LIAISONS Ã NEW YORK"
to this:
"LIAISONS à NEW YORK"
The output of print(ascii(value)) is
'LIAISONS \xc3 NEW YORK'
I tried encoding in cp1252 first and decoding after to utf8 but I get this:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 9: invalid continuation byte
I also tried to encode in Latin-1/ISO-8859-2 but that is not working too.
How can I do this?

You can't go from your input value to your desired output, because the data is no longer complete.
If your input value was an actual Mojibake re-coding from UTF-8 to a Latin encoding, then you'd have two bytes for the à codepoint:
>>> target = "LIAISONS à NEW YORK"
>>> target.encode('UTF-8').decode('latin1')
'LIAISONS Ã\xa0 NEW YORK'
That's because the UTF-8 encoding for à is C3 A0:
>>> 'à'.encode('utf8').hex()
'c3a0'
In your input, the A0 byte (which doesn't map to a printable character in most Latin-based codecs) has been filtered out somewhere. You can't re-create it from thin air, because the C3 byte of the UTF-8 pair can precede any number of other bytes, all resulting in valid output:
>>> b'\xc3\xa1'.decode('utf8')
'á'
>>> b'\xc3\xa2'.decode('utf8')
'â'
>>> b'\xc3\xa3'.decode('utf8')
'ã'
>>> b'\xc3\xa4'.decode('utf8')
'ä'
and you can't easily pick one of those, not without additional natural language processing. The bytes 80-A0 and AD are all valid continuation bytes in UTF-8 for this case, but none of those bytes result in a printable Latin-1 character, so there are at least 18 different possibilities here.

Related

Why are the byte representations for extended ASCII characters from bytes() different from chr()?

(I am working in python)
Suppose I have this list of integers
a = [170, 140, 139, 180, 225, 200]
and I want to find the raw byte representation of the ASCII character each integer is mapped to. Since these are all greater than 127, they fall in the Extended ASCII set. I was originally using the chr() method in python to get the character and then encode() to get the raw byte representation.
a_bytes = [chr(decimal).encode() for decimal in a]
Using this method, I saw that for numbers greater than 127, the corresponding ASCII character is represented by 2 bytes.
[b'\xc2\xaa', b'\xc2\x8c', b'\xc2\x8b', b'\xc2\xb4', b'\xc3\xa1', b'\xc3\x88']
But when I used the bytes() method, it appears that each character had one byte.
a_bytes2 = bytes(a)
>>> b'\xaa\x8c\x8b\xb4\xe1\xc8'
So why is it different when I use chr().encode() versus bytes()?
There is no such thing as "Extended ASCII". ASCII is defined as bytes (and code points) in the range 0-127. Most standard single-byte code pages (which are used to convert from bytes to code points) use ASCII for bytes 0-127 and then map 128-255 to whatever is convenient for the code page. Russian code pages map those bytes to Cyrillic code points for example.
In your example, .encode() defaults to the multi-byte UTF-8 encoding which maps the code points 0-127 to ASCII and follows multibyte encoding rules for any code point above 128. chr() converts an integer to its corresponding, fixed Unicode code point.
So you have to choose an appropriate encoding to see what a byte in that encoding represents as a character. As you can see below, it varies:
>>> a = [170, 140, 139, 180, 225, 200]
>>> ''.join(chr(x) for x in a) # Unicode code points
'ª\x8c\x8b´áÈ'
>>> bytes(a).decode('latin1') # ISO-8859-1, also matches first 256 Unicode code points.
'ª\x8c\x8b´áÈ'
>>> bytes(a).decode('cp1252') # USA and Western Europe
'ªŒ‹´áÈ'
>>> bytes(a).decode('cp1251') # Russian, Serbian, Bulgarian, ...
'ЄЊ‹ґбИ'
>>> bytes(a).decode('cp1250') # Central and Eastern Europe
'ŞŚ‹´áČ'
>>> bytes(a).decode('ascii') # these bytes aren't defined for ASCII
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xaa in position 0: ordinal not in range(128)
Also, when displaying bytes the Python default is to display printable ASCII characters as characters and anything else (unprintable control characters and >127 values) as escape codes:
>>> bytes([0,1,2,97,98,99,49,50,51,170,140,139,180,225,200])
b'\x00\x01\x02abc123\xaa\x8c\x8b\xb4\xe1\xc8'

Problem with .decode('utf-8').upper() and special characters (but only inside the string)

I would like to capitalise letters on given position in string. I have a problem with special letters - polish letters to be specific: for example "ą". Ideally would be a solution which works also for french, spanish etc. (ç, è etc.)
dobry="costąm"
print(dobry[4].decode('utf-8').upper())
I obtain:
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc4 in position 0: unexpected end of data
while for this:
print("ą".decode('utf-8').upper())
I obtain Ą as desired.
What is more curious for letters on positions 0-3 it works fine while for:
print(dobry[5].decode('utf-8').upper())
I obtain the same problem
The string actually looks like this:
>>> list(dobry)
['c', 'o', 's', 't', '\xc4', '\x85', 'm']
So, dobry[5] == '\x85' because the letter ą is represented by two bytes. To solve this, simply use Python 3 instead of Python 2.
UTF-8 may use more than one byte to encode a character, so iterating over a bytestring and manipulating individual bytes won't always work. It's better to decode to Python 2's unicode type. Perform your manipulations, then re-encode to UTF-8.
>>> dobry="costąm"
>>> udobry = unicode(dobry, 'utf-8')
>>> changed = udobry[:4] + udobry[4].upper() + udobry[5]
>>> new_dobry = changed.encode('utf-8')
>>> print new_dobry
costĄm
As #tripleee commented, non-ascii characters may not map to a single unicode codepoint: "ą" could be the single codepoint U+0105 LATIN SMALL LETTER A WITH OGONEK or it could be composed of "a" followed by U+0328 COMBINING OGONEK.
In the composed string the "a" character can be capitalised, and "a" followed by COMBINING OGONEK will result in "Ą" (though it may look like two separate characters in the Python REPL, or the terminal, depending on the terminal settings).
Note that you need to take the extra character into account when indexing.
It's also possible to normalise the composed string to the single codepoint (canonical) version using the tools in the unicodedata module:
>>> unicodedata.normalize('NFC', u'costa\u0328m') == u"costąm"
True
but this may cause problems if, for example, you are returning the changed string to a system that expects the combining character to be preserved.
what about that instead:
print(dobry.decode('utf-8')[5].upper())

How can I convert utf8 code number to unicode code number in Python3

I want to generate all utf8 characters list.
I wrote the code below but it didn't work well.
I thought that because chr() expected unicode number, but I gave utf8 code number.
I think I have to convert utf8 code number to unicode code number but I don't know the way.
How can I do? Or do you know better way?
def utf8_2byte():
characters = []
# first byte range: [C2-DF]
for first in range(0xC2, 0xDF + 1):
# second byte range: [80-BF]
for second in range(0x80, 0xBF + 1):
num = (first << 8) + second
line = [hex(num), chr(num)]
characters.append(line)
return characters
I expect:
# UTF8 code number, UTF8 character
[0xc380,À]
[0xc381,Á]
[0xc382,Â]
actually:
[0xc380,쎀]
[0xc381,쎁]
[0xc382,쎂]
In python 3, chr takes unicode codepoints, not utf-8. U+C380 is in the Hangul range. Instead you can use bytearray for the decode
>>> bytearray((0xc3, 0x80)).decode('utf-8')
'À'
There are other methods also, like struct or ctypes. Anything that assembles native bytes and converts them to bytes will do.
Unicode is a character set while UTF-8 is a encoding which is a algorithm to encode code point from Unicode to bytes in machine level and vice versa.
The code point 0xc380 is 쎀 in the standard of Unicode.
The bytes 0xc380 is À when you decode it use UTF-8 encoding.
>>> s = "쎀"
>>> hex(ord(s))
'0xc380'
>>> b = bytes.fromhex("C3 80")
>>> b
b'\xc3\x80'
>>> b.decode("utf8")
'À'
>>> bytes((0xc3, 0x80)).decode("utf8")
'À'

how to decode an ascii string with backslash x \x codes

I am trying to decode from a Brazilian Portogese text:
'Demais Subfun\xc3\xa7\xc3\xb5es 12'
It should be
'Demais Subfunções 12'
>> a.decode('unicode_escape')
>> a.encode('unicode_escape')
>> a.decode('ascii')
>> a.encode('ascii')
all give:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13:
ordinal not in range(128)
on the other hand this gives:
>> print a.encode('utf-8')
Demais Subfun├â┬º├â┬Áes 12
>> print a
Demais Subfunções 12
You have binary data that is not ASCII encoded. The \xhh codepoints indicate your data is encoded with a different codec, and you are seeing Python produce a representation of the data using the repr() function that can be re-used as a Python literal that accurately lets you re-create the exact same value. This representation is very useful when debugging a program.
In other words, the \xhh escape sequences represent individual bytes, and the hh is the hex value of that byte. You have 4 bytes with hex values C3, A7, C3 and B5, that do not map to printable ASCII characters so Python uses the \xhh notation instead.
You instead have UTF-8 data, decode it as such:
>>> 'Demais Subfun\xc3\xa7\xc3\xb5es 12'.decode('utf8')
u'Demais Subfun\xe7\xf5es 12'
>>> print 'Demais Subfun\xc3\xa7\xc3\xb5es 12'.decode('utf8')
Demais Subfunções 12
The C3 A7 bytes together encode U+00E7 LATIN SMALL LETTER C WITH CEDILLA, while the C3 B5 bytes encode U+00F5 LATIN SMALL LETTER O WITH TILDE.
ASCII happens to be a subset of the UTF-8 codec, which is why all the other letters can be represented as such in the Python repr() output.

UTF-8 latin-1 conversion issues, python django

ok so my issue is i have the string '\222\222\223\225' which is stored as latin-1 in the db. What I get from django (by printing it) is the following string, 'ââââ¢' which I assume is the UTF conversion of it. Now I need to pass the string into a function that
does this operation:
strdecryptedPassword + chr(ord(c) - 3 - intCounter - 30)
I get this error:
chr() arg not in range(256)
If I try to encode the string as latin-1 first I get this error:
'latin-1' codec can't encode characters in position 0-3: ordinal not
in range(256)
I have read a bunch on how character encoding works, and there is something I am missing because I just don't get it!
Your first error 'chr() arg not in range(256)' probably means you have underflowed the value, because chr cannot take negative numbers. I don't know what the encryption algorithm is supposed to do when the inputcounter + 33 is more than the actual character representation, you'll have to check what to do in that case.
About the second error. you must decode() and not encode() a regular string object to get a proper representation of your data. encode() takes a unicode object (those starting with u') and generates a regular string to be output or written to a file. decode() takes a string object and generate a unicode object with the corresponding code points. This is done with the unicode() call when generated from a string object, you could also call a.decode('latin-1') instead.
>>> a = '\222\222\223\225'
>>> u = unicode(a,'latin-1')
>>> u
u'\x92\x92\x93\x95'
>>> print u.encode('utf-8')
ÂÂÂÂ
>>> print u.encode('utf-16')
ÿþ
>>> print u.encode('latin-1')
>>> for c in u:
... print chr(ord(c) - 3 - 0 -30)
...
q
q
r
t
>>> for c in u:
... print chr(ord(c) - 3 -200 -30)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
ValueError: chr() arg not in range(256)
As Vinko notes, Latin-1 or ISO 8859-1 doesn't have printable characters for the octal string you quote. According to my notes for 8859-1, "C1 Controls (0x80 - 0x9F) are from ISO/IEC 6429:1992. It does not define names for 80, 81, or 99". The code point names are as Vinko lists them:
\222 = 0x92 => PRIVATE USE TWO
\223 = 0x93 => SET TRANSMIT STATE
\225 = 0x95 => MESSAGE WAITING
The correct UTF-8 encoding of those is (Unicode, binary, hex):
U+0092 = %11000010 %10010010 = 0xC2 0x92
U+0093 = %11000010 %10010011 = 0xC2 0x93
U+0095 = %11000010 %10010101 = 0xC2 0x95
The LATIN SMALL LETTER A WITH CIRCUMFLEX is ISO 8859-1 code 0xE2 and hence Unicode U+00E2; in UTF-8, that is %11000011 %10100010 or 0xC3 0xA2.
The CENT SIGN is ISO 8859-1 code 0xA2 and hence Unicode U+00A2; in UTF-8, that is %11000011 %10000010 or 0xC3 0x82.
So, whatever else you are seeing, you do not seem to be seeing a UTF-8 encoding of ISO 8859-1. All else apart, you are seeing but 5 bytes where you would have to see 8.
Added:
The previous part of the answer addresses the 'UTF-8 encoding' claim, but ignores the rest of the question, which says:
Now I need to pass the string into a function that does this operation:
strdecryptedPassword + chr(ord(c) - 3 - intCounter - 30)
I get this error: chr() arg not in range(256). If I try to encode the
string as Latin-1 first I get this error: 'latin-1' codec can't encode
characters in position 0-3: ordinal not in range(256).
You don't actually show us how intCounter is defined, but if it increments gently per character, sooner or later 'ord(c) - 3 - intCounter - 30' is going to be negative (and, by the way, why not combine the constants and use 'ord(c) - intCounter - 33'?), at which point, chr() is likely to complain. You would need to add 256 if the value is negative, or use a modulus operation to ensure you have a positive value between 0 and 255 to pass to chr(). Since we can't see how intCounter is incremented, we can't tell if it cycles from 0 to 255 or whether it increases monotonically. If the latter, then you need an expression such as:
chr(mod(ord(c) - mod(intCounter, 255) + 479, 255))
where 256 - 33 = 223, of course, and 479 = 256 + 223. This guarantees that the value passed to chr() is positive and in the range 0..255 for any input character c and any value of intCounter (and, because the mod() function never gets a negative argument, it also works regardless of how mod() behaves when its arguments are negative).
Well its because its been encrypted with some terrible scheme that just changes the ord() of the character by some request, so the string coming out of the database has been encrypted and this decrypts it. What you supplied above does not seem to work. In the database it is latin-1, django converts it to unicode, but I cannot pass it to the function as unicode, but when i try and encode it to latin-1 i see that error.

Categories