python 2 byte string that are not encoded in utf-8 - python

I maintain an api that can gets text input from multiple languages. We would like to make the encoding of string to be in utf-8.
Most of the solutions that previous developers have tried involved using the encode and decode function willy nilly. It just leads to confusing unmaintainable code.
For simplicity I am just defining x here but lets imagine this can be sent to my api. This string is encoded in latin-1
x = '\xe9toile' # x is a byte string in python 2
x.encode('utf-8')
results in
*** UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)
The only way that I know of to encode it to utf-8 is to first decode it as latin-1 then do the encoding.
x.decode('latin-1')
>>u'\xe9toile'
(x.decode('latin-1')).encode('utf-8')
>>'\xc3\xa9toile'
What if I did not know that the byte string was encoded in latin-1 how would I be able to encode it to utf-8 ?
What would I do if x was this chinese encoding that I don't know ?
x = '\u54c8\u54c8'
x is always a byte string.
Any help would be appreciated.

If x is a byte string then it doesn't make sense for you to encode it. Text encodings are a way to represent text as bytes. You first have to turn your bytes into text by decoding them and then encode that text into your target encoding.
What if I did not know that the byte string was encoded in latin-1 how would I be able to encode it to utf-8?
You can try to guess the encoding but you can't always be right:
>>> 'Vlh'.encode('cp037')
'\xe5\x93\x88'
>>> '哈'.encode('utf-8')
'\xe5\x93\x88'
This example is a little contrived but there's no way to know if the bytes '\xe5\x93\x88' represent 哈 or Vlh unless you know the original encoding.
The most sensible solution would be to just have your clients encode their text as UTF-8 and then you decode the bytes you receive as UTF-8.

Related

Python unicode accent a (à) hex

I have a string from bs4 that is
s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
\u00c3\u00a0should be accent a (à) I have gotten it to show up in the console partly correct as
vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html
with
str2 = u'%s' % s
print(str2.encode('utf-8').decode('unicode-escape'))
but it's decoding c3 and a0 separately, so I get a tilde A instead of an accent a. I know that c3 a0 is the hex utf-8 for accent a. I have no idea what's going on and I got to here using Google and the combinatory approach to the answers I got. This entire character encoding thing seems like a big mess to me.
The way it is supposed to be is
311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html
EDIT:
Andrey's method worked when printing it out, but trying to use urlopen with the string I get UnicodeEncodeError: 'ascii' codec can't encode character '\xe0' in position 60: ordinal not in range(128)
After using unquote(str,":/") it gives UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-57: ordinal not in range(128).
Transform the string back into bytes using .encode('latin-1'), then decode the unicode-escapes \u, transform everything into bytes again using the "wrong" 'latin-1' encoding, and finally, decode "properly" as 'utf-8':
s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
s.encode('latin-1').decode('raw_unicode_escape').encode('latin-1').decode('utf-8')
gives:
'vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html'
It works for the same reason as explained in this answer.
Assuming Python 2:
This is a byte string with Unicode escapes. The Unicode escapes were incorrectly generated for some UTF-8-encoded data:
>>> s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
>>> s.decode('unicode-escape')
u'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xc3\xa0-la-me-creatura.html'
Now it is a Unicode string but now appears mis-decoded since the code points resemble UTF-8 bytes. It turns output the latin1 (also iso-8859-1) codec maps the first 256 code points directly to bytes 0-255, so use this trick to convert back to a byte string:
>>> s.decode('unicode-escape').encode('latin1')
'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xc3\xa0-la-me-creatura.html'
Now it can be decoded correctly as UTF-8:
>>> s.decode('unicode-escape').encode('latin1').decode('utf8')
u'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xe0-la-me-creatura.html'
It is a Unicode string, so Python displays its repr() value, which shows code points above U+007F as escape codes. print it to see the actual value assuming your terminal is correctly configured with an encoding that supports the characters printed:
>>> print(s.decode('unicode-escape').encode('latin1').decode('utf8'))
vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html
Ideally, fix the problem that generated this string incorrectly in the first place instead of working around the mess.

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3 in position 3: invalid start byte

I am using repl.it Python web IDE, and I really can't solve a problem with it.
I was trying to decode a string, but it seems that there's no way to do it.
import base64
ciphertext = 'FxM7o1wl/7wE9CHPNzbB944feDFXbTSVaJfaLsUMzH5EP4xZRz7Sq8O3y7+jPbXIMVRxpvJZZm7ugqQ4fwpJwtvnB0/BoU+hhGeEZZZ0fFj1irm/zg3bsxOoxBJx4B3U'
ciphertext = base64.b64decode(ciphertext)
print ciphertext
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3 in position 3: invalid start byte
You cannot print ciphertext, since it is a sequence of nonsensical binary bytes, not a text at all (I checked).
Your terminal is assuming that if you print something, that something is UTF8; and it is not. Hence the error. If you had a ciphertext of VGhpcyB3aWxsIGJlIHByaW50ZWQuCg==, that would be printed without problems, since it decodes to valid UTF-8 (valid ASCII-7 actually).
If you want to display the ciphertext, you can replace non-UTF8 characters with spaces, or you can print the ciphertext as hex.
But, actually, what you should really do is decrypt it before printing (also, when you've done it, verify it is a UTF8 text and not, say, encoded in ISO-8859-15 or other charsets. If it is, you can use the appropriate codec; this answer also supplies useful information on charsets).

How to allow encode('utf-8') twice without getting error in python?

I have a legacy code segment that always encode('utf-8') for me when I pass in an unicode string (directly from database), is there a way to change unicode string to other format to allow it to be encoded to 'utf-8' again without getting an error, since I am not allowed to change the legacy code segment.
I've tried decoding it first but it returns this error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
If I leave the unicode string as is it returns
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 986: ordinal not in range(128)
If I change the legacy code to not encode('utf-8') it works, but this is not a viable option
Edit:
Here is the code snippet
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
if __name__ == "__main__":
# 1
a = u'贸易'
# 2
a = a.decode('utf-8')
# 3
a.encode('utf-8')
For some reason if I skip #2 I don't get the error that I mentioned above, I double check the type for the string, it seems like both is unicode, and both is the same character, but the code I am working on does not allow me to encode or decode to utf-8 , while the same character in some snippet allows me to do that.
Consider the following cases:
If you want a unicode string, and you already have a unicode string, you need do nothing.
If you want a bytestring, and you already have a bytestring, you need do nothing.
If you have a unicode string and want a bytestring, you encode it.
If you have a bytestring and want a unicode string, you decode it.
In none of these cases is it appropriate to encode or decode more than once.
In order for encode('utf-8') to make sense, the string must be a unicode string (or contain all-ASCII characters...). So, unless it's a unicode instance already, you have to decode it first from whatever encoding it's in to a unicode string, after which you can pass it into your legacy interface.
At no point does it make sense for anything to be double-encoded -- encoding takes a string and transforms it to a series of bytes; decoding takes a series of bytes and transforms them back into a string. The confusion only arises because Python 2 uses the str for both plain-ASCII strings and byte sequences.
>>> u'é'.encode('utf-8') # unicode string
'\xc3\xa9' # bytes, not unicode string
>>> '\xc3\xa9'.decode('utf-8')
u'\xe9' # unicode string
>>> u'\xe9' == u'é'
True

Encoding and Decoding of text in Python

I am currently working with a python script (appengine) that takes an input from the user (text) and stores it in the database for re-distribution later.
The text that comes in is unknown, in terms of encoding and I need to have it encoded only once.
Example Texts from clients:
This%20is%20a%20test
This is a test
Now in python what I thought I could do is decode it then encode it so both samples become:
This%20is%20a%20test
This%20is%20a%20test
The code that I am using is as follows:
#
# Dencode as UTF-8
#
pl = pl.encode('UTF-8')
#
#Unquote the string, then requote to assure encoding
#
pl = urllib.quote(urllib.unquote(pl))
Where pl is from the POST parameter for payload.
The Issue
The issue is that sometimes I get special (Chinese, Arabic) type chars and I get the following error.
'ascii' codec can't encode character u'\xc3' in position 0: ordinal not in range(128)
..snip..
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc3' in position 0: ordinal not in range(128)
does anyone know the best solution to process the string given the above issue?
Thanks.
Replace
pl = pl.encode('UTF-8')
with
pl = pl.decode('UTF-8')
since you're trying to decode a byte-string into a string of characters.
A design issue with Python 2 lets you .encode a bytestring (which is already encoded) by automatically decoding it as ASCII (which is why it apparently works for ASCII strings, failing only for non-ASCII bytes).

how to convert string from known encoding to utf-8 on the fly in python?

I know about codecs library, but I don't want to write string to file.
Is there a way to hold resulting string in variable?
Let's assume you have a string s encode in encoding. To get the same string in UTF-8, you can use
s.decode(encoding).encode("utf-8")
If you have an ascii encoded file,f:
1)f1=unicode(f)
2)f2=f1.encode('utf-8')
I this way, you get rid of errors like:"UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in position 5: ordinal not in range(128)"

Categories