Python Skip/Delete Non-Decodable Characters

Python Skip/Delete Non-Decodable Characters - python

I'm currently converting some customer-entered strings as part of a json. I've already made a dict out of the strings, and now am just doing:
json.dumps(some_dict)
Problem is, for some of the customer entered data, it seems they've somehow entered garbled stuff and trying to dump to json breaks the whole thing:
{'FIRST_NAME': 'sdffg\xed', 'LAST_NAME': 'sdfsadf'}
Which then gets me:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 6: ordinal not in range(128)
I have no control over where the data comes from, so I can't prevent this upfront. So, now that this bad data already exists, I was thinking of just replacing unknown/bad characters with some placeholder character, or deleting them. How can I do this?

{'FIRST_NAME': 'sdffg\xed', 'LAST_NAME': 'sdfsadf'}
is a Python dictionary whose keys and values are byte strings. That cannot be represented in JSON because JSON don't have any concept of bytes. JSON string values are always Unicode, so to faithfully reproduce a Python dict, you have to make sure all the textual keys and values are unicode (u'...') strings.
Python will let you get away with 'FIRST_NAME' because it is limited to plain ASCII; most popular byte encodings are ASCII supersets, so Python can reasonably safely implicitly decode the string as ASCII. But that is not the case for strings with bytes outside the range 0x00-0x7F, such as 'sdffg\xed'. You should .decode the byte str to a unicode string before putting it in the dictionary. (Really you should try to ensure that your textual data is kept in Unicode strings for all of your application processing, converting to byte strings only when input is loaded from a non-Unicode source and when output has to go to a non-Unicode destination. So you shouldn't have ended up with byte content in a dictionary at this point. Check where that input is coming from - you probably should be doing the decode() step further up.)
You can decode to Unicode and skip or replace non-ASCII characters by using:
>>> 'sdffg\xed'.decode('ascii', 'ignore')
u'sdffg'
>>> 'sdffg\xed'.decode('ascii', 'replace')
u'sdffg\uFFFD' # U+FFFD = �. Unicode string, json.dump can serialise OK
but it seems a shame to throw away potentially useful data. If you can guess the encoding that was used to create the byte string you can keep the subset of non-ASCII characters that are recoverable. If the byte 0xED represents the character U+00ED i-acute (í), then .decode('iso-8859-1') or possibly .decode('cp1252') may be the encoding you are looking for.

json.dumps attempts to convert chunks to ascii (unless you supply encoding). So, you need to make sure that your strings will encode to ascii. Luckily for us, unicode() encodes it's string to ASCII if an encoding isn't specified. So ...
copy = {}
for k, v in d.items():
copy[k] = unicode(v, errors='ignore')
json.dumps(copy)

Related

Recover byte string with invalid characters(UTF-8) after having decoded it in Python

I've logged a lot of texts that were being decoded to unicode(UTF-8) from a byte string.
Example:
From upstream I received a a lot of byte strings, like:
b_st = b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xfe\x00'
I saved those in my computer after doing a decoding
b_un = b_st.decode("utf-8", "replace")
As you can see the initial byte string have a invalid
characters to decode to UTF-8(e.g. \xff) so those will be replaced.
After that I tried to recover the byte string from that unicode text doing: b_un.encode("utf-8") but it returns to me another byte string, not the same as the original.
Is it possible to recover the original byte string?
PS. I didn't decode those texts intentionally, I didnt read the default behavior of an a Class that automatically converts any text to unicode if necessary.

replace is a lossy codec error handler, replacing any un-decodable bytes with \ufffd (the unicode replacement character)
as such it is impossible to recover your original image
If you're handed a byte string, you can write it to a file by using a binary io object:
with open(filename, 'wb') as f:
f.write(byte_string)

'ascii' codec can't encode character u'\xe9'

I already tried all previous answers and solution.
I am trying to use this value, which gave me encoding related error.
ar = [u'http://dbpedia.org/resource/Anne_Hathaway', u'http://dbpedia.org/resource/Jodie_Bain', u'http://dbpedia.org/resource/Wendy_Divine', u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno', u'http://dbpedia.org/resource/Baaba_Maal']
So I tried,
d = [x.decode('utf-8') for x in ar]
which gives:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 31: ordinal not in range(128)
I tried out
d = [x.encode('utf-8') for x in ar]
which removes error but changes the original content
original value was u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' which converted to 'http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno' while using encode
what is correct way to deal with this scenario?
Edit
Error comes when I feed these links in
req = urllib2.Request()

The second version of your string is the correct utf-8 representation of your original unicode string. If you want to have a meaningful comparison, you have to use the same representation for both the stored string and the user input string. The sane thing to do here is to always use Unicode string internally (in your code), and make sure both your user inputs and stored strings are correctly decoded to unicode from their respective encodings at your system's boundaries (storage subsystem and user inputs subsystem).
Also you seem to be a bit confused about unicode and encodings, so reading this and this might help.

Unicode strings in python are "raw" unicode, so make sure to .encode() and .decode() them as appropriate. Using utf8 encoding is considered a best practice among multiple dev groups all over the world.
To encode use the quote function from the urllib2 library:
from urllib2 import quote
escaped_string = quote(unicode_string.encode('utf-8'))
To decode, use unquote:
from urllib2 import unquote
src = "http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno"
unicode_string = unquote(src).decode('utf-8')
Also, if you're more interested in Unicode and UTF-8 work, check out Unicode HOWTO and

In your Unicode list, u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' is an ASCII safe way to represent a Unicode string. When encoded in a form that supports the full Western European character set, such as UTF-8, it's: http://dbpedia.org/resource/José_Elías_Moreno
Your .encode("UTF-8") is correct and would have looked ok in a UTF-8 editor or browser. What you saw after the encode was an ASCII safe representation of UTF-8.
For example, your trouble chars were é and í.
é = 00E9 Unicode = C3A9 UTF-8
í = 00ED Unicode = C3AD UTF-8
In short, your .encode() method is correct and should be used for writing to files or to a browser.

How to include pictures bytes to a JSON with python? (encoding issue)

I would like to include picture bytes into a JSON, but I struggle with a encoding issue:
import urllib
import json
data = urllib.urlopen('https://www.python.org/static/community_logos/python-logo-master-v3-TM-flattened.png').read()
json.dumps({'picture' : data})
UnicodeDecodeError: 'utf8' codec can't decode byte 0x89 in position 0: invalid start byte
I don't know how to deal with that issue since I am handling an image, so I am a bit confused about this encoding issue. I am using python 2.7. Does anyone can help me? :)

JSON data expects to handle Unicode text. Binary image data is not text, so when the json.dumps() function tries to decode the bytestring to unicode using UTF-8 (the default) that decoding fails.
You'll have to wrap your binary data in a text-safe encoding first, such as Base-64:
json.dumps({'picture' : data.encode('base64')})
Of course, this then assumes that the receiver expects your data to be wrapped so.
If your API endpoint has been so badly designed to expect your image bytes to be passed in as text, then the alternative is to pretend that your bytes are really text; if you first decode it as Latin-1 you can map those bytes straight to Unicode codepoints:
json.dumps({'picture' : data.encode('latin-1')})
With the data already a unicode object the json library will then proceed to treat it as text. This does mean that it can replace non-ASCII codepoints with \uhhhh escapes.

The best solution that comes to my mind for this situation, space-wise, is base85 encoding which represents four bytes as five characters. Also you could also map every byte to the corresponding character in U+0000-U+00FF format and then dump it in the json.
But still, those could be overkill methods for this and base64, ease-wise, would be the winner.

UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 47: ordinal not in range(128)

I am trying to write data in a StringIO object using Python and then ultimately load this data into a postgres database using psycopg2's copy_from() function.
First when I did this, the copy_from() was throwing an error: ERROR: invalid byte sequence for encoding "UTF8": 0xc92 So I followed this question.
I figured out that my Postgres database has UTF8 encoding.
The file/StringIO object I am writing my data into shows its encoding as the following:
setgid Non-ISO extended-ASCII English text, with very long lines, with CRLF line terminators
I tried to encode every string that I am writing to the intermediate file/StringIO object into UTF8 format. To do this used .encode(encoding='UTF-8',errors='strict')) for every string.
This is the error I got now:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 47: ordinal not in range(128)
What does it mean? How do I fix it?
EDIT:
I am using Python 2.7
Some pieces of my code:
I read from a MySQL database that has data encoded in UTF-8 as per MySQL Workbench.
This is a few lines code for writing my data (that's obtained from MySQL db) to StringIO object:
# Populate the table_data variable with rows delimited by \n and columns delimited by \t
row_num=0
for row in cursor.fetchall() :
# Separate rows in a table by new line delimiter
if(row_num!=0):
table_data.write("\n")
col_num=0
for cell in row:
# Separate cells in a row by tab delimiter
if(col_num!=0):
table_data.write("\t")
table_data.write(cell.encode(encoding='UTF-8',errors='strict'))
col_num = col_num+1
row_num = row_num+1
This is the code that writes to Postgres database from my StringIO object table_data:
cursor = db_connection.cursor()
cursor.copy_from(table_data, <postgres_table_name>)

The problem is that you're calling encode on a str object.
A str is a byte string, usually representing text encoded in some way like UTF-8. When you call encode on that, it first has to be decoded back to text, so the text can be re-encoded. By default, Python does that by calling s.decode(sys.getgetdefaultencoding()), and getdefaultencoding() usually returns 'ascii'.
So, you're talking UTF-8 encoded text, decoding it as if it were ASCII, then re-encoding it in UTF-8.
The general solution is to explicitly call decode with the right encoding, instead of letting Python use the default, and then encode the result.
But when the right encoding is already the one you want, the easier solution is to just skip the .decode('utf-8').encode('utf-8') and just use the UTF-8 str as the UTF-8 str that it already is.
Or, alternatively, if your MySQL wrapper has a feature to let you specify an encoding and get back unicode values for CHAR/VARCHAR/TEXT columns instead of str values (e.g., in MySQLdb, you pass use_unicode=True to the connect call, or charset='UTF-8' if your database is too old to auto-detect it), just do that. Then you'll have unicode objects, and you can call .encode('utf-8') on them.
In general, the best way to deal with Unicode problems is the last one—decode everything as early as possible, do all the processing in Unicode, and then encode as late as possible. But either way, you have to be consistent. Don't call str on something that might be a unicode; don't concatenate a str literal to a unicode or pass one to its replace method; etc. Any time you mix and match, Python is going to implicitly convert for you, using your default encoding, which is almost never what you want.
As a side note, this is one of the many things that Python 3.x's Unicode changes help with. First, str is now Unicode text, not encoded bytes. More importantly, if you have encoded bytes, e.g., in a bytes object, calling encode will give you an AttributeError instead of trying to silently decode so it can re-encode. And, similarly, trying to mix and match Unicode and bytes will give you an obvious TypeError, instead of an implicit conversion that succeeds in some cases and gives a cryptic message about an encode or decode you didn't ask for in others.

python byte string encode and decode

I am trying to convert an incoming byte string that contains non-ascii characters into a valid utf-8 string such that I can dump is as json.
b = '\x80'
u8 = b.encode('utf-8')
j = json.dumps(u8)
I expected j to be '\xc2\x80' but instead I get:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
In my situation, 'b' is coming from mysql via google protocol buffers and is filled out with some blob data.
Any ideas?
EDIT:
I have ethernet frames that are stored in a mysql table as a blob (please, everyone, stay on topic and keep from discussing why there are packets in a table). The table collation is utf-8 and the db layer (sqlalchemy, non-orm) is grabbing the data and creating structs (google protocol buffers) which store the blob as a python 'str'. In some cases I use the protocol buffers directly with out any issue. In other cases, I need to expose the same data via json. What I noticed is that when json.dumps() does its thing, '\x80' can be replaced with the invalid unicode char (\ufffd iirc)

You need to examine the documentation for the software API that you are using. BLOB is an acronym: BINARY Large Object.
If your data is in fact binary, the idea of decoding it to Unicode is of course a nonsense.
If it is in fact text, you need to know what encoding to use to decode it to Unicode.
Then you use json.dumps(a_Python_object) ... if you encode it to UTF-8 yourself, json will decode it back again:
>>> import json
>>> json.dumps(u"\u0100\u0404")
'"\\u0100\\u0404"'
>>> json.dumps(u"\u0100\u0404".encode('utf8'))
'"\\u0100\\u0404"'
>>>
UPDATE about latin1:
u'\x80' is a useless meaningless C1 control character -- the encoding is extremely unlikely to be Latin-1. Latin-1 is "a snare and a delusion" -- all 8-bit bytes are decoded to Unicode without raising an exception. Don't confuse "works" and "doesn't raise an exception".

Use b.decode('name of source encoding') to get a unicode version. This was surprising to me when I learned it. eg:
In [123]: 'foo'.decode('latin-1')
Out[123]: u'foo'

I think what you are trying to do is decode the string object of some encoding. Do you know what that encoding is? To get the unicode object.
unicode_b = b.decode('some_encoding')
and then re-encoding the unicode object using the utf_8 encoding back to a string object.
b = unicode_b.encode('utf_8')
Using the unicode object as a translator, without knowing what the original encoding of the string is I can't know for certain but there is the possibility that the conversion will not go as expected. The unicode object is not meant for converting strings of one encoding to another. I would work with the unicode object assuming you know what the encoding is, if you don't know what the encoding is then there really isn't a way to find out without trial and error, and then convert back to the encoded string when you want a string object back.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.