How to solve this double encoding? - python

I'm developing a website using python to preprocess request and a MySQL database to store information.
All my tables are utf8 and I also use utf8 as Content-type.
I have this code to establish connection to the db:
database_connection = MySQLdb.connect(host = database_host, user = database_username, passwd = database_password, db = database_name, use_unicode = True)
cursor = database_connection.cursor()
cursor.execute("""SET NAMES utf8;""");
cursor.execute("""SET CHARACTER SET utf8;""");
cursor.execute("""SET character_set_connection=utf8;""");
Running a simple test on my GoDaddy hosting printing the results of a simple SELECT query like this:
print results.encode("utf-8")
Shows a double encoded string. (So all non-ascii characters are transformed into two different specials). But if I leave the encode statement, it gives an encoding error for each non-ascii letter.

It sounds as though results contains a Unicode string that was incorrectly decoded from a byte string coming from the database. I.e. when you read the data from the database, it decoded the byte string as Latin-1 rather than the UTF-8 it really is.
So if you fix the decoding of the database contents, then you should be in business.

I use something like this which I found on the internet during one of my own encoding hunts. You can keep on chaining encoding styles to find a fit.
Also, as others said, try fixing the source first. This hack is just to figure out what encoding is being actually returned. Hope this helps.
#this method is a simple recursive hack that is going to find a compatible encoding for the problematic field
#does not guarantee successful encoding match. If no match is found, an error code will be returned: ENC_ERR
def findencoding(field, level):
print "level: " + str(level)
try:
if(level == 0):
field = field.encode('cp1252')
elif(level == 1):
field = field.encode('cp1254')
else:
return "ENC_ERR"
except Exception:
field = findencoding(field,level+1)
return field

Related

Python error - Unicode/Ascii problems with value pulled out of MySql database

This has been asked a million times but every single thing I try hasn't worked and all are for slightly different issues. I'm losing my mind over it!
I have a Python Script which pulls data from a MySql database - all works well.
Database Information:
I believe the information in the database is correct. I am trying to parse multiple records into word documents - that is why I am not too bothered about accuracy - even if the bad characters are removed - that is fine.
The Charset of the database is UTF-8 and the field I am working with is VarChar
I am using mysql.connector python module to connect
However, I am getting errors and I've realised it's because of values with unicode in, such as this:
The value of this item is "DOMAINoardroom".
I have tried:
text = order[11].encode().decode("utf-8")
text = order[11].encode("ascii", errors="ignore").decode()
text = str(order[11].encode("utf8", errors="ignore"))
The latter does work however it outputs it as b'DOMAIN\x08oardroom' due to it being bytes
I can get it to accept the text by print(text) to the screen. However when I try to output it to a word document (using the docx module), it produces an error:
table = document.add_table(rows=total_orders*2, cols=1)
row = table.rows[0].cells
row[0].text = row_text
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
I am not particularly fussy over how it handles the unicode, e.g. remove it if needed, but I just need it to parse without error.
Any thoughts or advice here?

Python MySQL connector returns bytearray instead of regular string value

I am loading data from one table into pandas and then inserting that data into new table. However, instead of normal string value I am seeing bytearray.
bytearray(b'TM16B0I8') it should be TM16B0I8
What am I doing wrong here?
My code:
engine_str = 'mysql+mysqlconnector://user:pass#localhost/db'
engine = sqlalchemy.create_engine(engine_str, echo=False, encoding='utf-8')
connection = engine.connect()
th_df = pd.read_sql('select ticket_id, history_date', con=connection)
for row in th_df.to_dict(orient="records"):
var_ticket_id = row['ticket_id']
var_history_date = row['history_date']
query = 'INSERT INTO new_table(ticket_id, history_date)....'
For some reason the Python MySql connector only returns bytearrys, (more info in (How return str from mysql using mysql.connector?) but you can decode them into unicode strings with
var_ticket_id = row['ticket_id'].decode()
var_history_date = row['history_date'].decode()
Make sure you are using the right collation, and encoding. I happen to use UTF8MB4_BIN for one of my website db tables. Changed it to utf8mb4_general_ci, and it did the trick.
Producing a bytearray is now the expected behaviour.
It changed with mysql-connector-python 8.0.24 (2021-04-20). According to the v8.0.24 release notes, "Binary columns were returned as strings instead of 'bytes' or 'bytearray'" behaviour was a bug that was fixed in that release.
So producing a Python binaryarray is the correct behaviour, if the database column is a binary type (e.g. binary or varbinary). Previously, it produced a Python string, but now it produces a binaryarray.
So either change the data type in the database to a non-binary data type, or convert the binaryarray to a string in your code. If the column is nullable, you'll have to check for that first; since attempting to invoke decode() method on None would produce an error. You'll also have to be sure the bytes represent a valid string, in the character encoding being used for the decoding/conversion.
Much easier...
How to return str from MySQL using mysql.connector?
Adding mysql-connector-python==8.0.17 to requirements.txt resolved this issue for me
"pip install mysql-connector-python" from terminal

pickle unicode strings with non-ascii caracters to mysql in django

Consider the I have an dictionary that I want to store in db using python's pickle.
My question is: which django models' field should I use?
So far I've been using a CharField, but there seems to be an error:
I pickle a u'\xe9' (i.e. 'É'), and I get:
Incorrect string value: '\xE1, ist...' for column 'edition' at row 1
(the ,"ist..." was because I have more text after the 'É').
I'm using
data = dict();
data['foo'] = input_that_has_the_caracter
to_save_in_db = cPickle.dumps(data)
Should I use a binary field and pickle with a protocol that uses binary? Because I have to change the db in order to do that, so it is better to be sure first...
You should check if you are using a proper encoding for your table AND column in your database backend (I'm assuming MySQL since your error message seems to be from it). In MySQL columns can have different encoding than the table. See if it's UTF-8.

Force character set conversion

I have an application that writes data to a Microsoft SQL Server. Database's character set is CP1252, and to-be-saved incoming data is in UTF-8. The data may contain characters that cannot be converted to CP1252, and will throw an exception when inserted.
The database guys said that I should just crunch the data to CP1252 forcibly, like this:
some_value = some_value.encode('CP1252', 'replace')
But SQLAlchemy does the conversion automatically and I don't see a way to force the conversion.
engine = sqlalchemy.create_engine('mssql+pyodbc://...'
encoding='CP1252',
convert_unicode=True,
)
It is critical that the data is saved, even with some missing characters. How can I implement this? Note that I'm using a lot of database reflection in this case.
I don't see a problem.
some_value = some_value.encode('CP1252', 'replace').decode('CP1252')
If some_value isn't actually unicode string but the raw UTF-8 data:
some_value = some_value.decode("utf-8").encode('cp1252', 'replace').decode('cp1252')

Getting error when INSERT into MySQL

_mysql_exceptions.Warning: Incorrect string value: '\xE7\xB9\x81\xE9\xAB\x94...' for column 'html' at row 1
def getSource(theurl, moved = 0):
if moved == 1:
theurl = urllib2.urlopen(theurl).geturl()
urlReq = urllib2.Request(theurl)
urlReq.add_header('User-Agent',random.choice(agents))
urlResponse = urllib2.urlopen(urlReq)
htmlSource = urlResponse.read()
return htmlSource
new_u = Url(source_url = source_url, source_url_short = source_url_short, source_url_hash = source_url_hash, html = htmlSource)
new_u.save()
Why is this happening?
I am basically downloading URL of a page...and then saving it to a database using Django.
It only happens sometimes....and sometimes it works fine.
Edit: it seems like I have to set the database to UTF-8? What is the command to do that?
You basically need to ensure proper a string encoding. E.g. the string you provide to django is not UTF-8 encoded and therefore some characters can't be resolved.
Some helpful advice on how to find the encoding of the requested page can be found here: urllib2 read to Unicode
There are 2 ways to go if you want to alter the character set in MySQL.
First is the default of the database, see MySQL Alter database,
and the second is per-table: MySQL Alter Table.
The database gives the default charset for, I believe, new tables. This
can be overridden on a per-table basis, which you need to do since you
already have tables. "utf8" is a supported character set.
Also have a look at Blog about UTF8 with django and MySQL.

Categories