Force character set conversion - python

I have an application that writes data to a Microsoft SQL Server. Database's character set is CP1252, and to-be-saved incoming data is in UTF-8. The data may contain characters that cannot be converted to CP1252, and will throw an exception when inserted.
The database guys said that I should just crunch the data to CP1252 forcibly, like this:
some_value = some_value.encode('CP1252', 'replace')
But SQLAlchemy does the conversion automatically and I don't see a way to force the conversion.
engine = sqlalchemy.create_engine('mssql+pyodbc://...'
encoding='CP1252',
convert_unicode=True,
)
It is critical that the data is saved, even with some missing characters. How can I implement this? Note that I'm using a lot of database reflection in this case.

I don't see a problem.
some_value = some_value.encode('CP1252', 'replace').decode('CP1252')
If some_value isn't actually unicode string but the raw UTF-8 data:
some_value = some_value.decode("utf-8").encode('cp1252', 'replace').decode('cp1252')

Related

How to partially overwrite blob in an sqlite3 db in python via SQLAlchemy?

I'm having a db that contains a blob column with the binary representation as follows
The value that I'm interested in is encoded as little endian unsigned long long (8 byte) value in the marked. Reading this value works fine like this
p = session.query(Properties).filter((Properties.object_id==1817012) & (Properties.name.like("%OwnerUniqueID"))).one()
id = unpack("<Q", p.value[-8:])[0]
id in the above example is 1657266.
Now what I would like to do is the reverse. I have the row object p, I have a number in decimal format (using the same 1657266 for testing purposes) and I want to write that number in little endian format to those same 8 byte.
I've been trying to do so via SQL statement
UPDATE properties SET value = (SELECT substr(value, 1, length(value)-8) || x'b249190000000000' FROM properties WHERE object_id=1817012 AND name LIKE '%OwnerUniqueID%') WHERE object_id=1817012 AND name LIKE '%OwnerUniqueID%'
But when I do it like that I then can't read it anymore. At least not with SQLAlchemy. When I try the same code as above, I get the error message Could not decode to UTF-8 column 'properties_value' with text '☻' so it looks like it's written in a different format.
Interestingly using a normal select statement in DB Browser still works fine and the blob is still displayed exactly as in the screenshot above.
Now ideally I'd like to be able to write just those 8 bytes using the SQLAlchemy ORM but I'd settle for a raw SQL statement if that's what it takes.
I managed to get it to work with SQLAlchemy by basically reversing the process that I used to read it. In hindsight using the + to concatenate and the [:-8] to slice the correct part seems pretty obvious.
p = session.query(Properties).filter((Properties.object_id==1817012) & (Properties.name.like("%OwnerUniqueID"))).one()
p.value = p.value[:-8] + pack("<Q", 1657266)
By turning on ECHO for SQLAlchemy I got the following raw SQL statement:
UPDATE properties SET value=? WHERE properties.object_id = ? AND properties.name = ?
(<memory at 0x000001B93A266A00>, 1817012, 'BP_ThrallComponent_C.OwnerUniqueID')
Which is not particularly helpful if you want to do the same thing manually I suppose.
It's worth noting that the raw SQL statement in my question not only works as far as reading it with the DB Browers is concerned but also with the game client that uses the db in question. It's only SQLAlchemy that seems to have troubles, trying to decode it as UTF-8 it seems.

Python error - Unicode/Ascii problems with value pulled out of MySql database

This has been asked a million times but every single thing I try hasn't worked and all are for slightly different issues. I'm losing my mind over it!
I have a Python Script which pulls data from a MySql database - all works well.
Database Information:
I believe the information in the database is correct. I am trying to parse multiple records into word documents - that is why I am not too bothered about accuracy - even if the bad characters are removed - that is fine.
The Charset of the database is UTF-8 and the field I am working with is VarChar
I am using mysql.connector python module to connect
However, I am getting errors and I've realised it's because of values with unicode in, such as this:
The value of this item is "DOMAINoardroom".
I have tried:
text = order[11].encode().decode("utf-8")
text = order[11].encode("ascii", errors="ignore").decode()
text = str(order[11].encode("utf8", errors="ignore"))
The latter does work however it outputs it as b'DOMAIN\x08oardroom' due to it being bytes
I can get it to accept the text by print(text) to the screen. However when I try to output it to a word document (using the docx module), it produces an error:
table = document.add_table(rows=total_orders*2, cols=1)
row = table.rows[0].cells
row[0].text = row_text
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
I am not particularly fussy over how it handles the unicode, e.g. remove it if needed, but I just need it to parse without error.
Any thoughts or advice here?

Python MySQL connector returns bytearray instead of regular string value

I am loading data from one table into pandas and then inserting that data into new table. However, instead of normal string value I am seeing bytearray.
bytearray(b'TM16B0I8') it should be TM16B0I8
What am I doing wrong here?
My code:
engine_str = 'mysql+mysqlconnector://user:pass#localhost/db'
engine = sqlalchemy.create_engine(engine_str, echo=False, encoding='utf-8')
connection = engine.connect()
th_df = pd.read_sql('select ticket_id, history_date', con=connection)
for row in th_df.to_dict(orient="records"):
var_ticket_id = row['ticket_id']
var_history_date = row['history_date']
query = 'INSERT INTO new_table(ticket_id, history_date)....'
For some reason the Python MySql connector only returns bytearrys, (more info in (How return str from mysql using mysql.connector?) but you can decode them into unicode strings with
var_ticket_id = row['ticket_id'].decode()
var_history_date = row['history_date'].decode()
Make sure you are using the right collation, and encoding. I happen to use UTF8MB4_BIN for one of my website db tables. Changed it to utf8mb4_general_ci, and it did the trick.
Producing a bytearray is now the expected behaviour.
It changed with mysql-connector-python 8.0.24 (2021-04-20). According to the v8.0.24 release notes, "Binary columns were returned as strings instead of 'bytes' or 'bytearray'" behaviour was a bug that was fixed in that release.
So producing a Python binaryarray is the correct behaviour, if the database column is a binary type (e.g. binary or varbinary). Previously, it produced a Python string, but now it produces a binaryarray.
So either change the data type in the database to a non-binary data type, or convert the binaryarray to a string in your code. If the column is nullable, you'll have to check for that first; since attempting to invoke decode() method on None would produce an error. You'll also have to be sure the bytes represent a valid string, in the character encoding being used for the decoding/conversion.
Much easier...
How to return str from MySQL using mysql.connector?
Adding mysql-connector-python==8.0.17 to requirements.txt resolved this issue for me
"pip install mysql-connector-python" from terminal

pickle unicode strings with non-ascii caracters to mysql in django

Consider the I have an dictionary that I want to store in db using python's pickle.
My question is: which django models' field should I use?
So far I've been using a CharField, but there seems to be an error:
I pickle a u'\xe9' (i.e. 'É'), and I get:
Incorrect string value: '\xE1, ist...' for column 'edition' at row 1
(the ,"ist..." was because I have more text after the 'É').
I'm using
data = dict();
data['foo'] = input_that_has_the_caracter
to_save_in_db = cPickle.dumps(data)
Should I use a binary field and pickle with a protocol that uses binary? Because I have to change the db in order to do that, so it is better to be sure first...
You should check if you are using a proper encoding for your table AND column in your database backend (I'm assuming MySQL since your error message seems to be from it). In MySQL columns can have different encoding than the table. See if it's UTF-8.

How to solve this double encoding?

I'm developing a website using python to preprocess request and a MySQL database to store information.
All my tables are utf8 and I also use utf8 as Content-type.
I have this code to establish connection to the db:
database_connection = MySQLdb.connect(host = database_host, user = database_username, passwd = database_password, db = database_name, use_unicode = True)
cursor = database_connection.cursor()
cursor.execute("""SET NAMES utf8;""");
cursor.execute("""SET CHARACTER SET utf8;""");
cursor.execute("""SET character_set_connection=utf8;""");
Running a simple test on my GoDaddy hosting printing the results of a simple SELECT query like this:
print results.encode("utf-8")
Shows a double encoded string. (So all non-ascii characters are transformed into two different specials). But if I leave the encode statement, it gives an encoding error for each non-ascii letter.
It sounds as though results contains a Unicode string that was incorrectly decoded from a byte string coming from the database. I.e. when you read the data from the database, it decoded the byte string as Latin-1 rather than the UTF-8 it really is.
So if you fix the decoding of the database contents, then you should be in business.
I use something like this which I found on the internet during one of my own encoding hunts. You can keep on chaining encoding styles to find a fit.
Also, as others said, try fixing the source first. This hack is just to figure out what encoding is being actually returned. Hope this helps.
#this method is a simple recursive hack that is going to find a compatible encoding for the problematic field
#does not guarantee successful encoding match. If no match is found, an error code will be returned: ENC_ERR
def findencoding(field, level):
print "level: " + str(level)
try:
if(level == 0):
field = field.encode('cp1252')
elif(level == 1):
field = field.encode('cp1254')
else:
return "ENC_ERR"
except Exception:
field = findencoding(field,level+1)
return field

Categories