How to enforce utf8mb4 on table creation with to_sql?

How to enforce utf8mb4 on table creation with to_sql? - python

I'm importing some data from an API in Python, formatting it and saving it to a MySQL database with to_sql.
results, types, valid = self.process_data(data, [])
if valid:
results.to_sql(
con=self.db.connection,
name="degreed_" + method,
if_exists="replace",
index=False,
dtype=types,
)
In my connection I have specified utf8mb4 as the charset:
self.connection = create_engine(
'mysql+mysqlconnector://{0}:{1}#{2}/{3}?charset=utf8mb4'.
format(database_username, database_password, database_ip, database_name))
and in my types I have text columns as:
NVARCHAR(length=500, collation='utf8mb4_bin').
However, I still get the error message:
COLLATION 'utf8mb4_bin' is not valid for CHARACTER SET 'utf8'
In MySQL my character_set_client is utf8mb4 and the default table charset is utf8mb4. Why is the character set utf8?
Apologies if I'm doing anything stupid here, I'm quite new to sqlalchemy and mysql in general.

Turns out the issue was that I was using NVARCHAR instead of VARCHAR, so the result was being cast as UTF8.

Related

pymysql warning: 1300 invalid utf8 character string: 'FFD8FF' when upload image to database [duplicate]

I'm having trouble inserting binary data into a longblob column in MySQL using MySQLdb from Python 2.7, but I'm getting an encoding warning that I don't know how to get around:
./test.py:11: Warning: Invalid utf8 character string: '8B0800'
curs.execute(sql, (blob,))
Here is the table definition:
CREATE TABLE test_table (
id int(11) NOT NULL AUTO_INCREMENT,
gzipped longblob,
PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
And the test code:
#!/usr/bin/env python
import sys
import MySQLdb
blob = open("/tmp/some-file.gz", "rb").read()
sql = "INSERT INTO test_table (gzipped) VALUES (%s)"
conn = MySQLdb.connect(db="unprocessed", user="some_user", passwd="some_pass", charset="utf8", use_unicode=True)
curs = conn.cursor()
curs.execute(sql, (blob,))
I've searched here and elsewhere for the answer, but unfortunately although many questions seem like they are what I'm looking for, the posters don't appear to be having encoding issues.
Questions:
What is causing this warning?
How do I get rid of it?

After some more searching I've found the answers.
It is actually MySQL generating this warning.
It can be avoided by using _binary before the binary parameter.
https://bugs.mysql.com/bug.php?id=79317
So the Python code needs to be updated as follows:
sql = "INSERT INTO test_table (gzipped) VALUES (_binary %s)"

Problems inserting utf8 data to PostgreSQL with Python

I am reading scandinavian language websites with a web-crawler - and wish to insert them into my PostgreSQL database.
Originally I tried to encode my PSQL DB as utf-8, then manually tried to insert the characters that would be of a problem like this:
Insert into name (surname) VALUES ('Børre');
This was done in the windows PSQL shell.
This gave me the following error: ERROR: invalid byte sequence for encoding "UTF8": 0x9b. So after doing some googling I changed the client encoding to latin1. Now that statement was successfull. The server encoding is still utf8.
When I do the same insert through my python script the name appears in my database as B°rre. If I change back the encoding of client to utf8, I also get entries with wrong special characters.
My python script is utf8 encoded, but prints the name correct.
Insert statement:
con = psycopg2.connect(*database details*)
print("Opened database successfully")
cur = con.cursor()
#INSERT NAME
query = "INSERT INTO name (surname) VALUES (%s) RETURNING id"
data = ('børre')
cur.execute(query,data)
As previously stated, print(personObject.surname) gives 'Børre'
If I try the following:
query = "INSERT INTO name (surname) VALUES (%s) RETURNING id"
data = ('børre'.encode('utf-8'))
cur.execute(query,data)
I get the following in my database:
\x62c383c2b8727265

psycopg2 doesn't understand postgresql queries it just converts the arguments given into their postgresql representation
if you give it an array of bytes to will convert it to a postgresql BYTEA literal,
data = ('børre'.encode('utf-8')) gets you a bytes.
so, don't do that, use a string.
The code fragment you have at the top should work.
In the error I see ø encoded as hex c383c2b8, that hex translates to UTF8 as two charactersÃ and ¸. It looks to me like python thinks your script is not wtitten is UTF8, but instead some other codepage.

using client_encoding key words
eg: conn=psycopg2.connect("dbname='foo' user='dbuser' password='mypass' client_encoding='utf8'")

MySQLdb can't initialize character set utf-8 error

I am trying to insert some Arabic word into the arabic_word column of my hanswehr2 database Maria DB using the MySQLdb driver.
I was getting a latin-1 encode error. But after reading around, I found out that the MySQLdb driver was defaulted to latin-1 and I had to explicitly set utf-8 as my charset of choice at the mariadb.connect() function. Sauce.
The entire database is set to utf-8.
Code:
def insert_into_db(arabic_word, definition):
try:
conn = mariadb.connect('localhost', 'root', 'xyz1234passwd', 'hans_wehr', charset='utf-8', use_unicode=True)
conn.autocommit(True)
cur = conn.cursor()
cur.execute("INSERT INTO hanswehr2 (arabic_word , definition) VALUES (%s,%s)", (arabic_word, definition,))
except mariadb.Error, e:
print e
sys.exit(1)
However now I get the following error:
/usr/bin/python2.7 /home/heisenberg/hans_wehr/main.py
Total lines 87672
(2019, "Can't initialize character set utf-8 (path: /usr/share/mysql/charsets/)")
Process finished with exit code 1
I have specified the Python MySQL driver to use the utf-8 character however it seems to ignore this.
Any inputs would be highly appreciated.

The charset alias for UTF-8 in MySQL is utf8 (no hyphen).
See https://dev.mysql.com/doc/refman/5.5/en/charset-charsets.html for available charsets.
Note, if you need to use non-BMP Unicode points, such as emojis, use utf8mb4 for the connection charset and the varchar type.

There is a thing called collations that helps encode/decode characters for specific languages.
https://softwareengineering.stackexchange.com/questions/95048/what-is-the-difference-between-collation-and-character-set
I think u need to specify it when creating your database table or in the connection string. refer this:
store arabic in SQL database
More on python mysql connection :
https://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlconnection-set-charset-collation.html

Using SQLAlchemy and pymysql, how can I set the connection to utilize utf8mb4?

I discovered (the hard way) that MySQL's UTF8 character set is only 3 bytes. A bit of research shows I can fix this by changing the tables to utilize the utf8mb4 collation and get the full 4 bytes UTF should be.
I've done so. My database, tables and columns have all been ALTERed to utilize this charset. However, I still receive this message if I have data that has unicode code points larger than U+FFFF:
Illegal mix of collations (utf8mb4_general_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='"
I discovered I have the following settings:
> show variables like '%collation%';
collation_connection utf8_general_ci
collation_database utf8mb4_general_ci
collation_server utf8mb4_general_ci
The collation_server was set by making changes to my.cnf. My question, is how do I change the connection one? I currently connect to the database using SQL Alchemy and pymysql like this:
connect_string = 'mysql+pymysql://{}:{}#{}:{}/{}?charset=utf8'.format(DB_USER, DB_PASS, DB_HOST, DB_PORT, DATABASE)
engine = create_engine(connect_string, convert_unicode=True, echo=False)
session = sessionmaker()
session.configure(bind=engine)
What can I do to change from utf8_general_ci to utf8mb4_general_ci when connecting via SQL Alchemy?

Change the connect_string to use charset=utf8mb4:
connect_string = 'mysql+pymysql://{}:{}#{}:{}/{}?charset=utf8mb4'.format(DB_USER, DB_PASS, DB_HOST, DB_PORT, DATABASE)

Python MySQLdb escape_string

MySQLdb is a module of python to communicate with mysql database. The escape_string is a method provided by MySQLdb to escape some characters in sql. For example, sql like 'Update table Set col = "My"s"' will cause a error. So escape_string will help us to add a '\' before the " in My"s.
However, in multibyte encoding like gbk, which use more than 2 bytes to store a chinese word, the escape_string only search the character to be escaped one character by one, which will cause some special characters to be escaped incorrectly. for example, the Chinese character ' 昞', whose bytes are '\x95\x5c', if the sql is 'update table set col = "昞"', then the MySQLdb.escape_string(sql) will get the result: update table set col = "昞\", which is wrong and cannot be executed correctly.
So is there anyone who ever came over such a problem.
P.S I googled the problem and found there is a method mysqli_set_charset in php which can solve such case, So, I wonder whether there is a such one in python.

This problem is most likely cause because the default character set for your connection is latin1 instead of unicode. There are a couple different things you can try. From this post,
conn = mysql.connect(host='127.0.0.1',
user='user',
passwd='passwd',
db='db',
charset='utf8',
use_unicode=True)
then you run your query like this
cursor.execute('INSERT INTO mytable VALUES (null, %s)',
('\x95\x5c',))
Appearently a similar problem was solved by running the following query first
SET NAMES 'gbk

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to enforce utf8mb4 on table creation with to_sql? - python

Turns out the issue was that I was using NVARCHAR instead of VARCHAR, so the result was being cast as UTF8.

Related

pymysql warning: 1300 invalid utf8 character string: 'FFD8FF' when upload image to database [duplicate]

Problems inserting utf8 data to PostgreSQL with Python

MySQLdb can't initialize character set utf-8 error

Using SQLAlchemy and pymysql, how can I set the connection to utilize utf8mb4?

Python MySQLdb escape_string

Categories

Resources