MySQLdb can't initialize character set utf-8 error - python

I am trying to insert some Arabic word into the arabic_word column of my hanswehr2 database Maria DB using the MySQLdb driver.
I was getting a latin-1 encode error. But after reading around, I found out that the MySQLdb driver was defaulted to latin-1 and I had to explicitly set utf-8 as my charset of choice at the mariadb.connect() function. Sauce.
The entire database is set to utf-8.
Code:
def insert_into_db(arabic_word, definition):
try:
conn = mariadb.connect('localhost', 'root', 'xyz1234passwd', 'hans_wehr', charset='utf-8', use_unicode=True)
conn.autocommit(True)
cur = conn.cursor()
cur.execute("INSERT INTO hanswehr2 (arabic_word , definition) VALUES (%s,%s)", (arabic_word, definition,))
except mariadb.Error, e:
print e
sys.exit(1)
However now I get the following error:
/usr/bin/python2.7 /home/heisenberg/hans_wehr/main.py
Total lines 87672
(2019, "Can't initialize character set utf-8 (path: /usr/share/mysql/charsets/)")
Process finished with exit code 1
I have specified the Python MySQL driver to use the utf-8 character however it seems to ignore this.
Any inputs would be highly appreciated.

The charset alias for UTF-8 in MySQL is utf8 (no hyphen).
See https://dev.mysql.com/doc/refman/5.5/en/charset-charsets.html for available charsets.
Note, if you need to use non-BMP Unicode points, such as emojis, use utf8mb4 for the connection charset and the varchar type.

There is a thing called collations that helps encode/decode characters for specific languages.
https://softwareengineering.stackexchange.com/questions/95048/what-is-the-difference-between-collation-and-character-set
I think u need to specify it when creating your database table or in the connection string. refer this:
store arabic in SQL database
More on python mysql connection :
https://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlconnection-set-charset-collation.html

Related

How to enforce utf8mb4 on table creation with to_sql?

I'm importing some data from an API in Python, formatting it and saving it to a MySQL database with to_sql.
results, types, valid = self.process_data(data, [])
if valid:
results.to_sql(
con=self.db.connection,
name="degreed_" + method,
if_exists="replace",
index=False,
dtype=types,
)
In my connection I have specified utf8mb4 as the charset:
self.connection = create_engine(
'mysql+mysqlconnector://{0}:{1}#{2}/{3}?charset=utf8mb4'.
format(database_username, database_password, database_ip, database_name))
and in my types I have text columns as:
NVARCHAR(length=500, collation='utf8mb4_bin').
However, I still get the error message:
COLLATION 'utf8mb4_bin' is not valid for CHARACTER SET 'utf8'
In MySQL my character_set_client is utf8mb4 and the default table charset is utf8mb4. Why is the character set utf8?
Apologies if I'm doing anything stupid here, I'm quite new to sqlalchemy and mysql in general.
Turns out the issue was that I was using NVARCHAR instead of VARCHAR, so the result was being cast as UTF8.

Problems inserting utf8 data to PostgreSQL with Python

I am reading scandinavian language websites with a web-crawler - and wish to insert them into my PostgreSQL database.
Originally I tried to encode my PSQL DB as utf-8, then manually tried to insert the characters that would be of a problem like this:
Insert into name (surname) VALUES ('Børre');
This was done in the windows PSQL shell.
This gave me the following error: ERROR: invalid byte sequence for encoding "UTF8": 0x9b. So after doing some googling I changed the client encoding to latin1. Now that statement was successfull. The server encoding is still utf8.
When I do the same insert through my python script the name appears in my database as B°rre. If I change back the encoding of client to utf8, I also get entries with wrong special characters.
My python script is utf8 encoded, but prints the name correct.
Insert statement:
con = psycopg2.connect(*database details*)
print("Opened database successfully")
cur = con.cursor()
#INSERT NAME
query = "INSERT INTO name (surname) VALUES (%s) RETURNING id"
data = ('børre')
cur.execute(query,data)
As previously stated, print(personObject.surname) gives 'Børre'
If I try the following:
query = "INSERT INTO name (surname) VALUES (%s) RETURNING id"
data = ('børre'.encode('utf-8'))
cur.execute(query,data)
I get the following in my database:
\x62c383c2b8727265
psycopg2 doesn't understand postgresql queries it just converts the arguments given into their postgresql representation
if you give it an array of bytes to will convert it to a postgresql BYTEA literal,
data = ('børre'.encode('utf-8')) gets you a bytes.
so, don't do that, use a string.
The code fragment you have at the top should work.
In the error I see ø encoded as hex c383c2b8, that hex translates to UTF8 as two charactersà and ¸. It looks to me like python thinks your script is not wtitten is UTF8, but instead some other codepage.
using client_encoding key words
eg: conn=psycopg2.connect("dbname='foo' user='dbuser' password='mypass' client_encoding='utf8'")

inserting unicode values in mysql using python

I want to insert unicode text to mysql table, so for this I have written below code
I am using flask framework
import MySQLdb as ms
db = ms.connect("localhost","username","password","dbname")
cursor = db.cursor()
my_text = "का" #this is marathi lang word
enc = my_text.encode('utf-8') #after encoding still it shows me marathi word
db_insert = "INSERT INTO TEXT(text) VALUES '{0}'"
cursor.execute(db_insert,(enc))
db.commit()
It gives me following error
TypeError: not all arguments converted during string formatting on line cursor.execute()
How to remove this error ?
Put this in the beginning of the source code:
# -*- coding: utf-8 -*-
And don't encode something that is already encoded - remove my_text.encode('utf-8')
Use charset="utf8", use_unicode=True in the connection call.
The CHARACTER SET in the table/column must be utf8 or utf8mb4. latin1 will not work correctly.
Python checklist
You need to pass a sequence (a list or tuple) to the params in cursor.execute statement:
db_insert = "INSERT INTO TEXT(text) VALUES (%s)"
# notice the comma at the end to convert it to tuple
cursor.execute(db_insert,(enc,))
You can read more in the documentation:
Why the tuple? Because the DB API requires you to pass in any parameters as a sequence.
Alternatively, you could even use named parameters:
db_insert = "INSERT INTO TEXT(text) VALUES (%(my_text)s)"
#and then you can pass a dict with key and value
cursor.execute(db_insert, {'my_text': enc})

Python MySQLdb escape_string

MySQLdb is a module of python to communicate with mysql database. The escape_string is a method provided by MySQLdb to escape some characters in sql. For example, sql like 'Update table Set col = "My"s"' will cause a error. So escape_string will help us to add a '\' before the " in My"s.
However, in multibyte encoding like gbk, which use more than 2 bytes to store a chinese word, the escape_string only search the character to be escaped one character by one, which will cause some special characters to be escaped incorrectly. for example, the Chinese character ' 昞', whose bytes are '\x95\x5c', if the sql is 'update table set col = "昞"', then the MySQLdb.escape_string(sql) will get the result: update table set col = "昞\", which is wrong and cannot be executed correctly.
So is there anyone who ever came over such a problem.
P.S I googled the problem and found there is a method mysqli_set_charset in php which can solve such case, So, I wonder whether there is a such one in python.
This problem is most likely cause because the default character set for your connection is latin1 instead of unicode. There are a couple different things you can try. From this post,
conn = mysql.connect(host='127.0.0.1',
user='user',
passwd='passwd',
db='db',
charset='utf8',
use_unicode=True)
then you run your query like this
cursor.execute('INSERT INTO mytable VALUES (null, %s)',
('\x95\x5c',))
Appearently a similar problem was solved by running the following query first
SET NAMES 'gbk

Pymssql, How to use it to read unicode data from MSSQL2008

I've used pymssql-1.0.2 and freetds-0.82.7 on ubuntu-10.10.
Also, I have a mssql2008 server on windows-7.
I can connect with mssql from ubuntu using pymssql and freetds.
But I can't get unicode data from mssql database. Database collation is Cyrillic_General_CI_AS.
My freetds.conf file looks like this:
[mssql2008]
host=10.0.0.34
port=1433
tds version=7.0
My code looks like this:
conn = pymssql.connect(host=10.0.0.34\mssql2008, user=***, password=***, database=eoffice, as_dict=true, charset='iso-8859-1')
crms = conn.cursor()
crms.execute('SELECT cc_Name FROM tblHR_CodeClass')
for row in crms.fetchall():
raise u"Succeeded! Test data: " + row['cc_Name']
break
Expected result is: "Өмнөговь аймаг"
Actual result is: "ªìíºãîâü àéìàã"
When I use 'UTF-8' charset, the fetchall() call throws an error means the utf8 can't read the data which is out of range of code page.
How to get unicode data as it stored on mssql database?
Please give your hand!
Regards,
Orgil
Is it really Unicode data? I.e., is the cc_Name column varchar or nvarchar? It sounds like it's varchar--in which case, try using cp1251 or windows-1251 as the charset instead of iso-8859-1.

Categories