1366 mysql incorrect string value for Zhuyin Fuhao - python

The "Incorrect string value error" is raised from MySQLdb.
_mysql_exceptions.OperationalError: (1366, "Incorrect string value: '\\xF0\\xA0\
\x84\\x8E\\xE2\\x8B...' for column 'from_url' at row 1")
But I already set both the connection charset and from url encoding to utf8. It works without problem for millions for records previously.
the value which will cause exception:
I think the issue is related to the special character u'\U0002010e' (a chinese special character "ㄋ")
u'http://www.ettoday.net/news/20120227/27879.htm?fb_action_ids=305328666231772&
fb_action_types=og.likes&fb_source=aggregation&fb_aggregation_id=288381481237582
\u7c89\u53ef\u611b\U0002010e\u22ef http://www.ettoday.net/news/20120221/26254.h
tm?fb_action_ids=305330026231636&fb_action_types=og.likes&fb_source=aggregation&
fb_aggregation_id=288381481237582 \u597d\u840c\u53c8\u22ef'
but this character can be encoded as utf8 in python as well.
>>> u'\U0002010e'.encode('utf8')
'\xf0\xa0\x84\x8e'
So why MySQL cannot accept this character?

The character you are using is outside the BMP, therefore it requires 4 bytes to store. Using the utf8 charset is not enough; you must have MySQL 5.5 or greater and use the utf8mb4 charset instead.

check the charset encoding you have set for mysql and make sure that you are using one that accepts utf8 encodings

Related

Can I insert UTF8 encoded characters into a Latin-1 table if I know only Latin-1 characters will be used?

I have 10 tables in a database. 9 of them only store data with standard ascii 1-byte characters supported by Latin-1. 1 of them requires that I store special characters that are only supported by UTF8. I would like to use the same MySQL connection object (using Python's PyMySQL library) to populate all 10 tables.
Previously, when creating the MySQL connection object, I did not specify the character set and it defaulted to Latin-1. That was fine when I was only populating the 9 Latin-1 tables. Now that I am populating the UTF8 table, I modified the connection object by passing in the parameter charset='utf8mb4' to the PyMySQL connection object function:
# Connect to the database
connection = pymysql.connect(host='localhost',
user='user',
password='passwd',
db='db',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
Now I am confident that, when inserting into my UTF8 MySQL table, all of my data is being stored fine. However, I am unsure if problems may arise when using my UTF8 connection object and inserting into the Latin-1 tables. After my first rounds of testing, everything looks great.
Is there anything I have overlooked? Are there any potential issues with inserting UTF8 encoded characters into a Latin-1 table?
It can be done. But... You must set some things correctly, else you will get any of several forms of garbage.
If the bytes in your client are UTF-8 encoded, then you must tell MySQL that fact. This is usually done on the connect string. Your charset='utf8mb4' connection argument does that. Here are some Python-specific tips: http://mysql.rjweb.org/doc.php/charcoll#python
Meanwhile, the column(s) in the table(s) can be either latin1 or utf8 (since you are sure the data is limited to the characters that are common between them).
A character example: é is hex E9 in latin1 and C3A9 in MySQL's utf8 (or utf8mb4). The conversion will occur during INSERT and SELECT if you correctly state the clients encoding.
(For your purposes, either utf8 and utf8mb4 will work.)
If you have further troubles, see Trouble with utf8 characters; what I see is not what I stored and/or provide SHOW CREATE TABLE and hex of some offending character.
Hi utf8 and latin 1 both are simple encoding they support some character which not included in both so problem may occur. if you pass some data of utf8 which is not in latin 1. In this process double encoding occour.
Here is a link to insert utf8 to latin
I had the same problem and solved it by using the Convert and Cast function :
mycursor.execute("INSERT INTO `topics` (`title`,parent_id)
VALUES (convert(cast(convert( %s using utf8) as binary) using latin1),0)" ,(name,) )

SQLAlchemy UnicodeEncodeError on '😕' from SQL Server to Postgres

Some text is being fetched from an nvarchar column on a SQL Server database using SQLAlchemy and then written to a Postgres database with column type character varying. Python throws the following exception when the source data contains the character 😕:
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d'
in position 34: surrogates not allowed
The SQLAlchemy column type is set to String. I tried setting the column type to Unicode and separately set the column collation to SQL_Latin1_General_CP1_CI_AS, with no luck. The driver being used is FreeTDS.
Why can't Python encode this string? Does the problem lie with our use of SQLAlchemy, Python or Postgres? This error is making me 😕.
The codepoint \U1f615 (😕) can be represented by the two surrogates \ud83d and \ude15. Somehow your SQL-Server, which uses internally UTF16 was decoded as UCS2, so that the surrogates are not properly decoded. So the problem is the SQL-Server.
If you cannot correctly read the data, you have to manually correct the unicode strings, like so (python3):
def surrogate_to_unicode(sur):
a, b = sur
return chr(0x10000 + ((ord(a)-0xd800)<<10) + (ord(b)-0xdc00))
text = '\ud83d\ude15'
text = re.sub('[\ud800-\udbff][\udc00-\udfff]', lambda g:surrogate_to_unicode(g.group(0)), text)

SQLAlchemy Returning UTF-8 as Latin1 Strings

I have a MySQL database encoded in UTF-8, but when I connect to it with SQLAlchemy (Python 2.7), I get back strings with Latin1 Unicode characters in them.
So, the Dutch spelling of Belgium (België) comes out as
'Belgi\xeb'
rather than
'Belgi\xc3\xab'
or, ideally the Unicode object
u'Belgi\xeb'
According to docs (http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html#custom-dbapi-args):
MySQLdb will accommodate Python unicode objects if the use_unicode=1 parameter, or the charset parameter, is passed as a connection argument.
Without this setting, many MySQL server installations default to a latin1 encoding for client connections.
You need to use
create_engine('mysql+mysqldb://HOSTNAME/DATABASE?charset=utf8')
rather than just
create_engine('mysql+mysqldb://HOSTNAME/DATABASE')

how to pass character ′ prime in wxpython?

When I am trying to insert this text 2′BR info MySql through wx.python Textctrl, it gives me such error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128).
The problem is in the character ′ and I tried encode('utf8') still doesn't work. When I insert it into MySql manually, then I query it it show me as 2?BR. Here is the code of insertion. Thanks.
cur.execute("INSERT INTO TKtable (title) VALUES (%s)", (str(self.Text.GetValue())))
I assume you're using the unicode version of wxPython under Python 2 (not Python 3).
The problem arises when you're calling the str constructor on the result of self.Text.GetValue().
wxPython accept all kind of characters and return unicode strings. In your example, Textctrl.GetValue() return the unicode string u"2′BR"
str() try to convert it into a string, using the default encoding, which is ascii. Ascii can only represents 128 characters. The prime character "′", is not represented in ascii. That's why you have this error.
What is the encoding of your MySQL database? If you want to use strange characters like the "′" prime, you should set your database encoding to utf-8.
Then you should be able to do:
cur.execute("INSERT INTO TKtable (title) VALUES (%s)", (self.Text.GetValue(),))
You won't be able to successfully insert a character that doesn't exist in your database encoding.
I think the prime "′" (code 2032 in utf-8) prime doesn't even exist in latin-1.

Does the MySQLdb module in python returns utf8 encoding or unicode in this case?

Using MySQLdb I connect to a database where everything is stored in the utf8 encoding.
If I do
cursor.execute("SET NAMES utf8")
and fetch some data from the database by another statement. Does that mean, that the strings in
cursor.execute("SELECT ...")
cursor.fetchall()
will be in unicode? Or do I have to turn them first by
mystr.decode("utf8")
to unicode?
From the docs:
connect(parameters...)
...
use_unicode
If True, CHAR and VARCHAR and TEXT columns are returned as Unicode strings, using the configured character set. It is best to set the default encoding in the server configuration, or client configuration (read with read_default_file). If you change the character set after connecting (MySQL-4.1 and later), you'll need to put the correct character set name in connection.charset.
If False, text-like columns are returned as normal strings, but you can always write Unicode strings.
This must be a keyword parameter.

Categories