I am trying to create a CSV of data from a MySQL RDB to move it over to Amazon Redshift. However, one of the fields contains descriptions and some of those descriptions contain the '’' character, or the right single quotation mark. before when I would run the code, it would give me
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position 62: character maps to <undefined>
I then tried using REPLACE to attempt to get rid of the right single quotation marks.
db = pymysql.connect(host='host', port=3306, user="user", passwd="password", db="db", autocommit=True)
cur = db.cursor()
#cur.execute("call inv1_view_prod.`Email_agg`")
cur.execute("""select field_1,
field_2,
field_3,
field_4,
replace(field_4_desc,"’","") field_4_desc,
field_5,
field_6,
field_7
from table_name """)
emails = cur.fetchall()
with open('O:\file\path\to\file_name.csv','w') as fileout:
writer = csv.writer(fileout)
writer.writerows(emails)
time.sleep(1)
However, this gave me the error:
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 132: ordinal not in range(256)
And I noticed 132 is the position of the right single quotation mark in the SQL statement so I beieve the code itself may be having an issue with it. I tried using the regular straight apostrophe instead of the right single quotation mark in the REPLACE statement, however this did not replace the character and still came back with the original error. Does anyone know why it won't accept the single quote and how to fix it?
\u2019 is Unicode for ’, UTF-8 hex E28099, which is a "RIGHT SINGLE QUOTATION MARK". The direct latin1 equivalent is hex 92. Some word processing products use that instead of apostrophe (').
You are getting the error messages, not because you can't handle the character, but because the configuration fails to declare which encoding is used where.
"132" seems irrelevant: 132 84 E2809E „
Notes on Python: http://mysql.rjweb.org/doc.php/charcoll#python
Notes on other charset issues: Trouble with UTF-8 characters; what I see is not what I stored
Without knowing the schema or the Python configuration, I can't be more specific.
Related
I am using Python 2.7 and MySQLdb 1.2.3. I tried everything I found on stackoverflow and other forums to handle encoding errors my script is throwing.
My script reads data from all tables in a source MySQL DB, writes them in a python StringIO.StringIO object, and then loads that data from StringIO object to Postgres database (which apparently is in UTF-8 encoding format. I found this by looking into Properties--Definition of database in pgadmin) using psycopg2 library's copy_from command.
I found out that my source MySQL database has some tables in latin1_swedish_ci encoding while others in utf_8 encoding format (Found this from TABLE_COLLATION in information_schema.tables).
I wrote all this code on the top of my Python script based on my research on the internet.
db_conn = MySQLdb.connect(host=host,user=user,passwd=passwd,db=db, charset="utf8", init_command='SET NAMES UTF8' ,use_unicode=True)
db_conn.set_character_set('utf8')
db_conn_cursor = db_conn.cursor()
db_conn_cursor.execute('SET NAMES utf8;')
db_conn_cursor.execute('SET CHARACTER SET utf8;')
db_conn_cursor.execute('SET character_set_connection=utf8;')
I still get the UnicodeEncodeError below with this line: cell = str(cell).replace("\r", " ").replace("\n", " ").replace("\t", '').replace("\"", "") #Remove unwanted characters from column value,
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 47: ordinal not in range(128)
I wrote the following line of code to clean cells in every table of source MySQL database when writing to StringIO object.
cell = str(cell).replace("\r", " ").replace("\n", " ").replace("\t", '').replace("\"", "") #Remove unwanted characters from column value
Please help.
str(cell) is trying to convert cell to ASCII. ASCII only supports characters with ordinals less than 255. What is cell?
If cell is a unicode string, just do cell.encode("utf8"), and that will return a bytestring encoded as utf 8
...or really iirc. If you pass mysql unicode, then the database will automagically convert it to utf8...
You could also try,
cell = unicode(cell).replace("\r", " ").replace("\n", " ").replace("\t", '').replace("\"", "")
or just use a 3rd party library. There is a good one that will fix text for you.
I made the mistake of accidentally using non-ascii characters in a form that was submitted into a database using SQLAlchemy, running on Flask. Basically, rather than using the ASCII hyphen –, I used the unicode en-dash –. I am trying to now go back and replace all occurrences of the en-dash with a hyphen in my database.
Let's say I have a users table, and the column I'm trying to change is called occupation. I'm able to figure out which entries in my database have the invalid character, because when I run:
User.query.get(id)
if the user has an invalid ASCII character, it returns
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 167: ordinal not in range(128)
So how can I go about replacing all occurrences of the en-dash with a hyphen in the occupation column for all rows in my DB?
I was able to fix this by running a script over all entries in my database, and replacing the ones with faulty characters.
from user.models import *
for u in User.query.all():
# \u2013 is unicode for en-dash
if u"\u2013" in u.occupation:
# replace with normal hyphen
u.occupation = u.occupation.replace(u"\u2013", "-")
db.session.commit()
When I am trying to insert this text 2′BR info MySql through wx.python Textctrl, it gives me such error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128).
The problem is in the character ′ and I tried encode('utf8') still doesn't work. When I insert it into MySql manually, then I query it it show me as 2?BR. Here is the code of insertion. Thanks.
cur.execute("INSERT INTO TKtable (title) VALUES (%s)", (str(self.Text.GetValue())))
I assume you're using the unicode version of wxPython under Python 2 (not Python 3).
The problem arises when you're calling the str constructor on the result of self.Text.GetValue().
wxPython accept all kind of characters and return unicode strings. In your example, Textctrl.GetValue() return the unicode string u"2′BR"
str() try to convert it into a string, using the default encoding, which is ascii. Ascii can only represents 128 characters. The prime character "′", is not represented in ascii. That's why you have this error.
What is the encoding of your MySQL database? If you want to use strange characters like the "′" prime, you should set your database encoding to utf-8.
Then you should be able to do:
cur.execute("INSERT INTO TKtable (title) VALUES (%s)", (self.Text.GetValue(),))
You won't be able to successfully insert a character that doesn't exist in your database encoding.
I think the prime "′" (code 2032 in utf-8) prime doesn't even exist in latin-1.
While fetching data from an unknown/old/non-consistent Mysql database to a Postgres utf-8 db using Python (Django) ORM I have sometimes faulty encoded data as a result.
Target: grégory
> a
u'gr\xe3\xa9gory'
> print a
grã©gory
I tried several decode/encode tricks without success:
> print a.encode('utf-8').decode('latin1')
grã©gory
> print a.encode('utf-8').decode('latin1')
grã©gory
> print a.decode('latin-1')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)
Even with some
unicode_escape
I guess the string has been incorrectly converted to lowercase at some point, changing \xc3 to \xe3. The lowercase conversion has assumed latin1 encoding when it was actually utf-8.
>>> print 'gr\xc3\xa9gory'.decode('utf8')
grégory
Since the problem was the lower(), I could fix it doing:
print a.upper().encode('latin1').lower()
Try this:
print a.decode('latin1')
I've tried to get certain Arabic string from web page then store these strings into db.
The first problem
The only way I could is to specify how many letters are they by using . and use unicode, like this:
import urllib,re
content=urllib.urlopen("http://example.com/content.html").read()
content = unicode(content,"utf-8")
Strings = re.findall("<Strong>...........</strong>",content) # it will work fine and fetch it but only strings with 11 char or letter (11 place)
Second problem
When I tried to write it to text file it displays:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
And when I've tried to store it into database it displays:
ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '\xd8\xa7\xd9\x84\xd9\x82\xd8\xb5\xd9\x8a\xd8\xb1)' at line 1")
What I've think about is to fetch it then encode it into base64 then store it into db
but still got an error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
The only way I could is to specify how many letters are they by using . and use unicode, like this
OK... is that a problem? Other than the general unreliability of hacking strings out of HTML with regex, obviously - consider using a proper parser (eg lxml.html et al).
When I tried to write it to text file it displays: UnicodeEncodeError
Files are bytes, so to write to a text file you have to encode the characters back to bytes. eg
with open('file.txt', 'w') as fp:
fp.write(content.encode('utf-8'))
if you try to write characters directly, Python will guess an encoding, typically ASCII, which will then fail as above because Arabic is not representable in ASCII.
And when I've tried to store it into database it displays: ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '\xd8\xa7\xd9\x84\xd9\x82\xd8\xb5\xd9\x8a\xd8\xb1)'
Post code? I don't think that's a Unicode problem. It looks more like you were creating a query with the content in it, without surrounding that content with quotes. Don't do that - use parameterised queries.
c.execute('INSERT INTO something VALUES ('+content+')') # fails, and security horror
c.execute('INSERT INTO something VALUES (%s)', (content,)) # fine
What I've think about is to fetch it then encode it into base64
Again, base64 operates on bytes, not characters, so encode first.
content.encode('utf-8').encode('base64')
but you shouldn't have to encode to base64 to store Unicode characters in a database. Ensure you are using table columns with a UTF-8 collation, and use UTF-8 as the connection charset, and no extra processing should be necessary.