Replace Character For All Rows using SQLAlchemy on Flask - python

I made the mistake of accidentally using non-ascii characters in a form that was submitted into a database using SQLAlchemy, running on Flask. Basically, rather than using the ASCII hyphen –, I used the unicode en-dash –. I am trying to now go back and replace all occurrences of the en-dash with a hyphen in my database.
Let's say I have a users table, and the column I'm trying to change is called occupation. I'm able to figure out which entries in my database have the invalid character, because when I run:
User.query.get(id)
if the user has an invalid ASCII character, it returns
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 167: ordinal not in range(128)
So how can I go about replacing all occurrences of the en-dash with a hyphen in the occupation column for all rows in my DB?

I was able to fix this by running a script over all entries in my database, and replacing the ones with faulty characters.
from user.models import *
for u in User.query.all():
# \u2013 is unicode for en-dash
if u"\u2013" in u.occupation:
# replace with normal hyphen
u.occupation = u.occupation.replace(u"\u2013", "-")
db.session.commit()

Related

Unicode Encode Error 'latin-1' codec can't encode character '\u2019'

I am trying to create a CSV of data from a MySQL RDB to move it over to Amazon Redshift. However, one of the fields contains descriptions and some of those descriptions contain the '’' character, or the right single quotation mark. before when I would run the code, it would give me
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position 62: character maps to <undefined>
I then tried using REPLACE to attempt to get rid of the right single quotation marks.
db = pymysql.connect(host='host', port=3306, user="user", passwd="password", db="db", autocommit=True)
cur = db.cursor()
#cur.execute("call inv1_view_prod.`Email_agg`")
cur.execute("""select field_1,
field_2,
field_3,
field_4,
replace(field_4_desc,"’","") field_4_desc,
field_5,
field_6,
field_7
from table_name """)
emails = cur.fetchall()
with open('O:\file\path\to\file_name.csv','w') as fileout:
writer = csv.writer(fileout)
writer.writerows(emails)
time.sleep(1)
However, this gave me the error:
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 132: ordinal not in range(256)
And I noticed 132 is the position of the right single quotation mark in the SQL statement so I beieve the code itself may be having an issue with it. I tried using the regular straight apostrophe instead of the right single quotation mark in the REPLACE statement, however this did not replace the character and still came back with the original error. Does anyone know why it won't accept the single quote and how to fix it?
\u2019 is Unicode for ’, UTF-8 hex E28099, which is a "RIGHT SINGLE QUOTATION MARK". The direct latin1 equivalent is hex 92. Some word processing products use that instead of apostrophe (').
You are getting the error messages, not because you can't handle the character, but because the configuration fails to declare which encoding is used where.
"132" seems irrelevant: 132 84 E2809E „ „
Notes on Python: http://mysql.rjweb.org/doc.php/charcoll#python
Notes on other charset issues: Trouble with UTF-8 characters; what I see is not what I stored
Without knowing the schema or the Python configuration, I can't be more specific.

SQLAlchemy UnicodeEncodeError on '😕' from SQL Server to Postgres

Some text is being fetched from an nvarchar column on a SQL Server database using SQLAlchemy and then written to a Postgres database with column type character varying. Python throws the following exception when the source data contains the character 😕:
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d'
in position 34: surrogates not allowed
The SQLAlchemy column type is set to String. I tried setting the column type to Unicode and separately set the column collation to SQL_Latin1_General_CP1_CI_AS, with no luck. The driver being used is FreeTDS.
Why can't Python encode this string? Does the problem lie with our use of SQLAlchemy, Python or Postgres? This error is making me 😕.
The codepoint \U1f615 (😕) can be represented by the two surrogates \ud83d and \ude15. Somehow your SQL-Server, which uses internally UTF16 was decoded as UCS2, so that the surrogates are not properly decoded. So the problem is the SQL-Server.
If you cannot correctly read the data, you have to manually correct the unicode strings, like so (python3):
def surrogate_to_unicode(sur):
a, b = sur
return chr(0x10000 + ((ord(a)-0xd800)<<10) + (ord(b)-0xdc00))
text = '\ud83d\ude15'
text = re.sub('[\ud800-\udbff][\udc00-\udfff]', lambda g:surrogate_to_unicode(g.group(0)), text)

how to pass character ′ prime in wxpython?

When I am trying to insert this text 2′BR info MySql through wx.python Textctrl, it gives me such error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128).
The problem is in the character ′ and I tried encode('utf8') still doesn't work. When I insert it into MySql manually, then I query it it show me as 2?BR. Here is the code of insertion. Thanks.
cur.execute("INSERT INTO TKtable (title) VALUES (%s)", (str(self.Text.GetValue())))
I assume you're using the unicode version of wxPython under Python 2 (not Python 3).
The problem arises when you're calling the str constructor on the result of self.Text.GetValue().
wxPython accept all kind of characters and return unicode strings. In your example, Textctrl.GetValue() return the unicode string u"2′BR"
str() try to convert it into a string, using the default encoding, which is ascii. Ascii can only represents 128 characters. The prime character "′", is not represented in ascii. That's why you have this error.
What is the encoding of your MySQL database? If you want to use strange characters like the "′" prime, you should set your database encoding to utf-8.
Then you should be able to do:
cur.execute("INSERT INTO TKtable (title) VALUES (%s)", (self.Text.GetValue(),))
You won't be able to successfully insert a character that doesn't exist in your database encoding.
I think the prime "′" (code 2032 in utf-8) prime doesn't even exist in latin-1.

arabic regex and MySQLdb in python

I've tried to get certain Arabic string from web page then store these strings into db.
The first problem
The only way I could is to specify how many letters are they by using . and use unicode, like this:
import urllib,re
content=urllib.urlopen("http://example.com/content.html").read()
content = unicode(content,"utf-8")
Strings = re.findall("<Strong>...........</strong>",content) # it will work fine and fetch it but only strings with 11 char or letter (11 place)
Second problem
When I tried to write it to text file it displays:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
And when I've tried to store it into database it displays:
ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '\xd8\xa7\xd9\x84\xd9\x82\xd8\xb5\xd9\x8a\xd8\xb1)' at line 1")
What I've think about is to fetch it then encode it into base64 then store it into db
but still got an error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
The only way I could is to specify how many letters are they by using . and use unicode, like this
OK... is that a problem? Other than the general unreliability of hacking strings out of HTML with regex, obviously - consider using a proper parser (eg lxml.html et al).
When I tried to write it to text file it displays: UnicodeEncodeError
Files are bytes, so to write to a text file you have to encode the characters back to bytes. eg
with open('file.txt', 'w') as fp:
fp.write(content.encode('utf-8'))
if you try to write characters directly, Python will guess an encoding, typically ASCII, which will then fail as above because Arabic is not representable in ASCII.
And when I've tried to store it into database it displays: ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '\xd8\xa7\xd9\x84\xd9\x82\xd8\xb5\xd9\x8a\xd8\xb1)'
Post code? I don't think that's a Unicode problem. It looks more like you were creating a query with the content in it, without surrounding that content with quotes. Don't do that - use parameterised queries.
c.execute('INSERT INTO something VALUES ('+content+')') # fails, and security horror
c.execute('INSERT INTO something VALUES (%s)', (content,)) # fine
What I've think about is to fetch it then encode it into base64
Again, base64 operates on bytes, not characters, so encode first.
content.encode('utf-8').encode('base64')
but you shouldn't have to encode to base64 to store Unicode characters in a database. Ensure you are using table columns with a UTF-8 collation, and use UTF-8 as the connection charset, and no extra processing should be necessary.

Python Unicode CSV export (using Django)

I'm using a Django app to export a string to a CSV file. The string is a message that was submitted through a front end form. However, I've been getting this error when a unicode single quote is provided in the input.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019'
in position 200: ordinal not in range(128)
I've been trying to convert the unicode to ascii using the code below, but still get a similar error.
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 0-9: ordinal not in range(128)
I've sifted through dozens of websites and learned a lot about unicode, however, I'm still not able to convert this unicode to ascii. I don't care if the algorithm removes the unicode characters. The commented lines indicate some various options I've tried, but the error persists.
import csv
import unicodedata
...
#message = unicode( unicodedata.normalize(
# 'NFKD',contact.message).encode('ascii','ignore'))
#dmessage = (contact.message).encode('utf-8','ignore')
#dmessage = contact.message.decode("utf-8")
#dmessage = "%s" % dmessage
dmessage = contact.message
csv_writer.writerow([
dmessage,
])
Does anyone have any advice in removing unicode characters to I can export them to CSV? This seemingly easy problem has kept my head spinning. Any help is much appreciated.
Thanks,
Joe
You can't encode the Unicode character u'\u2019' (U+2019 Right Single Quotation Mark) into ASCII, because ASCII doesn't have that character in it. ASCII is only the basic Latin alphabet, digits and punctuation; you don't get any accented letters or ‘smart quotes’ like this character.
So you will have to choose another encoding. Now normally the sensible thing to do would be to export to UTF-8, which can hold any Unicode character. Unfortunately for you if your target users are using Office (and they probably are), they're not going to be able to read UTF-8-encoded characters in CSV. Instead Excel will read the files using the system default code page for that machine (also misleadingly known as the ‘ANSI’ code page), and end up with mojibake like ’ instead of ’.
So that means you have to guess the user's system default code page if you want the characters to show up correctly. For Western users, that will be code page 1252. Users with non-Western Windows installs will see the wrong characters, but there's nothing you can do about that (other than organise a letter-writing campaign to Microsoft to just drop the stupid nonsense with ANSI already and use UTF-8 like everyone else).
Code page 1252 can contain U+2019 (’), but obviously there are many more characters it can't represent. To avoid getting UnicodeEncodeError for those characters you can use the ignore argument (or replace to replace them with question marks).
dmessage= contact.message.encode('cp1252', 'ignore')
alternatively, to give up and remove all non-ASCII characters, so that everyone gets an equally bad experience regardless of locale:
dmessage= contact.message.encode('ascii', 'ignore')
Encoding is a pain, but if you're working in django have you tried smart_unicode(str) from django.utils.encoding? I find that usually does the trick.
The only other option I've found is to use the built-in python encode() and decode() for strings, but you have to specify the encoding for those and honestly, it's a pain.
[caveat: I'm not a djangoist; django may have a better solution].
General non-django-specific answer:
If you have a smallish number of known non-ASCII characters and there are user-acceptable ASCII equivalents for them, you can set up a translation table and use the unicode.translate method:
smashcii = {
0x2019 : u"'",
# etc
#
smashed = input_string.translate(smashcii)

Categories