arabic regex and MySQLdb in python - python

I've tried to get certain Arabic string from web page then store these strings into db.
The first problem
The only way I could is to specify how many letters are they by using . and use unicode, like this:
import urllib,re
content=urllib.urlopen("http://example.com/content.html").read()
content = unicode(content,"utf-8")
Strings = re.findall("<Strong>...........</strong>",content) # it will work fine and fetch it but only strings with 11 char or letter (11 place)
Second problem
When I tried to write it to text file it displays:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
And when I've tried to store it into database it displays:
ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '\xd8\xa7\xd9\x84\xd9\x82\xd8\xb5\xd9\x8a\xd8\xb1)' at line 1")
What I've think about is to fetch it then encode it into base64 then store it into db
but still got an error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

The only way I could is to specify how many letters are they by using . and use unicode, like this
OK... is that a problem? Other than the general unreliability of hacking strings out of HTML with regex, obviously - consider using a proper parser (eg lxml.html et al).
When I tried to write it to text file it displays: UnicodeEncodeError
Files are bytes, so to write to a text file you have to encode the characters back to bytes. eg
with open('file.txt', 'w') as fp:
fp.write(content.encode('utf-8'))
if you try to write characters directly, Python will guess an encoding, typically ASCII, which will then fail as above because Arabic is not representable in ASCII.
And when I've tried to store it into database it displays: ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '\xd8\xa7\xd9\x84\xd9\x82\xd8\xb5\xd9\x8a\xd8\xb1)'
Post code? I don't think that's a Unicode problem. It looks more like you were creating a query with the content in it, without surrounding that content with quotes. Don't do that - use parameterised queries.
c.execute('INSERT INTO something VALUES ('+content+')') # fails, and security horror
c.execute('INSERT INTO something VALUES (%s)', (content,)) # fine
What I've think about is to fetch it then encode it into base64
Again, base64 operates on bytes, not characters, so encode first.
content.encode('utf-8').encode('base64')
but you shouldn't have to encode to base64 to store Unicode characters in a database. Ensure you are using table columns with a UTF-8 collation, and use UTF-8 as the connection charset, and no extra processing should be necessary.

Related

Inserting a unicode API response containing emojis into Mysql using Python mysql.connector

I'm connecting to the Facebook Graph API using Python and the curl response delivers a bunch of data in Unicode format. I am trying to insert this data into a mysql database using the python mysql.connector driver but I keep running into encoding errors.
Specifically, I get this type of error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 40: ordinal not in range(128)
or
File "/Library/Python/2.7/site-packages/mysql/connector/cursor_cext.py", line 243, in execute raise errors.ProgrammingError(str(err)) mysql.connector.errors.ProgrammingError: 'ascii' codec can't encode character u'\xa0' in position 519: ordinal not in range(128)
My database fields are all utf8mb4 and I believe my encoding is all UTF8 as well. So I can't figure out why I'm getting ASCII errors.
The error is happening on the 'caption' field of Instagram posts being returned which includes emojis so I'm 99% sure this is the problem, when commenting out this line everything else works as expected.
So far I have tried:
Adding use_unicode=True, charset='utf8' to the mysql.connector.connect command (according to the docs this is the default anyway)
Adding #!/usr/bin/python # encoding=utf8 to the top of the script
Adding use_unicode=True, charset='ascii' to the mysql.connector.connect command because why not try it
Tried combinations of caption.decode('utf') caption.encode('utf8') on the variable before the mysql insert directive.
I can't find any reference to ASCII in the mysql.connector documentation, so I'm not sure why it's trying to do the conversion.
In reference to the second error above, when going to that line of cursor_cext.py in the mysql.connector package the lines look like this:
try:
if isunicode(operation):
stmt = operation.encode(self._cnx.python_charset)
else:
stmt = operation
except (UnicodeDecodeError, UnicodeEncodeError) as err:
raise errors.ProgrammingError(str(err))
I have previously done something similar with PHP successfully using the old Instagram API but now that they have changed to the Facebook Graph API for Instagram I decided to use Python as it appeared easier but now I don't know where to go with these errors.
When you combine Unicode and byte strings in Python 2 (eg. "a" + u"a"), there's an implicit coercion calling .decode() on the byte string ("a"). The default codec for this method is ASCII in Python 2.
Encoding errors that happen during implicit coercion can be pretty tricky to track down.
Implicit coercion is gone in Python 3, so both user code and library code are forced to keep str and bytes separate.
I suggest you upgrade to Python 3 if you can.
It might not immediately make your code work, but it's more likely that you will find out where to explicitly set the encoding.

Unicode Encode Error 'latin-1' codec can't encode character '\u2019'

I am trying to create a CSV of data from a MySQL RDB to move it over to Amazon Redshift. However, one of the fields contains descriptions and some of those descriptions contain the '’' character, or the right single quotation mark. before when I would run the code, it would give me
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position 62: character maps to <undefined>
I then tried using REPLACE to attempt to get rid of the right single quotation marks.
db = pymysql.connect(host='host', port=3306, user="user", passwd="password", db="db", autocommit=True)
cur = db.cursor()
#cur.execute("call inv1_view_prod.`Email_agg`")
cur.execute("""select field_1,
field_2,
field_3,
field_4,
replace(field_4_desc,"’","") field_4_desc,
field_5,
field_6,
field_7
from table_name """)
emails = cur.fetchall()
with open('O:\file\path\to\file_name.csv','w') as fileout:
writer = csv.writer(fileout)
writer.writerows(emails)
time.sleep(1)
However, this gave me the error:
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 132: ordinal not in range(256)
And I noticed 132 is the position of the right single quotation mark in the SQL statement so I beieve the code itself may be having an issue with it. I tried using the regular straight apostrophe instead of the right single quotation mark in the REPLACE statement, however this did not replace the character and still came back with the original error. Does anyone know why it won't accept the single quote and how to fix it?
\u2019 is Unicode for ’, UTF-8 hex E28099, which is a "RIGHT SINGLE QUOTATION MARK". The direct latin1 equivalent is hex 92. Some word processing products use that instead of apostrophe (').
You are getting the error messages, not because you can't handle the character, but because the configuration fails to declare which encoding is used where.
"132" seems irrelevant: 132 84 E2809E „ „
Notes on Python: http://mysql.rjweb.org/doc.php/charcoll#python
Notes on other charset issues: Trouble with UTF-8 characters; what I see is not what I stored
Without knowing the schema or the Python configuration, I can't be more specific.

SQLAlchemy UnicodeEncodeError on '😕' from SQL Server to Postgres

Some text is being fetched from an nvarchar column on a SQL Server database using SQLAlchemy and then written to a Postgres database with column type character varying. Python throws the following exception when the source data contains the character 😕:
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d'
in position 34: surrogates not allowed
The SQLAlchemy column type is set to String. I tried setting the column type to Unicode and separately set the column collation to SQL_Latin1_General_CP1_CI_AS, with no luck. The driver being used is FreeTDS.
Why can't Python encode this string? Does the problem lie with our use of SQLAlchemy, Python or Postgres? This error is making me 😕.
The codepoint \U1f615 (😕) can be represented by the two surrogates \ud83d and \ude15. Somehow your SQL-Server, which uses internally UTF16 was decoded as UCS2, so that the surrogates are not properly decoded. So the problem is the SQL-Server.
If you cannot correctly read the data, you have to manually correct the unicode strings, like so (python3):
def surrogate_to_unicode(sur):
a, b = sur
return chr(0x10000 + ((ord(a)-0xd800)<<10) + (ord(b)-0xdc00))
text = '\ud83d\ude15'
text = re.sub('[\ud800-\udbff][\udc00-\udfff]', lambda g:surrogate_to_unicode(g.group(0)), text)

how to pass character ′ prime in wxpython?

When I am trying to insert this text 2′BR info MySql through wx.python Textctrl, it gives me such error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128).
The problem is in the character ′ and I tried encode('utf8') still doesn't work. When I insert it into MySql manually, then I query it it show me as 2?BR. Here is the code of insertion. Thanks.
cur.execute("INSERT INTO TKtable (title) VALUES (%s)", (str(self.Text.GetValue())))
I assume you're using the unicode version of wxPython under Python 2 (not Python 3).
The problem arises when you're calling the str constructor on the result of self.Text.GetValue().
wxPython accept all kind of characters and return unicode strings. In your example, Textctrl.GetValue() return the unicode string u"2′BR"
str() try to convert it into a string, using the default encoding, which is ascii. Ascii can only represents 128 characters. The prime character "′", is not represented in ascii. That's why you have this error.
What is the encoding of your MySQL database? If you want to use strange characters like the "′" prime, you should set your database encoding to utf-8.
Then you should be able to do:
cur.execute("INSERT INTO TKtable (title) VALUES (%s)", (self.Text.GetValue(),))
You won't be able to successfully insert a character that doesn't exist in your database encoding.
I think the prime "′" (code 2032 in utf-8) prime doesn't even exist in latin-1.

Python MySQL Unicode Error

I am trying to insert a query that contains é - or \xe9 (INSERT INTO tbl1 (text) VALUES ("fiancé")) into a MySQL table in Python using the _mysql module.
My query is in unicode, and when I call _mysql.connect(...).query(query) I get a UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position X
: ordinal not in range(128).
Obviously the call to query causes a conversion of the unicode string to ASCII somehow, but the question is why? My DB is in utf8 and the connection is opened with the flags use_unicode=True and charset='utf8'. Is unicode simply not supported with _mysql or MySQLdb? Am I missing something else?
Thanks!
I know this doesn't directly answer your question, but why aren't you using prepared statements? That will do two things: probably fix your problem, and almost certainly fix the SQLi bug you've almost certainly got.
If you won't do that, are you absolutely certain your string itself is unicode? If you're just naively using strings in python 2.7, it probably is being forced into an ASCII string.

Categories