how to pass character ′ prime in wxpython? - python

When I am trying to insert this text 2′BR info MySql through wx.python Textctrl, it gives me such error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128).
The problem is in the character ′ and I tried encode('utf8') still doesn't work. When I insert it into MySql manually, then I query it it show me as 2?BR. Here is the code of insertion. Thanks.
cur.execute("INSERT INTO TKtable (title) VALUES (%s)", (str(self.Text.GetValue())))

I assume you're using the unicode version of wxPython under Python 2 (not Python 3).
The problem arises when you're calling the str constructor on the result of self.Text.GetValue().
wxPython accept all kind of characters and return unicode strings. In your example, Textctrl.GetValue() return the unicode string u"2′BR"
str() try to convert it into a string, using the default encoding, which is ascii. Ascii can only represents 128 characters. The prime character "′", is not represented in ascii. That's why you have this error.
What is the encoding of your MySQL database? If you want to use strange characters like the "′" prime, you should set your database encoding to utf-8.
Then you should be able to do:
cur.execute("INSERT INTO TKtable (title) VALUES (%s)", (self.Text.GetValue(),))
You won't be able to successfully insert a character that doesn't exist in your database encoding.
I think the prime "′" (code 2032 in utf-8) prime doesn't even exist in latin-1.

Related

SQLAlchemy UnicodeEncodeError on '😕' from SQL Server to Postgres

Some text is being fetched from an nvarchar column on a SQL Server database using SQLAlchemy and then written to a Postgres database with column type character varying. Python throws the following exception when the source data contains the character 😕:
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d'
in position 34: surrogates not allowed
The SQLAlchemy column type is set to String. I tried setting the column type to Unicode and separately set the column collation to SQL_Latin1_General_CP1_CI_AS, with no luck. The driver being used is FreeTDS.
Why can't Python encode this string? Does the problem lie with our use of SQLAlchemy, Python or Postgres? This error is making me 😕.
The codepoint \U1f615 (😕) can be represented by the two surrogates \ud83d and \ude15. Somehow your SQL-Server, which uses internally UTF16 was decoded as UCS2, so that the surrogates are not properly decoded. So the problem is the SQL-Server.
If you cannot correctly read the data, you have to manually correct the unicode strings, like so (python3):
def surrogate_to_unicode(sur):
a, b = sur
return chr(0x10000 + ((ord(a)-0xd800)<<10) + (ord(b)-0xdc00))
text = '\ud83d\ude15'
text = re.sub('[\ud800-\udbff][\udc00-\udfff]', lambda g:surrogate_to_unicode(g.group(0)), text)

How to check Unicode of string data which is supposed to be inserted in PostgreSQL table using Python language?

I want to insert data into a table in PostgreSQL, data is string or array of strings, the problem is that data contain some words (or string) which are not in utf8, and it makes the following error :
DataError: invalid byte sequence for encoding "UTF8": 0xe2 0x80 0x20
the code that I am using is the following and it is in Python:
cur.execute("INSERT INTO tst1 (tweet,words,tag,unknown_tags) VALUES (%s,%s,%s,%s);", (row[1],words,tags[0],tags[1:]))
In order to prevent inserting data with inappropriate Unicode, I was wondering if there is a way in python (or in PostgreSQL) to check the Unicode of the data before insert it into tst1 table?
Use always unicode strings internally. You should decode your strings directly at the point of input, e.g.:
try:
tags = tags.decode('utf8')
except UnicodeDecodeError:
# do what ever you like to do, if input is invalid

UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 47: ordinal not in range(128)

I am trying to write data in a StringIO object using Python and then ultimately load this data into a postgres database using psycopg2's copy_from() function.
First when I did this, the copy_from() was throwing an error: ERROR: invalid byte sequence for encoding "UTF8": 0xc92 So I followed this question.
I figured out that my Postgres database has UTF8 encoding.
The file/StringIO object I am writing my data into shows its encoding as the following:
setgid Non-ISO extended-ASCII English text, with very long lines, with CRLF line terminators
I tried to encode every string that I am writing to the intermediate file/StringIO object into UTF8 format. To do this used .encode(encoding='UTF-8',errors='strict')) for every string.
This is the error I got now:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 47: ordinal not in range(128)
What does it mean? How do I fix it?
EDIT:
I am using Python 2.7
Some pieces of my code:
I read from a MySQL database that has data encoded in UTF-8 as per MySQL Workbench.
This is a few lines code for writing my data (that's obtained from MySQL db) to StringIO object:
# Populate the table_data variable with rows delimited by \n and columns delimited by \t
row_num=0
for row in cursor.fetchall() :
# Separate rows in a table by new line delimiter
if(row_num!=0):
table_data.write("\n")
col_num=0
for cell in row:
# Separate cells in a row by tab delimiter
if(col_num!=0):
table_data.write("\t")
table_data.write(cell.encode(encoding='UTF-8',errors='strict'))
col_num = col_num+1
row_num = row_num+1
This is the code that writes to Postgres database from my StringIO object table_data:
cursor = db_connection.cursor()
cursor.copy_from(table_data, <postgres_table_name>)
The problem is that you're calling encode on a str object.
A str is a byte string, usually representing text encoded in some way like UTF-8. When you call encode on that, it first has to be decoded back to text, so the text can be re-encoded. By default, Python does that by calling s.decode(sys.getgetdefaultencoding()), and getdefaultencoding() usually returns 'ascii'.
So, you're talking UTF-8 encoded text, decoding it as if it were ASCII, then re-encoding it in UTF-8.
The general solution is to explicitly call decode with the right encoding, instead of letting Python use the default, and then encode the result.
But when the right encoding is already the one you want, the easier solution is to just skip the .decode('utf-8').encode('utf-8') and just use the UTF-8 str as the UTF-8 str that it already is.
Or, alternatively, if your MySQL wrapper has a feature to let you specify an encoding and get back unicode values for CHAR/VARCHAR/TEXT columns instead of str values (e.g., in MySQLdb, you pass use_unicode=True to the connect call, or charset='UTF-8' if your database is too old to auto-detect it), just do that. Then you'll have unicode objects, and you can call .encode('utf-8') on them.
In general, the best way to deal with Unicode problems is the last one—decode everything as early as possible, do all the processing in Unicode, and then encode as late as possible. But either way, you have to be consistent. Don't call str on something that might be a unicode; don't concatenate a str literal to a unicode or pass one to its replace method; etc. Any time you mix and match, Python is going to implicitly convert for you, using your default encoding, which is almost never what you want.
As a side note, this is one of the many things that Python 3.x's Unicode changes help with. First, str is now Unicode text, not encoded bytes. More importantly, if you have encoded bytes, e.g., in a bytes object, calling encode will give you an AttributeError instead of trying to silently decode so it can re-encode. And, similarly, trying to mix and match Unicode and bytes will give you an obvious TypeError, instead of an implicit conversion that succeeds in some cases and gives a cryptic message about an encode or decode you didn't ask for in others.

arabic regex and MySQLdb in python

I've tried to get certain Arabic string from web page then store these strings into db.
The first problem
The only way I could is to specify how many letters are they by using . and use unicode, like this:
import urllib,re
content=urllib.urlopen("http://example.com/content.html").read()
content = unicode(content,"utf-8")
Strings = re.findall("<Strong>...........</strong>",content) # it will work fine and fetch it but only strings with 11 char or letter (11 place)
Second problem
When I tried to write it to text file it displays:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
And when I've tried to store it into database it displays:
ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '\xd8\xa7\xd9\x84\xd9\x82\xd8\xb5\xd9\x8a\xd8\xb1)' at line 1")
What I've think about is to fetch it then encode it into base64 then store it into db
but still got an error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
The only way I could is to specify how many letters are they by using . and use unicode, like this
OK... is that a problem? Other than the general unreliability of hacking strings out of HTML with regex, obviously - consider using a proper parser (eg lxml.html et al).
When I tried to write it to text file it displays: UnicodeEncodeError
Files are bytes, so to write to a text file you have to encode the characters back to bytes. eg
with open('file.txt', 'w') as fp:
fp.write(content.encode('utf-8'))
if you try to write characters directly, Python will guess an encoding, typically ASCII, which will then fail as above because Arabic is not representable in ASCII.
And when I've tried to store it into database it displays: ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '\xd8\xa7\xd9\x84\xd9\x82\xd8\xb5\xd9\x8a\xd8\xb1)'
Post code? I don't think that's a Unicode problem. It looks more like you were creating a query with the content in it, without surrounding that content with quotes. Don't do that - use parameterised queries.
c.execute('INSERT INTO something VALUES ('+content+')') # fails, and security horror
c.execute('INSERT INTO something VALUES (%s)', (content,)) # fine
What I've think about is to fetch it then encode it into base64
Again, base64 operates on bytes, not characters, so encode first.
content.encode('utf-8').encode('base64')
but you shouldn't have to encode to base64 to store Unicode characters in a database. Ensure you are using table columns with a UTF-8 collation, and use UTF-8 as the connection charset, and no extra processing should be necessary.

Python MySQL Unicode Error

I am trying to insert a query that contains é - or \xe9 (INSERT INTO tbl1 (text) VALUES ("fiancé")) into a MySQL table in Python using the _mysql module.
My query is in unicode, and when I call _mysql.connect(...).query(query) I get a UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position X
: ordinal not in range(128).
Obviously the call to query causes a conversion of the unicode string to ASCII somehow, but the question is why? My DB is in utf8 and the connection is opened with the flags use_unicode=True and charset='utf8'. Is unicode simply not supported with _mysql or MySQLdb? Am I missing something else?
Thanks!
I know this doesn't directly answer your question, but why aren't you using prepared statements? That will do two things: probably fix your problem, and almost certainly fix the SQLi bug you've almost certainly got.
If you won't do that, are you absolutely certain your string itself is unicode? If you're just naively using strings in python 2.7, it probably is being forced into an ASCII string.

Categories