Some text is being fetched from an nvarchar column on a SQL Server database using SQLAlchemy and then written to a Postgres database with column type character varying. Python throws the following exception when the source data contains the character 😕:
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d'
in position 34: surrogates not allowed
The SQLAlchemy column type is set to String. I tried setting the column type to Unicode and separately set the column collation to SQL_Latin1_General_CP1_CI_AS, with no luck. The driver being used is FreeTDS.
Why can't Python encode this string? Does the problem lie with our use of SQLAlchemy, Python or Postgres? This error is making me 😕.
The codepoint \U1f615 (😕) can be represented by the two surrogates \ud83d and \ude15. Somehow your SQL-Server, which uses internally UTF16 was decoded as UCS2, so that the surrogates are not properly decoded. So the problem is the SQL-Server.
If you cannot correctly read the data, you have to manually correct the unicode strings, like so (python3):
def surrogate_to_unicode(sur):
a, b = sur
return chr(0x10000 + ((ord(a)-0xd800)<<10) + (ord(b)-0xdc00))
text = '\ud83d\ude15'
text = re.sub('[\ud800-\udbff][\udc00-\udfff]', lambda g:surrogate_to_unicode(g.group(0)), text)
Related
I have a panda dataframe with text columns ('testdf'). I am using the below code to insert to TEST table in oracle database
from sqlalchemy import create_engine, Unicode, NVARCHAR
engine = create_engine("oracle+cx_oracle://{user}:{pw}#xxxxx.xxxxx.xx:1521/{db}"
.format(user="xxx",
pw="xxx",
db="xx"))
testdf.to_sql("TEST", con = engine, if_exists = 'append')
But it returns an error with encoding as below:
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f447' in position 237: character maps to <undefined>
How can I solve this problem? I am using Python 3, Jupyter Notebook with Anaconda
This is a common question. I think this answer is a good one. Or this one.
The problem is that Python is trying to convert your data (which is encoded in Unicode) into some other character set to insert into the database, and that other character set doesn't include \U0001f447 (which is in your dataframe). This answer points out that if you look at the full error traceback and not just the error message, it will tell you which charset it's trying to convert into.
There's a few different options. The easiest is probably to pass ?charset=utf8 to cx_oracle in your connect string. This tells cx_oracle to send strings as Unicode.
"oracle+cx_oracle://{user}:{pw}#xxxxx.xxxxx.xx:1521/{db}?charset=utf8"
You could also try setting the NLS_LANG environment variable. This will tell the Oracle server to expect Unicode from your Python application.
os.environ['NLS_LANG']= 'AMERICAN_AMERICA.AL32UTF8'
I want to insert data into a table in PostgreSQL, data is string or array of strings, the problem is that data contain some words (or string) which are not in utf8, and it makes the following error :
DataError: invalid byte sequence for encoding "UTF8": 0xe2 0x80 0x20
the code that I am using is the following and it is in Python:
cur.execute("INSERT INTO tst1 (tweet,words,tag,unknown_tags) VALUES (%s,%s,%s,%s);", (row[1],words,tags[0],tags[1:]))
In order to prevent inserting data with inappropriate Unicode, I was wondering if there is a way in python (or in PostgreSQL) to check the Unicode of the data before insert it into tst1 table?
Use always unicode strings internally. You should decode your strings directly at the point of input, e.g.:
try:
tags = tags.decode('utf8')
except UnicodeDecodeError:
# do what ever you like to do, if input is invalid
I am using Python 2.7 and MySQLdb 1.2.3. I tried everything I found on stackoverflow and other forums to handle encoding errors my script is throwing.
My script reads data from all tables in a source MySQL DB, writes them in a python StringIO.StringIO object, and then loads that data from StringIO object to Postgres database (which apparently is in UTF-8 encoding format. I found this by looking into Properties--Definition of database in pgadmin) using psycopg2 library's copy_from command.
I found out that my source MySQL database has some tables in latin1_swedish_ci encoding while others in utf_8 encoding format (Found this from TABLE_COLLATION in information_schema.tables).
I wrote all this code on the top of my Python script based on my research on the internet.
db_conn = MySQLdb.connect(host=host,user=user,passwd=passwd,db=db, charset="utf8", init_command='SET NAMES UTF8' ,use_unicode=True)
db_conn.set_character_set('utf8')
db_conn_cursor = db_conn.cursor()
db_conn_cursor.execute('SET NAMES utf8;')
db_conn_cursor.execute('SET CHARACTER SET utf8;')
db_conn_cursor.execute('SET character_set_connection=utf8;')
I still get the UnicodeEncodeError below with this line: cell = str(cell).replace("\r", " ").replace("\n", " ").replace("\t", '').replace("\"", "") #Remove unwanted characters from column value,
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 47: ordinal not in range(128)
I wrote the following line of code to clean cells in every table of source MySQL database when writing to StringIO object.
cell = str(cell).replace("\r", " ").replace("\n", " ").replace("\t", '').replace("\"", "") #Remove unwanted characters from column value
Please help.
str(cell) is trying to convert cell to ASCII. ASCII only supports characters with ordinals less than 255. What is cell?
If cell is a unicode string, just do cell.encode("utf8"), and that will return a bytestring encoded as utf 8
...or really iirc. If you pass mysql unicode, then the database will automagically convert it to utf8...
You could also try,
cell = unicode(cell).replace("\r", " ").replace("\n", " ").replace("\t", '').replace("\"", "")
or just use a 3rd party library. There is a good one that will fix text for you.
When I am trying to insert this text 2′BR info MySql through wx.python Textctrl, it gives me such error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128).
The problem is in the character ′ and I tried encode('utf8') still doesn't work. When I insert it into MySql manually, then I query it it show me as 2?BR. Here is the code of insertion. Thanks.
cur.execute("INSERT INTO TKtable (title) VALUES (%s)", (str(self.Text.GetValue())))
I assume you're using the unicode version of wxPython under Python 2 (not Python 3).
The problem arises when you're calling the str constructor on the result of self.Text.GetValue().
wxPython accept all kind of characters and return unicode strings. In your example, Textctrl.GetValue() return the unicode string u"2′BR"
str() try to convert it into a string, using the default encoding, which is ascii. Ascii can only represents 128 characters. The prime character "′", is not represented in ascii. That's why you have this error.
What is the encoding of your MySQL database? If you want to use strange characters like the "′" prime, you should set your database encoding to utf-8.
Then you should be able to do:
cur.execute("INSERT INTO TKtable (title) VALUES (%s)", (self.Text.GetValue(),))
You won't be able to successfully insert a character that doesn't exist in your database encoding.
I think the prime "′" (code 2032 in utf-8) prime doesn't even exist in latin-1.
I've tried to get certain Arabic string from web page then store these strings into db.
The first problem
The only way I could is to specify how many letters are they by using . and use unicode, like this:
import urllib,re
content=urllib.urlopen("http://example.com/content.html").read()
content = unicode(content,"utf-8")
Strings = re.findall("<Strong>...........</strong>",content) # it will work fine and fetch it but only strings with 11 char or letter (11 place)
Second problem
When I tried to write it to text file it displays:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
And when I've tried to store it into database it displays:
ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '\xd8\xa7\xd9\x84\xd9\x82\xd8\xb5\xd9\x8a\xd8\xb1)' at line 1")
What I've think about is to fetch it then encode it into base64 then store it into db
but still got an error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
The only way I could is to specify how many letters are they by using . and use unicode, like this
OK... is that a problem? Other than the general unreliability of hacking strings out of HTML with regex, obviously - consider using a proper parser (eg lxml.html et al).
When I tried to write it to text file it displays: UnicodeEncodeError
Files are bytes, so to write to a text file you have to encode the characters back to bytes. eg
with open('file.txt', 'w') as fp:
fp.write(content.encode('utf-8'))
if you try to write characters directly, Python will guess an encoding, typically ASCII, which will then fail as above because Arabic is not representable in ASCII.
And when I've tried to store it into database it displays: ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '\xd8\xa7\xd9\x84\xd9\x82\xd8\xb5\xd9\x8a\xd8\xb1)'
Post code? I don't think that's a Unicode problem. It looks more like you were creating a query with the content in it, without surrounding that content with quotes. Don't do that - use parameterised queries.
c.execute('INSERT INTO something VALUES ('+content+')') # fails, and security horror
c.execute('INSERT INTO something VALUES (%s)', (content,)) # fine
What I've think about is to fetch it then encode it into base64
Again, base64 operates on bytes, not characters, so encode first.
content.encode('utf-8').encode('base64')
but you shouldn't have to encode to base64 to store Unicode characters in a database. Ensure you are using table columns with a UTF-8 collation, and use UTF-8 as the connection charset, and no extra processing should be necessary.