Insert panda df to Oracle database using python - python

I have a panda dataframe with text columns ('testdf'). I am using the below code to insert to TEST table in oracle database
from sqlalchemy import create_engine, Unicode, NVARCHAR
engine = create_engine("oracle+cx_oracle://{user}:{pw}#xxxxx.xxxxx.xx:1521/{db}"
.format(user="xxx",
pw="xxx",
db="xx"))
testdf.to_sql("TEST", con = engine, if_exists = 'append')
But it returns an error with encoding as below:
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f447' in position 237: character maps to <undefined>
How can I solve this problem? I am using Python 3, Jupyter Notebook with Anaconda

This is a common question. I think this answer is a good one. Or this one.
The problem is that Python is trying to convert your data (which is encoded in Unicode) into some other character set to insert into the database, and that other character set doesn't include \U0001f447 (which is in your dataframe). This answer points out that if you look at the full error traceback and not just the error message, it will tell you which charset it's trying to convert into.
There's a few different options. The easiest is probably to pass ?charset=utf8 to cx_oracle in your connect string. This tells cx_oracle to send strings as Unicode.
"oracle+cx_oracle://{user}:{pw}#xxxxx.xxxxx.xx:1521/{db}?charset=utf8"
You could also try setting the NLS_LANG environment variable. This will tell the Oracle server to expect Unicode from your Python application.
os.environ['NLS_LANG']= 'AMERICAN_AMERICA.AL32UTF8'

Related

Inserting a unicode API response containing emojis into Mysql using Python mysql.connector

I'm connecting to the Facebook Graph API using Python and the curl response delivers a bunch of data in Unicode format. I am trying to insert this data into a mysql database using the python mysql.connector driver but I keep running into encoding errors.
Specifically, I get this type of error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 40: ordinal not in range(128)
or
File "/Library/Python/2.7/site-packages/mysql/connector/cursor_cext.py", line 243, in execute raise errors.ProgrammingError(str(err)) mysql.connector.errors.ProgrammingError: 'ascii' codec can't encode character u'\xa0' in position 519: ordinal not in range(128)
My database fields are all utf8mb4 and I believe my encoding is all UTF8 as well. So I can't figure out why I'm getting ASCII errors.
The error is happening on the 'caption' field of Instagram posts being returned which includes emojis so I'm 99% sure this is the problem, when commenting out this line everything else works as expected.
So far I have tried:
Adding use_unicode=True, charset='utf8' to the mysql.connector.connect command (according to the docs this is the default anyway)
Adding #!/usr/bin/python # encoding=utf8 to the top of the script
Adding use_unicode=True, charset='ascii' to the mysql.connector.connect command because why not try it
Tried combinations of caption.decode('utf') caption.encode('utf8') on the variable before the mysql insert directive.
I can't find any reference to ASCII in the mysql.connector documentation, so I'm not sure why it's trying to do the conversion.
In reference to the second error above, when going to that line of cursor_cext.py in the mysql.connector package the lines look like this:
try:
if isunicode(operation):
stmt = operation.encode(self._cnx.python_charset)
else:
stmt = operation
except (UnicodeDecodeError, UnicodeEncodeError) as err:
raise errors.ProgrammingError(str(err))
I have previously done something similar with PHP successfully using the old Instagram API but now that they have changed to the Facebook Graph API for Instagram I decided to use Python as it appeared easier but now I don't know where to go with these errors.
When you combine Unicode and byte strings in Python 2 (eg. "a" + u"a"), there's an implicit coercion calling .decode() on the byte string ("a"). The default codec for this method is ASCII in Python 2.
Encoding errors that happen during implicit coercion can be pretty tricky to track down.
Implicit coercion is gone in Python 3, so both user code and library code are forced to keep str and bytes separate.
I suggest you upgrade to Python 3 if you can.
It might not immediately make your code work, but it's more likely that you will find out where to explicitly set the encoding.

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 637: ordinal not in range(256)

This is driving me crazy and i have tried different suggestions from the community but it doesnt seem to work. I tried even recreating the db using just utf-8 and still it gives me this error.
Basically i am using pymysql module and writing to the db.
openconnect = pymysql.connect(host='xxxx',port=3306,user='xxx',passwd='xxx',db='xxxx')
opencursor = openconnect.cursor()
one of my column is having the problem, basically i tried these options...
the columns that cause issue is subject and i tried the below
subject = (df.Subject[i])
subject.encode('latin-1', 'ignore')
and then try to write to the db it fails.
if i try subject.encode('latin-1') also it fails.
I have two options, either fix the encoding or how i can set the coalition on pymysql to use utf-8 ? I verified the db, coalition on mysql is set to utf-8. Really appreciate your input on this..
still struggling with this.
cheers
Kabeer
I was able to resolve the issue by defining a charset in pymysql connect:
openconnect = pymysql.connect(
host='xxxx',port=3306,user='xxx',passwd='xxx',db='xxxx',charset='utf8'
)
Please note it is utf8 and not utf-8.

SQLAlchemy UnicodeEncodeError on 'πŸ˜•' from SQL Server to Postgres

Some text is being fetched from an nvarchar column on a SQL Server database using SQLAlchemy and then written to a Postgres database with column type character varying. Python throws the following exception when the source data contains the character πŸ˜•:
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d'
in position 34: surrogates not allowed
The SQLAlchemy column type is set to String. I tried setting the column type to Unicode and separately set the column collation to SQL_Latin1_General_CP1_CI_AS, with no luck. The driver being used is FreeTDS.
Why can't Python encode this string? Does the problem lie with our use of SQLAlchemy, Python or Postgres? This error is making me πŸ˜•.
The codepoint \U1f615 (πŸ˜•) can be represented by the two surrogates \ud83d and \ude15. Somehow your SQL-Server, which uses internally UTF16 was decoded as UCS2, so that the surrogates are not properly decoded. So the problem is the SQL-Server.
If you cannot correctly read the data, you have to manually correct the unicode strings, like so (python3):
def surrogate_to_unicode(sur):
a, b = sur
return chr(0x10000 + ((ord(a)-0xd800)<<10) + (ord(b)-0xdc00))
text = '\ud83d\ude15'
text = re.sub('[\ud800-\udbff][\udc00-\udfff]', lambda g:surrogate_to_unicode(g.group(0)), text)

arabic regex and MySQLdb in python

I've tried to get certain Arabic string from web page then store these strings into db.
The first problem
The only way I could is to specify how many letters are they by using . and use unicode, like this:
import urllib,re
content=urllib.urlopen("http://example.com/content.html").read()
content = unicode(content,"utf-8")
Strings = re.findall("<Strong>...........</strong>",content) # it will work fine and fetch it but only strings with 11 char or letter (11 place)
Second problem
When I tried to write it to text file it displays:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
And when I've tried to store it into database it displays:
ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '\xd8\xa7\xd9\x84\xd9\x82\xd8\xb5\xd9\x8a\xd8\xb1)' at line 1")
What I've think about is to fetch it then encode it into base64 then store it into db
but still got an error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
The only way I could is to specify how many letters are they by using . and use unicode, like this
OK... is that a problem? Other than the general unreliability of hacking strings out of HTML with regex, obviously - consider using a proper parser (eg lxml.html et al).
When I tried to write it to text file it displays: UnicodeEncodeError
Files are bytes, so to write to a text file you have to encode the characters back to bytes. eg
with open('file.txt', 'w') as fp:
fp.write(content.encode('utf-8'))
if you try to write characters directly, Python will guess an encoding, typically ASCII, which will then fail as above because Arabic is not representable in ASCII.
And when I've tried to store it into database it displays: ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '\xd8\xa7\xd9\x84\xd9\x82\xd8\xb5\xd9\x8a\xd8\xb1)'
Post code? I don't think that's a Unicode problem. It looks more like you were creating a query with the content in it, without surrounding that content with quotes. Don't do that - use parameterised queries.
c.execute('INSERT INTO something VALUES ('+content+')') # fails, and security horror
c.execute('INSERT INTO something VALUES (%s)', (content,)) # fine
What I've think about is to fetch it then encode it into base64
Again, base64 operates on bytes, not characters, so encode first.
content.encode('utf-8').encode('base64')
but you shouldn't have to encode to base64 to store Unicode characters in a database. Ensure you are using table columns with a UTF-8 collation, and use UTF-8 as the connection charset, and no extra processing should be necessary.

Python MySQL Unicode Error

I am trying to insert a query that contains Γ© - or \xe9 (INSERT INTO tbl1 (text) VALUES ("fiancΓ©")) into a MySQL table in Python using the _mysql module.
My query is in unicode, and when I call _mysql.connect(...).query(query) I get a UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position X
: ordinal not in range(128).
Obviously the call to query causes a conversion of the unicode string to ASCII somehow, but the question is why? My DB is in utf8 and the connection is opened with the flags use_unicode=True and charset='utf8'. Is unicode simply not supported with _mysql or MySQLdb? Am I missing something else?
Thanks!
I know this doesn't directly answer your question, but why aren't you using prepared statements? That will do two things: probably fix your problem, and almost certainly fix the SQLi bug you've almost certainly got.
If you won't do that, are you absolutely certain your string itself is unicode? If you're just naively using strings in python 2.7, it probably is being forced into an ASCII string.

Categories