Python pyodbc connections to IBM Netezza Erroring - python

So. This issue is almost exactly the same as the one discussed here -- but the fix (such as it is) discussed in that post doesn't fix things for me.
I'm trying to use Python 2.7.5 and pyodbc 3.0.7 to connect from an Ubuntu 12.04 64bit machine to an IBM Netezza Database. I'm using unixODBC to handle specifying a DSN. This DSN works beautifully from the isql CLI -- so I know it's configured correctly, and unixODBC is ticking right along.
The code is currently dead simple, and easy to reproduce in a REPL:
In [1]: import pyodbc
In [2]: conn = pyodbc.connect(dsn='NZSQL')
In [3]: curs = conn.cursor()
In [4]: curs.execute("SELECT * FROM DB..FOO ORDER BY created_on DESC LIMIT 10")
Out[4]: <pyodbc.Cursor at 0x1a70ab0>
In [5]: curs.fetchall()
---------------------------------------------------------------------------
InvalidOperation Traceback (most recent call last)
<ipython-input-5-ad813e4432e9> in <module>()
----> 1 curs.fetchall()
/usr/lib/python2.7/decimal.pyc in __new__(cls, value, context)
546 context = getcontext()
547 return context._raise_error(ConversionSyntax,
--> 548 "Invalid literal for Decimal: %r" % value)
549
550 if m.group('sign') == "-":
/usr/lib/python2.7/decimal.pyc in _raise_error(self, condition, explanation, *args)
3864 # Errors should only be risked on copies of the context
3865 # self._ignored_flags = []
-> 3866 raise error(explanation)
3867
3868 def _ignore_all_flags(self):
InvalidOperation: Invalid literal for Decimal: u''
So I get a connection, the query returns correctly, and then when I try to get a row... asplode.
Anybody ever managed to do this?

Turns out pyodbc can't gracefully convert all of Netezza's types. The table I was working with had two that are problematic:
A column of type NUMERIC(7,2)
A column of type NVARCHAR(255)
The NUMERIC column causes a decimal conversion error on NULL. The NVARCHAR column returns a utf-16-le encoded string, which is a pain in the ass.
I haven't found a good driver-or-wrapper-level solution yet. This can be hacked by casting types in the SQL statement:
SELECT
foo::FLOAT AS was_numeric
, bar::VARCHAR(255) AS was_nvarchar
I'll post here if I find a lower-level answer.

I've just encounter the same problem and found a different solution.
I managed to solve the issue by:
Making sure the following attributes are part of my driver options in odbc ini file:
UnicodeTranslationOption = utf16
CharacterTranslationOption = all
Add the following environment variables:
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:[NETEZZA_LIB_FILES_PATH]
ODBCINI=[ODBC_INI_FULL_PATH]
NZ_ODBC_INI_PATH=[ODBC_INI_FOLDER_PATH]
In my case the values are:
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nz/lib
ODBC_INI=/etc/odbc.ini
NZ_ODBC_INI_PATH=/etc
I'm using centos 6 and also installed both 'unixODBC' and 'unixODBC-devel' packages.
Hope it helps someone.

I'm not sure what your error is, but the code below is allowing me to connect to Netezza via ODBC:
# Connect via pyodbc to listed data sources on machine
import pyodbc
print pyodbc.dataSources()
print "Connecting via ODBC"
# get a connection, if a connect cannot be made an exception will be raised here
conn = pyodbc.connect("DRIVER={NetezzaSQL};SERVER=<myserver>;PORT=<myport>;DATABASE=<mydbschema>;UID=<user>;PWD=<password>;")
print "Connected!\n"
# you can then use conn cursor to perform queries

The Netezza Linux client package includes /usr/local/nz/lib/ODBC_README, which lists all the values for those two attributes:
UnicodeTranslationOption:
Specify translation option for Unicode.
Possible value strings are:
utf8 : unicode data is in utf-8 encoding
utf16 : unicode data is in utf-16 encoding
utf32 : unicode data is in utf-32 encoding
Do not add '-' in the value strings listed above, e.g. "utf-8" is not
a valid string value. These value strings are case insensitive.
On windows this option is not available as windows DM always passes
unicode data in utf16.
CharacterTranslationOption ("Optimize for ASCII character set" on Windows):
Specify translation option for character encodings.
Possible value strings are:
all : Support all character encodings
latin9 : Support Latin9 character encoding only
Do not add '-' in the value strings listed above, e.g. "latin-9"
is not a valid string value. These value strings are case
insensitive.
NPS uses the Latin9 character encoding for char and varchar
datatypes. The character encoding on many Windows systems
is similar, but not identical to this. For the ASCII subset
(letters a-z, A-Z, numbers 0-9 and punctuation marks) they are
identical. If your character data in CHAR or VARCHAR datatypes is
only in this ASCII subset then it will be faster if this box is
checked. If your data has special characters such as the Euro
sign (€) then keep the box unchecked to accurately convert
the encoding. Characters in the NCHAR or NVARCHAR data types
will always be converted appropriately.

Related

Insert panda df to Oracle database using python

I have a panda dataframe with text columns ('testdf'). I am using the below code to insert to TEST table in oracle database
from sqlalchemy import create_engine, Unicode, NVARCHAR
engine = create_engine("oracle+cx_oracle://{user}:{pw}#xxxxx.xxxxx.xx:1521/{db}"
.format(user="xxx",
pw="xxx",
db="xx"))
testdf.to_sql("TEST", con = engine, if_exists = 'append')
But it returns an error with encoding as below:
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f447' in position 237: character maps to <undefined>
How can I solve this problem? I am using Python 3, Jupyter Notebook with Anaconda
This is a common question. I think this answer is a good one. Or this one.
The problem is that Python is trying to convert your data (which is encoded in Unicode) into some other character set to insert into the database, and that other character set doesn't include \U0001f447 (which is in your dataframe). This answer points out that if you look at the full error traceback and not just the error message, it will tell you which charset it's trying to convert into.
There's a few different options. The easiest is probably to pass ?charset=utf8 to cx_oracle in your connect string. This tells cx_oracle to send strings as Unicode.
"oracle+cx_oracle://{user}:{pw}#xxxxx.xxxxx.xx:1521/{db}?charset=utf8"
You could also try setting the NLS_LANG environment variable. This will tell the Oracle server to expect Unicode from your Python application.
os.environ['NLS_LANG']= 'AMERICAN_AMERICA.AL32UTF8'

Python unicode string rejected by psycopg

I've received a unicode string from the wild that causes some of our psycopg2 statements to fail.
I have reduced the problem down to a SSCE:
import psycopg2
conn = psycopg2.connect(...)
cur = conn.cursor()
x = u'\ud837'
cur.execute("SELECT %s", (x,))
print cur.fetchone()
Running this gives the following exception:
Traceback (most recent call last):
File ".../run.py", line 65, in <module>
cur.execute("SELECT %s AS test", (x,))
psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xb7
Based on some of the comments, it has become clear that this particular character is one half of a surrogate pair, making it invalid to live on its own.
Specifically then, I am looking for a mechanism to detect when a string contains an incomplete surrogate pair in Python 2.
One such method I have found that leads to an exception is trying x.encode('utf16').decode('utf16'), however, since I don't totally understand the risks associated, I would be somewhat concerned here.
Edit: Reduced SSCE string to single character causing the problem, added information based on comments.
The string u'\ud837' consists of a lone member of a surrogate pair, two physical characters that appear in sequence to form a logical character. As such, it does not define a Unicode code point - instead, it is an implementation detail of the UTF-16 encoding which uses it to pack the full code point range into 16-bit code units. Python 3 correctly rejects attempts to encode lone surrogates in any byte encoding, including the UTF-* variants.
The string probably originated from a system that internally uses UTF-16 (such as Java, C#, Windows, or Python 2 built with 16-bit Py_UNICODE) that naively shortened the string without taking care of surrogates.
Taking the regex from this answer, it should be possible to efficiently detect such strings using code such as:
import re
lone = re.compile(
ur'''(?x) # verbose expression (allows comments)
( # begin group
[\ud800-\udbff] # match leading surrogate
(?![\udc00-\udfff]) # but only if not followed by trailing surrogate
) # end group
| # OR
( # begin group
(?<![\ud800-\udbff]) # if not preceded by leading surrogate
[\udc00-\udfff] # match trailing surrogate
) # end group
''')
def invalid_unicode(s):
assert isinstance(s, unicode)
return lone.search(s) is not None
To detect that the string is invalid utf-8, just wrap an attempt to encode it inside a try/except before executing it in psycopg2.
As for what caused the problem, there is a specific character in the middle of the string that is utf-16 encoded: \U000d8a85. So it's not that Postgres does not consider it utf-8, it really isn't.

Python Encoding - Could not decode to utf8

I have an sqlite database that was populated by an external program. Im trying to read the data with python. When I attempt to read the data I get the following error:
OperationalError: Could not decode to UTF-8
If I open the database in sqlite manager and look at the data in the offending record(s) using the inbuilt browse and search it looks fine, however if I export the table as csv, I notice the character £ in the offending records has become £
If I read the csv in python, the £ in the offending records is still read as £ but its not a problem I can parse this manually. However I need to be able to read the data direct from the database, without the intermediate step of converting to csv.
I have looked at some answers online for similar questions, I have so far tried setting "text_factory = str" and I have also tried changing the datatype of the column from TEXT to BLOB using sqlite manager, but still get the error.
My code below results in the OperationalError: Could not decode to UTF-8
conn = sqlite3.connect('test.db')
conn.text_factory = str
curr = conn.cursor()
curr.execute('''SELECT xml_dump FROM hands_1 LIMIT 5000 , 5001''')
row = curr.fetchone()
All the records above 5000 in the database have this character problem and hence produce the error.
Any help appreciated.
Python is trying to be helpful by converting pieces of text (stored as bytes in a database) into a python str object for you. In order to do this conversion, python has to guess what letter each byte (or group of bytes) returned by your query represents. The default guess is an encoding called utf-8. Obviously, this guess is wrong in your case.
The solution is to give python a little hint as to how to do the mapping from bytes to letters (i.e., unicode characters). You've already come close with the line
conn.text_factory = str
However (based on your response in the comments above), since you are using python 3, str is the default text factory, so that line will do nothing new for you (see the docs).
What happens behind the scenes with this line is that python tries to convert the bytes returned by the query using the str function, kind of like:
your_string = str(the_bytes, 'utf-8') # actually uses `conn.text_factory`, not `str`
...but you want a different encoding where 'utf-8' is. Since you can't change the default encoding of the str function, you will have to mimic it some other way. You can use a one-off nameless function called a lambda for this:
conn.text_factory = lambda x: str(x, 'latin1')
Now when the database is handing the bytes to python, python will try to map them to letters using the 'latin1' scheme instead of the 'utf-8' scheme. Of course, I don't know if latin1 is the correct encoding of your data. Realistically, you will have to try a handful of encodings to find the right one. I would try the following first:
'iso-8859-1'
'utf-16'
'utf-32'
'latin1'
You can find a more complete list here.
Another option is to simply let the bytes coming out of the database remain as bytes. Whether this is a good idea for you depends on your application. You can do it by setting:
conn.text_factory = bytes
If the text in the database is actually mostly encoded in UTF-8, but you're still seeing this error (Could not decode to UTF-8), then the problem may be that one or more rows have bogus data that is not valid UTF-8. By default, Python's decode() function throws an exception when it sees text like that. If you are in this situation and want to simply ignore these errors, you can set up a text_factory like this:
conn = sqlite3.connect('my-database.db')
conn.text_factory = lambda b: b.decode(errors = 'ignore')

Why Chinese garbled when use webpy but it's normal when use MySQLdb?

I create a database in mysql and use webpy to construct my web server.
But it's so strange for Chinese character between the webpy's and MySQLdb's behaviors when using them to access database respectively.
Below is my problem:
My table t_test (utf8 databse):
id name
1 测试
the utf8 code for "测试" is: \xe6\xb5\x8b\xe8\xaf\x95
when using MySQLdb to do "select" like this:
c=conn.cursor()
c.execute("SELECT * FROM t_test")
items = c.fetchall()
c.close()
print "items=%s, name=%s"%(eval_items, eval_items[1])
the result is normal, it prints:
items=(127L, '\xe6\xb5\x8b\xe8\xaf\x95'), name=测试
But when I use webpy do the same things:
db = web.database(dbn='mysql', host="127.0.0.1",
user='test', pw='test', db='db_test', charset="utf8")
eval_items=db.select('t_test')
comment=eval_items[0].name
print "comment code=%s"%repr(comment)
print "comment=%s"%comment.encode("utf8")
Chinese garble occured, the print result is:
comment code=u'\xe6\xb5\u2039\xe8\xaf\u2022'
comment=忙碌鈥姑€
I know webpy's database is also dependent on MySQLdb, but it's so different for these two way. Why?
BTW, for the reason above, I can just use MySQLdb directly to solve my Chinese character garble problem, but it loses the clolumn name in table——It's so ungraceful. I want to know how can I solve it with webpy?
Indeed, something very wrong is taking place --
as you said on your comment, the unicode repr. bytes for "测试" are E6B5 8BE8 AF95 -
which works on my utf-8 terminal here:
>>> d
'\xe6\xb5\x8b\xe8\xaf\x95'
>>> print d
测试
But look at the bytes on your "comment" unicode object:
comment code=u'\xe6\xb5\u2039\xe8\xaf\u2022'
Meaning part of your content are the utf-8 bytes for the comment
(the chars represented as "\xYY" and part is encoded as Unicode points
(the chares represented with \uYYYY ) - this indicates serious garbage.
MySQL has some catchs to proper decode (utf-8 or otherwise) encoded
text in it - one of which is passing a proper "charset" parameter
to the connection. But you did that already -
One attempt you can do is to pass the connection the option use_unicode=False -
and decode the utf-8 strings in your own code.
db = web.database(dbn='mysql', host="127.0.0.1",
user='test', pw='test', db='db_test', charset="utf8", use_unicode=False)
Check the options to connect for this and other parameters you might try:
http://mysql-python.sourceforge.net/MySQLdb.html
Regardless of getting it to work correctly, with the hints above, I got a workaround for you -- It looks like the Unicode characters (not the utf-8 raw bytes in the unicode objects)
in your encoded string are encoded in one of these encodings:
("cp1258", "cp1252", "palmos", "cp1254")
Of these, cp1252 is almost the same as "latin1" - which is the default charset MySQL uses
if it does not get the "charset" argument in the connection. But it is not only a matter
of web2py not passing it to the database, as you are getting mangled chars, not
just the wrong encoding - it is as if web2py is encoding and decoding your string back and forth, and ignoring encoding errors
From all of these encodings I could retrieve your original "测试" string,as an utf-8 byte string, doing, for example:
comment = comment.encode("cp1252", errors="ignore")
So, placing this line might work for you now, but guessing around with unicode is never good -
the proepr thing is to narrow down what is making web2py to give you those semi-decoded utf-8 strings on the first place, and make it stop there.
update
I checked here- this is what is happening - the correct utf-8 '\xe6\xb5\x8b\xe8\xaf\x95'string is read from the mysql, and before delivering it to you, (in the use_unicode=True case) 0- these bytes are being decoded as if they werhe "cp1252" - this yields the incorrect u'\xe6\xb5\u2039\xe8\xaf\u2022' unicode.  It is probably a web2py error, like, it does not pass your "charset=utf8"  parameter to the actual connection. When you set the "use_unicode=False" instead of giving you the raw bytes, it apparently picks the incorrect unicode, an dencode it using "utf-8" - this yields the
'\xc3\xa6\xc2\xb5\xe2\x80\xb9\xc3\xa8\xc2\xaf\xe2\x80\xa2'sequence you commented bellow (which is even more incorrect).
all in all, the workaround I mentioned above seems the only way to retrieve the original, correct string -that is, given the wrong unicode, do u'\xe6\xb5\u2039\xe8\xaf\u2022'.encode("cp1252", errors="ignore") - that is, short of
doing some other thing to set-up the database connection (or maybe update web2py or mysql drivers, if possible)
** update 2 **
I futrher checked the code in web2py dal.py file itself - it attempts to setup the connection as utf-8 by default - but it looks like it will try both MySQLdb and pymysql drivers -- if you have both installed try uninstalling pymysql, and leave only MySQLdb.

Sqlite database non-ascii character error with python SQLalchemy library

When i use sqlite3 database with sqlalchemy library, i got this error
sqlalchemy.exc.ProgrammingError: (ProgrammingError)
You must not use 8-bit bytestrings unless you use a text_factory that can
interpret 8-bit bytestrings (like text_factory = str).
It is highly recommended that you instead just switch your application
to Unicode strings.
u'INSERT INTO model_pair (user, password) VALUES (?, ?)' ('\xebE\xc2\xe4.\t312874#gg.com', '123456')
and here is some test data:
呆呆 3098920#gg.com 11111
�言 9707996#gg.com 11111
wwwj55572#gg?€? 11111
I have configured database encoding as utf-8 or gbk but neither success
when insert, i try str.decode('gbk'), it will stuck on char like € and get error like above.
anyone tell me how get around this error ?
try to change '\xebE\xc2\xe4.\t312874#gg.com' to u'\xebE\xc2\xe4.\t312874#gg.com'
also, you could try to do '\xebE\xc2\xe4.\t312874#gg.com'.decode("utf-8"), but it gives an error "invalid continuation byte", perhaps your string is not valid utf-8 after all?
btw, do mention is you are running python 2.x or 3.x, there is a difference.

Categories