psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0x00 - python

I have some bulk postgresql 9.3.9 data inserts I do in python 3.4. I've been using SQLAlchemy which works fine for normal data processing. For a while I've been using psycopg2 so as to utilize the copy_from function which I found faster when doing bulk inserts. The issue I have is that when using copy_from, the bulk inserts fail when I have data that has got some special characters in it. When I remove the highlighted line the insert runs successfully.
Error
Traceback (most recent call last):
File "/vagrant/apps/data_script/data_update.py", line 1081,
in copy_data_to_db
'surname', 'other_name', 'reference_number', 'balance'), sep="|", null='None')
psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0x00
CONTEXT:
COPY source_file_raw, line 98: "94|1|99|2015-09-03 10:17:34|False|True|John|Doe|A005-001\008020-01||||||..."
Code producing the error
cursor.copy_from(data_list, 'source_file_raw',
columns=('id', 'partner_id', 'pos_row', 'loaded_at', 'has_error',
'can_be_loaded', 'surname', 'other_name', 'reference_number', .............),
sep="|", null='None')
The db connection
import psycopg2
pg_conn_string = "host='%s' port='%s' dbname='%s' user='%s' password='%s'"
%(con_host, con_port, con_db, con_user, con_pass)
conn = psycopg2.connect(pg_conn_string)
conn.set_isolation_level(0)
if cursor_type == 'dict':
cursor = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
else:
cursor = conn.cursor()
return cursor
So the baffling thing is that SQlAlchemy can do the bulk inserts even when those "special characters" are present but using psycopg2 directly fails. Am thinking there must be a way for me to escape this or to tell psycopg2 to find a smart way to do the insert or am I missing a setting somewhere?

Related

Python3 ODBC Execute Many

I am trying to copy data from 1 oracle table to another in a different schema using the python odbc library. Here is what I'm doing
source = SomeString (source Oracle DataTable)
target = SomeString (target Oracle DataTable)
Connecting to data source to retrieve data:
source_data = pyodbc.connect(source)
source_cursor = source_data.cursor()
Connect to Target data source
target = pyodbc(target)
target_cursor = target.cursor()
I now declare my source data query
source_query = SELECT * FROM TABLE where TYPE = X
I put the data in a dataframe and then convert it to a list
data = pd.read_sql(source_query, source)
data = data.values.tolist()
I am now trying to insert data in my "data" list to my target table. I declare an insert statement and then run executemany as follows:
sql = "INSERT INTO SCHEMA.TABLE (column1, column 2, etc...) Values (?,?, etc..)
Now since I have my data and target connection established I execute the following
target_cursor.executemany(sql, data)
I get the following error below and the weird part is that the code inserted 1 line in the new table properly and then it fails and nothing happens.
Can you please guide me on how to fix this?
I get the following error:
C:\WinPy3770x64\python-3.7.7.amd64\lib\encodings\utf_16_le.py in decode(input, errors)
15 def decode(input, errors='strict'):
---> 16 return codecs.utf_16_le_decode(input, errors, True)
17
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 184-185: illegal encoding
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
<ipython-input-70-ec2d225d6132> in <module>
----> 1 target_cursor.executemany(sql_statement, data)
SystemError: <class 'pyodbc.Error'> returned a result with an error set

What's the cause of this UnicodeDecodeError with an nvarchar field using pyodbc and MSSQL?

I can read from a MSSQL database by sending queries in python through pypyodbc.
Mostly unicode characters are handled correctly, but I've hit a certain character that causes an error.
The field in question is of type nvarchar(50) and begins with this character "􀄑" which renders for me a bit like this...
-----
|100|
|111|
-----
If that number is hex 0x100111 then it's the character supplementary private use area-b u+100111. Though interestingly, if it's binary 0b100111 then it's an apostrophe, could it be that the wrong encoding was used when the data was uploaded? This field is storing part of a Chinese postal address.
The error message includes
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1: unexpected end of data
Here it is in full...
Traceback (most recent call last): File "question.py", line 19, in <module>
results.fetchone() File "/VIRTUAL_ENVIRONMENT_DIR/local/lib/python2.7/site-packages/pypyodbc.py", line 1869, in fetchone
value_list.append(buf_cvt_func(from_buffer_u(alloc_buffer))) File "/VIRTUAL_ENVIRONMENT_DIR/local/lib/python2.7/site-packages/pypyodbc.py", line 482, in UCS_dec
uchar = buffer.raw[i:i + ucs_length].decode(odbc_decoding) File "/VIRTUAL_ENVIRONMENT_DIR/lib/python2.7/encodings/utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1: unexpected end of data
Here's some minimal reproducing code...
import pypyodbc
connection_string = (
"DSN=sqlserverdatasource;"
"UID=REDACTED;"
"PWD=REDACTED;"
"DATABASE=obi_load")
connection = pypyodbc.connect(connection_string)
cursor = connection.cursor()
query_sql = (
"SELECT address_line_1 "
"FROM address "
"WHERE address_id == 'REDACTED' ")
with cursor.execute(query_sql) as results:
row = results.fetchone() # This is the line that raises the error.
print row
Here is a chunk of my /etc/freetds/freetds.conf
[global]
; tds version = 4.2
; dump file = /tmp/freetds.log
; debug flags = 0xffff
; timeout = 10
; connect timeout = 10
text size = 64512
[sqlserver]
host = REDACTED
port = 1433
tds version = 7.0
client charset = UTF-8
I've also tried with client charset = UTF-16 and omitting that line all together.
Here's the relevant chunk from my /etc/odbc.ini
[sqlserverdatasource]
Driver = FreeTDS
Description = ODBC connection via FreeTDS
Trace = No
Servername = sqlserver
Database = REDACTED
Here's the relevant chunk from my /etc/odbcinst.ini
[FreeTDS]
Description = TDS Driver (Sybase/MS SQL)
Driver = /usr/lib/x86_64-linux-gnu/odbc/libtdsodbc.so
Setup = /usr/lib/x86_64-linux-gnu/odbc/libtdsS.so
CPTimeout =
CPReuse =
UsageCount = 1
I can work around this issue by fetching results in a try/except block, throwing away any rows that raise a UnicodeDecodeError, but is there a solution? Can I throw away just the undecodable character, or is there a way to fetch this line without raising an error?
It's not inconceivable that some bad data has ended up on the database.
I've Googled around and checked this site's related questions, but have had no luck.
I fixed the issue myself by using this:
conn.setencoding('utf-8')
immediately before creating a cursor.
Where conn is the connection object.
I was fetching tens of millions of rows with fetchall(), and in the middle of a transaction that would be extremely expensive to undo manually, so I couldn't afford to simply skip invalid ones.
Source where I found the solution: https://github.com/mkleehammer/pyodbc/issues/112#issuecomment-264734456
This problem was eventually worked around, I suspect that the problem was that text had a character of one encoding hammered into a field with another declared encoding through some hacky method when the table was being set up.

Passing properly formatted windows pathname to SQL server via PYODBC

I need python to issue a query via pyodbc to insert a .PNG file into a table as a blob. The problem seems to have something to do with how the path to the file is represented. Here's some code:
OutFilePath = 'd:\\DilerBW_images\\'
OutFileName = SubjID+'_'+str(UserID)+'_'+DateTime+'.png'
print OutFilePath+OutFileName
qstring = ('insert into [wpic-smir].[Diler_BW].[Beckwith].[PlotImages](ID, UserID, ImageType, MoodType, ImageIndex, ImageData)'
'select '+str(SubjID)+' as ID, '+str(UserID)+' as UserID,'
'1 as ImageType,'
'NULL as MoodType,'
'NULL as ImageIndex,'
'* from OPENROWSET(Bulk \''+OutFilePath+OutFileName+'\', SINGLE_BLOB) as ImageData')
print qstring
cursor.execute(qstring)
conn.commit()
`
And here's the output:
d:\DilerBW_images\999123_999123_2015-01-20_14-25-07.013000.png
insert into [wpic-smir].[Diler_BW].[Beckwith].[PlotImages](ID, UserID, ImageType, MoodType, ImageIndex, ImageData)select 999123 as ID, 999123 as UserID,1 as ImageType,NULL as MoodType,NULL as ImageIndex,* from OPENROWSET(Bulk 'd:\DilerBW_images\999123_999123_2015-01-20_14-25-07.013000.png', SINGLE_BLOB) as ImageData
Now, here's the error I get:
Traceback (most recent call last):
File "c:\pythonscripts\DilerBW_Plot_Single_report_2_EH.py", line 253, in <module>
cursor.execute(qstring)
pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][SQL Server Native Client 11.0][SQL Server]Cannot bulk load because the file "d:\\DilerBW_images\\999123_999123_2015-01-20_14-25-07.013000.png" could not be opened. Operating system error code 3(The system cannot find the path specified.). (4861) (SQLExecDirectW)')
Sorry this is so long. Notice that the file path in the error message includes double backslashes, where the query only includes singles. I have looked extensively at the various methods to build the string (os.path.sep, using a raw string, os.path.join), but it doesn't seem to matter, and I'm not certain the input string is the problem. Again, if I cut and paste the query as it's presented in the output into SSMS and execute it, it works fine.
Thanx.

error to use django cursor to save escaped characters

I have a url which I want to save into the MySQL database using the "cursor" tool offered by django, but I keep getting the "not enough arguments for format string" error because this url contains some escaped characters (non-ascii characters). The testing code is fairly short:
test.py
import os
import runconfig #configuration file
os.environ['DJANGO_SETTINGS_MODULE'] = runconfig.django_settings_module
from django.db import connection,transaction
c = connection.cursor()
url = "http://www.academicjournals.org/ijps/PDF/pdf2011/18mar/G%C3%B3mez-Berb%C3%ADs et al.pdf"
dbquery = "INSERT INTO main_crawl_document SET url="+url
c.execute(dbquery)
transaction.commit_unless_managed()
The full error message is
Traceback (most recent call last):
File "./test.py", line 14, in <module>
c.execute(dbquery)
File "/usr/local/lib/python2.6/site-packages/django/db/backends/util.py", line 38, in execute
sql = self.db.ops.last_executed_query(self.cursor, sql, params)
File "/usr/local/lib/python2.6/site-packages/django/db/backends/__init__.py", line 505, in last_executed_query
return smart_unicode(sql) % u_params
TypeError: not enough arguments for format string
Can anybody help me?
You're opening yourself up for a possible SQL injection. Instead, use c.execute() properly:
url = "http://www.academicjournals.org/ijps/PDF/pdf2011/18mar/G%C3%B3mez-Berb%C3%ADs et al.pdf"
dbquery = "INSERT INTO main_crawl_document SET url=?"
c.execute(dbquery, (url,))
transaction.commit_unless_managed()
The .execute method should accept an iterable of parameters to use for escaping, assuming it's the normal dbapi method (which it should be with Django).

python encoding problem with mysqldb

I have troubles with encoding in python while using xlrd and mysqldb.
I am reading an excel file which contains Turkish characters in it.
When I print the value like that print sheet.cell(rownum,19).value it writes İstanbul to console, which is correct.(Win7 Lucida ConsoleLine,encoding is `cp1254)
However, if I want to insert that value to database like
sql = "INSERT INTO city (name) VALUES('"+sheet.cell(rownum,19).value+"')"
cursor.execute (sql)
db.commit()
gives error as
Traceback (most recent call last):
File "excel_employer.py", line 112, in <module> cursor.execute (sql_deneme)
File "C:\Python27\lib\site-packages\MySQLdb\cursors.py", line 157, in execute
query = query.encode(charset)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0130' in position
41: ordinal not in range(256)
If I change the sql as
sql = "INSERT INTO city (name) VALUES('"+sheet.cell(rownum,19).value.encode('utf8')+"')"
the value is inserted without any error but it becomes Ä°stanbul
Could you give me any idea how can I put the value İstanbul to database as it is.
Just as #Kazark said, maybe the encoding of your connector of mysql is not set.
conn = MySQLdb.connect(
host="localhost",
user="root",
passwd="root",
port=3306,
db="test1",
init_command="set names utf8"
)
Try this, when you init your python connector of mysql. But be sure the content been inserted is utf-8.

Categories