Python-Use binary copy table from with psycopg2

Python-Use binary copy table from with psycopg2 - python

i'm trying to adapt this Use binary COPY table FROM with psycopg2
example from #Mike T to my data but with i'm having some problems.
import psycopg2
import numpy as np
from struct import pack
from io import BytesIO
from datetime import datetime
conn = psycopg2.connect(host = 'x', database = 'x', user = 'x')
curs = conn.cursor()
DROP TABLE IF EXISTS test_test;
CREATE TABLE test_test(
id_from_database INT PRIMARY KEY,
version VARCHAR,
information TEXT
);
data = [(3,1,'hello hello!!'), (2,'123','test test???!'),(3,9, 'bye bye :)')]
dtype = [('id_from_database', 'object'),('version', 'object'),('information', 'object')]
data = np.array(data,dtype=dtype)
def prepare_text(dat):
cpy = BytesIO()
for row in dat:
cpy.write('\t'.join([repr(x) for x in row]) + '\n')
return(cpy)
def prepare_binary(dat):
pgcopy_dtype = [('num_fields','>i2')]
for field, dtype in dat.dtype.descr:
pgcopy_dtype += [(field + '_length', '>i4'),
(field, dtype.replace('<', '>'))]
pgcopy = np.empty(dat.shape, pgcopy_dtype)
pgcopy['num_fields'] = len(dat.dtype)
for i in range(len(dat.dtype)):
field = dat.dtype.names[i]
pgcopy[field + '_length'] = dat.dtype[i].alignment
pgcopy[field] = dat[field]
cpy = BytesIO()
cpy.write(pack('!11sii', b'PGCOPY\n\377\r\n\0', 0, 0))
cpy.write(pgcopy.tostring()) # all rows
cpy.write(pack('!h', -1)) # file trailer
#print("cpy")
#print(cpy)
return(cpy)
###
def time_pgcopy(dat, table, binary):
print('Processing copy object for ' + table)
tstart = datetime.now()
cpy = prepare_binary(dat)
tendw = datetime.now()
print('Copy object prepared in ' + str(tendw - tstart) + '; ' +
str(cpy.tell()) + ' bytes; transfering to database')
cpy.seek(0)
curs.copy_expert('COPY ' + table + ' FROM STDIN WITH BINARY', cpy)
conn.commit()
tend = datetime.now()
print('Database copy time: ' + str(tend - tendw))
print(' Total time: ' + str(tend - tstart))
return
print(time_pgcopy(data, 'test_test', binary=True))
I'm getting this error:
curs.copy_expert('COPY ' + table + ' FROM STDIN WITH BINARY', cpy)
psycopg2.DataError: incorrect binary data format
CONTEXT: COPY test_test, line 1, column id_from_database
What am i doing wrong?
Thank you :)
(I can not comment on the original question because i don't have enough reputation)

cpgcopy might be relevant here.

Related

Python SQLite3 - cursor.execute - no error

This is a piece of code which needs to perform the follow functionality:
Dump all table names in a database
From each table search for a column with either Latitude or Longitude in
Store these co-ords as a json file
The code was tested and working on a single database. However once it was put into another piece of code which calls it with different databases it now is not entering line 49. However there is no error either so I am struggling to see what the issue is as I have not changed anything.
Code snippet line 48 is the bottom line -
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print (cursor)
for tablerow in cursor.fetchall():
I am running this in the /tmp/ dir due to an earlier error with sqlite not working outside the temp.
Any questions please ask them.
Thanks!!
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sqlite3
import os
import sys
filename = sys.argv[1]
def validateFile(filename):
filename, fileExt = os.path.splitext(filename)
print ("[Jconsole] Python: Filename being tested - " + filename)
if fileExt == '.db':
databases(filename)
elif fileExt == '.json':
jsons(fileExt)
elif fileExt == '':
blank()
else:
print ('Unsupported format')
print (fileExt)
def validate(number):
try:
number = float(number)
if -90 <= number <= 180:
return True
else:
return False
except ValueError:
pass
def databases(filename):
dbName = sys.argv[2]
print (dbName)
idCounter = 0
mainList = []
lat = 0
lon = 0
with sqlite3.connect(filename) as conn:
conn.row_factory = sqlite3.Row
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print (cursor)
for tablerow in cursor.fetchall():
print ("YAY1")
table = tablerow[0]
cursor.execute('SELECT * FROM {t}'.format(t=table))
for row in cursor:
print(row)
print ("YAY")
tempList = []
for field in row.keys():
tempList.append(str(field))
tempList.append(str(row[field]))
for i in tempList:
if i in ('latitude', 'Latitude'):
index = tempList.index(i)
if validate(tempList[index + 1]):
idCounter += 1
tempList.append(idCounter)
(current_item, next_item) = \
(tempList[index], tempList[index + 1])
lat = next_item
if i in ('longitude', 'Longitude'):
index = tempList.index(i)
if validate(tempList[index + 1]):
(current_item, next_item) = \
(tempList[index], tempList[index + 1])
lon = next_item
result = '{ "id": ' + str(idCounter) \
+ ', "content": "' + dbName + '", "title": "' \
+ str(lat) + '", "className": "' + str(lon) \
+ '", "type": "box"},'
mainList.append(result)
file = open('appData.json', 'a')
for item in mainList:
file.write('%s\n' % item)
file.close()
# {
# ...."id": 1,
# ...."content": "<a class='thumbnail' href='./img/thumbs/thumb_IMG_20161102_151122.jpg'>IMG_20161102_151122.jpg</><span><img src='./img/thumbs/thumb_IMG_20161102_151122.jpg' border='0' /></span></a>",
# ...."title": "50.7700721944444",
# ...."className": "-0.8727045",
# ...."start": "2016-11-02 15:11:22",
# ...."type": "box"
# },
def jsons(filename):
print ('JSON')
def blank():
print ('blank')
validateFile(filename)

Fixed.
The issue was up here
filename, fileExt = os.path.splitext(filename)
The filename variable was being overwritten without the file extension so when SQLite searched it didn't find the file.
Strange no error appeared but it is fixed now by changing the filename var to filename1.

Import a CSV to Google Fusion Table with python

From http://fuzzytolerance.info/blog/2012/01/13/2012-01-14-updating-google-fusion-table-from-a-csv-file-using-python/ I have edited his code to import the necessary modules, however I get the following error "AttributeError: 'module' object has no attribute 'urlencode'". I run the code and I am prompted to enter my password, I enter my own google account password, and then the code gives me the error message, pehaps I need to define a password somewhere?
I wonder if anyone can please trouble shoot my code or advise me on how to avoid this error or even advise me of an EASIER way to import a CSV into a GOOGLE FUSION TABLE that I OWN
Here is my code
import csv
from decimal import *
import getpass
from fusiontables.authorization.clientlogin import ClientLogin
from fusiontables import ftclient
nameAgeNick = 'C:\\Users\\User\\Desktop\\NameAgeNickname.txt'
# check to see if something is an integer
def isInt(s):
try:
int(s)
return True
except ValueError:
return False
# check to see if something is a float
def isFloat(s):
try:
float(s)
return True
except ValueError:
return False
# open the CSV file
ifile = open(nameAgeNick, "rb")
reader = csv.reader(ifile)
# GFT table ID
tableID = "tableid"
# your username
username = "username"
# prompt for your password - you can hardcode it but this is more secure
password = getpass.getpass("Enter your password:")
# Get token and connect to GFT
token = ClientLogin().authorize(username, password)
ft_client = ftclient.ClientLoginFTClient(token)
# Loop through the CSV data and upload
# Assumptions for my data: if it's a float less than 0, it's a percentage
# Floats are being rounded to 1 significant digit
# Non-numbers are wrapped in a single quote for string-type in the updatate statement
# The first row is the column names and matches exactly the column names in Fustion tables
# The first column is the unique ID I'll use to select the record for updating in Fusion Tables
rownum = 0
setList = list()
nid = 0
for row in reader:
# Save header row.
if rownum == 0:
header = row
else:
colnum = 0
setList[:] = []
for col in row:
thedata = col
# This bit rounds numbers and turns numbers < 1 into percentages
if isFloat(thedata):
if isInt(thedata) is False:
if float(thedata) < 1:
thedata = float(thedata) * 100
thedata = round(float(thedata), 1)
else:
thedata = "'" + thedata + "'"
# make sql where clause for row
setList.append(header[colnum] + "=" + str(thedata))
nid = row[0]
colnum += 1
# get rowid and update the record
rowid = ft_client.query("select ROWID from " + tableID + " where ID = " + nid).split("\n")[1]
print( rowid)
print( ft_client.query("update " + tableID + " set " + ",".join(map(str, setList)) + " where rowid = '" + rowid + "'"))
rownum += 1
ifile.close()
And this is the module where the error occurs:
#!/usr/bin/python
#
# Copyright (C) 2010 Google Inc.
""" ClientLogin.
"""
__author__ = 'kbrisbin#google.com (Kathryn Brisbin)'
import urllib, urllib2
class ClientLogin():
def authorize(self, username, password):
auth_uri = 'https://www.google.com/accounts/ClientLogin'
authreq_data = urllib.urlencode({ //////HERE IS ERROR
'Email': username,
'Passwd': password,
'service': 'fusiontables',
'accountType': 'HOSTED_OR_GOOGLE'})
auth_req = urllib2.Request(auth_uri, data=authreq_data)
auth_resp = urllib2.urlopen(auth_req)
auth_resp_body = auth_resp.read()
auth_resp_dict = dict(
x.split('=') for x in auth_resp_body.split('\n') if x)
return auth_resp_dict['Auth']

Need help to improve performance of my python code

Hello to all passionate programmers out there. I need your help with my code.
My Goal: To efficiently move data from Amazon S3 to Amazon Redshift.
Basically I am moving all CSV files on my S3 to Redshift using the below code. I parse through part of the file, build a table structure and then use the copy command to load data into redshift.
'''
Created on Feb 25, 2015
#author: Siddartha.Reddy
'''
import sys
from boto.s3 import connect_to_region
from boto.s3.connection import Location
import csv
import itertools
import psycopg2
''' ARGUMENTS TO PASS '''
AWS_KEY = sys.argv[1]
AWS_SECRET_KEY = sys.argv[2]
S3_DOWNLOAD_PATH = sys.argv[3]
REDSHIFT_SCHEMA = sys.argv[4]
TABLE_NAME = sys.argv[5]
UTILS = S3_DOWNLOAD_PATH.split('/')
class UTIL():
global UTILS
def bucket_name(self):
self.BUCKET_NAME = UTILS[0]
return self.BUCKET_NAME
def path(self):
self.PATH = ''
offset = 0
for value in UTILS:
if offset == 0:
offset += 1
else:
self.PATH = self.PATH + value + '/'
return self.PATH[:-1]
def GETDATAINMEMORY():
conn = connect_to_region(Location.USWest2,aws_access_key_id = AWS_KEY,
aws_secret_access_key = AWS_SECRET_KEY,
is_secure=False,host='s3-us-west-2.amazonaws.com'
)
ut = util()
BUCKET_NAME = ut.bucket_name()
PATH = ut.path()
filelist = conn.lookup(BUCKET_NAME)
''' Fecth part of the data from S3 '''
for path in filelist:
if PATH in path.name:
DATA = path.get_contents_as_string(headers={'Range': 'bytes=%s-%s' % (0,100000000)})
return DATA
def TRAVERSEDATA():
DATA = getdatainmemory()
CREATE_TABLE_QUERY = 'CREATE TABLE ' + REDSHIFT_SCHEMA + '.' + TABLE_NAME + '( '
JUNKED_OUT = DATA[3:]
PROCESSED_DATA = JUNKED_OUT.split('\n')
CSV_DATA = csv.reader(PROCESSED_DATA,delimiter=',')
COUNTER,STRING,NUMBER = 0,0,0
COLUMN_TYPE = []
''' GET COLUMN NAMES AND COUNT '''
for line in CSV_DATA:
NUMBER_OF_COLUMNS = len(line)
COLUMN_NAMES = line
break;
''' PROCESS COLUMN NAMES '''
a = 0
for REMOVESPACE in COLUMN_NAMES:
TEMPHOLDER = REMOVESPACE.split(' ')
temp1 = ''
for x in TEMPHOLDER:
temp1 = temp1 + x
COLUMN_NAMES[a] = temp1
a = a + 1
''' GET COLUMN DATA TYPES '''
# print(NUMBER_OF_COLUMNS,COLUMN_NAMES,COUNTER)
# print(NUMBER_OF_COLUMNS)
i,j,a= 0,500,0
while COUNTER < NUMBER_OF_COLUMNS:
for COLUMN in itertools.islice(CSV_DATA,i,j+1):
if COLUMN[COUNTER].isdigit():
NUMBER = NUMBER + 1
else:
STRING = STRING + 1
if NUMBER == 501:
COLUMN_TYPE.append('INTEGER')
# print('I CAME IN')
NUMBER = 0
else:
COLUMN_TYPE.append('VARCHAR(2500)')
STRING = 0
COUNTER = COUNTER + 1
# print(COUNTER)
COUNTER = 0
''' BUILD SCHEMA '''
while COUNTER < NUMBER_OF_COLUMNS:
if COUNTER == 0:
CREATE_TABLE_QUERY = CREATE_TABLE_QUERY + COLUMN_NAMES[COUNTER] + ' ' + COLUMN_TYPE[COUNTER] + ' NOT NULL,'
else:
CREATE_TABLE_QUERY = CREATE_TABLE_QUERY + COLUMN_NAMES[COUNTER] + ' ' + COLUMN_TYPE[COUNTER] + ' ,'
COUNTER += 1
CREATE_TABLE_QUERY = CREATE_TABLE_QUERY[:-2]+ ')'
return CREATE_TABLE_QUERY
def COPY_COMMAND():
S3_PATH = 's3://' + S3_DOWNLOAD_PATH
COPY_COMMAND = "COPY "+REDSHIFT_SCHEMA+"."+TABLE_NAME+" from '"+S3_PATH+"' credentials 'aws_access_key_id="+AWS_KEY+";aws_secret_access_key="+AWS_SECRET_KEY+"' REGION 'us-west-2' csv delimiter ',' ignoreheader as 1 TRIMBLANKS maxerror as 500"
return COPY_COMMAND
def S3TOREDSHIFT():
conn = psycopg2.connect("dbname='xxx' port='5439' user='xxx' host='xxxxxx' password='xxxxx'")
cursor = conn.cursor()
cursor.execute('DROP TABLE IF EXISTS '+ REDSHIFT_SCHEMA + "." + TABLE_NAME)
SCHEMA = TRAVERSEDATA()
print(SCHEMA)
cursor.execute(SCHEMA)
COPY = COPY_COMMAND()
print(COPY)
cursor.execute(COPY)
conn.commit()
S3TOREDSHIFT()
Current Challenges:
Challenges with creating the table structure :
Field lengths : Right now I am just hardcoding the VARCHAR fields to 2500. All my files are > 30gb and parsing through the whole file to calculate length of a field takes lot of processing time.
Determining if a column is null: I am simply hard coding the first column to NOT NULL using the COUNTER variable. ( All my files have ID as first column ). Would like to know if there is a better way of doing it.
Is there any data structure I can use? I am always interested in learning new ways to improve the performance, if you guys have any suggestions please feel free to comment.

separate line output by groups

My python script checks mysqldump and if any problems script prints :
Dump is old for db;
Dump is not complete for db;
Dump is empty for db;
MySQL dump does not exist for db;
Script logs these records to the file line by line.
My question is there are a way to format output in the file like:
Dump is old for db;
Dump is old for db;
Dump is old for db;
Dump is not complete for db;
Dump is not complete for db;
Dump is not complete for db;
Dump is empty for db;
Dump is empty for db;
Dump is empty for db;
Because now my file looks like:
Dump is old for db;
Dump is empty for db;
Dump is old for db;
MySQL dump does not exist for db;
...
etc
Here my small script :)
#!/bin/env python
import psycopg2
import sys,os
from subprocess import Popen, PIPE
from datetime import datetime
import smtplib
con = None
today = datetime.now().strftime("%Y-%m-%d")
log_dump_fail = '/tmp/mysqldump_FAIL'
log_fail = open(log_dump_fail,'w').close()
log_fail = open(log_dump_fail, 'a')
sender = 'PUT_SENDER_NAME_HERE'
receiver = ['receiver_name']
smtp_daemon_host = 'localhost'
def db_backup_file_does_not_exist(db_backup_file):
if not os.path.exists(db_backup_file): return True
else: return False
def dump_health(last_dump_row, file_name,db):
last_row = last_dump_row.rsplit(" ")
tms = ''.join(last_row[4:5])
status = last_row[1:3]
if (status) and (tms != today):
log_fail.write("\nDB is old for "+ str(db) + str(file_name) + ", \nDump finished at " + str(''.join(tms)))
log_fail.write("\n-------------------------------------------")
elif not (status) and (tms == None):
log_fail.write("\nDump is not complete for "+str(db) + str(file_name) + " , end of file is not correct")
log_fail.write("\n-------------------------------------------")
suffixes = ['B', 'KB', 'MB', 'GB', 'TB', 'PB']
def humansize(nbytes):
if nbytes == 0: return '0 B'
i = 0
while nbytes >= 1024 and i < len(suffixes)-1:
nbytes /= 1024.
i += 1
f = ('%.2f' % nbytes).rstrip('0').rstrip('.')
return '%s %s' % (f, suffixes[i])
def dump_size(dump_file, file_name,db):
size = os.path.getsize(dump_file)
if (size < 1024):
human_readable = humansize(size)
log_fail.write("\nDump is empty for " +str(db) + "\n" +"\t" + str (file_name)+", file size is " + str(human_readable))
log_fail.write("\n-------------------------------------------")
def report_to_noc(isubject,text):
TEXT = text
SUBJECT = subject
message = 'Subject: %s\n\n%s' % (SUBJECT, TEXT)
server = smtplib.SMTP(smtp_daemon_host)
server.sendmail(sender, receiver, message)
server.quit()
try:
con = psycopg2.connect(database='**', user='***', password='***', host='****')
cur = con.cursor()
cur.execute("""\
select ad.servicename, (select name from servers where id = ps.server_id) as servername
from packages as p, account_data as ad, package_servers as ps
where p.id=ad.package_id and
p.date_deleted IS NULL and
p.id=ps.package_id and
p.aktuel IS NULL and
p.pre_def_package_id = 4 and
p.mother_package_id !=0 and
ps.subservice_id=5 and
p.mother_package_id NOT IN (select id from packages where date_deleted IS NOT NULL)
ORDER BY servername;
""")
while (1):
row = cur.fetchone ()
if row == None:
break
db = row[0]
server_name = str(row[1])
if (''.join(server_name) == 'SKIP_THIS') or (''.join(server_name) == 'SKIP_THIS'):
continue
else:
db_backup_file = '/storage/backup/db/mysql/' + str(db) + '/current/' + str(db) + '.mysql.gz'
db_backup_file2 = '/storage/backup/' + str(''.join(server_name.split("DB"))) + '/mysql/' + str(db) + '/current/'+ str(db) + '.mysql.gz'
db_file_does_not_exist = False
db_file2_does_not_exist = False
if db_backup_file_does_not_exist(db_backup_file):
db_file_does_not_exist = True
if db_backup_file_does_not_exist(db_backup_file2):
db_file2_does_not_exist = True
if db_file_does_not_exist and db_file2_does_not_exist:
log_fail.write("\nMySQL dump does not exist for " + str(db) + "\n" + "\t" + str(db_backup_file2) + "\n" + "\t" + str(db_backup_file))
log_fail.write("\n-------------------------------------------")
continue
elif (db_file_does_not_exist) and not (db_file2_does_not_exist):
p_zcat = Popen(["zcat", db_backup_file2], stdout=PIPE)
p_tail = Popen(["tail", "-2"], stdin=p_zcat.stdout, stdout=PIPE)
dump_status = str(p_tail.communicate()[0])
dump_health(dump_status,db_backup_file2,db)
dump_size(db_backup_file2, db_backup_file2,db)
elif (db_file2_does_not_exist) and not (db_file_does_not_exist):
p_zcat = Popen(["zcat", db_backup_file], stdout=PIPE)
p_tail = Popen(["tail", "-2"], stdin=p_zcat.stdout, stdout=PIPE)
dump_status = str(p_tail.communicate()[0])
dump_health(dump_status,db_backup_file,db)
dump_size(db_backup_file,db_backup_file,db)
con.close()
except psycopg2.DatabaseError, e:
print 'Error %s' % e
sys.exit(1)
log_fail.close()
if os.path.getsize(log_dump_fail) > 0:
subject = "Not all MySQL dumps completed successfully. Log file backup:" + str(log_dump_fail)
fh = open(log_dump_fail, 'r')
text = fh.read()
fh.close()
report_to_noc(subject,text)
else:
subject = "MySQL dump completed successfullyi for all DBs, listed in PC"
text = "Hello! \nI am notifying you that I checked mysqldump files this morning.\nThere are nothing to worry about. :)"
report_to_noc(subject,text)

You can process your log file after it has been written.
One option is to read your file and sort the lines:
lines = open('log.txt').readlines()
lines.sort()
open('log_sorted.txt', 'w').write("\n".join(lines))
This won't emit an empty line between log types.
Another option is to use a Counter:
from collections import Counter
lines = open('log.txt').readlines()
counter = Counter()
for line in lines:
counter[line] += 1
out_file = open('log_sorted.txt', 'w')
for line, num in counter.iteritems():
out_file.write(line * num + "\n")

Looks like you want to group the output of the script, rather than log the info as it comes while searching.
Easiest would be to maintain 4 lists, on each for empty, not empty and so on. In the script add the db names to appropriate list instead of logging, and then dump the lists one by one into the file with appropriate prefixes("not empty for" + dbname).
For example, remove all the log_fail.write() from the functions and replace them with list.append() and write a separate function that writes to the log file as you like:
Add lists:
db_dump_is_old_list = []
db_dump_is_empty_list = []
db_dump_is_not_complete_list = []
db_dump_does_not_exist_list = []
Modify the Functions:
def dump_health(last_dump_row, file_name,db):
last_row = last_dump_row.rsplit(" ")
tms = ''.join(last_row[4:5])
status = last_row[1:3]
if (status) and (tms != today):
db_dump_is_old_list.append(str(db))
#log_fail.write("\nDB is old for "+ str(db) + str(file_name) + ", \nDump finished at " + str(''.join(tms)))
#log_fail.write("\n-------------------------------------------")
elif not (status) and (tms == None):
db_dump_is_not_complete_list.append(str(db)
#log_fail.write("\nDump is not complete for "+str(db) + str(file_name) + " , end of file is not correct")
#log_fail.write("\n-------------------------------------------")
def dump_size(dump_file, file_name,db):
size = os.path.getsize(dump_file)
if (size < 1024):
human_readable = humansize(size)
db_dump_is_empty_list.append(str(db))
#log_fail.write("\nDump is empty for " +str(db) + "\n" +"\t" + str (file_name)+", file size is " + str(human_readable))
#log_fail.write("\n-------------------------------------------")
if db_file_does_not_exist and db_file2_does_not_exist:
db_dump_does_not_exist_list.append(str(db))
#log_fail.write("\nMySQL dump does not exist for " + str(db) + "\n" + "\t" + str(db_backup_file2) + "\n" + "\t" + str(db_backup_file))
#log_fail.write("\n-------------------------------------------")
continue
And add a logger function:
def dump_info_to_log_file():
log_dump_fail = '/tmp/mysqldump_FAIL'
log_fail = open(log_dump_fail,'w').close()
log_fail = open(log_dump_fail, 'a')
for dbname in db_dump_is_old_list:
log_fail.write("Dump is Old for" + str(dbname))
log_fail.write("\n\n")
for dbname in db_dump_is_empty_list:
log_fail.write("Dump is Empty for" + str(dbname))
log_fail.write("\n\n")
for dbname in db_dump_is_not_complete_list:
log_fail.write("Dump is Not Complete for" + str(dbname))
log_fail.write("\n\n")
for dbname in db_dump_does_not_exist_list:
log_fail.write("Dump Does Not Exist for" + str(dbname))
log_fail.close()
Or you could simply log as you are doing, and then read in the file, sort and write back the file.

Thank you all for all interesting ideas.
I have really tried all options :)
To my mind:
With Counter object the pros is to few lines of code.
But cons are - many read\write operations. Log file is not big, however, I decided to decrease read(s) \ write(s)
With array the cons are to many lines of code :) but the pros is - write to the file only once.
So I implemented arrays.. :)
Thank you guys!!!

Python non-ascii characters

I have a python file that creates and populates a table in ms sql. The only sticking point is that the code breaks if there are any non-ascii characters or single apostrophes (and there are quite a few of each). Although I can run the replace function to rid the strings of apostrophes, I would prefer to keep them intact. I have also tried converting the data into utf-8, but no luck there either.
Below are th error messages I get:
"'ascii' codec can't encode character u'\2013' in position..." (for non-ascii characters)
and for the single quotes
class 'pyodbc.ProgrammingError'>: ('42000', "[42000] [Microsoft][ODBC SQL Server Driver][SQL Server] Incorrect syntax near 'S, 230 X 90M.; Eligibilty....
When I try to encode string in utf-8, I instead get the following error message:
<type 'exceptions.UnicodeDecodeError'>: ascii' codec can't decode byte 0xe2 in position 219: ordinal not in range(128)
The python code is included below. I believe the point in the code where this break occurs is after the following line: InsertValue = str(row.GetValue(CurrentField['Name'])).
# -*- coding: utf-8 -*-
import pyodbc
import sys
import arcpy
import arcgisscripting
gp = arcgisscripting.create(9.3)
SQL_KEYWORDS = ['PERCENT', 'SELECT', 'INSERT', 'DROP', 'TABLE']
#SourceFGDB = '###'
#SourceTable = '###'
SourceTable = sys.argv[1]
TempInputName = sys.argv[2]
SourceTable2 = sys.argv[3]
#---------------------------------------------------------------------------------------------------------------------
# Target Database Settings
#---------------------------------------------------------------------------------------------------------------------
TargetDatabaseDriver = "{SQL Server}"
TargetDatabaseServer = "###"
TargetDatabaseName = "###"
TargetDatabaseUser = "###"
TargetDatabasePassword = "###"
# Get schema from FGDB table.
# This should be an ordered list of dictionary elements [{'FGDB_Name', 'FGDB_Alias', 'FGDB_Type', FGDB_Width, FGDB_Precision, FGDB_Scale}, {}]
if not gp.Exists(SourceTable):
print ('- The source does not exist.')
sys.exit(102)
#### Should see if it is actually a table type. Could be a Feature Data Set or something...
print(' - Processing Items From : ' + SourceTable)
FieldList = []
Field_List = gp.ListFields(SourceTable)
print(' - Getting number of rows.')
result = gp.GetCount_management(SourceTable)
Number_of_Features = gp.GetCount_management(SourceTable)
print(' - Number of Rows: ' + str(Number_of_Features))
print(' - Getting fields.')
Field_List1 = gp.ListFields(SourceTable, 'Layer')
Field_List2 = gp.ListFields(SourceTable, 'Comments')
Field_List3 = gp.ListFields(SourceTable, 'Category')
Field_List4 = gp.ListFields(SourceTable, 'State')
Field_List5 = gp.ListFields(SourceTable, 'Label')
Field_List6 = gp.ListFields(SourceTable, 'DateUpdate')
Field_List7 = gp.ListFields(SourceTable, 'OBJECTID')
for Current_Field in Field_List1 + Field_List2 + Field_List3 + Field_List4 + Field_List5 + Field_List6 + Field_List7:
print(' - Field Found: ' + Current_Field.Name)
if Current_Field.AliasName in SQL_KEYWORDS:
Target_Name = Current_Field.Name + '_'
else:
Target_Name = Current_Field.Name
print(' - Alias : ' + Current_Field.AliasName)
print(' - Type : ' + Current_Field.Type)
print(' - Length : ' + str(Current_Field.Length))
print(' - Scale : ' + str(Current_Field.Scale))
print(' - Precision: ' + str(Current_Field.Precision))
FieldList.append({'Name': Current_Field.Name, 'AliasName': Current_Field.AliasName, 'Type': Current_Field.Type, 'Length': Current_Field.Length, 'Scale': Current_Field.Scale, 'Precision': Current_Field.Precision, 'Unique': 'UNIQUE', 'Target_Name': Target_Name})
# Create table in SQL Server based on FGDB table schema.
cnxn = pyodbc.connect(r'DRIVER={SQL Server};SERVER=###;DATABASE=###;UID=sql_webenvas;PWD=###')
cursor = cnxn .cursor()
#### DROP the table first?
try:
DropTableSQL = 'DROP TABLE dbo.' + TempInputName + '_Test;'
print DropTableSQL
cursor.execute(DropTableSQL)
dbconnection.commit()
except:
print('WARNING: Can not drop table - may not exist: ' + TempInputName + '_Test')
CreateTableSQL = ('CREATE TABLE ' + TempInputName + '_Test '
' (Layer varchar(500), Comments varchar(5000), State int, Label varchar(500), DateUpdate DATETIME, Category varchar(50), OBJECTID int)')
cursor.execute(CreateTableSQL)
cnxn.commit()
# Cursor through each row in the FGDB table, get values, and insert into the SQL Server Table.
# We got Number_of_Features earlier, just use that.
Number_Processed = 0
print(' - Processing ' + str(Number_of_Features) + ' features.')
rows = gp.SearchCursor(SourceTable)
row = rows.Next()
while row:
if Number_Processed % 10000 == 0:
print(' - Processed ' + str(Number_Processed) + ' of ' + str(Number_of_Features))
InsertSQLFields = 'INSERT INTO ' + TempInputName + '_Test ('
InsertSQLValues = 'VALUES ('
for CurrentField in FieldList:
InsertSQLFields = InsertSQLFields + CurrentField['Target_Name'] + ', '
InsertValue = str(row.GetValue(CurrentField['Name']))
if InsertValue in ['None']:
InsertValue = 'NULL'
# Use an escape quote for the SQL.
InsertValue = InsertValue.replace("'","' '")
if CurrentField['Type'].upper() in ['STRING', 'CHAR', 'TEXT']:
if InsertValue == 'NULL':
InsertSQLValues = InsertSQLValues + "NULL, "
else:
InsertSQLValues = InsertSQLValues + "'" + InsertValue + "', "
elif CurrentField['Type'].upper() in ['GEOMETRY']:
## We're not handling geometry transfers at this time.
if InsertValue == 'NULL':
InsertSQLValues = InsertSQLValues + '0' + ', '
else:
InsertSQLValues = InsertSQLValues + '1' + ', '
else:
InsertSQLValues = InsertSQLValues + InsertValue + ', '
InsertSQLFields = InsertSQLFields[:-2] + ')'
InsertSQLValues = InsertSQLValues[:-2] + ')'
InsertSQL = InsertSQLFields + ' ' + InsertSQLValues
## print InsertSQL
cursor.execute(InsertSQL)
cnxn.commit()
Number_Processed = Number_Processed + 1
row = rows.Next()
print(' - Processed all ' + str(Number_Processed))
del row
del rows

James, I believe the real issue is that your are not using Unicode accross the board. Try to do the following:
Make sure that your input file that you are using to populate the DB is in UTF-8 and that you are reading it with the UTF-8 encoder.
Make sure your DB is actually storing the data as Unicode
When you retrieve data from the file or from the DB or want to manipulate strings (with the + operator for instance) you need to make sure that all parts are proper Unicode. You can NOT use the str() method. You need to use unicode() as Dave pointed out. If you define strings in your code use u'my string' instead of 'my string' (otherwise it is not considered unicode).
Also, please provide us the full stack trace and the exception name.

I'm going to use my psychic debugging skills and say you are trying to str()ify something and getting an error with the ascii codec. What you really should do is to use the utf-8 codec instead like this:
insert_value_uni = unicode(row.GetValue(CurrentField['Name']))
InsertValue = insert_value_uni.encode('utf-8')

Or you can take the view that only ASCII is allowed and use the awesomely named Unicode Hammer

In general you want to convert to unicode on data input, and convert to the desired encoding on output.
So it will be easier to find your problem if you do this. This means changing all strings to unicode, 'INSERT INTO ' to u'INSERT INTO '. (Notice the "u" before the string)
Then when you send the string to be executed convert to the desired encoding, "utf8".
cursor.execute(InsertSQL.encode("utf8")) # Where InsertSQL is unicode
Also, you should add the encoding string to the top of your source code.
This means adding the encoding cookie to one of the first two lines of the file:
#!/usr/bin/python
# -*- coding: <encoding name> -*-
If your pulling data from a file to build your string you can uses codecs.open to auto convert from a specific encoding to unicode on load.

When I converted my str() to unicode, that solved the problem. A simple answer, and I appreciate everyone's help on this.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python-Use binary copy table from with psycopg2 - python

cpgcopy might be relevant here.

Related

Python SQLite3 - cursor.execute - no error

Import a CSV to Google Fusion Table with python

Need help to improve performance of my python code

separate line output by groups

Python non-ascii characters

Categories

Resources