I have a function which, when passed database, table and access details connects to a table in SQL server to read all the contents to export to a pandas dataframe
def GET_DATA(source_server, source_database, source_table, source_username, source_password):
print('******* GETTING DATA ' ,source_server, '.', source_database,'.' ,source_table,'.' ,source_username , '*******')
data_collected = []
#SOURCE
connection = pypyodbc.connect('Driver={ODBC Driver 17 for SQL Server};'
'Server=' + source_server + ';'
'Database=' + source_database + ' ;'
'uid=' + source_username + ';pwd=' + source_password + '')
#OPEN THE CONNECTION
cursor = connection.cursor()
#BUILD THE COMMAND
SQLCommand = ("SELECT * FROM " + source_database +".dbo." + source_table )
#RUN THE QUERY
cursor.execute(SQLCommand)
#GET RESULTS
results = cursor.fetchone()
columnList = [tuple[0] for tuple in cursor.description]
#print(type(columnList))
while results:
data_collected.append(results)
results = cursor.fetchone()
df_column = pd.DataFrame(columnList)
df_column = df_column.transpose()
df_result = pd.DataFrame(data_collected)
frames = [df_column,df_result]
df = pd.concat(frames)
print('GET_DATA COMPLETE!')
return df
Most of the time this works fine, however, for reasons I can't identify I get this error
sequence item 0: expected str instance, bytes found
What is causing this and how do I account for it?
thx !
I found a much better way of extracting data from SQL to pandas
import pyodbc
import pandas as pd
def GET_DATA_TO_PANDAS(source_server,source_database, source_table,source_username,source_password):
print('***** STARTING DATA TO PANDAS ********* ')
con = pyodbc.connect('Driver={ODBC Driver 17 for SQL Server};'
'Server=' + source_server + ';'
'Database=' + source_database + ' ;'
'uid=' + source_username + ';pwd=' + source_password + '')
#BUILD QUERY
query = "SELECT * FROM " + source_database + ".dbo." + source_table
df = pd.read_sql(query, con)
return df
Used this link - https://www.quora.com/How-do-I-get-data-directly-from-databases-DB2-Oracle-MS-SQL-Server-into-Pandas-DataFrames-using-Python
I experienced a similar issue in one of my projects. This exception was raised by microsoft ODBC driver. According to me the issue might have occurred while fetching the results from the DB. May be at line
cursor.fetchone()
The reason for this exception as of what I understood before, is the size of the data that is received from SQL Server to Python. There might be one specific huge row in the DB that's causing this. If the row has unicode characters or non-ascii characters, the driver exceeds the buffer length, the driver cannot convert the nvarchar to bytes and from bytes object back to string. When the driver encounters some special characters, it sometimes cannot convert the bytes object back to string and hence the error. The driver sends a bytes object back to python. I think that's the reason for the exception.
May be if you dip a bit deep into that specific data row that might help you.
I also found another similar issue here - Click here
May be this URL (Microsoft ODBC driver's known issue) might help too - Click here
I got the same error using python 3, as follows: I defined a MS SQL column as nchar, stored an empty string (which in python 3 is unicode), then retrieved the row with the pypyodbc call cursor.fetchone(). It failed on this line:
if raw_data_parts != []:
if py_v3:
if target_type != SQL_C_BINARY:
raw_value = ''.join(raw_data_parts)
# FAILS WITH "sequence item 0: expected str instance, bytes found"
....
Changing the column datatype to nvarchar in the database fixed it.
Related
I am pretty new in python developing. I have a long python script what "clone" a database and add additional stored functions and procedures. Clone means copy only the schema of DB.These steps work fine.
My question is about pymysql insert exection:
I have to copy some table contents into the new DB. I don't get any sql error. If I debug or print the created INSERT INTO command is correct (I've tested it in an sql editor/handler). The insert execution is correct becuse the result contain the exact row number...but all rows are missing from destination table in dest.DB...
(Ofcourse DB_* variables have been definied!)
import pymysql
liveDbConn = pymysql.connect(DB_HOST, DB_USER, DB_PWD, LIVE_DB_NAME)
testDbConn = pymysql.connect(DB_HOST, DB_USER, DB_PWD, TEST_DB_NAME)
tablesForCopy = ['role', 'permission']
for table in tablesForCopy:
with liveDbConn.cursor() as liveCursor:
# Get name of columns
liveCursor.execute("DESCRIBE `%s`;" % (table))
columns = '';
for column in liveCursor.fetchall():
columns += '`' + column[0] + '`,'
columns = columns.strip(',')
# Get and convert values
values = ''
liveCursor.execute("SELECT * FROM `%s`;" % (table))
for result in liveCursor.fetchall():
data = []
for item in result:
if type(item)==type(None):
data.append('NULL')
elif type(item)==type('str'):
data.append("'"+item+"'")
elif type(item)==type(datetime.datetime.now()):
data.append("'"+str(item)+"'")
else: # for numeric values
data.append(str(item))
v = '(' + ', '.join(data) + ')'
values += v + ', '
values = values.strip(', ')
print("### table: %s" % (table))
testDbCursor = testDbConn.cursor()
testDbCursor.execute("INSERT INTO `" + TEST_DB_NAME + "`.`" + table + "` (" + columns + ") VALUES " + values + ";")
print("Result: {}".format(testDbCursor._result.message))
liveDbConn.close()
testDbConn.close()
Result is:
### table: role
Result: b"'Records: 16 Duplicates: 0 Warnings: 0"
### table: permission
Result: b'(Records: 222 Duplicates: 0 Warnings: 0'
What am I doing wrong? Thanks!
You have 2 main issues here:
You don't use conn.commit() (which would be either be liveDbConn.commit() or testDbConn.commit() here). Changes to the database will not be reflected without committing those changes. Note that all changes need committing but SELECT, for example, does not.
Your query is open to SQL Injection. This is a serious problem.
Table names cannot be parameterized, so there's not much we can do about that, but you'll want to parameterize your values. I've made multiple corrections to the code in relation to type checking as well as parameterization.
for table in tablesForCopy:
with liveDbConn.cursor() as liveCursor:
liveCursor.execute("SELECT * FROM `%s`;" % (table))
name_of_columns = [item[0] for item in liveCursor.description]
insert_list = []
for result in liveCursor.fetchall():
data = []
for item in result:
if item is None: # test identity against the None singleton
data.append('NULL')
elif isinstance(item, str): # Use isinstance to check type
data.append(item)
elif isinstance(item, datetime.datetime):
data.append(item.strftime('%Y-%m-%d %H:%M:%S'))
else: # for numeric values
data.append(str(item))
insert_list.append(data)
testDbCursor = testDbConn.cursor()
placeholders = ', '.join(['`%s`' for item in insert_list[0]])
testDbCursor.executemany("INSERT INTO `{}.{}` ({}) VALUES ({})".format(
TEST_DB_NAME,
table,
name_of_columns,
placeholders),
insert_list)
testDbConn.commit()
From this github thread, I notice that executemany does not work as expected in psycopg2; it instead sends each entry as a single query. You'll need to use execute_batch:
from psycopg2.extras import execute_batch
execute_batch(testDbCursor,
"INSERT INTO `{}.{}` ({}) VALUES ({})".format(TEST_DB_NAME,
table,
name_of_columns,
placeholders),
insert_list)
testDbConn.commit()
How to insert data into table using python pymsql
Find my solution below
import pymysql
import datetime
# Create a connection object
dbServerName = "127.0.0.1"
port = 8889
dbUser = "root"
dbPassword = ""
dbName = "blog_flask"
# charSet = "utf8mb4"
conn = pymysql.connect(host=dbServerName, user=dbUser, password=dbPassword,db=dbName, port= port)
try:
# Create a cursor object
cursor = conn.cursor()
# Insert rows into the MySQL Table
now = datetime.datetime.utcnow()
my_datetime = now.strftime('%Y-%m-%d %H:%M:%S')
cursor.execute('INSERT INTO posts (post_id, post_title, post_content, \
filename,post_time) VALUES (%s,%s,%s,%s,%s)',(5,'title2','description2','filename2',my_datetime))
conn.commit()
except Exception as e:
print("Exeception occured:{}".format(e))
finally:
conn.close()
I have been trying to loop through a list as a parameter for a query from database, and convert it into xlsx format, using pyodbc, pandas, xlsxwriter modules.
However, the message below keeps on appearing despite a process of trial and error:
The first argument to execute must be a string or unicode query.
Could this have something to do with the query itself or the module 'pandas'?
Thank you.
This is for exporting a query result to an excel spreadsheet using pandas and pyodbc, with python 3.7 ver.
import pyodbc
import pandas as pd
#Database Connection
conn = pyodbc.connect(driver='xxxxxx', server='xxxxxxx', database='xxxxxx',
user='xxxxxx', password='xxxxxxxx')
cursor = conn.cursor()
depts = ['Human Resources','Accounting','Marketing']
query = """
SELECT *
FROM device ID
WHERE
Department like ?
AND
Status like 'Active'
"""
target = r'O:\\Example'
today = target + os.sep + time.strftime('%Y%m%d')
if not os.path.exists(today):
os.mkdir(today)
for i in departments:
cursor.execute(query, i)
#workbook = Workbook(today + os.sep + i + 'xlsx')
#worksheet = workbook.add_worksheet()
data = cursor.fetchall()
P_data = pd.read_sql(data, conn)
P_data.to_excel(today + os.sep + i + 'xlsx')
When you read data into a dataframe using pandas.read_sql(), pandas expects the first argument to be a query to execute (in string format), not the results from the query.
Instead of your line:
P_data = pd.read_sql(data, conn)
You'd want to use:
P_data = pd.read_sql(query, conn)
And to filter out the departments, you'd want to serialize the list into SQL syntax string:
depts = ['Human Resources','Accounting','Marketing']
# gives you the string to use in your sql query:
depts_query_string = "('{query_vals}')".format(query_vals="','".join(depts))
To use the new SQL string in your query, use str.format:
query = """
SELECT *
FROM device ID
WHERE
Department in {query_vals}
AND
Status like 'Active'
""".format(query_vals=depts_query_string)
All together now:
import pyodbc
import pandas as pd
#Database Connection
conn = pyodbc.connect(driver='xxxxxx', server='xxxxxxx', database='xxxxxx',
user='xxxxxx', password='xxxxxxxx')
cursor = conn.cursor()
depts = ['Human Resources','Accounting','Marketing']
# gives you the string to use in your sql query:
depts_query_string = "('{query_vals}')".format(query_vals="','".join(depts))
query = """
SELECT *
FROM device ID
WHERE
Department in {query_vals}
AND
Status like 'Active'
""".format(query_vals=depts_query_string)
target = r'O:\\Example'
today = target + os.sep + time.strftime('%Y%m%d')
if not os.path.exists(today):
os.mkdir(today)
for i in departments:
#workbook = Workbook(today + os.sep + i + 'xlsx')
#worksheet = workbook.add_worksheet()
P_data = pd.read_sql(query, conn)
P_data.to_excel(today + os.sep + i + 'xlsx')
Once you have your query sorted, you can just load directly into a dataframe with the following command.
P_data = pd.read_sql_query(query, conn)
P_data.to_excel('desired_filename.format')
We have an assignment where we have to write and query from a database file directly without the use of any sqlite api functions. We were given the site https://www.sqlite.org/fileformat.html as guidance but I can't seem to get the database file to look like anything readable.
Here's a basic example of what I'm trying to do with the sqlite3 library from python
import sqlite3
con = sqlite3.connect("test.db")
cur = con.cursor()
cur.execute("PRAGMA page_size = 4096")
cur.execute("PRAGMA encoding = 'UTF-8'")
dropTable = "DROP TABLE IF EXISTS Employee"
createTable = "CREATE TABLE Employee(first int, second int, third int)"
cur.execute(dropTable)
cur.execute(createTable)
cur.execute("INSERT INTO Employee VALUES (1, 2, 3)")
con.commit()
con.close()
When I open the database file, it starts with "SQLite format 3", and then a bunch of weird symbols following it. The same thing happened when I made the databse with the actual csv file given. There's some readable parts but most of it is unreadable symbols that is in no way similar to the format specified by the website. I'm a little overwhelmed right now so I would appreciate anyone pointing me to the right direction on how I would begin fixing this mess.
Here is an example of reading that binary file using FileIO and struct:
from io import FileIO
from struct import *
def read_header(db):
# read the entire database header into one bytearray
header = bytearray(100)
db.readinto(header)
# print out a few of the values in the header
# any number that is more than one byte requires unpacking
# strings require decoding
print('header string: ' + header[0:15].decode('utf-8')) # note that this ignores the null byte at header[15]
page_size = unpack('>h', header[16:18])[0]
print('page_size = ' + str(page_size))
print('write version: ' + str(header[18]))
print('read version: ' + str(header[19]))
print('reserved space: ' + str(header[20]))
print('Maximum embedded payload fraction: ' + str(header[21]))
print('Minimum embedded payload fraction: ' + str(header[22]))
print('Leaf payload fraction: ' + str(header[23]))
file_change_counter = unpack('>i', header[24:28])[0]
print('File change counter: ' + str(file_change_counter))
sqlite_version_number = unpack('>i', header[96:])[0]
print('SQLITE_VERSION_NUMBER: ' + str(sqlite_version_number))
db = FileIO('test.db', mode='r')
read_header(db)
This only reads the database header, and ignores most of the values in the header.
I have 74 relatively large Pandas DataFrames (About 34,600 rows and 8 columns) that I am trying to insert into a SQL Server database as quickly as possible. After doing some research, I learned that the good ole pandas.to_sql function is not good for such large inserts into a SQL Server database, which was the initial approach that I took (very slow - almost an hour for the application to complete vs about 4 minutes when using mysql database.)
This article, and many other StackOverflow posts have been helpful in pointing me in the right direction, however I've hit a roadblock:
I am trying to use SQLAlchemy's Core rather than the ORM for reasons explained in the link above. So, I am converting the dataframe to a dictionary, using pandas.to_dict and then doing an execute() and insert():
self._session_factory.engine.execute(
TimeSeriesResultValues.__table__.insert(),
data)
# 'data' is a list of dictionaries.
The problem is that insert is not getting any values -- they appear as a bunch of empty parenthesis and I get this error:
(pyodbc.IntegretyError) ('23000', "[23000] [FreeTDS][SQL Server]Cannot
insert the value NULL into the column...
There are values in the list of dictionaries that I passed in, so I can't figure out why the values aren't showing up.
EDIT:
Here's the example I'm going off of:
def test_sqlalchemy_core(n=100000):
init_sqlalchemy()
t0 = time.time()
engine.execute(
Customer.__table__.insert(),
[{"name": 'NAME ' + str(i)} for i in range(n)]
)
print("SQLAlchemy Core: Total time for " + str(n) +
" records " + str(time.time() - t0) + " secs")
I've got some sad news for you, SQLAlchemy actually doesn't implement bulk imports for SQL Server, it's actually just going to do the same slow individual INSERT statements that to_sql is doing. I would say that your best bet is to try and script something up using the bcp command line tool. Here is a script that I've used in the past, but no guarantees:
from subprocess import check_output, call
import pandas as pd
import numpy as np
import os
pad = 0.1
tablename = 'sandbox.max.pybcp_test'
overwrite=True
raise_exception = True
server = 'P01'
trusted_connection= True
username=None
password=None
delimiter='|'
df = pd.read_csv('D:/inputdata.csv', encoding='latin', error_bad_lines=False)
def get_column_def_sql(col):
if col.dtype == object:
width = col.str.len().max() * (1+pad)
return '[{}] varchar({})'.format(col.name, int(width))
elif np.issubdtype(col.dtype, float):
return'[{}] float'.format(col.name)
elif np.issubdtype(col.dtype, int):
return '[{}] int'.format(col.name)
else:
if raise_exception:
raise NotImplementedError('data type {} not implemented'.format(col.dtype))
else:
print('Warning: cast column {} as varchar; data type {} not implemented'.format(col, col.dtype))
width = col.str.len().max() * (1+pad)
return '[{}] varchar({})'.format(col.name, int(width))
def create_table(df, tablename, server, trusted_connection, username, password, pad):
if trusted_connection:
login_string = '-E'
else:
login_string = '-U {} -P {}'.format(username, password)
col_defs = []
for col in df:
col_defs += [get_column_def_sql(df[col])]
query_string = 'CREATE TABLE {}\n({})\nGO\nQUIT'.format(tablename, ',\n'.join(col_defs))
if overwrite == True:
query_string = "IF OBJECT_ID('{}', 'U') IS NOT NULL DROP TABLE {};".format(tablename, tablename) + query_string
query_file = 'c:\\pybcp_tempqueryfile.sql'
with open (query_file,'w') as f:
f.write(query_string)
if trusted_connection:
login_string = '-E'
else:
login_string = '-U {} -P {}'.format(username, password)
o = call('sqlcmd -S {} {} -i {}'.format(server, login_string, query_file), shell=True)
if o != 0:
raise BaseException("Failed to create table")
# o = call('del {}'.format(query_file), shell=True)
def call_bcp(df, tablename):
if trusted_connection:
login_string = '-T'
else:
login_string = '-U {} -P {}'.format(username, password)
temp_file = 'c:\\pybcp_tempqueryfile.csv'
#remove the delimiter and change the encoding of the data frame to latin so sql server can read it
df.loc[:,df.dtypes == object] = df.loc[:,df.dtypes == object].apply(lambda col: col.str.replace(delimiter,'').str.encode('latin'))
df.to_csv(temp_file, index = False, sep = '|', errors='ignore')
o = call('bcp sandbox.max.pybcp_test2 in c:\pybcp_tempqueryfile.csv -S "localhost" -T -t^| -r\n -c')
This just recently been updated as of SQLAchemy ver: 1.3.0 just in case anyone else needs to know. Should make your dataframe.to_sql statement much faster.
https://docs.sqlalchemy.org/en/latest/changelog/migration_13.html#support-for-pyodbc-fast-executemany
engine = create_engine(
"mssql+pyodbc://scott:tiger#mssql2017:1433/test?driver=ODBC+Driver+13+for+SQL+Server",
fast_executemany=True)
I have this code:
dsn = cx_Oracle.makedsn(hostname, port, sid)
orcl = cx_Oracle.connect(username + '/' + password + '#' + dsn)
curs = orcl.cursor()
sql = "select TEMPLATE from my_table where id ='6'"
curs.execute(sql)
rows = curs.fetchall()
print rows
template = rows[0][0]
orcl.close()
print template.read()
When I do print rows, I get this:
[(<cx_Oracle.LOB object at 0x0000000001D49990>,)]
However, when I do print template.read(), I get this error:
cx_Oracle.DatabaseError: Invalid handle!
Do how do I get and read this data? Thanks.
I've found out that this happens in case when connection to Oracle is closed before the cx_Oracle.LOB.read() method is used.
orcl = cx_Oracle.connect(usrpass+'#'+dbase)
c = orcl.cursor()
c.execute(sq)
dane = c.fetchall()
orcl.close() # before reading LOB to str
wkt = dane[0][0].read()
And I get: DatabaseError: Invalid handle!
But the following code works:
orcl = cx_Oracle.connect(usrpass+'#'+dbase)
c = orcl.cursor()
c.execute(sq)
dane = c.fetchall()
wkt = dane[0][0].read()
orcl.close() # after reading LOB to str
Figured it out. I have to do something like this:
curs.execute(sql)
for row in curs:
print row[0].read()
You basically have to loop through the fetchall object
dsn = cx_Oracle.makedsn(hostname, port, sid)
orcl = cx_Oracle.connect(username + '/' + password + '#' + dsn)
curs = orcl.cursor()
sql = "select TEMPLATE from my_table where id ='6'"
curs.execute(sql)
rows = curs.fetchall()
for x in rows:
list_ = list(x)
print(list_)
There should be an extra comma in the for loop, see in below code, i have supplied an extra comma after x in for loop.
dsn = cx_Oracle.makedsn(hostname, port, sid)
orcl = cx_Oracle.connect(username + '/' + password + '#' + dsn)
curs = orcl.cursor()
sql = "select TEMPLATE from my_table where id ='6'"
curs.execute(sql)
rows = curs.fetchall()
for x, in rows:
print(x)
I had the same problem with in a slightly different context. I needed to query a +27000 rows table and it turns out that cx_Oracle cuts the connection to the DB after a while.
While a connection to the db is open, you can use the read() method of the cx_Oracle.Lob object to transform it into a string. But if the query brings a table that is too big, it won´t work because the connection will stop at some point and when you want to read the results from the query you´ll gt an error on the cx_Oracle objects.
I tried many things, like setting
connection.callTimeout = 0 (according to documentation, this means it would wait indefinetly), using fetchall() and then putting the results on a dataframe or numpy array but I could never read the cx_Oracle.Lob objects.
If I try to run the query using pandas.DataFrame.read_sql(query, connection) The dataframe would contain cx_Oracle.Lob objects with the connection closed, making them useless. (Again this only happens if the table is very big)
In the end I found a way of getting around this by querying and creating a csv file inmediatlely after, even though I know it´s not ideal.
def csv_from_sql(sql: str, path: str="dataframe.csv") -> bool:
try:
with cx_Oracle.connect(config.username, config.password, config.database, encoding=config.encoding) as connection:
connection.callTimeout = 0
data = pd.read_sql(sql, con=connection)
data.to_csv(path)
print("FILE CREATED")
except cx_Oracle.Error as error:
print(error)
return False
finally:
print("PROCESS ENDED\n")
return True
def make_query(sql: str, path: str="dataframe.csv") -> pd.DataFrame:
if csv_from_sql(sql, path):
dataframe = pd.read_csv("dataframe.csv")
return dataframe
return pd.DataFrame()
This took a long time (about 4 to 5 minutes) to bring my +27000-rows table, but it worked when everything else didn´t.
If anyone knows a better way, it would be helpful for me too.