Below is my code to connect to Teradata through a JDBC connection using Jaydebeapi. When i run the query using the razorsql GUI, it takes only 15 seconds. When i run it through the code below, it took over 20 minutes just to run query1.
Is there anything wrong with the Jaydebeapi or can i make it faster by optimizing my query/using Jpype?
#-*- coding: utf-8 -*-
import jaydebeapi
import jpype
import pandas as pd
import numpy as np
import collections
query_dict=collections.OrderedDict()
connection = jaydebeapi.connect('com.teradata.jdbc.TeraDriver', ['my_db_name','my_username','my_password'], ['/Applications/drivers/tdgssconfig.jar','/Applications/drivers/terajdbc4.jar'],)
cur = connection.cursor()
query_name_list=['query1','query2']
query1= """select ......"""
query2= """ select ....."""
for i in query_list:
query_dict[i]=locals()[i]
print query_dict.keys()
for index in range(len(query_list)):
tera_query=query_dict.values()[index]
cur.execute(tera_query)
print "executing ... "
result=cur.fetchall()
print "fetching results ... "
I've already posted about some performance considerations.
Here again:
...fetching of large result sets with the JPype implementation causes some JNI calls for every single cell value which causes lot's of overhead. ...
Minimize the size of your resultset. Do aggregations using SQL functions.
Give the newest implementation of JPype1 a try. There have been some performance improvements.
Switch your runtime to Jython (JayDeBeApi works on Jython as well)
Implement the db queries and data extraction directly in Java and call the logic using JPype but with a interface not returning a large data set.
Try to improve JPype and JayDeBeApi code
BTW: The subject ".. if 2 driver files" is a bit misleading. The number of driver files is definitely not related to performance issues.
Related
I'm executing a long python function that reads the data from 2 mySQL tables, appends it, transforms it, and writes the outputs back to sql. For some reason, the first write, which is 'append' works fine, the other one, which is 'replace' freezes (i.e. script is running, can't execute sql commands through terminal but nothing happens). The write is very small, when I disconnect (restart mySQL service) and do the write as a separate line of code, there's no problem and it takes less than a second. Here's the code I'm using:
from sqlalchemy import create_engine
import pandas as pd
def local_connect(login_local,password_local,database_local):
engine_input="mysql://"+login_local+":"+password_local+"#localhost/"+database_local
engine = create_engine(engine_input)
con = engine.connect()
return con
con=local_connect(login_local,password_local,database_local)
...
aaa.to_sql(name='aaa',con=con_,if_exists='append',index=True)
bbb.to_sql(name='bbb',con=con_,if_exists='replace',index=True)
aaa and bbb are pandas dataframes.
EDIT solved it by reconnecting after the first 'to sql'
con_.close()
con_=connect.local_connect(login_local,password_local,database_local)
What is the reason and a better way to do this?
I am trying to learn how to get Microsoft SQL query results using python and pyodbc module and have run into an issue in returning the same results using the same query that I use in Microsoft SQL Management Studio.
I've looked at the pyodbc documentation and set up my connection correctly... at least I'm not getting any connection errors at execution. The only issue seems to be returning the table data
import pyodbc
import sys
import csv
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER=<server>;DATABASE=<db>;UID=<uid>;PWD=<PWD>')
cursor = cnxn.cursor()
cursor.execute("""
SELECT request_id
From audit_request request
where request.reception_datetime between '2019-08-18' and '2019-08-19' """)
rows = cursor.fetchall()
for row in cursor:
print(row.request_id)
When I run the above code i get this in the python terminal window:
Process returned 0 (0x0) execution time : 0.331 s
Press any key to continue . . .
I tried this same query in SQL Management Studio and it returns the results I am looking for. There must be something I'm missing as far as displaying the results using python.
You're not actually setting your cursor up to be used. You should have something like this before executing:
cursor = cnxn.cursor()
Learn more here: https://github.com/mkleehammer/pyodbc/wiki/Connection#cursor
Hi everyone,
I'm trying to wrap my head around microsoft server 2017 and python script.
In general - I'm trying to store a table I took from a website (using bs4),
storing it in a panda df , and then simply put the results in a temp sql table.
I entered the following code (I'm skipping parts of the code because the python script
does work in python. Keep in mind I'm calling the script from microsoft sql server 2017):
CREATE PROC OTC
AS
BEGIN
EXEC sp_execute_external_script
#language = N'Python',
#script = N'
import bs4 as bs
import pandas as pd
import requests
....
r = requests.get(url, verify = False)
html = r.text
soup = bs.BeautifulSoup(html, "html.parser")
data_date = str(soup.find(id="ctl00_SPWebPartManager1_g_4be2cf24_5a47_472d_a6ab_4248c8eb10eb_ctl00_lDate").contents)
t_tab1 = soup.find(id="ctl00_SPWebPartManager1_g_4be2cf24_5a47_472d_a6ab_4248c8eb10eb_ctl00_NiaROGrid1_DataGrid1")
df = parse_html_table(1,t_tab1)
print(df)
OutputDataSet=df
'
I tried the microsoft tutorials and simply couldn't understand how to
handle the inputs/outputs to get the result as a sql table.
Furthermore, I get the error
"
import bs4 as bs
ImportError: No module named 'bs4'
"
I'm obviously missing a lot here.
What am I to add to the sql code?
does the sql server even supports bs4? or only pandas?
and then I need to find another solution like write as csv?
Thanks for any help or advice you can offer
To use pip to install a Python package on SQL Server 2017:
On the server, open a command prompt as administrator.
Then cd to {instance directory}\PYTHON_SERVICES\Scripts
(for example: C:\Program Files\Microsoft SQL Server\MSSQL14.SQL2017\PYTHON_SERVICES\Scripts).
Then execute pip install {package name}.
One you have the necessary package(s) installed and the script executes successfully, simply setting variable OutputDataSet to a pandas data frame will result in the contents of that data frame being returned as a result set from the stored procedure.
If you want to capture that result set in a table (perhaps a temporary table), you can use INSERT...EXEC (e.g. INSERT MyTable(Col1, Col2) EXEC sp_execute_external_script ...).
I am trying to make the best out of an aws server and had the idea to use an in memory database across multiple threads(using SQLite 3 in python) I found this command online:
conn = sqlite3.connect('file::memory:?cache=shared')
but then I get this vague error:
sqlite3.OperationalError: unable to open database file
Is it even possible to do this anymore?
It is still possible. I just verified against Python 3.6.0 and Python 2.7.13 on MacOS.
sqlite3.connect("file::memory:?cache=shared") is indeed the correct way to connect to DB.
import sqlite3
p = sqlite3.connect("file::memory:?cache=shared")
p.execute('CREATE TABLE foo (bar, baz)')
p.execute("INSERT INTO foo VALUES ('apple', 'orange')")
p.commit()
and in another python shell
import sqlite3
q = sqlite3.connect("file::memory:?cache=shared")
list(q.execute('SELECT * FROM foo'))
my output is [(u'apple', u'orange')]
To you answer your question "Is it even possible to do this anymore?", the answer is yes. So the problem lies in your system, as you confirmed it works on aws (in the comments below).
In Python 3.4+ and SQLite 3.7.13+, you can use this approach:
sqlite3.connect("file:memory?cache=shared&mode=memory", uri=True)
I have a query that returns over 125K rows.
The goal is to write a script the iterates through the rows, and for each, populate a second table with data processed from the result of the query.
To develop the script, I created a duplicate database with a small subset of the data (4126 rows)
On the small database, the following code works:
import os
import sys
import random
import mysql.connector
cnx = mysql.connector.connect(user='dbuser', password='thePassword',
host='127.0.0.1',
database='db')
cnx_out = mysql.connector.connect(user='dbuser', password='thePassword',
host='127.0.0.1',
database='db')
ins_curs = cnx_out.cursor()
curs = cnx.cursor(dictionary=True)
#curs = cnx.cursor(dictionary=True,buffered=True) #fail
with open('sql\\getRawData.sql') as fh:
sql = fh.read()
curs.execute(sql, params=None, multi=False)
result = curs.fetchall() #<=== script stops at this point
print len(result) #<=== this line never executes
print curs.column_names
curs.close()
cnx.close()
cnx_out.close()
sys.exit()
The line curs.execute(sql, params=None, multi=False) succeeds on both the large and small databases.
If I use curs.fetchone() in a loop, I can read all records.
If I alter the line:
curs = cnx.cursor(dictionary=True)
to read:
curs = cnx.cursor(dictionary=True,buffered=True)
The script hangs at curs.execute(sql, params=None, multi=False).
I can find no documentation on any limits to fetchall(), nor can I find any way to increase the buffer size, and no way to tell how large a buffer I even need.
There are no exceptions raised.
How can I resolve this?
I was having this same issue, first on a query that returned ~70k rows and then on one that only returned around 2k rows (and for me RAM was also not the limiting factor). I switched from using mysql.connector (i.e. the mysql-connector-python package) to MySQLdb (i.e. the mysql-python package) and then was able to fetchall() on large queries with no problem. Both packages seem to follow the python DB API, so for me MySQLdb was a drop-in replacement for mysql.connector, with no code changes necessary beyond the line that sets up the connection. YMMV if you're leveraging something specific about mysql.connector.
Pragmatically speaking, if you don't have a specific reason to be using mysql.connector the solution to this is just to switch to a package that works better!