Python/Hive interface slow with fetchone(), hangs with fetchall()

Python/Hive interface slow with fetchone(), hangs with fetchall() - python

I have a python script that is querying HiveServer2 using pyhs2, like so:
import pyhs2;
conn = pyhs2.connect(host=localhost,
port=10000,
user='user',
password='password',
database='default');
cur = conn.cursor();
cur.execute("SELECT name,data,number,time FROM table WHERE date = '2014-01-01' AND number in (1,5,6,22) ORDER BY name,time ASC");
line = cur.fetchone();
while line is not None:
<do some processing, including writing to stdout>
.
.
.
line = cur.fetchone();
I have also tried using fetchall() instead of fetchone(), but that just seems to hang forever.
My query runs just fine and returns ~270 million rows. For testing, I dumped the output from Hive into a flat, tab-delimited file and wrote the guts of my python script against that, so I didn't have to wait for the query to finish everytime I ran. My script that reads the flat file will finish in ~20 minutes. What confuses me is that I don't see that same performance when I directly query Hive. In fact, it takes about 5 times longer to finish processing. I am pretty new to Hive, and python so maybe I am making some bone-headed error, but examples that I see online show a set up such as this. I just want to iterate through my Hive return, getting one row at a time as quickly as possible, much like I did using my flat file. Any suggestions?
P.S. I have found this question that sounds similar:
Python slow on fetchone, hangs on fetchall
but that ended up being a SQLite issue, and I have no control over my Hive set up.

Have you considered using fetchmany().
That would be the DBAPI answer for pulling data in chunks (bigger one, where the overhead is an issue, and smaller than all rows, where memory is an issue).

Related

Python - Slow Execution of Stored Procedure in a Loop

I'm writing a python script which connects with Oracle DB. I'm collecting specific Reference ID into a variable and then executing a Stored Procedure in a For Loop. It's working fine but its taking very long time to complete.
Here's a code:
sql = f"SELECT STATEMENT"
cursor.execute(sql)
result = cursor.fetchall()
for i in result:
cursor.callproc('DeleteStoredProcedure', [i[0]])
print("Deleted:", i[0])
The first SQL SELECT Statement collect around 600 Ref IDs but its taking around 3 mins to execute Stored Procedure which is very long if we have around 10K or more record.
BTW, the Stored Procedure is configured to delete rows from three different tables based on the reference ID. And its running quickly from Oracle Toad.
Is there any way to improve the performance?

I think you could create just one store procedure that execute the SELECT STATEMENT and do what ever DeleteStoredProcedure does.
Or, you can use threads to execute every stored procedure https://docs.python.org/3/library/threading.html

Python code doesn't run the SQL stored procedure completely

I am not proficient in Python but I have written a python code that executes a stored procedure (SQL server) which within it contains multiple stored procedures therefore it usually takes 5 mins or so to run on SSMS.
I can see the stored procedure runs halfway through without error when I run the Python code which makes me think that somehow it needs more time to execute when coding in python.
I found other posts where people suggested subprocess but I don't know how to code this. Below is an example of a (not mine) python code to execute the stored procedure.
mydb_lock = pyodbc.connect('Driver={SQL Server Native Client 11.0};'
'Server=localhost;'
'Database=InterelRMS;'
'Trusted_Connection=yes;'
'MARS_Connection=yes;'
'user=sa;'
'password=Passw0rd;')
mycursor_lock = mydb_lock.cursor()
sql_nodes = "Exec IVRP_Nodes"
mycursor_lock.execute(sql_nodes)
mydb_lock.commit()
How can I edit the above code to use the subprocess? Is the subprocess the right choice? Any other method you can suggest?
Many thanks.
Python 2.7 and 3
SQL Server
UPDATE 04/04/2022:
#AlwaysLearning, I tried
NEWcnxn = pyodbc.connect('DRIVER={ODBC Driver 13 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password+';Connection Timeout=0')
But there was no change. What I noticed is that to check how much of the code it executes, I inserted the following two lines of code right after each other somewhere in the nested procedure where I thought the SP stopped.
INSERT INTO CheckTable (OrgID,Stage,Created) VALUES(#OrgID,2.5331,getdate())
INSERT INTO CheckTable (OrgID,Stage,Created) VALUES(#OrgID,2.5332,getdate())
Only the first query is completed. I use Azure DB if that helps.
UPDATE 05/04/2022:
I tried what #AlwaysLearning suggested, after my connection, I added, NEWconxn.timeout=4000 and it's working now

I tried what #AlwaysLearning suggested, after my connection, I added, NEWconxn.timeout=4000 and it's working now. Many thanks.

SQL Stored Procedures not finishing when called from Python

I'm trying to call a stored procedure in my MSSQL database from a python script, but it does not run completely when called via python. This procedure consolidates transaction data into hour/daily blocks in a single table which is later grabbed by the python script. If I run the procedure in SQL studio, it completes just fine.
When I run it via my script, it gets cut short about 2/3's of the way through. Currently I found a work around, by making the program sleep for 10 seconds before moving on to the next SQL statement, however this is not time efficient and unreliable as some procedures may not finish in that time. I'm looking for a more elegant way to implement this.
Current Code:
cursor.execute("execute mySP")
time.sleep(10)
cursor.commit()
The most related article I can find to my issue is here:
make python wait for stored procedure to finish executing
I tried the solution using Tornado and I/O generators, but ran into the same issue as listed in the article, that was never resolved. I also tried the accepted solution to set a runningstatus field in the database by my stored procedures. At the beginnning of my SP Status is updated to 1 in RunningStatus, and when the SP finished Status is updated to 0 in RunningStatus. Then I implemented the following python code:
conn=pyodbc_connect(conn_str)
cursor=conn.cursor()
sconn=pyodbc_connect(conn_str)
scursor=sconn.cursor()
cursor.execute("execute mySP")
cursor.commit()
while 1:
q=scursor.execute("SELECT Status FROM RunningStatus").fetchone()
if(q[0]==0):
break
When I implement this, the same problem happens as before with my storedprocedure finishing executing prior to it actually being complete. If I eliminate my cursor.commit(), as follows, I end up with the connection just hanging indefinitely until I kill the python process.
conn=pyodbc_connect(conn_str)
cursor=conn.cursor()
sconn=pyodbc_connect(conn_str)
scursor=sconn.cursor()
cursor.execute("execute mySP")
while 1:
q=scursor.execute("SELECT Status FROM RunningStatus").fetchone()
if(q[0]==0):
break
Any assistance in finding a more efficient and reliable way to implement this, as opposed to time.sleep(10) would be appreciated.

As OP found out, inconsistent or imcomplete processing of stored procedures from application layer like Python may be due to straying from best practices of TSQL scripting.
As #AaronBetrand highlights in this Stored Procedures Best Practices Checklist blog, consider the following among other items:
Explicitly and liberally use BEGIN ... END blocks;
Use SET NOCOUNT ON to avoid messages sent to client for every row affected action, possibly interrupting workflow;
Use semicolons for statement terminators.
Example
CREATE PROCEDURE dbo.myStoredProc
AS
BEGIN
SET NOCOUNT ON;
SELECT * FROM foo;
SELECT * FROM bar;
END
GO

speed improvement in postgres INSERT command

I am writing a program to load data into a particular database. This is what I am doing right now ...
conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
cur = conn.cursor()
lRows = len(rows)
i, iN = 0, 1000
while True:
if iN >= lRows:
# write the last of the data, and break ...
iN = lRows
values = [dict(zip(header, r)) for r in rows[i:iN]]
cur.executemany( insertString, values )
conn.commit()
break
values = [dict(zip(header, r)) for r in rows[i:iN]]
cur.executemany( insertString, values )
conn.commit()
i += 1000
iN += 1000
cur.close()
conn.close()
I am aware about this question about the use of the COPY command. However, I need to do some bookkeeping on my files before I can upload the files into a database. Hence I am using Python in this manner.
I have a couple of questions on how to make things faster ...
Would it be better (or possible) to do many cur.executemany() statements and a single conn.commit() at the end? This means that I will put a single conn.commit() statement just before the cur.close() statement.
I have always seen other people use cur.executemany() for batches of like 1000 or so records. Is this generally the case or is it possible to just do an cur.executemany() on the entire set of records that I read from the file. I would potentially have hundreds of thousands of records, or maybe a little over a million records. (I have sufficient RAM to fit the entire file in memory). How do I know the upper limit of the number of records that I can upload at any one time.
I am making a fresh connection to the database for every file that I am opening. I am doing this because, this process is taking me many days to complete and I dont want issues with connection to corrupt the entirety of the data, if the connection is lost at any time. I have over a thousand files that I need to go through. Are these thousand connections that we are making going to be a significant part of the time that is used for the process?
Are there any other things that I am doing in the program that I shouldn't be doing that can shorten the total time for the process?
Thanks very much for any help that I can get. Sorry for the questions being so basic. I am just starting with databases in Python, and for some reason, I don't seem to have any definitive answer to any of these questions right now.

As you mentioned at p.3 you are worried about database connection, that might break, so if you use one conn.commit() only after all inserts, you can easily loose already inserted, but not commited data if your connection breaks before conn.commit(). If you do conn.commit() after each cur.executemany(), you won't loose everything, only the last batch. So, it's up to you and depends on a workflow you need to support.
The number of records per batch is a trade-off between insertion speed and other things. You need to choose value that satisfies your requirements, you can test your script with 1000 records per batch, with 10000 per batch and check the difference.
The case of inserting whole file within one cur.executemany() has an advantage of an atomicity: if it has been executed, that means all records from this particular file have been inserted, so we're back to p. 1.
I think the cost of establishing a new connection in your case does not really matter. Let's say, if it takes one second to establish new connection, with 1000 files it will be 1000 seconds spent on connection within days.
The program itself looks fine, but I would still recommend you to take a look on COPY TO command with UNLOGGED or TEMPORARY tables, it will really speed up your imports.

PyODBC execute stored procedure does not complete

I have the following code, and the stored procedure is used to call several stored procedures. I can run the stored procedure and it will complete without issues in SQL 2012. I am using Python 3.3.
cnxn = pyodbc.connect('DRIVER={SQL Server};Server=.\SQLEXPRESS;Database=MyDatabase;Trusted_Connection=yes;')
cursor = cnxn.cursor()
cnxn.timeout = 0
cnxn.autocommit = True
cursor.execute("""exec my_SP""")
The python code is executing, I have determined this from inserting numerous prints.
I did see the other question regarding python waiting for the SP to finish. I tried adding a 'time.sleep()' after the execute, and varying the time (up to 120 seconds) no change.
The stored procedure appears to be partially executing, based on the results. The data suggests that it is even interrupting one of the sub-stored procedures, yet it is fine when the SP is run from query analyzer.
My best guess would be that this is something SQL config related, but I am lost in where to look.
Any thoughts?

Adding SET NOCOUNT OFF to my proc worked for me.

I had the same issue and solved it with a combination of setting a locking variable (see answer from Ben Caine in this thread: make python wait for stored procedure to finish executing) and adding
"SET NOCOUNT ON"
after "CREATE PROCEDURE ... AS"

Just a follow up; I have had limited success using the time features located at the link below, and reducing the level of nesting stored procedures.
At the level that I was calling in the above, there were 4 layers of nested SP's; pyodbc seems to behave a little better when you have 3 layers or less. Doesn't make a lot of sense to me, but it works.
make python wait for stored procedure to finish executing
Any input on the rationale behind this would be greatly appreciated.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python/Hive interface slow with fetchone(), hangs with fetchall() - python

Have you considered using fetchmany(). That would be the DBAPI answer for pulling data in chunks (bigger one, where the overhead is an issue, and smaller than all rows, where memory is an issue).

Related

Python - Slow Execution of Stored Procedure in a Loop

Python code doesn't run the SQL stored procedure completely

SQL Stored Procedures not finishing when called from Python

speed improvement in postgres INSERT command

PyODBC execute stored procedure does not complete

Categories

Resources