How to concurently run queries using Impala with Python code? - python

Context
I use Python (3.7) to run several queries on an Hadoop server.
After several tests, I think Impala is the most efficent engine to query the database. So I set up a connexion using Ibis framework in order to force the use of Impala (Hive is used by default).
Considering the number of queries, I'm trying to concurrently run these queries.
I think I'm getting close, but I'm stuck with a problem when trying to share the connection to the server using Ibis, and the multiple processes I start.
I'm quite new to Python, but I'm going to do my best to explain clearly my problem, and to use the right vocabulary. Please forgive me in advance for any mistake...!
How the queries are submitted
For submitting my queries, the code looks like :
Connection to the database:
hdfs = ibis.hdfs_connect(host='X.X.X.X', port=Y)
client = ibis.impala.connect(host='X.X.X.X',port=Y,hdfs_client=hdfs)
Creation of the query (done several times):
query = "SELECT ... FROM ... WHERE ..."
Send the query and retrieve the results (done for each query):
query = self.client.sql(query)
data = query.execute(limit = None)
What have been done to concurrently run these queries
For now, I've created a Process class using multiprocessing, and I'm passing to it the client parameter which would enable the connection (at least, I thought), and a list containing the informations required to configure the queries to run on the server:
import multiprocessing
class ParallelDataRetrieving(multiprocessing.Process):
"""Process in charge of retrieving the data."""
def __init__(self,client,someInformations):
multiprocessing.Process.__init__(self)
self.client = client
self.someInformations = someInformations
def run(self):
"""Code to run during the execution of the process."""
cpt1 = 0
while cpt1 < len(someInformations):
query = Use someInformations[cpt1] to create the query.
query = self.client.sql(query)
data = query.execute(limit = None)
Some work on the data...
return 0
Then, from the main script, I (try to) establish the connection, and start several Processes using this connection:
hdfs = ibis.hdfs_connect(host='X.X.X.X', port=Y)
client = ibis.impala.connect(host='X.X.X.X',port=Y,hdfs_client=hdfs)
process_1 = ParallelDataRetrieving(client,someInformations)
process_1.start()
process_2 = ...
But this code does not work. I get the error "TypeError: can't pickle _thread.lock objects".
From what I understand, this comes from the fact that multiprocessing uses Pickle to "encapsulate" the parameters, and transfer them to the processes (whose memory runs separatly on Windows). And it does not seem to be possible to pickle the "client" parameter.
I then found several ideas on the internet which tried to solve this issue, but none of them seems to be applicable to my particular case (Ibis, Impala...):
I tried to create the connection directly in the run method of the Process object (which means one connexion per Process) : this leads to "BrokenPipeError: [Errno 32] Broken pipe"
I tried to use multiprocessing.sharedctypes.RawValue, but if this is the right solution, I'm not very confident I implemented it correctly in my code...
Here is pretty much my situation on the moment. I will keep on trying to solve this problem, but as a kind of a "new comer" to Python, and multiprocessing of database queries, I have thought a more advanced user could probably help me!
Thank you in advance for the time you will devote to this request!

Related

Python code doesn't run the SQL stored procedure completely

I am not proficient in Python but I have written a python code that executes a stored procedure (SQL server) which within it contains multiple stored procedures therefore it usually takes 5 mins or so to run on SSMS.
I can see the stored procedure runs halfway through without error when I run the Python code which makes me think that somehow it needs more time to execute when coding in python.
I found other posts where people suggested subprocess but I don't know how to code this. Below is an example of a (not mine) python code to execute the stored procedure.
mydb_lock = pyodbc.connect('Driver={SQL Server Native Client 11.0};'
'Server=localhost;'
'Database=InterelRMS;'
'Trusted_Connection=yes;'
'MARS_Connection=yes;'
'user=sa;'
'password=Passw0rd;')
mycursor_lock = mydb_lock.cursor()
sql_nodes = "Exec IVRP_Nodes"
mycursor_lock.execute(sql_nodes)
mydb_lock.commit()
How can I edit the above code to use the subprocess? Is the subprocess the right choice? Any other method you can suggest?
Many thanks.
Python 2.7 and 3
SQL Server
UPDATE 04/04/2022:
#AlwaysLearning, I tried
NEWcnxn = pyodbc.connect('DRIVER={ODBC Driver 13 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password+';Connection Timeout=0')
But there was no change. What I noticed is that to check how much of the code it executes, I inserted the following two lines of code right after each other somewhere in the nested procedure where I thought the SP stopped.
INSERT INTO CheckTable (OrgID,Stage,Created) VALUES(#OrgID,2.5331,getdate())
INSERT INTO CheckTable (OrgID,Stage,Created) VALUES(#OrgID,2.5332,getdate())
Only the first query is completed. I use Azure DB if that helps.
UPDATE 05/04/2022:
I tried what #AlwaysLearning suggested, after my connection, I added, NEWconxn.timeout=4000 and it's working now
I tried what #AlwaysLearning suggested, after my connection, I added, NEWconxn.timeout=4000 and it's working now. Many thanks.

SQL Stored Procedures not finishing when called from Python

I'm trying to call a stored procedure in my MSSQL database from a python script, but it does not run completely when called via python. This procedure consolidates transaction data into hour/daily blocks in a single table which is later grabbed by the python script. If I run the procedure in SQL studio, it completes just fine.
When I run it via my script, it gets cut short about 2/3's of the way through. Currently I found a work around, by making the program sleep for 10 seconds before moving on to the next SQL statement, however this is not time efficient and unreliable as some procedures may not finish in that time. I'm looking for a more elegant way to implement this.
Current Code:
cursor.execute("execute mySP")
time.sleep(10)
cursor.commit()
The most related article I can find to my issue is here:
make python wait for stored procedure to finish executing
I tried the solution using Tornado and I/O generators, but ran into the same issue as listed in the article, that was never resolved. I also tried the accepted solution to set a runningstatus field in the database by my stored procedures. At the beginnning of my SP Status is updated to 1 in RunningStatus, and when the SP finished Status is updated to 0 in RunningStatus. Then I implemented the following python code:
conn=pyodbc_connect(conn_str)
cursor=conn.cursor()
sconn=pyodbc_connect(conn_str)
scursor=sconn.cursor()
cursor.execute("execute mySP")
cursor.commit()
while 1:
q=scursor.execute("SELECT Status FROM RunningStatus").fetchone()
if(q[0]==0):
break
When I implement this, the same problem happens as before with my storedprocedure finishing executing prior to it actually being complete. If I eliminate my cursor.commit(), as follows, I end up with the connection just hanging indefinitely until I kill the python process.
conn=pyodbc_connect(conn_str)
cursor=conn.cursor()
sconn=pyodbc_connect(conn_str)
scursor=sconn.cursor()
cursor.execute("execute mySP")
while 1:
q=scursor.execute("SELECT Status FROM RunningStatus").fetchone()
if(q[0]==0):
break
Any assistance in finding a more efficient and reliable way to implement this, as opposed to time.sleep(10) would be appreciated.
As OP found out, inconsistent or imcomplete processing of stored procedures from application layer like Python may be due to straying from best practices of TSQL scripting.
As #AaronBetrand highlights in this Stored Procedures Best Practices Checklist blog, consider the following among other items:
Explicitly and liberally use BEGIN ... END blocks;
Use SET NOCOUNT ON to avoid messages sent to client for every row affected action, possibly interrupting workflow;
Use semicolons for statement terminators.
Example
CREATE PROCEDURE dbo.myStoredProc
AS
BEGIN
SET NOCOUNT ON;
SELECT * FROM foo;
SELECT * FROM bar;
END
GO

What is the most efficient way to run independent processes from the same application in Python

I have a script that in the end executes two functions. It polls for data on a time interval (runs as daemon - and this data is retrieved from a shell command run on the local system) and, once it receives this data will: 1.) function 1 - first write this data to a log file, and 2.) function 2 - observe the data and then send an email IF that data meets certain criteria.
The logging will happen every time, but the alert may not. The issue is, in cases that an alert needs to be sent, if that email connection stalls or takes a lengthy amount of time to connect to the server, it obviously causes the next polling of the data to stall (for an undisclosed amount of time, depending on the server), and in my case it is very important that the polling interval remains consistent (for analytics purposes).
What is the most efficient way, if any, to keep the email process working independently of the logging process while still operating within the same application and depending on the same data? I was considering creating a separate thread for the mailer, but that kind of seems like overkill in this case.
I'd rather not set a short timeout on the email connection, because I want to give the process some chance to connect to the server, while still allowing the logging to be written consistently on the given interval. Some code:
def send(self,msg_):
"""
Send the alert message
:param str msg_: the message to send
"""
self.msg_ = msg_
ar = alert.Alert()
ar.send_message(msg_)
def monitor(self):
"""
Post to the log file and
send the alert message when
applicable
"""
read = r.SensorReading()
msg_ = read.get_message()
msg_ = read.get_message() # the data
if msg_: # if there is data in general...
x = read.get_failed() # store bad data
msg_ += self.write_avg(read)
msg_ += "==============================================="
self.ctlog.update_templog(msg_) # write general data to log
if x:
self.send(x) # if bad data, send...
This is exactly the kind of case you want to use threading/subprocesses for. Fork off a thread for the email, which times out after a while, and keep your daemon running normally.
Possible approaches that come to mind:
Multiprocessing
Multithreading
Parallel Python
My personal choice would be multiprocessing as you clearly mentioned independent processes; you wouldn't want a crashing thread to interrupt the other function.
You may also refer this before making your design choice: Multiprocessing vs Threading Python
Thanks everyone for the responses. It helped very much. I went with threading, but also updated the code to be sure it handled failing threads. Ran some regressions and found that the subsequent processes were no longer being interrupted by stalled connections and the log was being updated on a consistent schedule . Thanks again!!

Save to database inside thread

I'm working with django,
I have a thread whose purpose is to take queued list of database items and modify them
Here is my model :
class MyModel(models.Model):
boolean = models.BooleanField(editable=False)
and the problematic code :
def unqueue():
while not queue.empty():
myList = queue.get()
for i in myList:
if not i.boolean:
break
i.boolean = False
i.save() # <== error because database table is locked
queue.task_done()
queue = Queue()
thread = Thread(target=unqueue)
thread.start()
def addToQueue(pk_list): # can be called multiple times simultaneously
list = []
for pk in pk_list:
list.append(MyModel.objects.get(pk=pk))
queue.put(list)
I now the code is missing lot of check ect... I simplified here to make it clearer
What can I do to be able to save to my db while inside the thread ?
EDIT : I need to be synchronous because i.boolean (and other properties in my real code) mustn't be overwrite
I tried to create a dedicated Table in the database but it didn't work, I still have the same issue
EDIT 2 : I should precise that I'm using SQLite3. I tried to see if I could unlock/lock specific table in SQLite and it seem that locking apply on the entire db only. That is probably why using a dedicated table wasn't helpful.
That is bad for me, because I need to access different table simultaneously from different thread, is it possible ?
EDIT 3 : It seems that my problem is the one listed here
https://docs.djangoproject.com/en/1.8/ref/databases/#database-is-locked-errors
Are you sure you need a synchronized queue? May be asynchronous solution will solve your problem? Need a thread-safe asynchronous message queue
The solution I found was to change database,
SQLite don't allow concurency access like that.
I switched to MySql and it work now

What is the proper way of using MySQLdb connections and cursors across multiple functions in Python

I'm kind of new to Python and its MySQLdb connector.
I'm writing an API to return some data from a database using the RESTful approach. In PHP, I wrapped the Connection management part in a class, acting as an abstraction layer for MySQL queries.
In Python:
I define the connection early on in the script: con = mdb.connect('localhost', 'user', 'passwd', 'dbname')
Then, in all subsequent methods:
import MySQLdb as mdb
def insert_func():
with con:
cur = con.cursor(mdb.cursors.DictCursor)
cur.execute("INSERT INTO table (col1, col2, col3) VALUES (%s, %s, %s)", (val1, val2, val3) )
rows = cur.fetchall()
#do something with the results
return someval
etc.
I use mdb.cursors.DictCursor because I prefer to be able to access database columns in an associative array manner.
Now the problems start popping up:
in one function, I issue an insert query to create a 'group' with unique 'groupid'.
This 'group' has a creator. Every user in the database holds a JSON array in the 'groups' column of his/her row in the table.
So when I create a new group, I want to assign the groupid to the user that created it.
I update the user's record using a similar function.
I've wrapped the 'insert' and 'update' parts in two separate function defs.
The first time I run the script, everything works fine.
The second time I run the script, the script runs endlessly (I suspect due to some idle connection to the MySQL database).
When I interrupt it using CTRL + C, I get one of the following errors:
"'Cursor' object has no attribute 'connection'"
"commands out of sync; you can't run this command now"
or any other KeyboardInterrupt exception, as would be expected.
It seems to me that these errors are caused by some erroneous way of handling connections and cursors in my code.
I read it was good practice to use with con: so that the connection will automatically close itself after the query. I use 'with' on 'con' in each function, so the connection is closed, but I decided to define the connection globally, for any function to use it. This seems incompatible with the with con: context management. I suspect the cursor needs to be 'context managed' in a similar way, but I do not know how to do this (To my knowledge, PHP doesn't use cursors for MySQL, so I have no experience using them).
I now have the following questions:
Why does it work the first time but not the second? (it will however, work again, once, after the CTRL + C interrupt).
How should I go about using connections and cursors when using multiple functions (that can be called upon in sequence)?
I think there are two main issues going on here- one appears to be python code and the other is the structure of how you're interacting to your DB.
First, you're not closing your connection. This depends on your application's needs - you have to decide how long it should stay open. Reference this SO question
from contextlib import closing
with closing( connection.cursor() ) as cursor:
... use the cursor ...
# cursor closed. Guaranteed.
connection.close()
Right now, you have to interrupt your program with Ctl+C because there's no reason for your with statement to stop running.
Second, start thinking about your interactions with the DB in terms of 'transactions'. Do something, commit it to the DB, if it didn't work, rollback, if it did, close the connection. Here's a tutorial.
With connections, as with file handles the rule of thumb is open late, close early.
So I would recommend share connections only where they are trying to do one thing. Or if you multiprocess, then each process gets a connection, again following open late, close early. And if you are doing sequential operation (say in a loop) open and close outside the loop. Having global connections can get messy. Mainly because now you have to keep track of which function uses it at what time, and what it tries to do with it.
The issue of "cannot run command now", is because your keyboard interrupt kills the active connection.
As to part one of your question - endlessly could be anywhere. Each instance of python will get its own connection. So when you run it the second time it should get its own connection. Open up a mysql client and do
show full processlist
to see whats going on.

Categories