I've created a quite simple script that works with multiprocessing and SQL. The aim of this exercise is to obtain the lowest time of execution :
def Query(Query):
conn = sqlite3.connect("DB.db")
cur = conn.cursor()
cur.execute(Query)
cur.close()
conn.close()
return
if __name__ == '__main__':
conn = sqlite3.connect("DB.db")
cur = conn.cursor()
start = time.time()
curOperations.execute(QUERY)
curOperations.execute(QUERY)
curOperations.execute(QUERY)
end = time.time()
TIME1 = end - start
cur.execute('PRAGMA journal_mode=wal')
conn.commit()
start = time.time()
pool = Pool(processes=2)
pool.imap(Query,[QUERY, QUERY, QUERY])
pool.close()
pool.join()
end = time.time()
TIME2 = end - start
cur.close()
conn.close()
The average result for TIME1 after 20 executions is 13.43 and for TIME2, 10.39.
Shouldn't it be lower than that ?! am I doing something wrong ?
For my answer, I will assume that your query only reads things from the database.
Before you try to make something faster, you need to know what exactly is preventing the process from being faster.
Not all speed problems are amendable to improvement by multiprocessing!
So what you really need to do is profile the application to see where it spends its time.
Since SQLite does caching of queries, I would suggest timing each execution of the query in the single process separately.
I would suspect that the first query takes longer than the following ones.
Also consider the overhead in the multiprocessing case. The query has to be pickled and sent to the worker process via IPC. Then each worker has to create a connection and cursor and close them afterwards. In a real world situation, your query function would have done something with the data, e.g. return it to the master process which also requires pickling it and sending it via IPC.
Since all workers access the same database, at some point the reading from the database will become the bottleneck.
If you query modifies the database, access will be serialized anyway to prevent coruption.
Related
I am using Python 3.6.8 and have a function that needs to run 77 times. I am passing in data that is pulling data out of PostgreSQL and doing a statistical analysis and then putting back into the database. I can only run 3 processes at the same time due to one at a time takes way to long (about 10 min for each function call) and me only being able to have 3 DB connections open at one time. I am trying to use the Poll library of Multiprocessing and it is trying to start all of them at once which is causing a too many connections error. Am i using the poll method correctly, if not what should i use to limit to ONLY 3 functions starting and finishing at the same time.
def AnalysisOf3Years(data):
FUNCTION RAN HERE
######START OF THE PROGRAM######
print("StartTime ((FINAL 77)): ", datetime.now())
con = psycopg2.connect(host="XXX.XXX.XXX.XXX", port="XXXX", user="USERNAME", password="PASSWORD", dbname="DATABASE")
cur = con.cursor()
date = datetime.today()
cur.execute("SELECT Value FROM table")
Array = cur.fetchall()
Data = []
con.close()
for value in Array:
Data.append([value,date])
p = Pool(3)
p.map(AnalysisOf3Years,Data)
print("EndTime ((FINAL 77)): ", datetime.now())
It appears you only briefly need your database connection, with the bulk on the script's time spent processing the data. If this is the case you may wish to pull the data out once and then write the data to disk. You can then load this data fresh from disk in each new instance of your program, without having to worry about your connection limit to the database.
If you want to look into connection pooling, you way wish to use pgbouncer. This is a separate program that sits between your main program and the database, pooling the number of connections you give it. You are then free to write your script as a single-threaded program, and you can spawn as many instances as your machine can cope with.
It's hard to tell why your program is misbehaving as the indentation is appears to be wrong. But at a guess it would seem like you do not create an use your pool in side a __main__ guard. Which, on certain OSes could lead to all sorts of weird issues.
You would expect well behaving code to look something like:
from multiprocessing import Pool
def double(x):
return x * 2
if __name__ == '__main__':
# means pool only gets created in the main parent process and not in the child pool processes
with Pool(3) as pool:
result = pool.map(double, range(5))
assert result == [0, 2, 4, 6, 8, 10]
You can use SQLAlchemy Python package that has database connection pooling as a standard functionality.
It does work with Postgres and many other database backends.
engine = create_engine('postgresql://me#localhost/mydb',
pool_size=3, max_overflow=0)
pool_size max number of connections to the database. You can set it to 3.
This page has some examples how to use that with Postgres -
https://docs.sqlalchemy.org/en/13/core/pooling.html
Based on your use case you might be also interested in
SingletonThreadPool
https://docs.sqlalchemy.org/en/13/core/pooling.html#sqlalchemy.pool.SingletonThreadPool
A Pool that maintains one connection per thread.
I have a simple function that writes the output of some calculations in a sqlite table. I would like to use this function in parallel using multi-processing in Python. My specific question is how to avoid conflict when each process tries to write its result into the same table? Running the code gives me this error: sqlite3.OperationalError: database is locked.
import sqlite3
from multiprocessing import Pool
conn = sqlite3.connect('test.db')
c = conn.cursor()
c.execute("CREATE TABLE table_1 (id int,output int)")
def write_to_file(a_tuple):
index = a_tuple[0]
input = a_tuple[1]
output = input + 1
c.execute('INSERT INTO table_1 (id, output)' 'VALUES (?,?)', (index,output))
if __name__ == "__main__":
p = Pool()
results = p.map(write_to_file, [(1,10),(2,11),(3,13),(4,14)])
p.close()
p.join()
Traceback (most recent call last):
sqlite3.OperationalError: database is locked
Using a Pool is a good idea.
I see three possible solutions to this problem.
First, instead of having the pool worker trying to insert data into the database, let the worker return the data to the parent process.
In the parent process, use imap_unordered instead of map.
This is an iterable that starts providing values as soon as they become available. The parent can than insert the data into the database.
This will serialize the access to the database, preventing the problem.
This solution would be preferred if the data to be inserted into the database is relatively small, but updates happen very often. So if it takes the same or more time to update the database than to calculate the data.
Second, you could use a Lock. A worker should then
acquire the lock,
open the database,
insert the values,
close the database,
release the lock.
This will avoid the overhead of sending the data to the parent process. But instead you may have workers stalling waiting to write their data into a database.
This would be a preferred solution if the amount of data to be inserted is large but it takes much longer to calculate the data than to insert it into the database.
Third, you could have each worker write to its own database, and merge them afterwards. You can do this directly in sqlite or even in Python. Although with a large amount of data I'm not sure if the latter has advantages.
The database is locked to protect your data from corruption.
I believe you cannot have many processes accessing the same database at the same time, at least NOT with
conn = sqlite3.connect('test.db')
c = conn.cursor()
If each process must access the database, you should consider closing at least the cursor object c (and, perhaps less strictly, the connect object conn) within each process and reopen it when the process needs it again. Somehow, the other processes need to wait for the current one to release the lock before another process can acquire the lock. (There are many ways to achieved the waiting).
Setting the isolation_level to 'EXCLUSIVE' fixed it for me:
conn = sqlite3.connect('test.db', isolation_level='EXCLUSIVE')
(Please note: There is a question called "SQLite3 and Multiprocessing" but that question is actually about multithreading and so is the accepted answer, this isn't a duplicate)
I'm implementing a multiprocess script, each process will need to write some results in an sqlite table. My program keeps crashing with database is locked (with sqlite only one DB modification is allowed at a time).
Here's an example of what I have:
def scan(n):
n = n + 1 # Some calculation
cur.execute(" \
INSERT INTO hello \
(n) \
VALUES ('"+n+"') \
")
con.commit()
con.close()
return True
if __name__ == '__main__':
pool = Pool(processes=int(sys.argv[1]))
for status in pool.imap_unordered(scan, range(0,9999)):
if status:
print "ok"
pool.close()
I've tried using a lock by declaring a lock in the main and using it as a global in scan(), but it didn't stop me getting the database is locked.
What is the proper way of making sure only one INSERT statement will get issued at the same time in a multiprocess Python script?
EDIT:
I'm running on a Debian-based Linux.
This will happen if the write lock can't be grabbed within (by default) a 5-second timeout. In general, make sure your code COMMITs its transactions with sufficient frequency, thereby releasing the lock and letting other processes have a chance to grab it. If you want to wait for longer, you can do that:
db = sqlite.connect(filename, timeout=30.0)
...waits for 30 seconds.
I have a python script that uses pyodbc to call an MSSQL stored procedure, like so:
cursor.execute("exec MyProcedure #param1 = '" + myparam + "'")
I call this stored procedure inside a loop, and I notice that sometimes, the procedure gets called again before it was finished executing the last time. I know this because if I add the line
time.sleep(1)
after the execute line, everything works fine.
Is there a more elegant and less time-costly way to say, "sleep until the exec is finished"?
Update (Divij's solution): This code is currently not working for me:
from tornado import gen
import pyodbc
#gen.engine
def func(*args, **kwargs):
# connect to db
cnxn_str = """
Driver={SQL Server Native Client 11.0};
Server=172.16.111.235\SQLEXPRESS;
Database=CellTestData2;
UID=sa;
PWD=Welcome!;
"""
cnxn = pyodbc.connect(cnxn_str)
cnxn.autocommit = True
cursor = cnxn.cursor()
for _ in range(5):
yield gen.Task(cursor.execute, 'exec longtest')
return
func()
I know this is old, but I just spent several hours trying to figure out how to make my Python code wait for a stored proc on MSSQL to finish.
The issue is not with asynchronous calls.
The key to resolving this issue is to make sure that your procedure does not return any messages until it's finished running. Otherwise, PYDOBC interprets the first message from the proc as the end of it.
Run your procedure with SET NOCOUNT ON. Also, make sure any PRINT statements or RAISERROR you might use for debugging are muted.
Add a BIT parameter like #muted to your proc and only raise your debugging messages if it's 0.
In my particular case, I'm executing a proc to process a loaded table and my application was exiting and closing the cursor before the procedure finished running because I was getting row counts and debugging messages.
So to summarize, do something along the lines of
cursor.execute('SET NOCOUNT ON; EXEC schema.proc #muted = 1')
and PYODBC will wait for the proc to finish.
Here's my workaround:
In the database, I make a table called RunningStatus with just one field, status, which is a bit, and just one row, initially set to 0.
At the beginning of my stored procedure, I execute the line
update RunningStatus set status = 1;
And at the end of the stored procedure,
update RunningStatus set status = 0;
In my Python script, I open a new connection and cursor to the same database. After my execute line, I simply add
while 1:
q = status_check_cursor.execute('select status from RunningStatus').fetchone()
if q[0] == 0:
break
You need to make a new connection and cursor, because any calls from the old connection will interrupt the stored procedure and potentially cause status to never go back to 0.
It's a little janky but it's working great for me!
I have found a solution which does not require "muting" your stored procedures or altering them in any way. According to the pyodbc wiki:
nextset()
This method will make the cursor skip to the next available result
set, discarding any remaining rows from the current result set. If
there are no more result sets, the method returns False. Otherwise, it
returns a True and subsequent calls to the fetch methods will return
rows from the next result set.
This method is primarily used if you have stored procedures that
return multiple results.
To wait for a stored procedure to finish execution before moving on with the rest of the program, use the following code after executing the code that runs the stored procedure in the cursor.
slept = 0
while cursor.nextset():
if slept >= TIMEOUT:
break
time.sleep(1)
slept += 1
You could also change the time.sleep() value from 1 second to a little under a second to minimize extra wait time, but I don't recommend calling it very many times a second.
Here is a full program showing how this code would be implemented:
import time
import pyodbc
connection = pyodbc.connect('DRIVER={SQL Server};SERVER=<hostname>;PORT=1433;DATABASE=<database name>;UID=<database user>;PWD=password;CHARSET=UTF-8;')
cursor = connection.cursor()
TIMEOUT = 20 # Max number of seconds to wait for procedure to finish execution
params = ['value1', 2, 'value3']
cursor.execute("BEGIN EXEC dbo.sp_StoredProcedureName ?, ?, ? END", *params)
# here's where the magic happens with the nextset() function
slept = 0
while cursor.nextset():
if slept >= TIMEOUT:
break
time.sleep(1)
slept += 1
cursor.close()
connection.close()
There's no python built-in that allows you to wait for an asynchronous call to finish. However, you can achieve this behaviour using Tornado's IOLoop. Tornado's gen interface allows you to do register a function call as a Task and return to the next line in your function once the call has finished executing. Here's an example using gen and gen.Task
from tornado import gen
#gen.engine
def func(*args, **kwargs)
for _ in range(5):
yield gen.Task(async_function_call, arg1, arg2)
return
In the example, execution of func resumes after async_function_call is finished. This way subsequent calls to asnyc_function_call won't overlap, and you wont' have to pause execution of the main process with the time.sleep call.
i think my way is alittle bit more crude but in the same time much more easy to understand:
cursor = connection.cursor()
SQLCommand = ("IF EXISTS(SELECT 1 FROM msdb.dbo.sysjobs J JOIN
msdb.dbo.sysjobactivity A ON A.job_id = J.job_id WHERE J.name ='dbo.SPNAME' AND
A.run_requested_date IS NOT NULL AND A.stop_execution_date IS NULL) select 'The job is
running!' ELSE select 'The job is not running.'")
cursor.execute(SQLCommand)
results = cursor.fetchone()
sresult= str(results)
while "The job is not running" in sresult:
time.sleep(1)
cursor.execute(SQLCommand)
results = cursor.fetchone()
sresult= str(results)
while "SPNAME" return "the job is not running" from the jobactivity table sleep 1 second and check the result again.
this work for sql job, for SP should like in another table
Goal:
run ~40 huge queries in a db using SQLAlchemy with Threads or Processes, put the corresponding SQLA ResultProxies in a Queue.Queue (being handled by multiprocessing.Manager)
at the same time, write the results to .csv files with a number of Processes that consume said Queue
Current state:
QueryThread and WriteThread classes that run the queries and write the data; since the queries take some time to run, there is no significant performance loss due to how the GIL handles threading
writing the files on the other hand takes forever; in fact, even though the original idea was to run multiple threads of the WriteThread class, the best performance is obtained with a single thread.
Hence the idea to use multiprocessing; I want to be able to write the output concurrently and not be CPU bound but rather I/O bound.
Background aside, here's the issue (which is essentially a design question) - the multiprocessing library works by pickling objects and then piping the data to other spawned processes; but the ResultProxy objects and shared Queue that I'm trying to use in the WriteWorker Process aren't picklable, which results in the following message (not verbatim, but close enough):
pickle.PicklingError: Can't pickle object in WriteWorker.start()
So the question for you helpful folks is, any ideas on a potential design pattern or approach that would avoid this issue? This seems like a simple, classic producer-consumer problem, I imagine the solution is straightforward and I'm just overthinking it
any help or feedback is appreciated! thanks :)
edit: here's some relevant snippets of code, let me know if there's any other context I can provide
from the parent class:
#init manager and queues
self.manager = multiprocessing.Manager()
self.query_queue = self.manager.Queue()
self.write_queue = self.manager.Queue()
def _get_data(self):
#spawn a pool of query processes, and pass them query queue instance
for i in xrange(self.NUM_QUERY_THREADS):
qt = QueryWorker.QueryWorker(self.query_queue, self.write_queue, self.config_values, self.args)
qt.daemon = True
# qt.setDaemon(True)
qt.start()
#populate query queue
self.parse_sql_queries()
#spawn a pool of writer processes, and pass them output queue instance
for i in range(self.NUM_WRITE_THREADS):
wt = WriteWorker.WriteWorker(self.write_queue, self.output_path, self.WRITE_BUFFER, self.output_dict)
wt.daemon = True
# wt.setDaemon(True)
wt.start()
#wait on the queues until everything has been processed
self.query_queue.join()
self.write_queue.join()
and from the QueryWorker class:
def run(self):
while True:
#grabs host from query queue
query_tupe = self.query_queue.get()
table = query_tupe[0]
query = query_tupe[1]
query_num = query_tupe[2]
if query and table:
#grab connection from pool, run the query
connection = self.engine.connect()
print 'Running query #' + str(query_num) + ': ' + table
try:
result = connection.execute(query)
except:
print 'Error while running query #' + str(query_num) + ': \n\t' + str(query) + '\nError: ' + str(sys.exc_info()[1])
#place result handle tuple into out queue
self.out_queue.put((table, result))
#signals to queue job is done
self.query_queue.task_done()
The simple answer is to avoid using ResultsProxy directly. Instead get the data from the ResultsProxy using cursor.fetchall() or cursor.fetchmany(number_to_fetch) and then pass the data into the multiprocessing queue.