Python: Improving performance - Writing to database in seperate thread - python

I am running a python app where I for various reasons have to host my program on a server in one part of the world and then have my database in another.
I tested via a simple script, and from my home which is in a neighboring country to the database server, the time to write and retrieve a row from the database is about 0.035 seconds (which is a nice speed imo) compared to 0,16 seconds when my python server in the other end of the world performs same action.
This is an issue as I am trying to keep my python app as fast as possible so I was wondering if there is a smart way to do this?
As I am running my code synchronously my program is waiting every time it has to write to the db, which is about 3 times a second so the time adds up. Is it possible to run the connection to the database in a separate thread or something, so it doesn't halt the whole program while it tries to send data to the database? Or can this be done using asyncio (I have no experience with async code)?
I am really struggling figuring out a good way to solve this issue.
In advance, many thanks!

Yes, you can create a thread that does the writes in the background. In your case, it seems reasonable to have a queue where the main thread puts things to be written and the db thread gets and writes them. The queue can have a maximum depth so that when too much stuff is pending, the main thread waits. You could also do something different like drop things that happen too fast. Or, use a db with synchronization and write a local copy. You also may have an opportunity to speed up the writes a bit by committing multiple at once.
This is a sketch of a worker thread
import threading
import queue
class SqlWriterThread(threading.Thread):
def __init__(self, db_connect_info, maxsize=8):
super().__init__()
self.db_connect_info = db_connect_info
self.q = queue.Queue(maxsize)
# TODO: Can expose q.put directly if you don't need to
# intercept the call
# self.put = q.put
self.start()
def put(self, statement):
print(f"DEBUG: Putting\n{statement}")
self.q.put(statement)
def run(self):
db_conn = None
while True:
# get all the statements you can, waiting on first
statements = [self.q.get()]
try:
while True:
statements.append(self.q.get(), block=False)
except queue.Empty:
pass
try:
# early exit before connecting if channel is closed.
if statements[0] is None:
return
if not db_conn:
db_conn = do_my_sql_connect()
try:
print("Debug: Executing\n", "--------\n".join(f"{id(s)} {s}" for s in statements))
# todo: need to detect closed connection, then reconnect and resart loop
cursor = db_conn.cursor()
for statement in statements:
if statement is None:
return
cursor.execute(*statement)
finally:
cursor.commit()
finally:
for _ in statements:
self.q.task_done()
sql_writer = SqlWriterThread(('user', 'host', 'credentials'))
sql_writer.put(('execute some stuff',))

Related

Set function timeout without having to use contextlib [duplicate]

I looked online and found some SO discussing and ActiveState recipes for running some code with a timeout. It looks there are some common approaches:
Use thread that run the code, and join it with timeout. If timeout elapsed - kill the thread. This is not directly supported in Python (used private _Thread__stop function) so it is bad practice
Use signal.SIGALRM - but this approach not working on Windows!
Use subprocess with timeout - but this is too heavy - what if I want to start interruptible task often, I don't want fire process for each!
So, what is the right way? I'm not asking about workarounds (eg use Twisted and async IO), but actual way to solve actual problem - I have some function and I want to run it only with some timeout. If timeout elapsed, I want control back. And I want it to work on Linux and Windows.
A completely general solution to this really, honestly does not exist. You have to use the right solution for a given domain.
If you want timeouts for code you fully control, you have to write it to cooperate. Such code has to be able to break up into little chunks in some way, as in an event-driven system. You can also do this by threading if you can ensure nothing will hold a lock too long, but handling locks right is actually pretty hard.
If you want timeouts because you're afraid code is out of control (for example, if you're afraid the user will ask your calculator to compute 9**(9**9)), you need to run it in another process. This is the only easy way to sufficiently isolate it. Running it in your event system or even a different thread will not be enough. It is also possible to break things up into little chunks similar to the other solution, but requires very careful handling and usually isn't worth it; in any event, that doesn't allow you to do the same exact thing as just running the Python code.
What you might be looking for is the multiprocessing module. If subprocess is too heavy, then this may not suit your needs either.
import time
import multiprocessing
def do_this_other_thing_that_may_take_too_long(duration):
time.sleep(duration)
return 'done after sleeping {0} seconds.'.format(duration)
pool = multiprocessing.Pool(1)
print 'starting....'
res = pool.apply_async(do_this_other_thing_that_may_take_too_long, [8])
for timeout in range(1, 10):
try:
print '{0}: {1}'.format(duration, res.get(timeout))
except multiprocessing.TimeoutError:
print '{0}: timed out'.format(duration)
print 'end'
If it's network related you could try:
import socket
socket.setdefaulttimeout(number)
I found this with eventlet library:
http://eventlet.net/doc/modules/timeout.html
from eventlet.timeout import Timeout
timeout = Timeout(seconds, exception)
try:
... # execution here is limited by timeout
finally:
timeout.cancel()
For "normal" Python code, that doesn't linger prolongued times in C extensions or I/O waits, you can achieve your goal by setting a trace function with sys.settrace() that aborts the running code when the timeout is reached.
Whether that is sufficient or not depends on how co-operating or malicious the code you run is. If it's well-behaved, a tracing function is sufficient.
An other way is to use faulthandler:
import time
import faulthandler
faulthandler.enable()
try:
faulthandler.dump_tracebacks_later(3)
time.sleep(10)
finally:
faulthandler.cancel_dump_tracebacks_later()
N.B: The faulthandler module is part of stdlib in python3.3.
If you're running code that you expect to die after a set time, then you should write it properly so that there aren't any negative effects on shutdown, no matter if its a thread or a subprocess. A command pattern with undo would be useful here.
So, it really depends on what the thread is doing when you kill it. If its just crunching numbers who cares if you kill it. If its interacting with the filesystem and you kill it , then maybe you should really rethink your strategy.
What is supported in Python when it comes to threads? Daemon threads and joins. Why does python let the main thread exit if you've joined a daemon while its still active? Because its understood that someone using daemon threads will (hopefully) write the code in a way that it wont matter when that thread dies. Giving a timeout to a join and then letting main die, and thus taking any daemon threads with it, is perfectly acceptable in this context.
I've solved that in that way:
For me is worked great (in windows and not heavy at all) I'am hope it was useful for someone)
import threading
import time
class LongFunctionInside(object):
lock_state = threading.Lock()
working = False
def long_function(self, timeout):
self.working = True
timeout_work = threading.Thread(name="thread_name", target=self.work_time, args=(timeout,))
timeout_work.setDaemon(True)
timeout_work.start()
while True: # endless/long work
time.sleep(0.1) # in this rate the CPU is almost not used
if not self.working: # if state is working == true still working
break
self.set_state(True)
def work_time(self, sleep_time): # thread function that just sleeping specified time,
# in wake up it asking if function still working if it does set the secured variable work to false
time.sleep(sleep_time)
if self.working:
self.set_state(False)
def set_state(self, state): # secured state change
while True:
self.lock_state.acquire()
try:
self.working = state
break
finally:
self.lock_state.release()
lw = LongFunctionInside()
lw.long_function(10)
The main idea is to create a thread that will just sleep in parallel to "long work" and in wake up (after timeout) change the secured variable state, the long function checking the secured variable during its work.
I'm pretty new in Python programming, so if that solution has a fundamental errors, like resources, timing, deadlocks problems , please response)).
solving with the 'with' construct and merging solution from -
Timeout function if it takes too long to finish
this thread which work better.
import threading, time
class Exception_TIMEOUT(Exception):
pass
class linwintimeout:
def __init__(self, f, seconds=1.0, error_message='Timeout'):
self.seconds = seconds
self.thread = threading.Thread(target=f)
self.thread.daemon = True
self.error_message = error_message
def handle_timeout(self):
raise Exception_TIMEOUT(self.error_message)
def __enter__(self):
try:
self.thread.start()
self.thread.join(self.seconds)
except Exception, te:
raise te
def __exit__(self, type, value, traceback):
if self.thread.is_alive():
return self.handle_timeout()
def function():
while True:
print "keep printing ...", time.sleep(1)
try:
with linwintimeout(function, seconds=5.0, error_message='exceeded timeout of %s seconds' % 5.0):
pass
except Exception_TIMEOUT, e:
print " attention !! execeeded timeout, giving up ... %s " % e

Can't pickle psycopg2.extensions.connection objects when using pool.imap, but can be done in individual processes

I am trying to build an application which will "check out" a cell, which is a square covering a part of land in a geographic database, and perform an analysis of the features within that cell. Since I have many cells to process, I am using a multiprocessing approach.
I had it somewhat working inside of my object like this:
class DistributedGeographicConstraintProcessor:
...
def _process_cell(self, conn_string):
conn = pg2.connect(conn_string)
try:
cur = conn.cursor()
cell_id = self._check_out_cell(cur)
conn.commit()
print(f"processing cell_id {cell_id}...")
for constraint in self.constraints:
# print(f"processing {constraint.name()}...")
query = constraint.prepare_distributed_query(self.job, self.grid)
cur.execute(query, {
"buffer": constraint.buffer(),
"cell_id": cell_id,
"name": constraint.name(),
"simplify_tolerance": constraint.simplify_tolerance()
})
# TODO: do a final race condition check to further suppress duplicates
self._check_in_cell(cur, cell_id)
conn.commit()
finally:
del cur
conn.close()
return None
def run(self):
while True:
if not self._job_finished():
params = [self.conn_string] * self.num_cores
processes = []
for param in params:
process = mp.Process(target=self._process_cell, args=(param,))
processes.append(process)
sleep(0.1) # Prevent multiple processes from checkout out the same grid square
process.start()
for process in processes:
process.join()
else:
self._finalize_job()
break
But the problem is that it will only start four processes and wait until they all finish before starting four new processes.
I want to make it so when one process finishes its work, it will begin working on the next cell immediately, even if its co-processes are not yet finished.
I am unsure about how to implement this and I have tried using a pool like this:
def run(self):
pool = mp.Pool(self.num_cores)
unprocessed_cells = self._unprocessed_cells()
for i in pool.imap(self._process_cell, unprocessed_cells):
print(i)
But this just tells me that the connection is not able to be pickled:
TypeError: can't pickle psycopg2.extensions.connection objects
But I do not understand why, because it is the exact same function that I am using in the imap function as in the Process target.
I have already looked at these threads, here is why they do not answer my question:
Error Connecting To PostgreSQL can't pickle psycopg2.extensions.connection objects - The answer here only indicates that multiple processes cannot share the same connection. I am aware of this, and am initializing the process inside the function which is being executed in the child process. Also, as I mentioned, it works when I map the function to individual Process instances, with the same function with the same inputs.
Multiprocessing result of a psycopg2 request. “Can't pickle psycopg2.extensions.connection objects” - There is no answer nor any comments on this question, and the code is not intact anyway - the author makes reference to a function that does not specified in the question, and in any case it is obvious that they are blatantly trying to share the same cursor between processes.
My guess is that you're attaching some connection object to self; try to rewrite your solution using functions only (no classes/methods).
Here is a simplified version of a single producer/multiple workers solution I used some time ago:
def worker(param):
//connect to pg
//do work
def main():
pool = Pool(processes=NUM_PROC)
tasks = []
for param in params:
t = pool.apply_async(utils.process_month, args=(param, ))
tasks.append(t)
pool.close()
finished = false
while not finished:
finished = True
for t in tasks:
if not t.ready():
finished = False
break
time.sleep(1)

Assistance with Python multithreading

Currently, i have a list of url to grab contents from and is doing it serially. I would like to change it to grabbing them in parallel. This is a psuedocode. I will like to ask is the design sound? I understand that .start() starts the thread, however, my database is not updated. Do i need to use q.get() ? thanks
import threading
import Queue
q = Queue.Queue()
def do_database(url):
""" grab url then input to database """
webdata = grab_url(url)
try:
insert_data_into_database(webdata)
except:
....
else:
< do I need to do anything with the queue after each db operation is done?>
def put_queue(q, url ):
q.put( do_database(url) )
for myfiles in currentdir:
url = myfiles + some_other_string
t=threading.Thread(target=put_queue,args=(q,url))
t.daemon=True
t.start()
It's odd that you're putting stuff into q but never taking anything out of q. What is the purpose of q? In addition, since do_database() doesn't return anything, sure looks like the only thing q.put(do_database(url)) does is put None into q.
The usual way these things work, a description of work to do is added to a queue, and then a fixed number of threads take turns pulling things off the queue. You probably don't want to create an unbounded number of threads ;-)
Here's a pretty complete - but untested - sketch:
import threading
import Queue
NUM_THREADS = 5 # whatever
q = Queue.Queue()
END_OF_DATA = object() # a unique object
class Worker(threading.Thread):
def run(self):
while True:
url = q.get()
if url is END_OF_DATA:
break
webdata = grab_url(url)
try:
# Does your database support concurrent updates
# from multiple threads? If not, need to put
# this in a "with some_global_mutex:" block.
insert_data_into_database(webdata)
except:
#....
threads = [Worker() for _ in range(NUM_THREADS)]
for t in threads:
t.start()
for myfiles in currentdir:
url = myfiles + some_other_string
q.put(url)
# Give each thread an END_OF_DATA marker.
for _ in range(NUM_THREADS):
q.put(END_OF_DATA)
# Shut down cleanly. `daemon` is way overused.
for t in threads:
t.join()
You should do this with asynchronous programming rather than threads. Threading in Python is problematic (see: Global Interpreter Lock), and anyway you're not trying to achieve multicore performance here. You just need a way to multiplex potentially long-running I/O. For that you can use a single thread and an event-driven library such as Twisted.
Twisted comes with HTTP functionality, so you can issue many concurrent requests and react (by populating your database) when results come in. Be aware that this model of programming may take a little getting used to, but it will give you good performance if the number of requests you're making is not astronomical (i.e. if you can get it all done on one machine, which it seems is your intention).
For DB, You have to commit before your changes become effective. But, commit for every insert is not optimal. Commit after bulk changes gives much better performance.
For parallel, Python isn't born for this. For your use-case, i suppose using python with gevent would be a painless solution.
Here is a much more efficient pseudo implementation FYI:
import gevent
from gevent.monkey import patch_all
patch_all() # to use with urllib, etc
from gevent.queue import Queue
def web_worker(q, url):
grab_something
q.push(result)
def db_worker(q):
buf = []
while True:
buf.append(q.get())
if len(buf) > 20:
insert_stuff_in_buf_to_db
db_commit
buf = []
def run(urls):
q = Queue()
gevent.spawn(db_worker, q)
for url in urls:
gevent.spawn(web_worker, q, url)
run(urls)
plus, since this implementation is totally single threaded, you can safely manipulate shared data between workers like queue, db connection, global variables etc.

Integrating with deluged api - twisted deferred

I have a very simple script that monitors a file transfer progress, comparing its actual size with the target then calculating its hash, comparing with the desired hash and firing up a few extra things when everything seems alright.
I've replaced the tool used for the file transfers (wget) with deluged, which has a neat api to integrate with.
Instead of comparing the file progress and compare the hashes, I only need to know now when deluged finished downloading the files. To achieve that, I was able to modify this script to my needs, but I'm stuck trying to wrap my head around twisted framework, that deluged makes use of.
To try getting over it, I grabbed one sample script from twisted deferred documentation, wrapped a class around it and attempted to use the same concept I'm using on this script I mentioned.
Now, I don't know exactly what to do with the reactor object, since it's basically a blocking loop that can't be restarted.
This is my sample code I'm working with:
from twisted.internet import reactor, defer
import time
class DummyDataGetter:
done = False
result = 0
def getDummyData(self, x):
d = defer.Deferred()
# simulate a delayed result by asking the reactor to fire the
# Deferred in 2 seconds time with the result x * 3
reactor.callLater(2, d.callback, x * 3)
return d
def assignResult(self, d):
"""
Data handling function to be added as a callback: handles the
data by printing the result
"""
self.result = d
self.done = True
reactor.stop()
def run(self):
d = self.getDummyData(3)
d.addCallback(self.assignResult)
reactor.run()
getter = DummyDataGetter()
getter.run()
while not getter.done:
time.sleep(0.5)
print getter.result
# then somewhere else I want to get dummy data again
getter = DummyDataGetter()
getter.run() #this throws an exception of type error.ReactorNotRestartable
while not getter.done:
time.sleep(0.5)
print getter.result
My questions are:
Should reactor be fired in another thread to prevent it blocking the code?
If so, how would I add more callbacks to this reactor living in a separate thread? Simply by doing something similar to reactor.callLater(2, d.callback, x * 3), from my main thread?
If not, what is the technique to overcome this problem of not being able to starting/stopping reactor twice or more on the same process?
OK, easiest approach I found to this is to simply have a separate script called using subprocess.Popen, dump the statuses of the torrents and anything else needed into the stdout (serialized using JSON) and pipe that into the calling script.
Way less traumatic than learning twisted, but of course far away from optimal.

separate threads in pygtk application

I'm having some problems threading my pyGTK application. I give the thread some time to complete its task, if there is a problem I just continue anyway but warn the user. However once I continue, this thread stops until gtk.main_quit is called. This is confusing me.
The relevant code:
class MTP_Connection(threading.Thread):
def __init__(self, HOME_DIR, username):
self.filename = HOME_DIR + "mtp-dump_" + username
threading.Thread.__init__(self)
def run(self):
#test run
for i in range(1, 10):
time.sleep(1)
print i
..........................
start_time = time.time()
conn = MTP_Connection(self.HOME_DIR, self.username)
conn.start()
progress_bar = ProgressBar(self.tree.get_widget("progressbar"),
update_speed=100, pulse_mode=True)
while conn.isAlive():
while gtk.events_pending():
gtk.main_iteration()
if time.time() - start_time > 5:
self.write_info("problems closing connection.")
break
#after this the program continues normally, but my conn thread stops
Firstly, don't subclass threading.Thread, use Thread(target=callable).start().
Secondly, and probably the cause of your apparent block is that gtk.main_iteration takes a parameter block, which defaults to True, so your call to gtk.main_iteration will actually block when there are no events to iterate on. Which can be solved with:
gtk.main_iteration(block=False)
However, there is no real explanation why you would use this hacked up loop rather than the actual gtk main loop. If you are already running this inside a main loop, then I would suggest that you are doing the wrong thing. I can expand on your options if you give us a bit more detail and/or the complete example.
Thirdly, and this only came up later: Always always always always make sure you have called gtk.gdk.threads_init in any pygtk application with threads. GTK+ has different code paths when running threaded, and it needs to know to use these.
I wrote a small article about pygtk and threads that offers you a small abstraction so you never have to worry about these things. That post also includes a progress bar example.

Categories