Goal:
run ~40 huge queries in a db using SQLAlchemy with Threads or Processes, put the corresponding SQLA ResultProxies in a Queue.Queue (being handled by multiprocessing.Manager)
at the same time, write the results to .csv files with a number of Processes that consume said Queue
Current state:
QueryThread and WriteThread classes that run the queries and write the data; since the queries take some time to run, there is no significant performance loss due to how the GIL handles threading
writing the files on the other hand takes forever; in fact, even though the original idea was to run multiple threads of the WriteThread class, the best performance is obtained with a single thread.
Hence the idea to use multiprocessing; I want to be able to write the output concurrently and not be CPU bound but rather I/O bound.
Background aside, here's the issue (which is essentially a design question) - the multiprocessing library works by pickling objects and then piping the data to other spawned processes; but the ResultProxy objects and shared Queue that I'm trying to use in the WriteWorker Process aren't picklable, which results in the following message (not verbatim, but close enough):
pickle.PicklingError: Can't pickle object in WriteWorker.start()
So the question for you helpful folks is, any ideas on a potential design pattern or approach that would avoid this issue? This seems like a simple, classic producer-consumer problem, I imagine the solution is straightforward and I'm just overthinking it
any help or feedback is appreciated! thanks :)
edit: here's some relevant snippets of code, let me know if there's any other context I can provide
from the parent class:
#init manager and queues
self.manager = multiprocessing.Manager()
self.query_queue = self.manager.Queue()
self.write_queue = self.manager.Queue()
def _get_data(self):
#spawn a pool of query processes, and pass them query queue instance
for i in xrange(self.NUM_QUERY_THREADS):
qt = QueryWorker.QueryWorker(self.query_queue, self.write_queue, self.config_values, self.args)
qt.daemon = True
# qt.setDaemon(True)
qt.start()
#populate query queue
self.parse_sql_queries()
#spawn a pool of writer processes, and pass them output queue instance
for i in range(self.NUM_WRITE_THREADS):
wt = WriteWorker.WriteWorker(self.write_queue, self.output_path, self.WRITE_BUFFER, self.output_dict)
wt.daemon = True
# wt.setDaemon(True)
wt.start()
#wait on the queues until everything has been processed
self.query_queue.join()
self.write_queue.join()
and from the QueryWorker class:
def run(self):
while True:
#grabs host from query queue
query_tupe = self.query_queue.get()
table = query_tupe[0]
query = query_tupe[1]
query_num = query_tupe[2]
if query and table:
#grab connection from pool, run the query
connection = self.engine.connect()
print 'Running query #' + str(query_num) + ': ' + table
try:
result = connection.execute(query)
except:
print 'Error while running query #' + str(query_num) + ': \n\t' + str(query) + '\nError: ' + str(sys.exc_info()[1])
#place result handle tuple into out queue
self.out_queue.put((table, result))
#signals to queue job is done
self.query_queue.task_done()
The simple answer is to avoid using ResultsProxy directly. Instead get the data from the ResultsProxy using cursor.fetchall() or cursor.fetchmany(number_to_fetch) and then pass the data into the multiprocessing queue.
Related
I have this task which is sort of I/O bound and CPU bound at the same time.
Basically I am getting a list of queries from a user, google search them (via custom-search-api), store each query results in a .txt file, and storing all results in a results.txt file.
I was thinking that maybe parallelism might be an advantage here.
My whole task is wrapped with an Object which has 2 member fields which I am supposed to use across all threads/processes (a list and a dictionary).
Therefore, when I use multiprocessing I get weird results (I assume that it is because of my shared resources).
i.e:
class MyObject(object):
_my_list = []
_my_dict = {}
_my_dict contains key:value pairs of "query_name":list().
_my_list is a list of queries to search in google. It is safe to assume that it is not written into.
For each query : I search it on google, grab the top results and store it in _my_dict
I want to do this in parallel. I thought that threading may be good but it seems that they slow the work..
how I attempted to do it (this is the method which is doing the entire job per query):
def _do_job(self, query):
""" search the query on google (via http)
save results on a .txt file locally. """
this is the method which is supposed to execute all jobs for all queries in parallel:
def find_articles(self):
p = Pool(processes=len(self._my_list))
p.map_async(self._do_job, self._my_list)
p.close()
p.join()
self._create_final_log()
The above execution does not work, I get corrupted results...
When I use multithreading however, the results are fine, but very slow:
def find_articles(self):
thread_pool = []
for vendor in self._vendors_list:
self._search_validate_cache(vendor)
thread = threading.Thread(target=self._search_validate_cache, args=. (vendor,))
thread_pool.append(thread)
thread.start()
for thread in thread_pool:
thread.join()
self._create_final_log()
Any help would be appreciated, thanks!
I have encountered this while doing similar projects in the past (multiprocessing doesn't work efficiently, single-threaded is too slow, starting a thread per query is too fast and is bottlenecked). I found an efficient way to complete a task like this is to create a thread pool with a limited amount of threads. Logically, the fastest way to complete this task is to use as many network resources as possible without a bottleneck, which is why the threads active at one time actively making requests are capped.
In your case, cycling a list of queries with a thread pool with a callback function would be a quick and easy way to go through all the data. Obviously, there is a lot of factors that affect that such as network speed and finding the correct size threadpool to avoid a bottlneck, but overall I've found this to work well.
import threading
class MultiThread:
def __init__(self, func, list_data, thread_cap=10):
"""
Parameters
----------
func : function
Callback function to multi-thread
threads : int
Amount of threads available in the pool
list_data : list
List of data to multi-thread index
"""
self.func = func
self.thread_cap = thread_cap
self.thread_pool = []
self.current_index = -1
self.total_index = len(list_data) - 1
self.complete = False
self.list_data = list_data
def start(self):
for _ in range(self.thread_cap):
thread = threading.Thread(target=self._wrapper)
self.thread_pool += [thread]
thread.start()
def _wrapper(self):
while not self.complete:
if self.current_index < self.total_index:
self.current_index += 1
self.func(self.list_data[self.current_index])
else:
self.complete = True
def wait_on_completion(self):
for thread in self.thread_pool:
thread.join()
import requests #, time
_my_dict = {}
base_url = "https://www.google.com/search?q="
s = requests.sessions.session()
def example_callback_func(query):
global _my_dict
# code to grab data here
r = s.get(base_url+query)
_my_dict[query] = r.text # whatever parsed results
print(r, query)
#start_time = time.time()
_my_list = ["examplequery"+str(n) for n in range(100)]
mt = MultiThread(example_callback_func, _my_list, thread_cap=30)
mt.start()
mt.wait_on_completion()
# output queries to file
#print("Time:{:2f}".format(time.time()-start_time))
You could also open the file and output whatever you need to as you go, or output data at the end. Obviously, my replica here isn't exactly what you need, but it's a solid boilerplate with a lightweight function I made that will greatly reduce the time it takes. It uses a thread pool to call a callback to a default function that takes a single parameter (the query).
In my test here, it completed cycling 100 queries in ~2 seconds. I could definitely play with the thread cap and get the timings lower before I find the bottleneck.
I am running a python app where I for various reasons have to host my program on a server in one part of the world and then have my database in another.
I tested via a simple script, and from my home which is in a neighboring country to the database server, the time to write and retrieve a row from the database is about 0.035 seconds (which is a nice speed imo) compared to 0,16 seconds when my python server in the other end of the world performs same action.
This is an issue as I am trying to keep my python app as fast as possible so I was wondering if there is a smart way to do this?
As I am running my code synchronously my program is waiting every time it has to write to the db, which is about 3 times a second so the time adds up. Is it possible to run the connection to the database in a separate thread or something, so it doesn't halt the whole program while it tries to send data to the database? Or can this be done using asyncio (I have no experience with async code)?
I am really struggling figuring out a good way to solve this issue.
In advance, many thanks!
Yes, you can create a thread that does the writes in the background. In your case, it seems reasonable to have a queue where the main thread puts things to be written and the db thread gets and writes them. The queue can have a maximum depth so that when too much stuff is pending, the main thread waits. You could also do something different like drop things that happen too fast. Or, use a db with synchronization and write a local copy. You also may have an opportunity to speed up the writes a bit by committing multiple at once.
This is a sketch of a worker thread
import threading
import queue
class SqlWriterThread(threading.Thread):
def __init__(self, db_connect_info, maxsize=8):
super().__init__()
self.db_connect_info = db_connect_info
self.q = queue.Queue(maxsize)
# TODO: Can expose q.put directly if you don't need to
# intercept the call
# self.put = q.put
self.start()
def put(self, statement):
print(f"DEBUG: Putting\n{statement}")
self.q.put(statement)
def run(self):
db_conn = None
while True:
# get all the statements you can, waiting on first
statements = [self.q.get()]
try:
while True:
statements.append(self.q.get(), block=False)
except queue.Empty:
pass
try:
# early exit before connecting if channel is closed.
if statements[0] is None:
return
if not db_conn:
db_conn = do_my_sql_connect()
try:
print("Debug: Executing\n", "--------\n".join(f"{id(s)} {s}" for s in statements))
# todo: need to detect closed connection, then reconnect and resart loop
cursor = db_conn.cursor()
for statement in statements:
if statement is None:
return
cursor.execute(*statement)
finally:
cursor.commit()
finally:
for _ in statements:
self.q.task_done()
sql_writer = SqlWriterThread(('user', 'host', 'credentials'))
sql_writer.put(('execute some stuff',))
I am trying to build an application which will "check out" a cell, which is a square covering a part of land in a geographic database, and perform an analysis of the features within that cell. Since I have many cells to process, I am using a multiprocessing approach.
I had it somewhat working inside of my object like this:
class DistributedGeographicConstraintProcessor:
...
def _process_cell(self, conn_string):
conn = pg2.connect(conn_string)
try:
cur = conn.cursor()
cell_id = self._check_out_cell(cur)
conn.commit()
print(f"processing cell_id {cell_id}...")
for constraint in self.constraints:
# print(f"processing {constraint.name()}...")
query = constraint.prepare_distributed_query(self.job, self.grid)
cur.execute(query, {
"buffer": constraint.buffer(),
"cell_id": cell_id,
"name": constraint.name(),
"simplify_tolerance": constraint.simplify_tolerance()
})
# TODO: do a final race condition check to further suppress duplicates
self._check_in_cell(cur, cell_id)
conn.commit()
finally:
del cur
conn.close()
return None
def run(self):
while True:
if not self._job_finished():
params = [self.conn_string] * self.num_cores
processes = []
for param in params:
process = mp.Process(target=self._process_cell, args=(param,))
processes.append(process)
sleep(0.1) # Prevent multiple processes from checkout out the same grid square
process.start()
for process in processes:
process.join()
else:
self._finalize_job()
break
But the problem is that it will only start four processes and wait until they all finish before starting four new processes.
I want to make it so when one process finishes its work, it will begin working on the next cell immediately, even if its co-processes are not yet finished.
I am unsure about how to implement this and I have tried using a pool like this:
def run(self):
pool = mp.Pool(self.num_cores)
unprocessed_cells = self._unprocessed_cells()
for i in pool.imap(self._process_cell, unprocessed_cells):
print(i)
But this just tells me that the connection is not able to be pickled:
TypeError: can't pickle psycopg2.extensions.connection objects
But I do not understand why, because it is the exact same function that I am using in the imap function as in the Process target.
I have already looked at these threads, here is why they do not answer my question:
Error Connecting To PostgreSQL can't pickle psycopg2.extensions.connection objects - The answer here only indicates that multiple processes cannot share the same connection. I am aware of this, and am initializing the process inside the function which is being executed in the child process. Also, as I mentioned, it works when I map the function to individual Process instances, with the same function with the same inputs.
Multiprocessing result of a psycopg2 request. “Can't pickle psycopg2.extensions.connection objects” - There is no answer nor any comments on this question, and the code is not intact anyway - the author makes reference to a function that does not specified in the question, and in any case it is obvious that they are blatantly trying to share the same cursor between processes.
My guess is that you're attaching some connection object to self; try to rewrite your solution using functions only (no classes/methods).
Here is a simplified version of a single producer/multiple workers solution I used some time ago:
def worker(param):
//connect to pg
//do work
def main():
pool = Pool(processes=NUM_PROC)
tasks = []
for param in params:
t = pool.apply_async(utils.process_month, args=(param, ))
tasks.append(t)
pool.close()
finished = false
while not finished:
finished = True
for t in tasks:
if not t.ready():
finished = False
break
time.sleep(1)
TL;DR: Getting different results after running code with threading and multiprocessing and single threaded. Need guidance on troubleshooting.
Hello, I apologize in advance if this may be a bit too generic, but I need a bit of help troubleshooting an issue and I am not sure how best to proceed.
Here is the story; I have a bunch of data indexed into a Solr Collection (~250m items), all items in that collection have a sessionid. Some items can share the same session id. I am combing through the collection to extract all items that have the same session, massage the data a bit and spit out another JSON file for indexing later.
The code has two main functions:
proc_day - accepts a day and processes all the sessions for that day
and
proc_session - does everything that needs to happen for a single session.
Multiprocessing is implemented on proc_day, so each day would be processed by a separate process, the proc_session function can be ran with threads. Below is the code I am using for threading/multiprocessing below. It accepts a function, a list of arguments and number of threads / multiprocesses. It will then create a queue based on input args, then create processes/threads and let them go through it. I am not posting the actual code, since it generally runs fine single threaded without any issues, but can post it if needed.
autoprocs.py
import sys
import logging
from multiprocessing import Process, Queue,JoinableQueue
import time
import multiprocessing
import os
def proc_proc(func,data,threads,delay=10):
if threads < 0:
return
q = JoinableQueue()
procs = []
for i in range(threads):
thread = Process(target=proc_exec,args=(func,q))
thread.daemon = True;
thread.start()
procs.append(thread)
for item in data:
q.put(item)
logging.debug(str(os.getpid()) + ' *** Processes started and data loaded into queue waiting')
s = q.qsize()
while s > 0:
logging.info(str(os.getpid()) + " - Proc Queue Size is:" + str(s))
s = q.qsize()
time.sleep(delay)
for p in procs:
logging.debug(str(os.getpid()) + " - Joining Process {}".format(p))
p.join(1)
logging.debug(str(os.getpid()) + ' - *** Main Proc waiting')
q.join()
logging.debug(str(os.getpid()) + ' - *** Done')
def proc_exec(func,q):
p = multiprocessing.current_process()
logging.debug(str(os.getpid()) + ' - Starting:{},{}'.format(p.name, p.pid))
while True:
d = q.get()
try:
logging.debug(str(os.getpid()) + " - Starting to Process {}".format(d))
func(d)
sys.stdout.flush()
logging.debug(str(os.getpid()) + " - Marking Task as Done")
q.task_done()
except:
logging.error(str(os.getpid()) + " - Exception in subprocess execution")
logging.error(sys.exc_info()[0])
logging.debug(str(os.getpid()) + 'Ending:{},{}'.format(p.name, p.pid))
autothreads.py:
import threading
import logging
import time
from queue import Queue
def thread_proc(func,data,threads):
if threads < 0:
return "Thead Count not specified"
q = Queue()
for i in range(threads):
thread = threading.Thread(target=thread_exec,args=(func,q))
thread.daemon = True
thread.start()
for item in data:
q.put(item)
logging.debug('*** Main thread waiting')
s = q.qsize()
while s > 0:
logging.debug("Queue Size is:" + str(s))
s = q.qsize()
time.sleep(1)
logging.debug('*** Main thread waiting')
q.join()
logging.debug('*** Done')
def thread_exec(func,q):
while True:
d = q.get()
#logging.debug("Working...")
try:
func(d)
except:
pass
q.task_done()
I am running into problems with validating data after python runs under different multiprocessing/threading configs. There is a lot of data, so I really need to get multiprocessing working. Here are the results of my test yesterday.
Only with multiprocessing - 10 procs:
Days Processed 30
Sessions Found 3,507,475
Sessions Processed 3,514,496
Files 162,140
Data Output: 1.9G
multiprocessing and multithreading - 10 procs 10 threads
Days Processed 30
Sessions Found 3,356,362
Sessions Processed 3,272,402
Files 424,005
Data Output: 2.2GB
just threading - 10 threads
Days Processed 31
Sessions Found 3,595,263
Sessions Processed 3,595,263
Files 733,664
Data Output: 3.3GB
Single process/ no threading
Days Processed 31
Sessions Found 3,595,263
Sessions Processed 3,595,263
Files 162,190
Data Output: 1.9GB
These counts were gathered by grepping and counties entries in the log files (1 per main process). The first thing that jumps out is that days processed doesn't match. However, I manually checked the log files and it looks like a log entry was missing, there are follow on log entries to indicate that the day was actually processed. I have no idea why it was omitted.
I really don't want to write more code to validate this code, just seems like a terrible waste of time, is there any alternative?
I gave some general hints in the comments above. I think there are multiple problems with your approach, at very different levels of abstraction. You are also not showing all code of relevance.
The issue might very well be
in the method you are using to read from solr or in preparing read data before feeding it to your workers.
in the architecture you have come up with for distributing the work among multiple processes.
in your logging infrastructure (as you have pointed out yourself).
in your analysis approach.
You have to go through all of these points, and as of the complexity of the issue surely nobody here will be able to identify the exact issues for you.
Regarding points (3) and (4):
If you are not sure about the completeness of your log files, you should perform the analysis based on the payload output of your processing engine. What I am trying to say: the log files probably are just a side product of your data processing. The primary product is the thing you should analyze. Of course it is also important to get your logs right. But these two problems should be treated independently.
My contribution regarding point (2) in the list above:
What is especially suspicious about your multiprocessing-based solution is your way to wait for the workers to finish. You seem not to be sure by which method you should wait for your workers, so you apply three different methods:
First, you are monitoring the size of the queue in a while loop and wait for it to become 0. This is a non-canonical approach, which might actually work.
Secondly, you join() your processes in a weird way:
for p in procs:
logging.debug(str(os.getpid()) + " - Joining Process {}".format(p))
p.join(1)
Why are you defining a timeout of one second here and do not respond to whether the process actually terminated within that time frame? You should either really join a process, i.e. wait until it has terminated or you specify a timeout and, if that timeout expires before the process finishes, treat that situation specially. Your code does not distinguish these situations, so p.join(1) is like writing time.sleep(1) instead.
Thirdly, you join the queue.
So, after making sure that q.qsize() returns 0 and after waiting for another second, do you really think that joining the queue is important? Does it make any difference? One of these approaches should be enough, and you need to think about which of these criteria is most important to your problem. That is, one of these conditions should deterministically implicate the other two.
All this looks like a quick & dirty hack of a multiprocessing solution, whereas you yourself are not really sure how that solution should behave. One of the most important insights I have obtained while working on concurrency architectures: You, the architect, must be 100 % aware of how the communication and control flow works in your system. Not properly monitoring and controlling the state of your worker processes may very well be the source of the issues you are observing.
I figured it out, I followed Jan-Philip's advice and started examining the output data of the multiprocess/multithreaded process. Turned out that an object that does all these things with the data from Solr was shared among threads. I did not have any locking mechanisms, so in a case it had mixed data from multiple sessions which caused inconsistent output. I validated this by instantiating a new object for every thread and the counts matched up. It is a bit slower, but still workable.
Thanks
Currently, i have a list of url to grab contents from and is doing it serially. I would like to change it to grabbing them in parallel. This is a psuedocode. I will like to ask is the design sound? I understand that .start() starts the thread, however, my database is not updated. Do i need to use q.get() ? thanks
import threading
import Queue
q = Queue.Queue()
def do_database(url):
""" grab url then input to database """
webdata = grab_url(url)
try:
insert_data_into_database(webdata)
except:
....
else:
< do I need to do anything with the queue after each db operation is done?>
def put_queue(q, url ):
q.put( do_database(url) )
for myfiles in currentdir:
url = myfiles + some_other_string
t=threading.Thread(target=put_queue,args=(q,url))
t.daemon=True
t.start()
It's odd that you're putting stuff into q but never taking anything out of q. What is the purpose of q? In addition, since do_database() doesn't return anything, sure looks like the only thing q.put(do_database(url)) does is put None into q.
The usual way these things work, a description of work to do is added to a queue, and then a fixed number of threads take turns pulling things off the queue. You probably don't want to create an unbounded number of threads ;-)
Here's a pretty complete - but untested - sketch:
import threading
import Queue
NUM_THREADS = 5 # whatever
q = Queue.Queue()
END_OF_DATA = object() # a unique object
class Worker(threading.Thread):
def run(self):
while True:
url = q.get()
if url is END_OF_DATA:
break
webdata = grab_url(url)
try:
# Does your database support concurrent updates
# from multiple threads? If not, need to put
# this in a "with some_global_mutex:" block.
insert_data_into_database(webdata)
except:
#....
threads = [Worker() for _ in range(NUM_THREADS)]
for t in threads:
t.start()
for myfiles in currentdir:
url = myfiles + some_other_string
q.put(url)
# Give each thread an END_OF_DATA marker.
for _ in range(NUM_THREADS):
q.put(END_OF_DATA)
# Shut down cleanly. `daemon` is way overused.
for t in threads:
t.join()
You should do this with asynchronous programming rather than threads. Threading in Python is problematic (see: Global Interpreter Lock), and anyway you're not trying to achieve multicore performance here. You just need a way to multiplex potentially long-running I/O. For that you can use a single thread and an event-driven library such as Twisted.
Twisted comes with HTTP functionality, so you can issue many concurrent requests and react (by populating your database) when results come in. Be aware that this model of programming may take a little getting used to, but it will give you good performance if the number of requests you're making is not astronomical (i.e. if you can get it all done on one machine, which it seems is your intention).
For DB, You have to commit before your changes become effective. But, commit for every insert is not optimal. Commit after bulk changes gives much better performance.
For parallel, Python isn't born for this. For your use-case, i suppose using python with gevent would be a painless solution.
Here is a much more efficient pseudo implementation FYI:
import gevent
from gevent.monkey import patch_all
patch_all() # to use with urllib, etc
from gevent.queue import Queue
def web_worker(q, url):
grab_something
q.push(result)
def db_worker(q):
buf = []
while True:
buf.append(q.get())
if len(buf) > 20:
insert_stuff_in_buf_to_db
db_commit
buf = []
def run(urls):
q = Queue()
gevent.spawn(db_worker, q)
for url in urls:
gevent.spawn(web_worker, q, url)
run(urls)
plus, since this implementation is totally single threaded, you can safely manipulate shared data between workers like queue, db connection, global variables etc.