Background
I have some code that looks like this right now.
failed_player_ids: Set[str] = set()
for player_id in player_ids:
success = player_api.send_results(
player_id, user=user, send_health_results=True
)
if not success:
failed_player_ids.add(player_id)
This code works well but the problem is this is taking 5 seconds per call. There is a rate limit of 2000 calls per minute so i am way under the max capacity. I want to parallelize this to speed things up. This is my first time using multiprocessing library in python and hence I am a little confused as to how i should proceed. I can describe what i want to do in words.
In my current code i am loop through list of player_id and if api response is success I do nothing and if it failed i make note of that player id.
I am not sure how to implement paralleled version of this code. I have some idea but i am a little confused.
This is what i though of so far
from multiprocessing import Pool
num_processors_to_use = 5 # This is a number can be increased to get more speed
def send_player_result(player_id_list: List[str]) -> Optional[str]:
for player_id in player_id_list:
success = player_api.send_results(player_id, user=user, send_health_results=True)
if not success:
return player_id
# Caller
with Pool(processes=num_processors_to_use) as pool:
responses = pool.map(
func=send_player_result,
iterable=player_id_list,
)
failed_player_ids = Set(responses)
Any comments and suggestions would help.
If you are using function map, then each item of the iterable player_id_list will be passed as a separate task to function send_player_result. Consequently, this function should no longer be expecting to be passed a list of player ids, but rather a single player id. And, as you know by now, if your tasks are largely I/O bound, then multithreading is a better model. You can either:
from multiprocessing.dummy import Pool
# or
from multiprocessing.pool import ThreadPool
You will probably want to greatly increase the number of threads (but not greater than the size of player_id_list):
#from multiprocessing import Pool
from multiprocessing.dummy import Pool
from typing import Set
def send_player_result(player_id):
success = player_api.send_results(player_id, user=user, send_health_results=True)
return success
# Only required for Windows if you are doing multiprocessing:
if __name__ == '__main__':
pool_size = 5 # This is a number can be increased to get more concurrency
# Caller
failed_player_ids: Set[str] = set()
with Pool(pool_size) as pool:
results = pool.map(func=send_player_result, iterable=player_id_list)
for idx, success in enumerate(results):
if not success:
# failed for argument player_id_list[idx]:
failed_player_ids.add(player_id_list[idx])
Related
I have this task which is sort of I/O bound and CPU bound at the same time.
Basically I am getting a list of queries from a user, google search them (via custom-search-api), store each query results in a .txt file, and storing all results in a results.txt file.
I was thinking that maybe parallelism might be an advantage here.
My whole task is wrapped with an Object which has 2 member fields which I am supposed to use across all threads/processes (a list and a dictionary).
Therefore, when I use multiprocessing I get weird results (I assume that it is because of my shared resources).
i.e:
class MyObject(object):
_my_list = []
_my_dict = {}
_my_dict contains key:value pairs of "query_name":list().
_my_list is a list of queries to search in google. It is safe to assume that it is not written into.
For each query : I search it on google, grab the top results and store it in _my_dict
I want to do this in parallel. I thought that threading may be good but it seems that they slow the work..
how I attempted to do it (this is the method which is doing the entire job per query):
def _do_job(self, query):
""" search the query on google (via http)
save results on a .txt file locally. """
this is the method which is supposed to execute all jobs for all queries in parallel:
def find_articles(self):
p = Pool(processes=len(self._my_list))
p.map_async(self._do_job, self._my_list)
p.close()
p.join()
self._create_final_log()
The above execution does not work, I get corrupted results...
When I use multithreading however, the results are fine, but very slow:
def find_articles(self):
thread_pool = []
for vendor in self._vendors_list:
self._search_validate_cache(vendor)
thread = threading.Thread(target=self._search_validate_cache, args=. (vendor,))
thread_pool.append(thread)
thread.start()
for thread in thread_pool:
thread.join()
self._create_final_log()
Any help would be appreciated, thanks!
I have encountered this while doing similar projects in the past (multiprocessing doesn't work efficiently, single-threaded is too slow, starting a thread per query is too fast and is bottlenecked). I found an efficient way to complete a task like this is to create a thread pool with a limited amount of threads. Logically, the fastest way to complete this task is to use as many network resources as possible without a bottleneck, which is why the threads active at one time actively making requests are capped.
In your case, cycling a list of queries with a thread pool with a callback function would be a quick and easy way to go through all the data. Obviously, there is a lot of factors that affect that such as network speed and finding the correct size threadpool to avoid a bottlneck, but overall I've found this to work well.
import threading
class MultiThread:
def __init__(self, func, list_data, thread_cap=10):
"""
Parameters
----------
func : function
Callback function to multi-thread
threads : int
Amount of threads available in the pool
list_data : list
List of data to multi-thread index
"""
self.func = func
self.thread_cap = thread_cap
self.thread_pool = []
self.current_index = -1
self.total_index = len(list_data) - 1
self.complete = False
self.list_data = list_data
def start(self):
for _ in range(self.thread_cap):
thread = threading.Thread(target=self._wrapper)
self.thread_pool += [thread]
thread.start()
def _wrapper(self):
while not self.complete:
if self.current_index < self.total_index:
self.current_index += 1
self.func(self.list_data[self.current_index])
else:
self.complete = True
def wait_on_completion(self):
for thread in self.thread_pool:
thread.join()
import requests #, time
_my_dict = {}
base_url = "https://www.google.com/search?q="
s = requests.sessions.session()
def example_callback_func(query):
global _my_dict
# code to grab data here
r = s.get(base_url+query)
_my_dict[query] = r.text # whatever parsed results
print(r, query)
#start_time = time.time()
_my_list = ["examplequery"+str(n) for n in range(100)]
mt = MultiThread(example_callback_func, _my_list, thread_cap=30)
mt.start()
mt.wait_on_completion()
# output queries to file
#print("Time:{:2f}".format(time.time()-start_time))
You could also open the file and output whatever you need to as you go, or output data at the end. Obviously, my replica here isn't exactly what you need, but it's a solid boilerplate with a lightweight function I made that will greatly reduce the time it takes. It uses a thread pool to call a callback to a default function that takes a single parameter (the query).
In my test here, it completed cycling 100 queries in ~2 seconds. I could definitely play with the thread cap and get the timings lower before I find the bottleneck.
I have code that makes unique combinations of elements. There are 6 types, and there are about 100 of each. So there are 100^6 combinations. Each combination has to be calculated, checked for relevance and then either be discarded or saved.
The relevant bit of the code looks like this:
def modconffactory():
for transmitter in totaltransmitterdict.values():
for reciever in totalrecieverdict.values():
for processor in totalprocessordict.values():
for holoarray in totalholoarraydict.values():
for databus in totaldatabusdict.values():
for multiplexer in totalmultiplexerdict.values():
newconfiguration = [transmitter, reciever, processor, holoarray, databus, multiplexer]
data_I_need = dosomethingwith(newconfiguration)
saveforlateruse_if_useful(data_I_need)
Now this takes a long time and that is fine, but now I realize this process (making the configurations and then calculations for later use) is only using 1 of my 8 processor cores at a time.
I've been reading up about multithreading and multiprocessing, but I only see examples of different processes, not how to multithread one process. In my code I call two functions: 'dosomethingwith()' and 'saveforlateruse_if_useful()'. I could make those into separate processes and have those run concurrently to the for-loops, right?
But what about the for-loops themselves? Can I speed up that one process? Because that is where the time consumption is. (<-- This is my main question)
Is there a cheat? for instance compiling to C and then the os multithreads automatically?
I only see examples of different processes, not how to multithread one process
There is multithreading in Python, but it is very ineffective because of GIL (Global Interpreter Lock). So if you want to use all of your processor cores, if you want concurrency, you have no other choice than use multiple processes, which can be done with multiprocessing module (well, you also could use another language without such problems)
Approximate example of multiprocessing usage for your case:
import multiprocessing
WORKERS_NUMBER = 8
def modconffactoryProcess(generator, step, offset, conn):
"""
Function to be invoked by every worker process.
generator: iterable object, the very top one of all you are iterating over,
in your case, totalrecieverdict.values()
We are passing a whole iterable object to every worker, they all will iterate
over it. To ensure they will not waste time by doing the same things
concurrently, we will assume this: each worker will process only each stepTH
item, starting with offsetTH one. step must be equal to the WORKERS_NUMBER,
and offset must be a unique number for each worker, varying from 0 to
WORKERS_NUMBER - 1
conn: a multiprocessing.Connection object, allowing the worker to communicate
with the main process
"""
for i, transmitter in enumerate(generator):
if i % step == offset:
for reciever in totalrecieverdict.values():
for processor in totalprocessordict.values():
for holoarray in totalholoarraydict.values():
for databus in totaldatabusdict.values():
for multiplexer in totalmultiplexerdict.values():
newconfiguration = [transmitter, reciever, processor, holoarray, databus, multiplexer]
data_I_need = dosomethingwith(newconfiguration)
saveforlateruse_if_useful(data_I_need)
conn.send('done')
def modconffactory():
"""
Function to launch all the worker processes and wait until they all complete
their tasks
"""
processes = []
generator = totaltransmitterdict.values()
for i in range(WORKERS_NUMBER):
conn, childConn = multiprocessing.Pipe()
process = multiprocessing.Process(target=modconffactoryProcess, args=(generator, WORKERS_NUMBER, i, childConn))
process.start()
processes.append((process, conn))
# Here we have created, started and saved to a list all the worker processes
working = True
finishedProcessesNumber = 0
try:
while working:
for process, conn in processes:
if conn.poll(): # Check if any messages have arrived from a worker
message = conn.recv()
if message == 'done':
finishedProcessesNumber += 1
if finishedProcessesNumber == WORKERS_NUMBER:
working = False
except KeyboardInterrupt:
print('Aborted')
You can adjust WORKERS_NUMBER to your needs.
Same with multiprocessing.Pool:
import multiprocessing
WORKERS_NUMBER = 8
def modconffactoryProcess(transmitter):
for reciever in totalrecieverdict.values():
for processor in totalprocessordict.values():
for holoarray in totalholoarraydict.values():
for databus in totaldatabusdict.values():
for multiplexer in totalmultiplexerdict.values():
newconfiguration = [transmitter, reciever, processor, holoarray, databus, multiplexer]
data_I_need = dosomethingwith(newconfiguration)
saveforlateruse_if_useful(data_I_need)
def modconffactory():
pool = multiprocessing.Pool(WORKERS_NUMBER)
pool.map(modconffactoryProcess, totaltransmitterdict.values())
You probably would like to use .map_async instead of .map
Both snippets do the same, but I would say in the first one you have more control over the program.
I suppose the second one is the easiest, though :)
But the first one should give you the idea of what is happening in the second one
multiprocessing docs: https://docs.python.org/3/library/multiprocessing.html
you can run your function in this way:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))
https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers
I basically want to be able to create a chord consisting of a group of chains. Beyond the fact that I can't seem to make that work is the fact that all the sub-chains have to complete before the chord callback is fired.
So my thought was to create a while loop like:
data = [foo.delay(i) for i in bar]
complete = {}
L = len(data)
cnt = 0
while cnt != L:
for i in data:
ID = i.task_id
try:
complete[ID]
except KeyError:
if i.status == 'SUCCESS':
complete[ID] = run_hourly.delay(i.result)
cnt += 1
if cnt >= L:
return complete.values()
So that when a task was ready it could be acted on without having to wait on other tasks to be complete.
The problem I'm having is that the status of some tasks never get past the 'PENDING' state.
All tasks will reach the 'SUCCESS' state if I add a time.sleep(x) line to the for loop but with a large number of sub tasks in data that solution becomes grossly inefficient.
I'm using memcached as my results backend and rabbitmq. My guess is that the speed of the for loop that iterates over data and calling attributes of it's tasks creates a race condition that breaks the connection to celery's messaging which leaves these zombie tasks that stay in the 'PENDING' state. But then again I could be completely wrong and it certainly wouldn't be the first time..
My questions
Why is time.sleep(foo) needed to avoid a perpetually PENDING task when iterating over a list of just launched tasks?
When a celery task is performing a loop is it blocking? When I try to shutdown the worker that gets stuck in an infinite loop I am unable to do so and have to manually find the python process running the worker and kill it. If I leave the worker to run eventually the python process running it will start to consume several gigs of memory, growing exponentially and what seems to be without bound.
Any insight on this matter would be appreciated. I'm also open to suggestions on ways to avoid the while loop entirely.
I appreciate your time. Thank you.
My chord consisting of a group of chains was being constructed and executed from within a celery task. Which will create problems if you need to access the results of those tasks. Below is a summary of what I was trying to do, what I ended up doing, and what I think I learned in the processes so maybe it can help someone else.
--common_tasks.py--
from cel_test.celery import app
#app.task(ignore_result=False)
def qry(sql, writing=False):
import psycopg2
conn = psycopg2.connect(dbname='space_test', user='foo', host='localhost')
cur = conn.cursor()
cur.execute(sql)
if writing == True:
cur.close()
conn.commit()
conn.close()
return
a = cur.fetchall()
cur.close()
conn.close()
return a
--foo_tasks.py --
from __future__ import absolute_import
from celery import chain, chord, group
import celery
from cel_test.celery import app
from weather.parse_weather import parse_weather
from cel_test.common_tasks import qry
import cPickle as PKL
import csv
import requests
#app.task(ignore_results=False)
def write_the_csv(data, file_name, sql):
with open(file_name, 'wb') as fp:
a = csv.writer(fp, delimiter=',')
for i in data:
a.writerows(i)
qry.delay(sql, True)
return True
#app.task(ignore_result=False)
def idx_point_qry(DIR='/bar/foo/'):
with open(''.join([DIR, 'thing.pkl']), 'rb') as f:
a = PKL.load(f)
return [(i, a[i]) for i in a]
#app.task(ignore_results=False)
def load_page(url):
page = requests.get(url, **{'timeout': 30})
if page.status_code == 200:
return page.json()
#app.task(ignore_results=False)
def run_hourly(page, info):
return parse_weather(page, info).hourly()
#app.task(ignore_results=False)
def pool_loop(info):
data = []
for iz in info:
a = chain(load_page.s(url, iz), run_hourly.s())()
data.append(a)
return data
#app.task(ignore_results=False)
def run_update(file_name, sql, writing=True):
chain(idx_point_qry.s(), pool_loop.s(file_name, sql), write_the_csv.s(writing, sql))()
return True
--- separate.py file ---
from foo_tasks import *
def update_hourly_weather():
fn = '/weather/csv_data/hourly_weather.csv'
run_update.delay(fn, "SELECT * from hourly_weather();")
return True
update_hourly_weather()
I tried 30 or so combinations of .py files listed above along with several combinations of the code existing in them. I tried chords, groups, chains, launching tasks from different tasks, combining tasks.
A few combinations did end up working but I had to call .get() directly in the wrtie_the_csv task on data but celery was throwing a warning that in 4.0 calling get() in a task would raise an error so I thought I ought not to be doing that..
In essence my problem was (and still is) poor design of tasks and the flow between them. Which was leading me to the problem of having synchronize tasks from within other tasks.
The while loop I proposed in my question was an attempt to asynchronously launch tasks when another tasks status had become COMPLETE and then launch another task when that task had become complete and so on.. as opposed to synchronously letting celery do so through a call to chord or chain. What I seemed to find (and I'm not sure I'm right about this) was that from within a celery task you don't have access to the scope needed to discern such things. The following is a quote from the state portion in the celery documentation on tasks “asserting the world is the responsibility of the task”.
I was launching tasks that the task had no idea existed and as such was unable to know anything about them.
My solution was the to load and iterate through the .pkl file and launch a group of chains with a callback synchronously. I basically replaced the pool_loop task with the code below and launched the tasks synchronously instead of asynchronously.
--the_end.py--
from foo_tasks import *
from common_tasks import *
def update_hourly_weather():
fn = '/weather/csv_data/hourly_weather.csv'
dd = []
idx = idx_point_qry()
for i in idx:
dd.append(chain(load_page.s(i, run_hourly.s(i)))
vv = chord(dd)(write_the_csv.s(fn, "SELECT * from hourly_weather();"))
return
update_hourly_weather()
Currently, i have a list of url to grab contents from and is doing it serially. I would like to change it to grabbing them in parallel. This is a psuedocode. I will like to ask is the design sound? I understand that .start() starts the thread, however, my database is not updated. Do i need to use q.get() ? thanks
import threading
import Queue
q = Queue.Queue()
def do_database(url):
""" grab url then input to database """
webdata = grab_url(url)
try:
insert_data_into_database(webdata)
except:
....
else:
< do I need to do anything with the queue after each db operation is done?>
def put_queue(q, url ):
q.put( do_database(url) )
for myfiles in currentdir:
url = myfiles + some_other_string
t=threading.Thread(target=put_queue,args=(q,url))
t.daemon=True
t.start()
It's odd that you're putting stuff into q but never taking anything out of q. What is the purpose of q? In addition, since do_database() doesn't return anything, sure looks like the only thing q.put(do_database(url)) does is put None into q.
The usual way these things work, a description of work to do is added to a queue, and then a fixed number of threads take turns pulling things off the queue. You probably don't want to create an unbounded number of threads ;-)
Here's a pretty complete - but untested - sketch:
import threading
import Queue
NUM_THREADS = 5 # whatever
q = Queue.Queue()
END_OF_DATA = object() # a unique object
class Worker(threading.Thread):
def run(self):
while True:
url = q.get()
if url is END_OF_DATA:
break
webdata = grab_url(url)
try:
# Does your database support concurrent updates
# from multiple threads? If not, need to put
# this in a "with some_global_mutex:" block.
insert_data_into_database(webdata)
except:
#....
threads = [Worker() for _ in range(NUM_THREADS)]
for t in threads:
t.start()
for myfiles in currentdir:
url = myfiles + some_other_string
q.put(url)
# Give each thread an END_OF_DATA marker.
for _ in range(NUM_THREADS):
q.put(END_OF_DATA)
# Shut down cleanly. `daemon` is way overused.
for t in threads:
t.join()
You should do this with asynchronous programming rather than threads. Threading in Python is problematic (see: Global Interpreter Lock), and anyway you're not trying to achieve multicore performance here. You just need a way to multiplex potentially long-running I/O. For that you can use a single thread and an event-driven library such as Twisted.
Twisted comes with HTTP functionality, so you can issue many concurrent requests and react (by populating your database) when results come in. Be aware that this model of programming may take a little getting used to, but it will give you good performance if the number of requests you're making is not astronomical (i.e. if you can get it all done on one machine, which it seems is your intention).
For DB, You have to commit before your changes become effective. But, commit for every insert is not optimal. Commit after bulk changes gives much better performance.
For parallel, Python isn't born for this. For your use-case, i suppose using python with gevent would be a painless solution.
Here is a much more efficient pseudo implementation FYI:
import gevent
from gevent.monkey import patch_all
patch_all() # to use with urllib, etc
from gevent.queue import Queue
def web_worker(q, url):
grab_something
q.push(result)
def db_worker(q):
buf = []
while True:
buf.append(q.get())
if len(buf) > 20:
insert_stuff_in_buf_to_db
db_commit
buf = []
def run(urls):
q = Queue()
gevent.spawn(db_worker, q)
for url in urls:
gevent.spawn(web_worker, q, url)
run(urls)
plus, since this implementation is totally single threaded, you can safely manipulate shared data between workers like queue, db connection, global variables etc.
I'm doing data scraping calls with an urllib2, yet they each take around 1 seconds to complete. I was trying to test if I could multi-thread the URL-call loop into threading with different offsets.
I'm doing this now with my update_items() method, where first and second parameter are the offset and limit to do loops:
import threading
t1 = threading.Thread(target=trade.update_items(1, 100))
t2 = threading.Thread(target=trade.update_items(101, 200))
t3 = threading.Thread(target=trade.update_items(201, 300))
t1.start()
t2.start()
t3.start()
#t1.join()
#t2.join()
#t3.join()
Like the code, I tried to commment out the join() to prevent waiting of the threads, but it seems I get the idea of this library wrong. I inserted print() functions into the update_items() method, funny tho it shows that it's still looping just in serial routine and not all 3 threads in parallel, like I wanted to achieve.
My normal scraping protocol takes about 5 hours to complete and it's only very small pieces of data, but the HTTP call always takes some time. I want to multi-thread this task at least a few times to shorten the fetching at least to around 30-45minutes.
To get multiple urls in parallel limiting to 20 connections at a time:
import urllib2
from multiprocessing.dummy import Pool
def generate_urls(): # generate some dummy urls
for i in range(100):
yield 'http://example.com?param=%d' % i
def get_url(url):
try: return url, urllib2.urlopen(url).read(), None
except EnvironmentError as e:
return url, None, e
pool = Pool(20) # limit number of concurrent connections
for url, result, error in pool.imap_unordered(get_url, generate_urls()):
if error is None:
print result,
Paul Seeb has correctly diagnosed your issue.
You are calling trade.update_items, and then passing the result to the threading.Thread constructor. Thus, you get serial behavior: your threads don't do any work, and the creation of each one is delayed until the update_items call returns.
The correct form is threading.Thread(target=trade.update_items, args=(1, 100) for the first line, and similarly for the later ones. This will pass the update_items function as the thread entry point, and the *[1, 100] as its positional arguments.