I have a list of around 28K numbers in a list named "y" and I am running a for loop on API to send Messages but this takes a lot of time (to be exact 1.2797 seconds per call)
Code:
import timeit
start = timeit.default_timer()
for i in y:
data = {'From': 'XXXX', 'To': str(i),
'Body': "ABC ABC" }
requests.post('https://xxxx:xx#api.xxx.com/v1/Accounts/xxx/Sms/send',data=data)
stop = timeit.default_timer()
print('Time: ', stop - start)
How can I reduce the time for this ?
Asyncio or Multithreading are the two possible solutions to optimize your code, and both basically do the same under the hood:
Threaded
import timeit
import threading
import time
y = list(range(50))
def post_data(server, data, sleep_time=1.5):
time.sleep(sleep_time)
# request.post(server, data=data)
start = timeit.default_timer()
server = 'https://xxxx:xx#api.xxx.com/v1/Accounts/xxx/Sms/send'
threads = []
for i in y:
# if you don't need to wait for your threads don't hold them in memory after they are done and instead do
# threading.Thread(target, args).start()
# instead. Especially important if you want to send a large number of messages
threads.append(threading.Thread(target=post_data,
args=(server, {'From': 'XXXX', 'To': str(i), 'Body': "ABC ABC"}))
threads[-1].start()
for thread in threads:
# optional if you want to wait for completion of the concurrent posts
thread.join()
stop = timeit.default_timer()
print('Time: ', stop - start)
Asyncio
Referring to this answer.
import timeit
import asyncio
from concurrent.futures import ThreadPoolExecutor
y = list(range(50)
_executor = ThreadPoolExecutor(len(y))
loop = asyncio.get_event_loop()
def post_data(server, data, sleep_time=1.5):
time.sleep(sleep_time)
# request.post(server, data=data)
async def post_data_async(server, data):
return await loop.run_in_executor(_executor, lambda: post_data(server, data))
async def run(y, server):
return await asyncio.gather(*[post_data_async(server, {'From': 'XXXX', 'To': str(i), 'Body': "ABC ABC"})
for i in y])
start = timeit.default_timer()
server = 'https://xxxx:xx#api.xxx.com/v1/Accounts/xxx/Sms/send'
loop.run_until_complete(run(y, server))
stop = timeit.default_timer()
print('Time: ', stop - start)
When using an API that does not support asyncio but would profit from concurrency, like your use-case, I'd tend towards using threading as it's easier to read IMHO. If your API/Library does support asyncio, go for it! It's great!
On my machine with a list of 50 elements the asyncio solutions clocks in at 1.515 seconds of runtime while the threaded solution needs about 1.509 seconds, when executing 50 instances of time.sleep(1.5).
Related
I am getting this error, when using the "submit" functionality of ProcessPoolExecutor.
Exception has occurred: TypeError
'Future' object is not iterable
File "C:......\test3.py", line 28, in
for f in as_completed(res):
import time
import json
import os
import requests
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
from concurrent.futures import as_completed
BAN_API_URL = 'https://api-adresse.data.gouv.fr/search/'
def get_french_addresses(request):
print(f"Started task with pid: {os.getpid()} fetch addresses: {request['search_field']}")
query_params = {'q': request['search_field'], 'type': 'housenumber', 'autocomplete': 1}
response = requests.get(BAN_API_URL, params=query_params)
print(f"Finished task with pid: {os.getpid()} to address: {request['search_field']}")
return json.loads(response.text)
request_data = [
{'search_field': '17 rue saint maur'},
{'search_field': '35 boulevard voltaire'},
{'search_field': '32 rue rivoli'},
{'search_field': 'Route de la Croqueterie'},
]
if __name__ == '__main__':
start_time = time.time()
# Execute asynchronously with multi threads
with ProcessPoolExecutor() as executor:
res = executor.submit(get_french_addresses, request_data)
print(res)
for f in as_completed(res):
print(f.result())
end_time = time.time()
print(f'Total time to run multithreads: {end_time - start_time:2f}s')
you are using submit which passes all of the data to the function at once, what you want is to use map to pass it one item at a time, like so:
res = executor.map(get_french_addresses, request_data)
or if you need to keep using submit, you will have to split your data yourself:
res = []
with ProcessPoolExecutor() as executor:
for item in request_data:
res.append(executor.submit(get_french_addresses, item ))
print(res)
for f in as_completed(res):
the simplest edit to avoid the error, is to change
for f in as_completed(res):
to
for f in as_completed([res]):
However, this way it will almost be an equivalent of a synchronous call (I say 'almost' because some code still could execute between submit and as_completed, but because of GIL it should either be async itself or invoke some IO).
If you want the function get_french_addresses to return data asyncronously (as it processes it), it must be rewritten to support that.
I use python multiprocessing to compute some sort of scores on DNA sequences from a large file.
For that I write and use the script below.
I use a Linux machine with 48 cpu in python 3.8 environment.
Th code work fine, and terminate the work correctly and print the processing time at the end.
Problem: when I use the htop command, I find that all 48 processes are still alive.
I don't know why, and I don't know what to add to my script to avoid this.
import csv
import sys
import concurrent.futures
from itertools import combinations
import psutil
import time
nb_cpu = psutil.cpu_count(logical=False)
def fun_job(seq_1, seq_2): # seq_i : (id, string)
start = time.time()
score_dist = compute_score_dist(seq_1[1], seq_2[1])
end = time.time()
return seq_1[0], seq_2[0], score_dist, end - start # id seq1, id seq2, score, time
def help_fun_job(nested_pair):
return fun_job(nested_pair[0], nested_pair[1])
def compute_using_multi_processing(list_comb_ids, dict_ids_seqs):
start = time.perf_counter()
with concurrent.futures.ProcessPoolExecutor(max_workers=nb_cpu) as executor:
results = executor.map(help_fun_job,
[((pair_ids[0], dict_ids_seqs[pair_ids[0]]), (pair_ids[1], dict_ids_seqs[pair_ids[1]]))
for pair_ids in list_comb_ids])
save_results_to_csv(results)
finish = time.perf_counter()
proccessing_time = str(datetime.timedelta(seconds=round(finish - start, 2)))
print(f' Processing time Finished in {proccessing_time} hh:mm:ss')
def main():
print("nb_cpu in this machine : ", nb_cpu)
file_path = sys.argv[1]
dict_ids_seqs = get_dict_ids_seqs(file_path)
list_ids = list(dict_ids_seqs) # This will convert the dict_keys to a list
list_combined_ids = list(combinations(list_ids, 2))
compute_using_multi_processing(list_combined_ids, dict_ids_seqs)
if __name__ == '__main__':
main()
Thank you for your help.
Edit : add the complete code for fun_job (after #Booboo answer)
from Bio import Align
def fun_job(seq_1, seq_2): # seq_i : (id, string)
start = time.time()
aligner = Align.PairwiseAligner()
aligner.mode = 'global'
score_dist = aligner.score(seq_1[1],seq_2[1])
end = time.time()
return seq_1[0], seq_2[0], score_dist, end - start # id seq1, id seq2, score, time
When the with ... as executor: block exits, there is an implicit call to executor.shutdown(wait=True). This will wait for all pending futures to to be done executing "and the resources associated with the executor have been freed", which presumably includes terminating the processes in the pool (if possible?). Why your program terminates (or does it?) or at least you say all the futures have completed executing, while the processes have not terminated is a bit of a mystery. But you haven't provided the code for fun_job, so who can say why this is so?
One thing you might try is to switch to using the multiprocessing.pool.Pool class from the multiprocessing module. It supports a terminate method, which is implicitly called when its context manager with block exits, that explicitly attempts to terminate all processes in the pool:
#import concurrent.futures
import multiprocessing
... # etc.
def compute_using_multi_processing(list_comb_ids, dict_ids_seqs):
start = time.perf_counter()
with multiprocessing.Pool(processes=nb_cpu) as executor:
results = executor.map(help_fun_job,
[((pair_ids[0], dict_ids_seqs[pair_ids[0]]), (pair_ids[1], dict_ids_seqs[pair_ids[1]]))
for pair_ids in list_comb_ids])
save_results_to_csv(results)
finish = time.perf_counter()
proccessing_time = str(datetime.timedelta(seconds=round(finish - start, 2)))
print(f' Processing time Finished in {proccessing_time} hh:mm:ss')
I've been trying to wrap my head around multiprocessing using an old python bitcoin mining program. Although relatively useless for mining, I figured this would be a great way to explore multiprocessing. However, I've hit a wall when it comes to stopping the processes when one of them achieves the goal they are all working towards.
I want to kill all multiprocessing pools when one of them finds the solution. Then allow the program to continue. I have tried terminate() and join(). I've attempted to include an Event(). I've tried using Process instead of Pool with the direction of a similar issue here: Killing a multiprocessing process when condition is met. However, same problem. How can I stop all processes after a condition is met without exiting the program with something like sys.exit() that would kill the entire program?
I tried also apply_sync with the direction from this post: Python Multiprocess Pool. How to exit the script when one of the worker process determines no more work needs to be done? However, it did not solve the problem of needing to continue executing the final functions of the program. In fact, it actually slowed the program significantly.
For clarity, I've included the code I tried based on the above mentioned link here:
from multiprocessing import Pool
from hashlib import sha256
import time
def SHA256(text):
return sha256(text.encode("ascii")).hexdigest()
def solution_helper(args):
solution, nonce = do_job(args)
if solution:
print(f"\nNonce Found: {nonce}\n")
return True
else:
return False
class Mining():
def __init__(self, workers, initargs):
self.pool = Pool(processes=workers, initargs=initargs)
def callback(self, result):
if result:
print('Solution Found...Terminating Processes...')
self.pool.terminate()
def do_job(self):
for args in values:
start_nonce = args[0]
end_nonce = args[1]
prefix_str = '0'*difficulty
self.pool.apply_async(solution_helper, args=args, callback=self.callback)
start = time.time()
for nonce in range(start_nonce, end_nonce):
text = str(block_number) + transactions + previous_hash + str(nonce)
new_hash = SHA256(text)
if new_hash.startswith(prefix_str):
print(f"Hashing: {text}")
print(f"\nSuccessfully mined bitcoin with nonce value: {nonce}\n")
print(f"New hash: {new_hash}")
total_time = str((time.time()-start))
print(f"\nEnd mning... Mining took {total_time} seconds\n")
return new_hash, nonce
self.pool.close()
self.pool.join()
print('.Goodbye.')
block_number = 5
transactions = """
bill->steve->20,
jan->phillis->45
"""
previous_hash = '0000000b7c7723e4d3a8654c975fe4dd23d4d37f22d0ea7e5abde2225d1567dc6'
values = [(20000, 100000), (100000, 1000000), (1000000, 10000000), (10000000, 100000000)]
difficulty = 4
m = Mining(5, values)
m.do_job()
Here's the basic concept. It works great to start the processes, but I cannot figure out how to stop them:
from multiprocessing import Pool
from hashlib import sha256
import functools
MAX_NONCE = 1000000000
def SHA256(text):
return sha256(text.encode("ascii")).hexdigest()
def nonce(block_number, transactions, previous_hash, prefix_str):
import time
start = time.time()
for nonce in range(MAX_NONCE):
text = str(block_number) + transactions + previous_hash + str(nonce)
new_hash = SHA256(text)
if new_hash.startswith(prefix_str):
print(f"\nYay! Successfully mined bitcoins with nonce value:{nonce}")
total_time = str((time.time()-start))
print(f"\nend mining. Mining took: {total_time} seconds\n")
print(new_hash + "\n")
def mine(block_number, transactions, previous_hash, prefix_zeros):
from multiprocessing import Pool
with Pool(4) as p:
prefix_str = '0'*prefix_zeros
p.map(nonce(block_number, transactions, previous_hash, prefix_str), [20000, 40000, 60000, 80000, 100000])
if __name__=='__main__':
transactions="""
bill->steve->20,
jan->phillis->45
"""
difficulty=7
print("\nstart mining\n")
new_hash = mine(5, transactions, '0000000b7c7723e4d3a8654c975fe4dd23d4d37f22d0ea7e5abde2225d1567dc6', difficulty)
# Do some other things... Here is where I'd like to get to after the multiproccesses are killed
print(f"\nMission Complete...{new_hash}\n") <---This never gets a chance to happen
I have a database record set (approx. 1000 rows) and I am currently iterating through them, to integrate more data using extra db query for each record.
Doing that, raises the overall process time to maybe 100 seconds.
What I want to do is share the functionality to 2-4 processes.
I am using Python 2.7 to have AWS Lambda compatibility.
def handler(event, context):
try:
records = connection.get_users()
mandrill_client = open_mandrill_connection()
mandrill_messages = get_mandrill_messages()
mandrill_template = 'POINTS weekly-report-to-user'
start_time = time.time()
messages = build_messages(mandrill_messages, records)
print("OVERALL: %s seconds ---" % (time.time() - start_time))
send_mandrill_message(mandrill_client, mandrill_template, messages)
connection.close_database_connection()
return "Process Completed"
except Exception as e:
print(e)
Following is the function which I want to put into threads:
def build_messages(messages, records):
for record in records:
record = dict(record)
stream = get_user_stream(record)
data = compile_loyalty_stream(stream)
messages['to'].append({
'email': record['email'],
'type': 'to'
})
messages['merge_vars'].append({
'rcpt': record['email'],
'vars': [
{
'name': 'total_points',
'content': record['total_points']
},
{
'name': 'total_week',
'content': record['week_points']
},
{
'name': 'stream_greek',
'content': data['el']
},
{
'name': 'stream_english',
'content': data['en']
}
]
})
return messages
What I have tried is importing the multiprocessing library:
from multiprocessing.pool import ThreadPool
Created a pool inside the try block and mapped the function inside this pool:
pool = ThreadPool(4)
messages = pool.map(build_messages_in, itertools.izip(itertools.repeat(mandrill_messages), records))
def build_messages_in(a_b):
build_msg(*a_b)
def build_msg(a, b):
return build_messages(a, b)
def get_user_stream(record):
response = []
i = 0
for mod, mod_id, act, p, act_created in izip(record['models'], record['model_ids'], record['actions'],
record['points'], record['action_creation']):
information = get_reference(mod, mod_id)
if information:
response.append({
'action': act,
'points': p,
'created': act_created,
'info': information
})
if (act == 'invite_friend') \
or (act == 'donate') \
or (act == 'bonus_500_general') \
or (act == 'bonus_1000_general') \
or (act == 'bonus_500_cancel') \
or (act == 'bonus_1000_cancel'):
response[i]['info']['date_ref'] = act_created
response[i]['info']['slug'] = 'attiki'
if (act == 'bonus_500_general') \
or (act == 'bonus_1000_general') \
or (act == 'bonus_500_cancel') \
or (act == 'bonus_1000_cancel'):
response[i]['info']['title'] = ''
i += 1
return response
Finally I removed the for loop from the build_message function.
What I get as a results is a 'NoneType' object is not iterable.
Is this the correct way of doing this?
Your code seems pretty in-depth and so you cannot be sure that multithreading will lead to any performance gains when applied on a high level. Therefore, it's worth digging down to the point that gives you the largest latency and considering how to approach the specific bottleneck. See here for greater discussion on threading limitations.
If, for example as we discussed in comments, you can pinpoint a single task that is taking a long time, then you could try to parallelize it using multiprocessing instead - to leverage more of your CPU power. Here is a generic example that hopefully is simple enough to understand to mirror your Postgres queries without going into your own code base; I think that's an unfeasible amount of effort tbh.
import multiprocessing as mp
import time
import random
import datetime as dt
MAILCHIMP_RESPONSE = [x for x in range(1000)]
def chunks(l, n):
n = max(1, n)
return [l[i:i + n] for i in range(0, len(l), n)]
def db_query():
''' Delayed response from database '''
time.sleep(0.01)
return random.random()
def do_queries(query_list):
''' The function that takes all your query ids and executes them
sequentially for each id '''
results = []
for item in query_list:
query = db_query()
# Your super-quick processing of the Postgres response
processing_result = query * 2
results.append([item, processing_result])
return results
def single_processing():
''' As you do now - equivalent to get_reference '''
result_of_process = do_queries(MAILCHIMP_RESPONSE)
return result_of_process
def multi_process(chunked_data, queue):
''' Same as single_processing, except we put our results in queue rather
than returning them '''
result_of_process = do_queries(chunked_data)
queue.put(result_of_process)
def multiprocess_handler():
''' Divide and conquor on our db requests. We split the mailchimp response
into a series of chunks and fire our queries simultaneously. Thus, each
concurrent process has a smaller number of queries to make '''
num_processes = 4 # depending on cores/resources
size_chunk = len(MAILCHIMP_RESPONSE) / num_processes
chunked_queries = chunks(MAILCHIMP_RESPONSE, size_chunk)
queue = mp.Queue() # This is going to combine all the results
processes = [mp.Process(target=multi_process,
args=(chunked_queries[x], queue)) for x in range(num_processes)]
for p in processes: p.start()
divide_and_conquor_result = []
for p in processes:
divide_and_conquor_result.extend(queue.get())
return divide_and_conquor_result
if __name__ == '__main__':
start_single = dt.datetime.now()
single_process = single_processing()
print "Single process took {}".format(dt.datetime.now() - start_single)
print "Number of records processed = {}".format(len(single_process))
start_multi = dt.datetime.now()
multi = multiprocess_handler()
print "Multi process took {}".format(dt.datetime.now() - start_multi)
print "Number of records processed = {}".format(len(multi))
I have a situation to call multiple requests in a scheduler job to check live user status for 1000 users at a time. But server limits maximum up to 50 users in each hit of an API request. So using following approach with for loop its taking around 66 seconds for 1000 users (i.e for 20 API calls).
from apscheduler.schedulers.blocking import BlockingScheduler
sched = BlockingScheduler()
def shcdulerjob():
"""
"""
uidlist = todays_userslist() #Get around 1000 users from table
#-- DIVIDE LIST BY GIVEN SIZE (here 50)
split_list = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)]
idlists = split_list(uidlist, 50) # SERVER MAX LIMIT - 50 ids/request
for idlist in idlists:
apiurl = some_server_url + "&ids="+str(idlist)
resp = requests.get(apiurl)
save_status(resp.json()) #-- Save status to db
if __name__ == "__main__":
sched.add_job(shcdulerjob, 'interval', minutes=10)
sched.start()
So,
Is there any workaround so that it should optimize the time required to fetch API?
Does Python- APScheduler provide any multiprocessing option to process such api requests in a single job?
You could try to apply python's Thread pool from the concurrent.futures module, if the server allows concurrent requests. That way you would parallelise the processing, instead of the scheduling itself
There are some good examples provided in the documentation here (If you're using python 2, there is a sort of an equivalent module
e.g.
import concurrent.futures
import multiprocessing
import requests
import time
import json
cpu_start_time = time.process_time()
clock_start_time = time.time()
queue = multiprocessing.Queue()
uri = "http://localhost:5000/data.json"
users = [str(user) for user in range(1, 50)]
with concurrent.futures.ThreadPoolExecutor(multiprocessing.cpu_count()) as executor:
for user_id, result in zip(
[str(user) for user in range(1, 50)]
, executor.map(lambda x: requests.get(uri, params={id: x}).content, users)
):
queue.put((user_id, result))
while not queue.empty():
user_id, rs = queue.get()
print("User ", user_id, json.loads(rs.decode()))
cpu_end_time = time.process_time()
clock_end_time = time.time()
print("Took {0:.03}s [{1:.03}s]".format(cpu_end_time-cpu_start_time, clock_end_time-clock_start_time))
If you want to use a Process pool, just make sure you don't use shared resources, e.g. queue, and write your data our independently