I have a Python function that requests data via API and involves a rotating expiring key. The volume of requests necessitates some parallelization of the function. I am doing this with the multiprocessing.pool module ThreadPool. Example code:
import requests
from multiprocessing.pool import ThreadPool
from tqdm import tqdm
# Input is a list-of-dicts results of a previous process.
results = [...]
# Process starts by retrieving an authorization key.
headers = {"authorization": get_new_authorization()}
# api_call() is called on each existing result with the retrieved key.
results = thread(api_call, [(headers, result) for result in results])
# Function calls API with passed headers for given URL and returns dict.
def api_call(headers_plus_result):
headers, result = headers_plus_result
r = requests.get(result["url"]), headers=headers)
return json.loads(r.text)
# Threading function with default num_threads.
def thread(worker, jobs, num_threads=5):
pool = ThreadPool(num_threads)
results = list()
for result in tqdm(pool.imap_unordered(worker, jobs), total=len(jobs)):
if result:
results.append(result)
pool.close()
pool.join()
if results:
return results
# Function to get new authorization key.
def get_new_authorization():
...
return auth_key
I am trying to modify my mapping process so that, when the first worker fails (i.e. the authorization key expires), all other processes are paused until a new authorization key is retrieved. Then, the processes proceed with the new key.
Should this be inserted into the actual thread() function? If I put an exception in the api_call function itself, I don't see how I can stop the pool manager or update the header being passed to other workers.
Additionally: is using ThreadPool even the best method if I want this kind of flexibility?
A simpler possibility might be to use a multiprocessing.Event and a shared variable. The Event would indicate whether the authentication was legit or not, and the shared variable would contain the authentication.
event = mp.Event()
sharedAuthentication = mp.Array('u', 100) # 100 = max length
So a worker would run:
event.wait();
authentication = sharedAuthentication.value
Your main thread would initially set the authentication with
sharedAuthentication.value = ....
event.set()
and later modify the authentication with
event.clear()
... calculate new authentication
sharedAuthentication.value = .....
event.set()
Related
I am fetching JSON data from a website (https://www.nseindia.com/) using python request library. The website only gives data if correct cookies are provided. So I use selenium webdriver to get cookies and then get data for 850 different stocks.
Now, my code is such that if the cookies are wrong, then selenium should open again and get new cookie value. But the problem is that when I am using concurrent.futures, the tasks are very fast (due to asynchronicity) and it opens new drivers for each symbol till new cookies are not found. My code is as below:
--Initially get cookies
cookie_dict = get_cookies()
for cookie in cookie_dict:
if cookie == "bm_sv" or cookie == "nsit" or cookie == "nseappid":
session.cookies.set(cookie,cookie_dict[cookie])
--This function is used in Thread Pool Executor
def final(u):
try:
data = session.get(u,headers = headers).json()
print(data['data'][0]['CH_SYMBOL'])
list_done.append(data['data'][0]['CH_SYMBOL'])
except:
print("Error")
cookie_dict = get_cookies()
for cookie in cookie_dict:
if cookie == "bm_sv" or cookie == "nsit" or cookie == "nseappid":
session.cookies.set(cookie,cookie_dict[cookie])
data = session.get(u,headers = headers).json()
print(data['data'][0]['CH_SYMBOL'])
list_done.append(data['data'][0]['CH_SYMBOL'])
As you can see that if exception is met, it should get new cookies. But as I previously mentioned, that futures will keep running for other stocks and exceptions will be met till new cookie is not found.
with concurrent.futures.ThreadPoolExecutor(max_workers = 5) as executor:
executor.map(final,urls)
Is there a way I can change my code, or some built in facility by futures, so that I can stop it till I don't get new cookies and continue running only if new cookies are appended?
You could use the wait() method of concurrent.futures to monitor for the first exception. first, you have to change final to be throwing. You should not append to a shared list inside this function, you should return the value:
def final(u):
data = session.get(u,headers = headers).json()
print(data['data'][0]['CH_SYMBOL'])
return data['data'][0]['CH_SYMBOL']
Second, instead of using .map() you submit work and store futures:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor() as executor:
futures = [executor.submit(final, u) for u in urls]
Then you use wait() to wait for an exception to occur. Here, either there is an exception and you handle it before continuing, of there is no error and you unwrap the futures:
from concurrent.futures import wait, FIRST_EXCEPTION
completed, not_done = wait(futures, return_when=FIRST_EXCEPTION)
if not_done:
# An exception occurred before all futures completed,
# handle it here.
# Here, handle the results of the futures,
# `completed` contains both finished and failed futures,
# `not_done` contains pending futures.
Note that you should check for each future if succeeded of failed before unwrapping it, and you should resubmit futures that failed. Also, the wait() function does not cancels pending futures, so not_done futures are still progressing.
I want to make repeated requests to a server that will return with some tasks. The response from the server will be a dictionary with a list of functions that need to be called. For example:
{
tasks: [
{
function: "HelloWorld",
id: 1212
},
{
function: "GoodbyeWorld"
id: 1222
}
]
}
NOTE: I'm dummying it down.
For each of these tasks, I will run the specified function using multiprocessing. Here is an example of my code:
r = requests.get('https://localhost:5000', auth=('user', 'pass'))
data = r.json()
if len(data["tasks"]) > 0:
manager = multiprocessing.Manager()
for task in data["tasks"]:
if task["function"] == "HelloWorld":
helloObj = HelloWorldClass()
hello = multiprocessing.Process(target=helloObj.helloWorld)
hello.start()
hello.join()
elif task["function"] == "GoodbyeWorld":
byeObj = GoodbyeWorldClass()
bye = multiprocessing.Process(target=byeObj.byeWorld)
bye.start()
bye.join()
The problem is, I want to make repeated requests and fill the data["tasks"] array as the other processes are running. If I throw everything into some while loop, it'll only make a request after all the processes from the initial response is done (when join() has been reached for all processes).
Can anyone help me to make repeated requests and fill the array continuously? Please let me know if I need to make any clarifications.
If I understood you correctly, you need something like this:
import time
from multiprocessing import Process
import requests
from task import FunctionFactory
def get_tasks():
resp = requests.get('https://localhost:5000', auth=('user', 'pass'))
data = resp.json()
return data['tasks']
if __name__ == '__main__':
procs = {}
for _ in range(10):
tasks = get_tasks()
if not tasks:
time.sleep(5)
continue
for task in tasks:
if task['id'] in procs:
# This task has been already submitted for execution.
continue
func = FunctionFactory.build(task['function'])
proc = Process(target=func)
proc.start()
procs[task['id']] = proc
# Waiting for all the submitted tasks to finish.
for proc in procs.values():
proc.join()
Here, the function get_tasks is used to request a list of dictionaries with id and function keys from the server. In the main section, there is a procs dictionary that maps id to running process instances which execute functions built by a FunctionFactory using received tasks' function names. In the case there is already a running task with the same id, it gets ignored.
With this approach, you can request tasks as often as needed (here, 10 requests are used in a for loop) and start processes to execute them in parallel. In the end, you just wait for all the submitted tasks to finish.
You have a bug in your program, you should call the joins after you've created all the tasks. Join blocks until the process has finished -- in your case before you start the next one. Which practically makes you whole program run sequentially.
What I mean by "deterministic time"? For example AWS offer a service "AWS Lambda". The process started as lambda function has time limit, after that lambda function will stop execution and will assume that task was finished with error. And example task - send data to http endpoint. Depending of a network connection to http endpoint, or other factors, process of sending data can take a long time. If I need to send the same data to the many endpoints, then full process time will take one process time times endpoints amount. Which increase a chance that lambda function will be stopped before all data will be send to all endpoints.
To solve this I need to send data to different endpoints in parallel mode using threads.
The problem with threads - started thread can't be stopped. If http request will take more time than it dedicated by lambda function time limit, lambda function will be aborted and return error. So I need to use timeout with http request, to abort it, if it take more time than expected.
If http request will be canceled by timeout or endpoint will return error, I need to save not processed data somewhere to not lost the data. The time needed to save unprocessed data can be predicted, because I control the storage where data will be saved.
And the last part that consume time - procedure or loop where threads are scheduled executor.submit(). If there is only one endpoint or small number of them then the consumed time will be small. And there is no necessary to control this. But if I have deal with many endpoints, I have to take this into account.
So basically full time will consists of:
scheduling threads
http request execution
saving unprocessed data
There is example of how I can manage time using threads
import concurrent.futures
from functools import partial
import requests
import time
start = time.time()
def send_data(data):
host = 'http://127.0.0.1:5000/endpoint'
try:
result = requests.post(host, json=data, timeout=(0.1, 0.5))
# print('done')
if result.status_code == 200:
return {'status': 'ok'}
if result.status_code != 200:
return {'status': 'error', 'msg': result.text}
except requests.exceptions.Timeout as err:
return {'status': 'error', 'msg': 'timeout'}
def get_data(n):
return {"wait": n}
def done_cb(a, b, future):
pass # save unprocessed data
def main():
executor = concurrent.futures.ThreadPoolExecutor()
futures = []
max_time = 0.5
for i in range(1):
future = executor.submit(send_data, *[{"wait": 10}])
future.add_done_callback(partial(done_cb, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new threads')
# save unprocessed data
break
try:
for item in concurrent.futures.as_completed(futures, timeout=1):
item.result()
except concurrent.futures.TimeoutError as err:
pass
I was thinking of how I can use asyncio library instead of threads, to do the same thing.
import asyncio
import time
from functools import partial
import requests
start = time.time()
def send_data(data):
...
def get_data(n):
return {"wait": n}
def done_callback(a,b, future):
pass # save unprocessed data
def main(loop):
max_time = 0.5
futures = []
start_appending = time.time()
for i in range(1):
event_data = get_data(1)
future = (loop.run_in_executor(None, send_data, event_data))
future.add_done_callback(partial(done_callback, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new futures')
# save unprocessed data
break
finished, unfinished = loop.run_until_complete(
asyncio.wait(futures, timeout=1)
)
_loop = asyncio.get_event_loop()
result = main(_loop)
Function send_data() the same as in previous code snipped.
Because request library is not async code I use run_in_executor() to create future object. The main problems I have is that done_callback() is not executed when the thread that started but executor done it's job. But only when the futures will be "processed" by asyncio.wait() expression.
Basically I seeking the way to start execute asyncio future, like ThreadPoolExecutor start execute threads, and not wait for asyncio.wait() expression to call done_callback(). If you have other ideas how to write python code that will work with threads or coroutines and will complete in deterministic time. Please share it, I will be glad to read them.
And other question. If thread or future done its job, it can return result, that I can use in done_callback(), for example to remove message from queue by id returned in result. But if thread or future was canceled, I don't have result. And I have to use functools.partial() pass in done_callback additional data, that can help me to understand for what data this callback was called. If passed data are small this is not a problem. If data will be big, I need to put data in array/list/dictionary and pass in callback only index of array or put "full data: in callback.
Can I somehow get access to variable that was passed to future/thread, from done_callback(), that was triggered on canceled future/thread?
You can use asyncio.wait_for to wait for a future (or multiple futures, when combined with asyncio.gather) and cancel them in case of a timeout. Unlike threads, asyncio supports cancellation, so you can cancel a task whenever you feel like it, and it will be cancelled at the first blocking call it makes (typically a network call).
Note that for this to work, you should be using asyncio-native libraries such as aiohttp for HTTP. Trying to combine requests with asyncio using run_in_executor will appear to work for simple tasks, but it will not bring you the benefits of using asyncio, such as being able to spawn a massive number of tasks without encumbering the OS, or the possibility of cancellation.
In celery i want to get the task status for all the tasks for specific task name. For that tried below code.
import celery.events.state
# Celery status instance.
stat = celery.events.state.State()
# task_by_type will return list of tasks.
query = stat.tasks_by_type("my_task_name")
# Print tasks.
print query
Now i'm getting empty list in this code.
celery.events.state.State() is a data-structure used to keep track of the state of celery workers and tasks. When calling State(), you get an empty state object with no data.
You should use app.events.Receiver(Stream Processing) or celery.events.snapshot(Batch Processing) to capture state that contains tasks.
Sample Code:
from celery import Celery
def my_monitor(app):
state = app.events.State()
def announce_failed_tasks(event):
state.event(event)
# task name is sent only with -received event, and state
# will keep track of this for us.
task = state.tasks.get(event['uuid'])
print('TASK FAILED: %s[%s] %s' % (
task.name, task.uuid, task.info(),))
with app.connection() as connection:
recv = app.events.Receiver(connection, handlers={
'task-failed': announce_failed_tasks,
'*': state.event,
})
recv.capture(limit=None, timeout=None, wakeup=True)
if __name__ == '__main__':
app = Celery(broker='amqp://guest#localhost//')
my_monitor(app)
This isn't natively supported. Depending on the backend (Mongo, Redis, etc), you may or may not be able to introspect the contents of a queue and find out what's in it. Even if you do, you'll miss items currently in progress.
That said, you could manage this yourself:
result = mytask.delay(...)
my_datastore.save("mytask", result.id)
...
for id in my_datastore.find(task="mytask"):
res = AsyncResult(id)
print res.state
In celery you can easily find the status of task by accessing them through task ID if you want to access them from other function.
Sample Code:-
#task(name='Sum_of_digits')
def ABC(x,y):
return x+y
Add this task for processing
res = ABC.delay(1, 2)
Now use the task res to fetch the state, status and results(res.get())
print(f"id={res.id}, state={res.state}, status={res.status}")
How can I run background tasks on App Engine?
You may use the Task Queue Python API.
GAE is very useful tool to build scalable web applications. Few of the limitations pointed out by many are no support for background tasks, lack of periodic tasks and strict limit on how much time each HTTP request takes, if a request exceeds that time limit the operation is terminated, which makes running time consuming tasks impossible.
How to run background task ?
In GAE the code is executed only when there is a HTTP request. There is a strict time limit (i think 10secs) on how long the code can take. So if there are no requests then code is not executed. One of the suggested work around was use an external box to send requests continuously, so kind of creating a background task. But for this we need an external box and now we dependent on one more element. The other alternative was sending 302 redirect response so that client re-sends the request, this also makes us dependent on external element which is client. What if that external box is GAE itself ? Everyone who has used functional language which does not support looping construct in the language is aware of the alternative ie recursion is the replacement to loop. So what if we complete part of the computation and do a HTTP GET on the same url with very short time out say 1 second ? This creates a loop(recursion) on php code running on apache.
<?php
$i = 0;
if(isset($_REQUEST["i"])){
$i= $_REQUEST["i"];
sleep(1);
}
$ch = curl_init("http://localhost".$_SERVER["PHP_SELF"]."?i=".($i+1));
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_TIMEOUT, 1);
curl_exec($ch);
print "hello world\n";
?>
Some how this does not work on GAE. So what if we do HTTP GET on some other url say url2 which does HTTP GET on the first url ? This seem to work in GAE. Code for this looks like this.
class FirstUrl(webapp.RequestHandler):
def get(self):
self.response.out.write("ok")
time.sleep(2)
urlfetch.fetch("http://"+self.request.headers["HOST"]+'/url2')
class SecondUrl(webapp.RequestHandler):
def get(self):
self.response.out.write("ok")
time.sleep(2)
urlfetch.fetch("http://"+self.request.headers["HOST"]+'/url1')
application = webapp.WSGIApplication([('/url1', FirstUrl), ('/url2', SecondUrl)])
def main():
run_wsgi_app(application)
if __name__ == "__main__":
main()
Since we found out a way to run background task, lets build abstractions for periodic task (timer) and a looping construct which spans across many HTTP requests (foreach).
Timer
Now building timer is straight forward. Basic idea is to have list of timers and the interval at which each should be called. Once we reach that interval call the callback function. We will use memcache to maintain the timer list. To find out when to call callback, we will store a key in memcache with interval as expiration time. We periodically (say 5secs) check if that key is present, if not present then call the callback and again set that key with interval.
def timer(func, interval):
timerlist = memcache.get('timer')
if(None == timerlist):
timerlist = []
timerlist.append({'func':func, 'interval':interval})
memcache.set('timer-'+func, '1', interval)
memcache.set('timer', timerlist)
def checktimers():
timerlist = memcache.get('timer')
if(None == timerlist):
return False
for current in timerlist:
if(None == memcache.get('timer-'+current['func'])):
#reset interval
memcache.set('timer-'+current['func'], '1', current['interval'])
#invoke callback function
try:
eval(current['func']+'()')
except:
pass
return True
return False
Foreach
This is needed when we want to do long taking computation say doing some operation on 1000 database rows or fetch 1000 urls etc. Basic idea is to maintain list of callbacks and arguments in memcache and each time invoke callback with the argument.
def foreach(func, args):
looplist = memcache.get('foreach')
if(None == looplist):
looplist = []
looplist.append({'func':func, 'args':args})
memcache.set('foreach', looplist)
def checkloops():
looplist = memcache.get('foreach')
if(None == looplist):
return False
if((len(looplist) > 0) and (len(looplist[0]['args']) > 0)):
arg = looplist[0]['args'].pop(0)
func = looplist[0]['func']
if(len(looplist[0]['args']) == 0):
looplist.pop(0)
if((len(looplist) > 0) and (len(looplist[0]['args']) > 0)):
memcache.set('foreach', looplist)
else:
memcache.delete('foreach')
try:
eval(func+'('+repr(arg)+')')
except:
pass
return True
else:
return False
# instead of
# foreach index in range(0, 1000):
# someoperaton(index)
# we will say
# foreach('someoperaton', range(0, 1000))
Now building a program which fetches list of urls every one hour is straight forward. Here is the code.
def getone(url):
try:
result = urlfetch.fetch(url)
if(result.status_code == 200):
memcache.set(url, '1', 60*60)
#process result.content
except :
pass
def getallurl():
#list of urls to be fetched
urllist = ['http://www.google.com/', 'http://www.cnn.com/', 'http://www.yahoo.com', 'http://news.google.com']
fetchlist = []
for url in urllist:
if (memcache.get(url) is None):
fetchlist.append(url)
#this is equivalent to
#for url in fetchlist: getone(url)
if(len(fetchlist) > 0):
foreach('getone', fetchlist)
#register the timer callback
timer('getallurl', 3*60)
complete code is here http://groups.google.com/group/httpmr-discuss/t/1648611a54c01aa
I have been running this code on appengine for few days without much problem.
Warning: We make heavy use of urlfetch. The limit on no of urlfetch per day is 160000. So be careful not to reach that limit.
You can find more about cron jobs in Python App Engine here.
Up and coming version of runtime will have some kind of periodic execution engine a'la cron. See this message on AppEngine group.
So, all the SDK pieces appear to work, but my testing indicates this isn't running on the production servers yet-- I set up an "every 1 minutes" cron that logs when it runs, and it hasn't been called yet
Hard to say when this will be available, though...
Using the Deferred Python Library is the easiest way of doing background task on Appengine using Python which is built on top of TaskQueue API.
from google.appengine.ext import deferred
def do_something_expensive(a, b, c=None):
logging.info("Doing something expensive!")
# Do your work here
# Somewhere else
deferred.defer(do_something_expensive, "Hello, world!", 42, c=True)
If you want to run background periodic tasks, see this question (AppEngine cron)
If your tasks are not periodic, see Task Queue Python API or Task Queue Java API
There is a cron facility built into app engine.
Please refer to:
https://developers.google.com/appengine/docs/python/config/cron?hl=en
Use the Task Queue - http://code.google.com/appengine/docs/java/taskqueue/overview.html