How can I run background tasks on App Engine?
You may use the Task Queue Python API.
GAE is very useful tool to build scalable web applications. Few of the limitations pointed out by many are no support for background tasks, lack of periodic tasks and strict limit on how much time each HTTP request takes, if a request exceeds that time limit the operation is terminated, which makes running time consuming tasks impossible.
How to run background task ?
In GAE the code is executed only when there is a HTTP request. There is a strict time limit (i think 10secs) on how long the code can take. So if there are no requests then code is not executed. One of the suggested work around was use an external box to send requests continuously, so kind of creating a background task. But for this we need an external box and now we dependent on one more element. The other alternative was sending 302 redirect response so that client re-sends the request, this also makes us dependent on external element which is client. What if that external box is GAE itself ? Everyone who has used functional language which does not support looping construct in the language is aware of the alternative ie recursion is the replacement to loop. So what if we complete part of the computation and do a HTTP GET on the same url with very short time out say 1 second ? This creates a loop(recursion) on php code running on apache.
<?php
$i = 0;
if(isset($_REQUEST["i"])){
$i= $_REQUEST["i"];
sleep(1);
}
$ch = curl_init("http://localhost".$_SERVER["PHP_SELF"]."?i=".($i+1));
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_TIMEOUT, 1);
curl_exec($ch);
print "hello world\n";
?>
Some how this does not work on GAE. So what if we do HTTP GET on some other url say url2 which does HTTP GET on the first url ? This seem to work in GAE. Code for this looks like this.
class FirstUrl(webapp.RequestHandler):
def get(self):
self.response.out.write("ok")
time.sleep(2)
urlfetch.fetch("http://"+self.request.headers["HOST"]+'/url2')
class SecondUrl(webapp.RequestHandler):
def get(self):
self.response.out.write("ok")
time.sleep(2)
urlfetch.fetch("http://"+self.request.headers["HOST"]+'/url1')
application = webapp.WSGIApplication([('/url1', FirstUrl), ('/url2', SecondUrl)])
def main():
run_wsgi_app(application)
if __name__ == "__main__":
main()
Since we found out a way to run background task, lets build abstractions for periodic task (timer) and a looping construct which spans across many HTTP requests (foreach).
Timer
Now building timer is straight forward. Basic idea is to have list of timers and the interval at which each should be called. Once we reach that interval call the callback function. We will use memcache to maintain the timer list. To find out when to call callback, we will store a key in memcache with interval as expiration time. We periodically (say 5secs) check if that key is present, if not present then call the callback and again set that key with interval.
def timer(func, interval):
timerlist = memcache.get('timer')
if(None == timerlist):
timerlist = []
timerlist.append({'func':func, 'interval':interval})
memcache.set('timer-'+func, '1', interval)
memcache.set('timer', timerlist)
def checktimers():
timerlist = memcache.get('timer')
if(None == timerlist):
return False
for current in timerlist:
if(None == memcache.get('timer-'+current['func'])):
#reset interval
memcache.set('timer-'+current['func'], '1', current['interval'])
#invoke callback function
try:
eval(current['func']+'()')
except:
pass
return True
return False
Foreach
This is needed when we want to do long taking computation say doing some operation on 1000 database rows or fetch 1000 urls etc. Basic idea is to maintain list of callbacks and arguments in memcache and each time invoke callback with the argument.
def foreach(func, args):
looplist = memcache.get('foreach')
if(None == looplist):
looplist = []
looplist.append({'func':func, 'args':args})
memcache.set('foreach', looplist)
def checkloops():
looplist = memcache.get('foreach')
if(None == looplist):
return False
if((len(looplist) > 0) and (len(looplist[0]['args']) > 0)):
arg = looplist[0]['args'].pop(0)
func = looplist[0]['func']
if(len(looplist[0]['args']) == 0):
looplist.pop(0)
if((len(looplist) > 0) and (len(looplist[0]['args']) > 0)):
memcache.set('foreach', looplist)
else:
memcache.delete('foreach')
try:
eval(func+'('+repr(arg)+')')
except:
pass
return True
else:
return False
# instead of
# foreach index in range(0, 1000):
# someoperaton(index)
# we will say
# foreach('someoperaton', range(0, 1000))
Now building a program which fetches list of urls every one hour is straight forward. Here is the code.
def getone(url):
try:
result = urlfetch.fetch(url)
if(result.status_code == 200):
memcache.set(url, '1', 60*60)
#process result.content
except :
pass
def getallurl():
#list of urls to be fetched
urllist = ['http://www.google.com/', 'http://www.cnn.com/', 'http://www.yahoo.com', 'http://news.google.com']
fetchlist = []
for url in urllist:
if (memcache.get(url) is None):
fetchlist.append(url)
#this is equivalent to
#for url in fetchlist: getone(url)
if(len(fetchlist) > 0):
foreach('getone', fetchlist)
#register the timer callback
timer('getallurl', 3*60)
complete code is here http://groups.google.com/group/httpmr-discuss/t/1648611a54c01aa
I have been running this code on appengine for few days without much problem.
Warning: We make heavy use of urlfetch. The limit on no of urlfetch per day is 160000. So be careful not to reach that limit.
You can find more about cron jobs in Python App Engine here.
Up and coming version of runtime will have some kind of periodic execution engine a'la cron. See this message on AppEngine group.
So, all the SDK pieces appear to work, but my testing indicates this isn't running on the production servers yet-- I set up an "every 1 minutes" cron that logs when it runs, and it hasn't been called yet
Hard to say when this will be available, though...
Using the Deferred Python Library is the easiest way of doing background task on Appengine using Python which is built on top of TaskQueue API.
from google.appengine.ext import deferred
def do_something_expensive(a, b, c=None):
logging.info("Doing something expensive!")
# Do your work here
# Somewhere else
deferred.defer(do_something_expensive, "Hello, world!", 42, c=True)
If you want to run background periodic tasks, see this question (AppEngine cron)
If your tasks are not periodic, see Task Queue Python API or Task Queue Java API
There is a cron facility built into app engine.
Please refer to:
https://developers.google.com/appengine/docs/python/config/cron?hl=en
Use the Task Queue - http://code.google.com/appengine/docs/java/taskqueue/overview.html
Related
What I mean by "deterministic time"? For example AWS offer a service "AWS Lambda". The process started as lambda function has time limit, after that lambda function will stop execution and will assume that task was finished with error. And example task - send data to http endpoint. Depending of a network connection to http endpoint, or other factors, process of sending data can take a long time. If I need to send the same data to the many endpoints, then full process time will take one process time times endpoints amount. Which increase a chance that lambda function will be stopped before all data will be send to all endpoints.
To solve this I need to send data to different endpoints in parallel mode using threads.
The problem with threads - started thread can't be stopped. If http request will take more time than it dedicated by lambda function time limit, lambda function will be aborted and return error. So I need to use timeout with http request, to abort it, if it take more time than expected.
If http request will be canceled by timeout or endpoint will return error, I need to save not processed data somewhere to not lost the data. The time needed to save unprocessed data can be predicted, because I control the storage where data will be saved.
And the last part that consume time - procedure or loop where threads are scheduled executor.submit(). If there is only one endpoint or small number of them then the consumed time will be small. And there is no necessary to control this. But if I have deal with many endpoints, I have to take this into account.
So basically full time will consists of:
scheduling threads
http request execution
saving unprocessed data
There is example of how I can manage time using threads
import concurrent.futures
from functools import partial
import requests
import time
start = time.time()
def send_data(data):
host = 'http://127.0.0.1:5000/endpoint'
try:
result = requests.post(host, json=data, timeout=(0.1, 0.5))
# print('done')
if result.status_code == 200:
return {'status': 'ok'}
if result.status_code != 200:
return {'status': 'error', 'msg': result.text}
except requests.exceptions.Timeout as err:
return {'status': 'error', 'msg': 'timeout'}
def get_data(n):
return {"wait": n}
def done_cb(a, b, future):
pass # save unprocessed data
def main():
executor = concurrent.futures.ThreadPoolExecutor()
futures = []
max_time = 0.5
for i in range(1):
future = executor.submit(send_data, *[{"wait": 10}])
future.add_done_callback(partial(done_cb, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new threads')
# save unprocessed data
break
try:
for item in concurrent.futures.as_completed(futures, timeout=1):
item.result()
except concurrent.futures.TimeoutError as err:
pass
I was thinking of how I can use asyncio library instead of threads, to do the same thing.
import asyncio
import time
from functools import partial
import requests
start = time.time()
def send_data(data):
...
def get_data(n):
return {"wait": n}
def done_callback(a,b, future):
pass # save unprocessed data
def main(loop):
max_time = 0.5
futures = []
start_appending = time.time()
for i in range(1):
event_data = get_data(1)
future = (loop.run_in_executor(None, send_data, event_data))
future.add_done_callback(partial(done_callback, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new futures')
# save unprocessed data
break
finished, unfinished = loop.run_until_complete(
asyncio.wait(futures, timeout=1)
)
_loop = asyncio.get_event_loop()
result = main(_loop)
Function send_data() the same as in previous code snipped.
Because request library is not async code I use run_in_executor() to create future object. The main problems I have is that done_callback() is not executed when the thread that started but executor done it's job. But only when the futures will be "processed" by asyncio.wait() expression.
Basically I seeking the way to start execute asyncio future, like ThreadPoolExecutor start execute threads, and not wait for asyncio.wait() expression to call done_callback(). If you have other ideas how to write python code that will work with threads or coroutines and will complete in deterministic time. Please share it, I will be glad to read them.
And other question. If thread or future done its job, it can return result, that I can use in done_callback(), for example to remove message from queue by id returned in result. But if thread or future was canceled, I don't have result. And I have to use functools.partial() pass in done_callback additional data, that can help me to understand for what data this callback was called. If passed data are small this is not a problem. If data will be big, I need to put data in array/list/dictionary and pass in callback only index of array or put "full data: in callback.
Can I somehow get access to variable that was passed to future/thread, from done_callback(), that was triggered on canceled future/thread?
You can use asyncio.wait_for to wait for a future (or multiple futures, when combined with asyncio.gather) and cancel them in case of a timeout. Unlike threads, asyncio supports cancellation, so you can cancel a task whenever you feel like it, and it will be cancelled at the first blocking call it makes (typically a network call).
Note that for this to work, you should be using asyncio-native libraries such as aiohttp for HTTP. Trying to combine requests with asyncio using run_in_executor will appear to work for simple tasks, but it will not bring you the benefits of using asyncio, such as being able to spawn a massive number of tasks without encumbering the OS, or the possibility of cancellation.
I have a REST API, which returns a JSON response. I need to keep track of one of the fields of the response, and listen for any changes in the value of this field. If the value reaches a certain threshold, I need to perform some task (say print an alert message). How can I accomplish this? Right now, I have a daemon which runs periodically, making an HTTP request and obtaining the value. What is the correct approach to do this, if I want to perform the action the moment the variable reaches the threshold?
This is what I currently have -
import daemon, time
from daemon import runner
NUMBER_OF_MINUTES = 10
def doSomething():
print "Yay, we got there"
def getSomeData():
url = "www.somewebsite.com/getdata?id=somevalue&name=someothervalue"
response = requests.get(url)
json_data = response.json()
myField = json_data['somefield']
if myField > threshold:
doSomething()
def run()
while True:
getSomeData()
time.sleep(60*NUMBER_OF_MINUTES)
if __name__ == '__main__':
run()
Do you have control of the API? If so, add a websocket endpoint that your frontend app can connect to. Your API can then let your frontend app know whenever the value changes, through whatever data structure is appropriate.
If you don't have control of the API, your current polling solution is about as good as it gets.
I am developing a bot with module pyTelegramBotAPI, which is working via webhooks with installed Flask+Gunicorn as server for webhooks. Gunicorn is working with 5 workers for better speed, structure of my project looks like that:
app.py
bot.py
In bot.py i have a function for processing updates:
def process_the_update(update):
logger.info(update)
update = types.Update.de_json(update)
bot.process_new_updates([update])
In app.py i imported this function, so, whenever an update comes, app.py will call this function, and bot will handle the update. In my bot user can call a command, which will use an external api to get some information. The problem is, that this external api has limit for 3 requests per second. I need to configure a bot with such rate limit. First i thought of doing it with Queue with code like this:
lock_queue = Queue(1)
requests_queue = Queue(3)
def api_request(argument):
if lock_queue.empty():
try:
requests_queue.put_nowait(time.time())
except queue.Full:
lock_queue.put(1)
first_request_time = requests_queue.get()
logger.info('First request time: ' + str(first_request_time))
current_time = time.time()
passed_time = current_time - first_request_time
if passed_time >= 1:
requests_queue.put_nowait(time.time())
lock_queue.get()
else:
logger.info(passed_time)
time.sleep(1 - passed_time)
requests_queue.put_nowait(time.time())
lock_queue.get()
else:
lock_queue.put(1)
first_request_time = vk_requests_queue.get()
logger.info('First request time: ' + str(first_request_time))
current_time = time.time()
passed_time = current_time - first_request_time
if passed_time >= 1:
requests_queue.put_nowait(time.time())
lock_queue.get()
else:
logger.info(passed_time)
time.sleep(1 - passed_time)
requests_queue.put_nowait(time.time())
lock_queue.get()
result = make_api_request(argument) # requests are made too by external module.
return result
The logic was that, as i thought, because module pyTelegramBotAPI uses threads for faster updates handling, all threads would check requests_queue, which will have time of 3 last api_requests, and so the time of the first of 3 requests made will be compared to the current time (to check, if a second passed). And, because i needed to be sure, that only one thread would do this kind of comparison simultaneously, i made lock_queue.
But, the problem is that, firstly, gunicorn uses 5 workers, so there will be always possibility, that all messages from users will be handled in different processes, and these processes will have their own queues. And, secondly, even if i set number of workers to default (1 worker), i still get 429 error, so i think, that my code won't work as i wanted at all.
I thought of making rate limit with redis, so every time in every thread and process bot will check the time of last 3 requests, but still i am not sure, that this is the right way, and i am not sure, how to write this.
Would be glad, if someone suggests any ideas or even examples of code (external api does not provide any x-rate-limit header)
Wrote this function, using redis to count requests (based on this https://www.binpress.com/tutorial/introduction-to-rate-limiting-with-redis/155 tutorial)
import redis
r_db = redis.Redis(port=port, db=db)
def limit_request(request_to_make, limit=3, per=1, request_name='test', **kwargs):
over_limit_lua_ = '''
local key_name = KEYS[1]
local limit = tonumber(ARGV[1])
local duration = ARGV[2]
local key = key_name .. '_num_of_requests'
local count = redis.call('INCR', key)
if tonumber(count) > limit then
local time_left = redis.call('PTTL', key)
return time_left
end
redis.call('EXPIRE', key, duration)
return -2
'''
if not hasattr(r_db, 'over_limit_lua'):
r_db.over_limit_lua = r_db.register_script(over_limit_lua_)
request_possibility = int(r_db.over_limit_lua(keys=request_name, args=[limit, per]))
if request_possibility > 0:
time.sleep(request_possibility / 1000.0)
return limit_request(request_to_make, limit, per, request_name, **kwargs)
else:
request_result = request_to_make(**kwargs)
return request_result
I am working on a small python/flask project, which interfaces a heavy computation routine with a browser interface. For practical reasons, I have to keep the computation in a background process and reload/redirect the page (with output results) when the computation is done. The following is a minimal code of what I have so far (in reverse order):
interface.py
from flask import Flask
from threading import Thread
import time
app = Flask(__name__)
# step 4: rerender browser with output data
#app.route('/done')
def done(data_to_pass):
# rerender browser's html here?
print data_to_pass
return data_to_pass
# step 3: heavy computation routine
def background():
print "start runing backgroun process"
time.sleep(3) # simulate heavy computation routine
data = 'done from background'
done(data)
# step 2: initiate background process
def init():
t = Thread(target=background)
t.daemon = True
t.start()
# step 1: home interface
#app.route('/')
def front_end():
init()
return 'initiate bachground process'
if __name__ == '__main__':
app.run()
When the interface.py is running, accessing 127.0.0.1:5000 get a string initiate bachground process in the browser. However, the final data (string done from background in this case) only been processed in the server's terminal, not to the browser.
I believe this procedure is commonly done for most of the server, but I can't find any flask solution... Or do I go in the wrong direction?
If you want to know when the process is finished I would suggest to use one of:
Long polling
WebSocket
However you can reload whole page:
window.location.reload()
it is a good practice to return from the server only the result of the background process and update only related fragment of the page.
I need a webserver which routes the incoming requests to back-end workers by batching them every 0.5 second or when it has 50 http requests whichever happens earlier. What will be a good way to implement it in python/tornado or any other language?
What I am thinking is to publish the incoming requests to a rabbitMQ queue and then somehow batch them together before sending to the back-end servers. What I can't figure out is how to pick multiple requests from the rabbitMq queue. Could someone point me to right direction or suggest some alternate apporach?
I would suggest using a simple python micro web framework such as bottle. Then you would send the requests to a background process via a queue (thus allowing the connection to end).
The background process would then have a continuous loop that would check your conditions (time and number), and do the job once the condition is met.
Edit:
Here is an example webserver that batches the items before sending them to any queuing system you want to use (RabbitMQ always seemed overcomplicated to me with Python. I have used Celery and other simpler queuing systems before). That way the backend simply grabs a single 'item' from the queue, that will contain all required 50 requests.
import bottle
import threading
import Queue
app = bottle.Bottle()
app.queue = Queue.Queue()
def send_to_rabbitMQ(items):
"""Custom code to send to rabbitMQ system"""
print("50 items gathered, sending to rabbitMQ")
def batcher(queue):
"""Background thread that gathers incoming requests"""
while True:
batcher_loop(queue)
def batcher_loop(queue):
"""Loop that will run until it gathers 50 items,
then will call then function 'send_to_rabbitMQ'"""
count = 0
items = []
while count < 50:
try:
next_item = queue.get(timeout=.5)
except Queue.Empty:
pass
else:
items.append(next_item)
count += 1
send_to_rabbitMQ(items)
#app.route("/add_request", method=["PUT", "POST"])
def add_request():
"""Simple bottle request that grabs JSON and puts it in the queue"""
request = bottle.request.json['request']
app.queue.put(request)
if __name__ == '__main__':
t = threading.Thread(target=batcher, args=(app.queue, ))
t.daemon = True # Make sure the background thread quits when the program ends
t.start()
bottle.run(app)
Code used to test it:
import requests
import json
for i in range(101):
req = requests.post("http://localhost:8080/add_request",
data=json.dumps({"request": 1}),
headers={"Content-type": "application/json"})