Setup: Python 2.7.15, Tornado 5.1
I have a web-server machine that handles ~40 /recommend requests per second.
The average response time is 25ms, but there's a big divergence (some requests can take more than 500ms).
Each request generates between 1-8 Elasticsearch queries (HTTP requests) internally.
Each Elasticsearch query can take between 1-150ms.
The Elasticsearch requests are handled synchronously via elasticsearch-dsl library.
The goal is to reduce the i/o waiting time (queries to Elasticsearch) and handle more requests per second so I can reduce the number of machines.
One thing is unacceptable - I don't want to increase the average handle time (25ms).
I found some tornado-elasticsearch implementations on the web, but since I need to use only one endpoint to Elasticsearch (/_search) I am trying to do that alone.
Below there's a degenerated implementation of my web-server. With the same load (~40 request per second) the average request response time increased to 200ms!
Digging in, I see that the internal async handle time (queries to Elasticsearch) is not stable and the time takes to each fetch call might be different, and the total average (in ab load test) is high.
I'm using ab to simulate the load and measure it internally by printing the current fetch handle time, average fetch handle time and maximum handle time.
When doing one request at a time (concurrency 1):
ab -p es-query-rcom.txt -T application/json -n 1000 -c 1 -k 'http://localhost:5002/recommend'
my prints looks like: [avg req_time: 3, dur: 3] [current req_time: 2, dur: 3] [max req_time: 125, dur: 125] reqs: 8000
But when I try to increase the concurrency (up to 8): ab -p es-query-rcom.txt -T application/json -n 1000 -c 8 -k 'http://localhost:5002/recommend'
now my prints looks like: [avg req_time: 6, dur: 13] [current req_time: 4, dur: 4] [max req_time: 73, dur: 84] reqs: 8000
The average req is now x2 slower (or x4 by my measurements)!
What do I miss here? why do I see this degradation?
import tornado
from tornado.httpclient import AsyncHTTPClient
from tornado.options import define, options
from tornado.httpserver import HTTPServer
from web_handler import WebHandler
SERVICE_NAME = 'web_server'
class Statistics(object):
def __init__(self):
self.total_requests = 0
self.total_requests_time = 0
self.total_duration = 0
self.max_time = 0
self.max_duration = 0
class RcomService(object):
def __init__(self):
print 'initializing RcomService...'
AsyncHTTPClient.configure("tornado.curl_httpclient.CurlAsyncHTTPClient", max_clients=3)
self.stats = Statistics()
def start(self, port):
define("port", default=port, type=int)
db = self.get_db(self.stats)
routes = self.generate_routes(db)
app = tornado.web.Application(routes)
http_server = HTTPServer(app, xheaders=True)
def generate_routes(db):
return [
(r"/recommend", WebHandler, dict(db=db))
def get_db(stats):
return {
'stats': stats
def main():
port = 5002
print('starting %s on port %s', SERVICE_NAME, port)
rcom_service = RcomService()
if __name__ == '__main__':
import time
import ujson
from tornado import gen
from tornado.gen import coroutine
from tornado.httpclient import AsyncHTTPClient
from tornado.web import RequestHandler
class WebHandler(RequestHandler):
def initialize(self, db):
self.stats = db['stats']
def post(self, *args, **kwargs):
result = yield self.wrapper_innear_loop([{}, {}, {}, {}, {}, {}, {}, {}]) # dummy queries (empty)
'res': result
def wrapper_innear_loop(self, queries):
result = []
for q in queries: # queries are performed serially
res = yield self.async_fetch_gen(q)
raise gen.Return(result)
def async_fetch_gen(self, query):
url = 'http://localhost:9200/my_index/_search'
headers = {
'Content-Type': 'application/json',
'Connection': 'keep-alive'
http_client = AsyncHTTPClient()
start_time = int(round(time.time() * 1000))
response = yield http_client.fetch(url, method='POST', body=ujson.dumps(query), headers=headers)
end_time = int(round(time.time() * 1000))
duration = end_time - start_time
body = ujson.loads(response.body)
request_time = int(round(response.request_time * 1000))
self.stats.total_requests += 1
self.stats.total_requests_time += request_time
self.stats.total_duration += duration
if self.stats.max_time < request_time:
self.stats.max_time = request_time
if self.stats.max_duration < duration:
self.stats.max_duration = duration
duration_avg = self.stats.total_duration / self.stats.total_requests
time_avg = self.stats.total_requests_time / self.stats.total_requests
print "[avg req_time: " + str(time_avg) + ", dur: " + str(duration_avg) + \
"] [current req_time: " + str(request_time) + ", dur: " + str(duration) + "] [max req_time: " + \
str(self.stats.max_time) + ", dur: " + str(self.stats.max_duration) + "] reqs: " + \
raise gen.Return(body)
I tried to play a bit with the async class (Simple vs curl), the max_clients size, but I don't understand what is the best tune in my case.
Increased time may be because with concurrency==1, CPU was under-utilized and with c==8 it's 100%+ utilized and is unable to catch up with all requests. Example, abstract CPU can process 1000 operations/sec, to send a request it takes 50 CPU ops and to read a request result it takes 50 CPU ops too. When you have 5 RPS your CPU is 50% utilized and average request time is 50 ms (to send a req.) + request time + 50 ms (to read a req.). But when you have, for example, 40 RPS (8 times more than 5 RPS), your CPU would be over-utilized by 400% and some finished requests would be waiting to be parsed, so average request time now is 50 ms + request time + CPU wait time + 50 ms.
To sum up, my advise would be to check a CPU utilization on both loads and, to be sure, to profile how much time does it takes to send a request and parse a response, CPU may be your bottleneck.
server: ubuntu 14.04 2core and 4G.
gunicorn -k gevent and flask.
the service behind the flask is some redis read/write, just small keys and values. use the python library: redis==3.4.1.
the production problem is: when more people use same api at same time, the cost or time of the api response becomes heavy and spend more time in redis operations: from 10ms increase to 100ms or even higher.
mport time
import functools
import redis
from flask import Flask, request, jsonify
app = Flask(__name__)
pool = redis.ConnectionPool(host='',
r = redis.StrictRedis(
def timer(func):
def decorator(*args, **kwargs):
s = time.time()
data = request.json or request.form.to_dict()
r = func(data, *args, **kwargs)
end = time.time()
print('spend: {}'.format(int(end * 1000 - s * 1000)))
return r
return decorator
def get_no():
z = r.get('test2')
print('room_no: {}'.format(z))
if not z:
return get_no()
if player_num() > 100:
return get_no()
return z
def player_num():
return r.incrby('room_num')
def create_no():
if r.setnx('lock', 1):
n = r.incrby('test2')
return n
#app.route('/test', methods=['POST', 'GET'])
def test(data):
# no = get_no()
# print(no)
z = r.incrby('incry_4')
return jsonify(dict(code=200))
Plus, I take some tests in local machine with wrk tool. and found that, when use more connections , the api response spend more time. I want to know why when use the -k gevent, the api spend more time.
Currently using an API that rate limits me to 3000 requests per 10 seconds. I have 10,000 urls that are fetched using Tornado due to it's asynchronous IO nature.
How do I go about implementing a rate limit to reflect the API limit?
from tornado import ioloop, httpclient
i = 0
def handle_request(response):
global i
i -= 1
if i == 0:
http_client = httpclient.AsyncHTTPClient()
for url in open('urls.txt'):
i += 1
http_client.fetch(url.strip(), handle_request, method='HEAD')
You can check where does the value of i lies in the interval of 3000 requests. For example, if i is in between 3000 and 6000, you can set the timeout of 10 seconds on every request until 6000. After 6000, just double the timeout. And so on.
http_client = AsyncHTTPClient()
timeout = 10
interval = 3000
for url in open('urls.txt'):
i += 1
if i <= interval:
# i is less than 3000
# just fetch the request without any timeout
http_client.fetch(url.strip(), handle_request, method='GET')
continue # skip the rest of the loop
if i % interval == 1:
# i is now 3001, or 6001, or so on ...
timeout += timeout # double the timeout for next 3000 calls
loop = ioloop.IOLoop.current()
loop.call_later(timeout, callback=functools.partial(http_client.fetch, url.strip(), handle_request, method='GET'))
Note: I only tested this code with small number of requests. It might be possible that the value of i would change because you're subtracting i in handle_request function. If that's the case, you should maintain another variable similar to i and perform subtraction on that.
I have a situation to call multiple requests in a scheduler job to check live user status for 1000 users at a time. But server limits maximum up to 50 users in each hit of an API request. So using following approach with for loop its taking around 66 seconds for 1000 users (i.e for 20 API calls).
from apscheduler.schedulers.blocking import BlockingScheduler
sched = BlockingScheduler()
def shcdulerjob():
uidlist = todays_userslist() #Get around 1000 users from table
split_list = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)]
idlists = split_list(uidlist, 50) # SERVER MAX LIMIT - 50 ids/request
for idlist in idlists:
apiurl = some_server_url + "&ids="+str(idlist)
resp = requests.get(apiurl)
save_status(resp.json()) #-- Save status to db
if __name__ == "__main__":
sched.add_job(shcdulerjob, 'interval', minutes=10)
Is there any workaround so that it should optimize the time required to fetch API?
Does Python- APScheduler provide any multiprocessing option to process such api requests in a single job?
You could try to apply python's Thread pool from the concurrent.futures module, if the server allows concurrent requests. That way you would parallelise the processing, instead of the scheduling itself
There are some good examples provided in the documentation here (If you're using python 2, there is a sort of an equivalent module
import concurrent.futures
import multiprocessing
import requests
import time
import json
cpu_start_time = time.process_time()
clock_start_time = time.time()
queue = multiprocessing.Queue()
uri = "http://localhost:5000/data.json"
users = [str(user) for user in range(1, 50)]
with concurrent.futures.ThreadPoolExecutor(multiprocessing.cpu_count()) as executor:
for user_id, result in zip(
[str(user) for user in range(1, 50)]
, x: requests.get(uri, params={id: x}).content, users)
queue.put((user_id, result))
while not queue.empty():
user_id, rs = queue.get()
print("User ", user_id, json.loads(rs.decode()))
cpu_end_time = time.process_time()
clock_end_time = time.time()
print("Took {0:.03}s [{1:.03}s]".format(cpu_end_time-cpu_start_time, clock_end_time-clock_start_time))
If you want to use a Process pool, just make sure you don't use shared resources, e.g. queue, and write your data our independently
import redis
import datetime
import time
import json
import sys
import threading
import gevent
from gevent import monkey
def main(chan):
redis_host = ''
r = redis.client.StrictRedis(host=redis_host, port=6379)
while True:
def getpkg():
package = {'time': time.time(),
'signature' : 'content'
return package
#test 2: complex data
now = json.dumps(getpkg())
# send it
r.publish(chan, now)
print 'Sending {0}'.format(now)
print 'data type is %s' % type(now)
def zerg_rush(n):
for x in range(n):
t = threading.Thread(target=main, args=(x,))
if __name__ == '__main__':
num_of_chan = 10
cnt = 0
stop_cnt = 21
while True:
print 'Waiting'
cnt += 1
if cnt == stop_cnt:
import redis
import threading
import time
import json
import gevent
from gevent import monkey
def callback(ind):
redis_host = ''
r = redis.client.StrictRedis(host=redis_host, port=6379)
sub = r.pubsub()
start = False
avg = 0
tot = 0
sum = 0
while True:
for m in sub.listen():
if not start:
start = True
got_time = time.time()
decoded = json.loads(m['data'])
sent_time = float(decoded['time'])
dur = got_time - sent_time
tot += 1
sum += dur
avg = sum / tot
print decoded #'Recieved: {0}'.format(m['data'])
file_name = 'logs/sub_%s' % ind
f = open(file_name, 'a')
f.write('processing no. %s' % tot)
f.write('it took %s' % dur)
f.write('current avg: %s\n' % avg)
def zerg_rush(n):
for x in range(n):
t = threading.Thread(target=callback, args=(x,))
def main():
num_of_chan = 10
while True:
print 'Waiting'
if __name__ == '__main__':
I am testing redis pubsub to replace the use of rsh to communicate with remote boxes.
One of the things I have tested for was the number of channels affecting latency of publish and pubsub.listen().
Test: One publisher and one subscriber per channel (publisher publish every one second). Incremented the number of channels from and observed the latency (The duration from the moment publisher publish a message to the moment subscriber got the message via listen)
num of chan--------------avg latency in seconds
Note: tested on 2 CPU + 4GB RAM + 1 NICsĀ RHEL6.4 VM.
What can I do to maintain low latency with high number of channels?
Redis is single-threaded so increasing more cpus wont help. maybe more RAM? if so, how much more?
Anything I can do code-wise or bottleneck is in Redis itself?
Maybe the limitation comes from the way my test codes are written with threading?
Redis Cluster vs ZeroMQ in Pub/Sub, for horizontally scaled distributed systems
Accepted answer says "You want to minimize latency, I guess. The number of channels is irrelevant. The key factors are the number of publishers and number of subscribers, message size, number of messages per second per publisher, number of messages received by each subscriber, roughly. ZeroMQ can do several million small messages per second from one node to another; your bottleneck will be the network long before it's the software. Most high-volume pubsub architectures therefore use something like PGM multicast, which ZeroMQ supports."
From my testings, i dont know if this is true. (The claim that the number of channels is irrelevant)
For example, i did a testing.
1) One channel. 100 publishers publishing to a channel with 1 subscriber listening. Publisher publishing one second at a time. latency was 0.00965 seconds
2) Same testing except 1000 publishers. latency was 0.00808 seconds
Now during my channel testing:
300 channels with 1 pub - 1 sub resulted in 0.0621 and this is only 600 connections which is less than above testing yet significantly slow in latency
I am trying to write a web benchmarking tool base on twisted. Twisted is very fantastic asynchronous framework for web applications. Because I get started with this framework for just two weeks, I face a problem, here is it:
When I test this benchmarking tool compare with ApacheBench, the result differs greatly on the same concurrency. Here is the result of my tool:
python 50000 50
speed:1063(q/s), worker:50, interval:7, req_made:7493, req_done:7443, req_error:0
And Here is the result of Apache Bench:
ab -c 50 -n 50000
Server Software: nginx/1.4.1
Server Hostname:
Server Port: 8080
Document Path: /a.txt
Document Length: 6 bytes
Concurrency Level: 50
Time taken for tests: 6.89937 seconds
Complete requests: 50000
Failed requests: 0
Write errors: 0
Total transferred: 12501750 bytes
HTML transferred: 300042 bytes
Requests per second: 8210.27 [#/sec] (mean)
Time per request: 6.090 [ms] (mean)
Time per request: 0.122 [ms] (mean, across all concurrent requests)
Transfer rate: 2004.62 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.8 0 4
Processing: 1 5 3.4 5 110
Waiting: 0 2 3.6 2 109
Total: 1 5 3.5 5 110
Percentage of the requests served within a certain time (ms)
50% 5
66% 6
75% 6
80% 6
90% 7
95% 7
98% 8
99% 8
100% 110 (longest request)
On the same url and concurrency, ApacheBench can go up to 8000 req/sec, while pyab only 1000 req/sec.
Here is my code(
from twisted.internet import reactor,threads
from twisted.internet.protocol import Protocol
from twisted.internet.defer import Deferred
from twisted.web.client import Agent
from twisted.web.client import HTTPConnectionPool
from twisted.web.http_headers import Headers
from twisted.python import log
import time, os, stat, logging, sys
from collections import Counter
#filename= "/%s/log/%s.%s" % (RUN_DIR,RUN_MODULE,RUN_TIME),
format="%(asctime)s [%(levelname)s] %(message)s",
observer = log.PythonLoggingObserver()
class IgnoreBody(Protocol):
def __init__(self, deferred, tl):
self.deferred = deferred = tl
def dataReceived(self, bytes):
def connectionLost(self, reason):
class Pyab:
def __init__( self, n = 50000, concurrency = 100, url=''):
self.n = n
self.url = url
self.pool = HTTPConnectionPool(reactor, persistent=True)
self.pool.maxPersistentPerHost = concurrency
self.agent = Agent(reactor, connectTimeout = 5, pool = self.pool)
#self.agent = Agent(reactor, connectTimeout = 5)
self.time_start = time.time()
self.max_worker = concurrency
self.cnt = Counter({
'worker' : 0 ,
'req_made' : 0,
'req_done' : 0,
'req_error' : 0,
def monitor( self ):
interval = int(time.time() - self.time_start)
speed = 0
if interval != 0:
speed = int( self.cnt['req_done'] / interval )
log.msg("speed:%d(q/s), worker:%d, interval:%d, req_made:%d, req_done:%d, req_error:%d"
% (speed, self.cnt['worker'], interval, self.cnt['req_made'], self.cnt['req_done'], self.cnt['req_error']), logLevel=logging.WARNING)
reactor.callLater(1, lambda : self.monitor())
def start( self ):
self.keeprunning = True
def stop( self ):
self.keeprunning = False
def readMore( self ):
while self.cnt['worker'] < self.max_worker and self.cnt['req_done'] < self.n :
if self.keeprunning and self.cnt['req_done'] < self.n:
reactor.callLater( 0.0001, lambda: self.readMore() )
def make_request( self ):
d = self.agent.request(
Headers({'User-Agent': ['Twisted Web Client Example']}),
self.cnt['worker'] += 1
self.cnt['req_made'] += 1
def cbResponse(resp):
self.cnt['worker'] -= 1
self.cnt['req_done'] += 1
log.msg('response received')
finished = Deferred()
resp.deliverBody(IgnoreBody(finished, self))
return finished
def cbError(error):
self.cnt['worker'] -= 1
self.cnt['req_error'] += 1
log.msg(error, logLevel=logging.ERROR)
if __name__ == '__main__' :
if len(sys.argv) < 4:
print "Usage: %s <n> <concurrency> <url>" % (sys.argv[0])
ab = Pyab(n=int(sys.argv[1]), concurrency=int(sys.argv[2]), url=sys.argv[3])
Is there any wrong with my code? Thanks!
When I last used it, ab was known to have dozens of serious bugs. Sometimes that would cause it to report massively inflated results. Sometimes it would report negative results. Sometimes it would crash. I'd try another tool, like httperf, as a sanity check.
However, if your server is actually that fast, then you might have another issue.
Even if ab has been fixed, you're talking here about a C program versus a Python program running on CPython. 8x slower than C in Python is not actually all that bad, so I don't expect there is actually anything wrong with your program, except that it doesn't make use of spawnProcess and multi-core concurrency.
For starters, see if you get any better results on PyPy.