How to lock using context manager - python

I have been trying to figure out context manager more and more and the more I am into it the more problem I seem to find. My current problem is that I have currently no lock which could cause that two or more threads could end up having the same shared value as I only want one value to be in used.
import random
import threading
import time
list_op_proxy = [
"https://123.123.12.21:12345",
"http://123.123.12.21:54321",
]
proxy_dict = dict(zip(list_op_proxy, ['available'] * len(list_op_proxy)))
proxy_dict['http://123.123.12.21:987532'] = "busy"
class AvailableProxies:
def __enter__(self):
while True:
available = [att for att, value in proxy_dict.items() if "available" in value]
if available:
self.proxy = random.choice(available)
proxy_dict[self.proxy] = "busy"
return self.proxy
else:
continue
def __exit__(self, exc_type, exc_val, exc_tb):
proxy_dict[self.proxy] = "available"
def handler(name):
with AvailableProxies() as proxy:
print(f"{name} | Proxy in use: {proxy}")
# Adding 2 seconds as we want to see if it actually wait for the availability
time.sleep(2)
for i in range(5):
threading.Thread(target=handler, args=(f'Thread {i}',)).start()
as you can see in my context manager I want to random loop through a dict key:value that has the value set to available and if it is available then we set it to busy -> do some stuff and then exit it (release by settings the same value to available) - However my problem is that in rare cases it seems like more than 2 threads are able to get the same proxy which I want to block, I want only one thread to be able to access the context manager at the time so we can set the proxy value to busy so no other threads can take it.
How can I lock so only one thread can set the proxy to busy so it doesnt happend that two or more threads set busy on the same proxy?

You just need to lock when looking for a proxy and release the lock after a proxy was found (usage is the same as in your previous question, no matter if you are using a context manager), I just added some more debug messages:
import random
import threading
import time
list_op_proxy = [
"https://123.123.12.21:12345",
"http://123.123.12.21:54321",
]
proxy_dict = dict(zip(list_op_proxy, ['available'] * len(list_op_proxy)))
proxy_dict['http://123.123.12.21:987532'] = "busy"
proxy_lock = threading.Lock()
class AvailableProxies:
def __enter__(self):
proxy_lock.acquire()
self.proxy = None
while not self.proxy:
available = [
att for att, value in proxy_dict.items() if "available" in value
]
if available:
print('%d proxies available' % len(available))
self.proxy = random.choice(available)
proxy_dict[self.proxy] = "busy"
break
else:
print("Waiting ... not proxy available")
time.sleep(.2)
continue
proxy_lock.release()
return self.proxy
def __exit__(self, exc_type, exc_val, exc_tb):
proxy_dict[self.proxy] = "available"
def handler(name):
with AvailableProxies() as proxy:
print(f"{name} | Proxy in use: {proxy}")
# Adding 2 seconds as we want to see if it actually wait for the availability
time.sleep(.1)
for j in range(5):
threads = [threading.Thread(target=handler, args=(i, )) for i in range(3)]
[t.start() for t in threads]
[t.join() for t in threads]
print("---")
Out:
2 proxies available
0 | Proxy in use: http://123.123.12.21:54321
1 proxies available
1 | Proxy in use: https://123.123.12.21:12345
Waiting ... not proxy available
2 proxies available
2 | Proxy in use: https://123.123.12.21:12345
---
2 proxies available
0 | Proxy in use: http://123.123.12.21:54321
1 proxies available
1 | Proxy in use: https://123.123.12.21:12345
Waiting ... not proxy available
2 proxies available
2 | Proxy in use: http://123.123.12.21:54321
---
2 proxies available
0 | Proxy in use: https://123.123.12.21:12345
1 proxies available
1 | Proxy in use: http://123.123.12.21:54321
Waiting ... not proxy available
2 proxies available
2 | Proxy in use: https://123.123.12.21:12345
---
2 proxies available
0 | Proxy in use: https://123.123.12.21:12345
1 proxies available
1 | Proxy in use: http://123.123.12.21:54321
Waiting ... not proxy available
2 proxies available
2 | Proxy in use: https://123.123.12.21:12345
---
2 proxies available
0 | Proxy in use: http://123.123.12.21:54321
1 proxies available
1 | Proxy in use: https://123.123.12.21:12345
Waiting ... not proxy available
2 proxies available
2 | Proxy in use: http://123.123.12.21:54321
---

Related

How to tell a beanstalkc receiver to wait X seconds before reserving a task?

I have 2 beanstalkc receivers watching the same tube "tubename".
I would like one beanstalkc receiver to have priority over the other. In order to achieve this, I would like to tell the lowest-priority beanstalkc receiver to wait for task being X seconds old before reserving them.
I found "reserve-with-timeout", but I neither really understand it nor do I managed to make it work successfully for my use case.
class MyBeanstalkReceiver():
def __init__(self, host=beanstalkc.DEFAULT_HOST, port=beanstalkc.DEFAULT_PORT,
tube="default", timeout=1):
self.tube = tube
self.host = host
self.port = port
self.timeout = timeout
def run(self):
while True:
self.run_once()
def run_once(self):
job = self._get_task()
try:
body = job.body
data = json.loads(body)
self.job(data)
except Exception as e:
job.delete()
def job(self, data):
print(data)
def beanstalk(self):
beanstalk = beanstalkc.Connection(host=self.host, port=self.port)
beanstalk.use(self.tube)
beanstalk.watch(self.tube)
return beanstalk
def _get_task(self):
return self.beanstalk().reserve(self.timeout)
And my 2 beanstalkc receivers:
# receiver 1
w = MyBeanstalkReceiver(hosts=["localhost:14711"], tube="tubename", timeout=1)
w.run()
# receiver 2
w = MyBeanstalkReceiver(hosts=["localhost:14711"], tube="tubename", timeout=10000)
w.run()
Between the 2 receivers, with a timeout of 1 and 10000, nothing changes when I send tasks over the tube: both end up managing the same quantity of tasks put inside the tube "tubename".
Any idea on how to proceed to make "receiver 1" prioritary over "receiver 2"?
The timeout in reserve is for how long the client will wait before returning without a job.
You may be looking for put (with a delay), where the job is not released until it has been in the queue for at least n seconds.
There is also a priority per job. If the receiver could have seen them both at the same time, it will return any jobs with a higher priority (ie: closer to 0) rather than with a lower priority (with a larger number).
Beanstalkd does not differentiate between priorities of the clients or receivers.

How to create a ``depends_on`` relationship between scheduled and queued jobs in python-rq

I have a web service (Python 3.7, Flask 1.0.2) with a workflow consisting of 3 steps:
Step 1: Submitting a remote compute job to a commercial queuing system (IBM's LSF)
Step 2: Polling every 61 seconds for the remote compute job status (61 seconds because of cached job status results)
Step 3: Data post-processing if step 2 returns remote compute job status == "DONE"
The remote compute job is of arbitrary length (between seconds and days) and each step is dependent on the completion of the previous one:
with Connection(redis.from_url(current_app.config['REDIS_URL'])):
q = Queue()
job1 = q.enqueue(step1)
job2 = q.enqueue(step2, depends_on=job1)
job3 = q.enqueue(step3, depends_on=job2)
However, eventually all workers (4 workers) will do polling (step 2 of 4 client requests), while they should continue to do step 1 of other incoming requests and step 3 of those workflows having successfully passed step 2.
Workers should be released after each poll. They should periodically come back to step 2 for the next poll (at most every 61 seconds per job) and if the remote compute job poll does not return "DONE" re-queue the poll job.
At this point in time I started to use rq-scheduler (because the interval and re-queueing features sounded promising):
with Connection(redis.from_url(current_app.config['REDIS_URL'])):
q = Queue()
s = Scheduler('default')
job1 = q.enqueue(step1, REQ_ID)
job2 = Job.create(step2, (REQ_ID,), depends_on=job1)
job2.meta['interval'] = 61
job2.origin = 'default'
job2.save()
s.enqueue_job(job2)
job3 = q.enqueue(step3, REQ_ID, depends_on=job2)
Job2 is created correctly (including the depends_on relationship to job1 but s.enqueue_job() executes it straight away, ignoring its relationship to job1. (The function doc-string of q.enqueue_job() actually says that it is executed immediately ...).
How can I create the depends_on relationship between job1, job2 and job3, when job2 is put in the scheduler and not the queue? (Or, how can I hand job2 to the scheduler, without it executing job2 straight away and waiting for job1 to finish?)
For testing purposes the steps look like this:
def step1():
print(f'*** --> [{datetime.utcnow()}] JOB [ 1 ] STARTED...', flush=True)
time.sleep(20)
print(f' <-- [{datetime.utcnow()}] JOB [ 1 ] FINISHED', flush=True)
return True
def step2():
print(f' --> [{datetime.utcnow()}] POLL JOB [ 2 ] STARTED...', flush=True)
time.sleep(10)
print(f' <-- [{datetime.utcnow()}] POLL JOB [ 2 ] FINISHED', flush=True)
return True
def step3():
print(f' --> [{datetime.utcnow()}] JOB [ 3 ] STARTED...', flush=True)
time.sleep(10)
print(f'*** <-- [{datetime.utcnow()}] JOB [ 3 ] FINISHED', flush=True)
return True
And the output I receive is this:
worker_1 | 14:44:57 default: project.server.main.tasks.step1(1) (d40256a2-904f-4ce3-98da-6e49b5d370c9)
worker_2 | 14:44:57 default: project.server.main.tasks.step2(1) (3736909c-f05d-4160-9a76-01bb1b18db58)
worker_2 | --> [2019-11-04 14:44:57.341133] POLL JOB [ 2 ] STARTED...
worker_1 | *** --> [2019-11-04 14:44:57.342142] JOB [ 1 ] STARTED...
...
job2 is not waiting for job1 to complete ...
#requirements.txt
Flask==1.0.2
Flask-Bootstrap==3.3.7.1
Flask-Testing==0.7.1
Flask-WTF==0.14.2
redis==3.3.11
rq==0.13
rq_scheduler==0.9.1
My solution to this problem uses rq only (and no longer rq_scheduler):
Upgrade to the latest python-rq package:
# requirements.txt
...
rq==1.1.0
Create a dedicated queue for the polling jobs, and enqueue jobs accordingly (with the depends_on relationship):
with Connection(redis.from_url(current_app.config['REDIS_URL'])):
q = Queue('default')
p = Queue('pqueue')
job1 = q.enqueue(step1)
job2 = p.enqueue(step2, depends_on=job1) # step2 enqueued in polling queue
job3 = q.enqueue(step3, depends_on=job2)
Derive a dedicated worker for the polling queue. It inherits from the standard Worker class:
class PWorker(rq.worker.Worker):
def execute_job(self, *args, **kwargs):
seconds_between_polls = 65
job = args[0]
if 'lastpoll' in job.meta:
job_timedelta = (datetime.utcnow() - job.meta["lastpoll"]).total_seconds()
if job_timedelta < seconds_between_polls:
sleep_period = seconds_between_polls - job_timedelta
time.sleep(sleep_period)
job.meta['lastpoll'] = datetime.utcnow()
job.save_meta()
super().execute_job(*args, **kwargs)
The PWorker extends the execute_job method by adding a timestamp to the job's meta data 'lastpoll'.
If a poll job comes in, having a lastpoll timestamp, the worker checks if the time period since lastpoll is greater than 65 seconds. If it is, it writes the current time to 'lastpoll' and executes the poll. If not, it sleeps until the 65s are up and then writes the current time to 'lastpoll' and executes the poll. A job coming in without a lastpoll timestamp is polling for the first time and the worker creates the timestamp and executes the poll.
Create a dedicated exception (to be thrown by the task function) and an exception handler to deal with it:
# exceptions.py
class PACError(Exception):
pass
class PACJobRun(PACError):
pass
class PACJobExit(PACError):
pass
# exception_handlers.py
def poll_exc_handler(job, exc_type, exc_value, traceback):
if exc_type is PACJobRun:
requeue_job(job.get_id(), connection=job.connection)
return False # no further exception handling
else:
return True # further exception handling
# tasks.py
def step2():
# GET request to remote compute job portal API for status
# if response == "RUN":
raise PACJobRun
return True
When the custom exception handler catches the custom exception (which means the remote compute job is still running), it requeues the job in the polling queue.
Slot the custom exception handler into the exception handling hierarchy:
# manage.py
#cli.command('run_pworker')
def run_pworker():
redis_url = app.config['REDIS_URL']
redis_connection = redis.from_url(redis_url)
with rq.connections.Connection(redis_connection):
pworker = PWorker(app.config['PQUEUE'], exception_handlers=[poll_exc_handler])
pworker.work()
The nice thing about this solution is that it extends the standard functionality of python-rq with only a few lines of extra code. On the other hand, there is the added complexity of an extra queue and worker …

Tornado Server using most of the cpu while using tornado-sockjs and only two clients.

I am using Tornado Server, 4.4.2 and pypy 5.9.0 and python 2.7.13,
hosted on Ubuntu 16.04.3 LTS
A new client logs in and a new class is created and passed the socket, so dialog can be maintained. I am using a global clients[] list to contain the classes. initial dialog looks like :
clients = []
class RegisterWebSocket(SockJSConnection):
# intialize the class and handle on-open (some things left out)
def on_open(self,info):
self.ipaddress = info.headers['X-Real-Ip']
def on_message(self, data):
coinlist = []
msg = json.loads(data)
if 'coinlist' in msg:
coinlist = msg['coinlist']
if 'currency' in msg:
currency = msg['currency']
tz = pendulum.timezone('America/New_York')
started = pendulum.now(tz).to_day_datetime_string()
ws = WebClientUpdater(self, self.clientid, coinlist,currency,
started, self.ipaddress)
clients.append(ws)
The ws class is shown below and I use a tornado periodiccallback to update the clients with their specific info every 20 seconds
class WebClientUpdater(SockJSConnection):
def __init__(self, ws,id, clist, currency, started, ipaddress):
super(WebClientUpdater,self).__init__(ws.session)
self.ws = ws
self.id = id
self.coinlist = clist
self.currency = currency
self.started = started
self.ipaddress = ipaddress
self.location = loc
self.loop = tornado.ioloop.PeriodicCallback(self.updateCoinList,
20000, io_loop=tornado.ioloop.IOLoop.instance())
self.loop.start()
self.send_msg('welcome '+ id)
def updateCoinList(self):
pdata = db.getPricesOfCoinsInCurrency(self.coinlist,self.currency)
self.send(dict(priceforcoins = pdata))
def send_msg(self,msg):
self.send(msg)
I also start at 60 second periodiccallback at startup, to monitor the clients for closed connections and remove them from the client[] list. Which I put on the startup line to call a def internally like
if __name__ == "__main__":
app = make_app()
app.listen(options.port)
ScheduleSocketCleaning()
and
def ScheduleSocketCleaning():
def cleanSocketHouse():
print "checking sockets"
for x in clients:
if x.is_closed:
x = None
clients[:] = [y for y in clients if not y.is_closed ]
loop = tornado.ioloop.PeriodicCallback(cleanSocketHouse, 60000,
io_loop=tornado.ioloop.IOLoop.instance())
loop.start()
If I monitor the server using TOP I see that it uses 4% cpu typical with bursts to 60+ immediately, but later, say after a few hours it becomes in the 90% and stays there.
I have used strace and I see an enormous amount of Stat calls on the same files with errors shown in the strace -c view, but I cannot find any errors in a text file using -o trace.log. How can I find those errors ?
But I also notice that most of the time is consumed in epoll_wait.
%time
41.61 0.068097 7 9484 epoll_wait
26.65 0.043617 0 906154 2410 stat
15.77 0.025811 0 524072 read
10.90 0.017840 129 138 brk
2.41 0.003937 9 417 madvise
2.04 0.003340 0 524072 lseek
0.56 0.000923 3 298 sendto
0.06 0.000098 0 23779 gettimeofday
100.00 0.163663 1989527 2410 total
Notice 2410 errors above.
When I view the strace output stream using attached pid, I just see endless Stat calls on the same files..
Can someone advise me as to how to better debug this situation? With only two clients and 20 seconds between client updates, I would expect the CPU usage (there are no other users of the site during this prototype stage) would be less than 1% or thereabouts.
You need to close PeriodicCallbacks, otherwise its a memory leak. You do that by simply calling .close() on a PeriodicCallback object. One way to deal with that is in your periodic cleaning task:
def cleanSocketHouse():
global clients
new_clients = []
for client in clients:
if client.is_closed:
# I don't know why you call it loop,
# .timer would be more appropriate
client.loop.close()
else:
new_clients.append(client)
clients = new_clients
I'm not sure how accurate .is_closed is (some testing is required). The other way is to alter updateCoinList. The .send() method should fail when the client is no longer connected, right? Therefore try: except: should do the trick:
def updateCoinList(self):
global clients
pdata = db.getPricesOfCoinsInCurrency(self.coinlist,self.currency)
try:
self.send(dict(priceforcoins = pdata))
except Exception:
# log exception?
self.loop.close()
clients.remove(self) # you should probably use set instead of list
If ,send() actually doesn't fail (for whatever reason, I'm not that familiar with Tornado) then stick to the first solution.

Extract text from 200k domains with scrapy

My problem is: I want extract all valuable text from some domain for example www.example.com. So I go to this website and visit all the links with the maximal depth 2 and write it csv file.
I wrote the module in scrapy which solves this problem using 1 process and yielding multiple crawlers, but it is inefficient - I am able to crawl ~1k domains/~5k websites/h and as far as I can see my bottleneck is CPU (because of GIL?). After leaving my PC for some time I found that my network connection was broken.
When I wanted to use multiple processes I've just got error from twisted: Multiprocessing of Scrapy Spiders in Parallel Processes So this mean I must learn twisted which I would say I deprecated, when compared to asyncio, but this only my opinion.
So I have couples of ideas what to do
Fight back and try to learn twisted and implement multiprocessing and with distributed queue with Redis, but I don't feel that scrapy is the right tool for this type of job.
Go with pyspider - which has all features that I need (I've never used)
Go with nutch - which is so complex (I've never used)
Try to build my own distributed crawler, but after crawling 4 websites I've found 4 edge cases: SSL, duplications, timeouts. But it will be easy to add some modifications like: focused crawling.
What solution do you recommend?
Edit1: Sharing code
class ESIndexingPipeline(object):
def __init__(self):
# self.text = set()
self.extracted_type = []
self.text = OrderedSet()
import html2text
self.h = html2text.HTML2Text()
self.h.ignore_links = True
self.h.images_to_alt = True
def process_item(self, item, spider):
body = item['body']
body = self.h.handle(str(body, 'utf8')).split('\n')
first_line = True
for piece in body:
piece = piece.strip(' \n\t\r')
if len(piece) == 0:
first_line = True
else:
e = ''
if not self.text.empty() and not first_line and not regex.match(piece):
e = self.text.pop() + ' '
e += piece
self.text.add(e)
first_line = False
return item
def open_spider(self, spider):
self.target_id = spider.target_id
self.queue = spider.queue
def close_spider(self, spider):
self.text = [e for e in self.text if comprehension_helper(langdetect.detect, e) == 'en']
if spider.write_to_file:
self._write_to_file(spider)
def _write_to_file(self, spider):
concat = "\n".join(self.text)
self.queue.put([self.target_id, concat])
And the call:
def execute_crawler_process(targets, write_to_file=True, settings=None, parallel=800, queue=None):
if settings is None:
settings = DEFAULT_SPIDER_SETTINGS
# causes that runners work sequentially
#defer.inlineCallbacks
def crawl(runner):
n_crawlers_batch = 0
done = 0
n = float(len(targets))
for url in targets:
#print("target: ", url)
n_crawlers_batch += 1
r = runner.crawl(
TextExtractionSpider,
url=url,
target_id=url,
write_to_file=write_to_file,
queue=queue)
if n_crawlers_batch == parallel:
print('joining')
n_crawlers_batch = 0
d = runner.join()
# todo: print before yield
done += n_crawlers_batch
yield d # download rest of data
if n_crawlers_batch < parallel:
d = runner.join()
done += n_crawlers_batch
yield d
reactor.stop()
def f():
runner = CrawlerProcess(settings)
crawl(runner)
reactor.run()
p = Process(target=f)
p.start()
Spider is not particularly interesting.
You can use Scrapy-Redis. It is basically a Scrapy spider that fetches URLs to crawl from a queue in Redis.
The advantage is that you can start many concurrent spiders so you can crawl faster. All the instances of the spider will pull the URLs from the queue and wait idle when they run out of URLs to crawl.
The repository of Scrapy-Redis comes with an example project to implement this.
I use Scrapy-Redis to fire up 64 instances of my crawler to scrape 1 million URLs in around 1 hour.

Benchmarking tool using twisted

I am trying to write a web benchmarking tool base on twisted. Twisted is very fantastic asynchronous framework for web applications. Because I get started with this framework for just two weeks, I face a problem, here is it:
When I test this benchmarking tool compare with ApacheBench, the result differs greatly on the same concurrency. Here is the result of my tool:
python pyab.py 50000 50 http://xx.com/a.txt
speed:1063(q/s), worker:50, interval:7, req_made:7493, req_done:7443, req_error:0
And Here is the result of Apache Bench:
ab -c 50 -n 50000 http://xx.com/a.txt
Server Software: nginx/1.4.1
Server Hostname: 203.90.245.26
Server Port: 8080
Document Path: /a.txt
Document Length: 6 bytes
Concurrency Level: 50
Time taken for tests: 6.89937 seconds
Complete requests: 50000
Failed requests: 0
Write errors: 0
Total transferred: 12501750 bytes
HTML transferred: 300042 bytes
Requests per second: 8210.27 [#/sec] (mean)
Time per request: 6.090 [ms] (mean)
Time per request: 0.122 [ms] (mean, across all concurrent requests)
Transfer rate: 2004.62 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.8 0 4
Processing: 1 5 3.4 5 110
Waiting: 0 2 3.6 2 109
Total: 1 5 3.5 5 110
Percentage of the requests served within a certain time (ms)
50% 5
66% 6
75% 6
80% 6
90% 7
95% 7
98% 8
99% 8
100% 110 (longest request)
On the same url and concurrency, ApacheBench can go up to 8000 req/sec, while pyab only 1000 req/sec.
Here is my code(pyab.py):
from twisted.internet import reactor,threads
from twisted.internet.protocol import Protocol
from twisted.internet.defer import Deferred
from twisted.web.client import Agent
from twisted.web.client import HTTPConnectionPool
from twisted.web.http_headers import Headers
from twisted.python import log
import time, os, stat, logging, sys
from collections import Counter
logging.basicConfig(
#filename= "/%s/log/%s.%s" % (RUN_DIR,RUN_MODULE,RUN_TIME),
format="%(asctime)s [%(levelname)s] %(message)s",
level=logging.WARNING,
#level=logging.DEBUG,
stream=sys.stdout
)
#log.startLogging(sys.stdout)
observer = log.PythonLoggingObserver()
observer.start()
class IgnoreBody(Protocol):
def __init__(self, deferred, tl):
self.deferred = deferred
self.tl = tl
def dataReceived(self, bytes):
pass
def connectionLost(self, reason):
self.deferred.callback(None)
class Pyab:
def __init__( self, n = 50000, concurrency = 100, url='http://203.90.245.26:8080/a.txt'):
self.n = n
self.url = url
self.pool = HTTPConnectionPool(reactor, persistent=True)
self.pool.maxPersistentPerHost = concurrency
self.agent = Agent(reactor, connectTimeout = 5, pool = self.pool)
#self.agent = Agent(reactor, connectTimeout = 5)
self.time_start = time.time()
self.max_worker = concurrency
self.cnt = Counter({
'worker' : 0 ,
'req_made' : 0,
'req_done' : 0,
'req_error' : 0,
})
def monitor( self ):
interval = int(time.time() - self.time_start)
speed = 0
if interval != 0:
speed = int( self.cnt['req_done'] / interval )
log.msg("speed:%d(q/s), worker:%d, interval:%d, req_made:%d, req_done:%d, req_error:%d"
% (speed, self.cnt['worker'], interval, self.cnt['req_made'], self.cnt['req_done'], self.cnt['req_error']), logLevel=logging.WARNING)
reactor.callLater(1, lambda : self.monitor())
def start( self ):
self.keeprunning = True
self.monitor()
self.readMore()
def stop( self ):
self.keeprunning = False
def readMore( self ):
while self.cnt['worker'] < self.max_worker and self.cnt['req_done'] < self.n :
self.make_request()
if self.keeprunning and self.cnt['req_done'] < self.n:
reactor.callLater( 0.0001, lambda: self.readMore() )
else:
reactor.stop()
def make_request( self ):
d = self.agent.request(
'GET',
#'http://examplexx.com/',
#'http://example.com/',
#'http://xa.xingcloud.com/v4/qvo/WDCXWD7500AADS-00M2B0_WD-WCAV5E38536685366?update0=ref0%2Ccor&update1=nation%2Ccn&action0=visit&_ts=1376397973636',
#'http://203.90.245.26:8080/a.txt',
self.url,
Headers({'User-Agent': ['Twisted Web Client Example']}),
None)
self.cnt['worker'] += 1
self.cnt['req_made'] += 1
def cbResponse(resp):
self.cnt['worker'] -= 1
self.cnt['req_done'] += 1
log.msg('response received')
finished = Deferred()
resp.deliverBody(IgnoreBody(finished, self))
return finished
def cbError(error):
self.cnt['worker'] -= 1
self.cnt['req_error'] += 1
log.msg(error, logLevel=logging.ERROR)
d.addCallback(cbResponse)
d.addErrback(cbError)
if __name__ == '__main__' :
if len(sys.argv) < 4:
print "Usage: %s <n> <concurrency> <url>" % (sys.argv[0])
sys.exit()
ab = Pyab(n=int(sys.argv[1]), concurrency=int(sys.argv[2]), url=sys.argv[3])
ab.start()
reactor.run()
Is there any wrong with my code? Thanks!
When I last used it, ab was known to have dozens of serious bugs. Sometimes that would cause it to report massively inflated results. Sometimes it would report negative results. Sometimes it would crash. I'd try another tool, like httperf, as a sanity check.
However, if your server is actually that fast, then you might have another issue.
Even if ab has been fixed, you're talking here about a C program versus a Python program running on CPython. 8x slower than C in Python is not actually all that bad, so I don't expect there is actually anything wrong with your program, except that it doesn't make use of spawnProcess and multi-core concurrency.
For starters, see if you get any better results on PyPy.

Categories