Python Threads not finishing - python

I'm currently testing something with Threading/ workpool; I create 400 Threads which download a total of 5000 URLS... The problem is that some of the 400 threads are "freezing", when looking into my Processes I see that +- 15 threads in every run freeze, and after a time eventually close 1 by 1.
My question is if there is a way to have some sort of 'timer' / 'counter' that kills a thread if it isn't finished after x seconds.
# download2.py - Download many URLs using multiple threads.
import os
import urllib2
import workerpool
import datetime
from threading import Timer
class DownloadJob(workerpool.Job):
"Job for downloading a given URL."
def __init__(self, url):
self.url = url # The url we'll need to download when the job runs
def run(self):
try:
url = urllib2.urlopen(self.url).read()
except:
pass
# Initialize a pool, 400 threads in this case
pool = workerpool.WorkerPool(size=400)
# Loop over urls.txt and create a job to download the URL on each line
print datetime.datetime.now()
for url in open("urls.txt"):
job = DownloadJob(url.strip())
pool.put(job)
# Send shutdown jobs to all threads, and wait until all the jobs have been completed
pool.shutdown()
pool.wait()
print datetime.datetime.now()

The problem is that some of the 400 threads are "freezing"...
That's most likely because of this line...
url = urllib2.urlopen(self.url).read()
By default, Python will wait forever for a remote server to respond, so if a one of your URLs points to a server which is ignoring the SYN packet, or is otherwise just really slow, the thread could potentially be blocked forever.
You can use the timeout parameter of urlopen() set a limit as to how long the thread will wait for the remote host to respond...
url = urllib2.urlopen(self.url, timeout=5).read() # Time out after 5 seconds
...or you can set it globally instead with socket.setdefaulttimeout() by putting these lines at the top of your code...
import socket
socket.setdefaulttimeout(5) # Time out after 5 seconds

urlopen accepts a timeout value, that would be the best way to handle it I think.
But I agree with the commenter that 400 threads is probably way too many

Related

How to test a single GET request in parallel for specified count?

I want to parallely send a GET request for the specified count say 100 times. How to achieve this using JMeter or Python ?
I tried bzm parallel executor but that doesn't workout.
import requests
import threading
totalRequests = 0
numberOfThreads = 10
threads = [0] * numberOfThreads
def worker(thread):
r = requests.get("url")
threads[thread] = 0 # free thread
while totalRequests < 100:
for thread in range(numberOfThreads):
if threads[thread] == 0:
threads[thread] = 1 # occupy thread
t = threading.Thread(target=worker, args=(thread,))
t.start()
totalRequests += 1
In JMeter:
Add Thread Group to your Test Plan and configure it like:
Add HTTP Request sampler as a child of the Thread Group and specify protocol, host, port, path and parameters:
if you're not certain regarding properly configuring the HTTP Request sampler - you can just record the request using your browser and JMeter's HTTP(S) Test Script Recorder or JMeter Chrome Extension
For Python the correct would be using Locust framework as I believe you're interested in metrics like response times, latencies and so on. The official website is down at the moment
so in the meantime you can check https://readthedocs.org/projects/locust/

part of the requests delay the whole process in grequests

I have 600 urls to request, when I use grequests, I found that sometimes it finishes so fast within 10 secs, but sometimes just stuck there(can't reach the statement of printing 'done').
Here is my code:
l=[]
urls=[...]
reqs=[grequests.get(i) for i in urls]
rs=grequests.map(reqs)
print('done')
Is it because majority of the requests are already finished within 10 secs(with status 200), just a few requests are still waiting for response? I am using Pycharm, in the variable monitor window, I can indeed see that lots of requests already get 200 status. How can I fix this problem? Is there any params that I can set to limit the maximum request time for each request, therefore if some requests exceed a certain time, it will return anyway.
Just set the timeout:
import grequests
l=[]
urls = [ ]
reqs=[grequests.get(i, timeout=1) for i in urls] # timeout in sec
rs=grequests.map(reqs)
print(rs)

How to write python code that will work with threads or coroutines and will complete in deterministic time?

What I mean by "deterministic time"? For example AWS offer a service "AWS Lambda". The process started as lambda function has time limit, after that lambda function will stop execution and will assume that task was finished with error. And example task - send data to http endpoint. Depending of a network connection to http endpoint, or other factors, process of sending data can take a long time. If I need to send the same data to the many endpoints, then full process time will take one process time times endpoints amount. Which increase a chance that lambda function will be stopped before all data will be send to all endpoints.
To solve this I need to send data to different endpoints in parallel mode using threads.
The problem with threads - started thread can't be stopped. If http request will take more time than it dedicated by lambda function time limit, lambda function will be aborted and return error. So I need to use timeout with http request, to abort it, if it take more time than expected.
If http request will be canceled by timeout or endpoint will return error, I need to save not processed data somewhere to not lost the data. The time needed to save unprocessed data can be predicted, because I control the storage where data will be saved.
And the last part that consume time - procedure or loop where threads are scheduled executor.submit(). If there is only one endpoint or small number of them then the consumed time will be small. And there is no necessary to control this. But if I have deal with many endpoints, I have to take this into account.
So basically full time will consists of:
scheduling threads
http request execution
saving unprocessed data
There is example of how I can manage time using threads
import concurrent.futures
from functools import partial
import requests
import time
start = time.time()
def send_data(data):
host = 'http://127.0.0.1:5000/endpoint'
try:
result = requests.post(host, json=data, timeout=(0.1, 0.5))
# print('done')
if result.status_code == 200:
return {'status': 'ok'}
if result.status_code != 200:
return {'status': 'error', 'msg': result.text}
except requests.exceptions.Timeout as err:
return {'status': 'error', 'msg': 'timeout'}
def get_data(n):
return {"wait": n}
def done_cb(a, b, future):
pass # save unprocessed data
def main():
executor = concurrent.futures.ThreadPoolExecutor()
futures = []
max_time = 0.5
for i in range(1):
future = executor.submit(send_data, *[{"wait": 10}])
future.add_done_callback(partial(done_cb, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new threads')
# save unprocessed data
break
try:
for item in concurrent.futures.as_completed(futures, timeout=1):
item.result()
except concurrent.futures.TimeoutError as err:
pass
I was thinking of how I can use asyncio library instead of threads, to do the same thing.
import asyncio
import time
from functools import partial
import requests
start = time.time()
def send_data(data):
...
def get_data(n):
return {"wait": n}
def done_callback(a,b, future):
pass # save unprocessed data
def main(loop):
max_time = 0.5
futures = []
start_appending = time.time()
for i in range(1):
event_data = get_data(1)
future = (loop.run_in_executor(None, send_data, event_data))
future.add_done_callback(partial(done_callback, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new futures')
# save unprocessed data
break
finished, unfinished = loop.run_until_complete(
asyncio.wait(futures, timeout=1)
)
_loop = asyncio.get_event_loop()
result = main(_loop)
Function send_data() the same as in previous code snipped.
Because request library is not async code I use run_in_executor() to create future object. The main problems I have is that done_callback() is not executed when the thread that started but executor done it's job. But only when the futures will be "processed" by asyncio.wait() expression.
Basically I seeking the way to start execute asyncio future, like ThreadPoolExecutor start execute threads, and not wait for asyncio.wait() expression to call done_callback(). If you have other ideas how to write python code that will work with threads or coroutines and will complete in deterministic time. Please share it, I will be glad to read them.
And other question. If thread or future done its job, it can return result, that I can use in done_callback(), for example to remove message from queue by id returned in result. But if thread or future was canceled, I don't have result. And I have to use functools.partial() pass in done_callback additional data, that can help me to understand for what data this callback was called. If passed data are small this is not a problem. If data will be big, I need to put data in array/list/dictionary and pass in callback only index of array or put "full data: in callback.
Can I somehow get access to variable that was passed to future/thread, from done_callback(), that was triggered on canceled future/thread?
You can use asyncio.wait_for to wait for a future (or multiple futures, when combined with asyncio.gather) and cancel them in case of a timeout. Unlike threads, asyncio supports cancellation, so you can cancel a task whenever you feel like it, and it will be cancelled at the first blocking call it makes (typically a network call).
Note that for this to work, you should be using asyncio-native libraries such as aiohttp for HTTP. Trying to combine requests with asyncio using run_in_executor will appear to work for simple tasks, but it will not bring you the benefits of using asyncio, such as being able to spawn a massive number of tasks without encumbering the OS, or the possibility of cancellation.

Batching and queueing in a real-time webserver

I need a webserver which routes the incoming requests to back-end workers by batching them every 0.5 second or when it has 50 http requests whichever happens earlier. What will be a good way to implement it in python/tornado or any other language?
What I am thinking is to publish the incoming requests to a rabbitMQ queue and then somehow batch them together before sending to the back-end servers. What I can't figure out is how to pick multiple requests from the rabbitMq queue. Could someone point me to right direction or suggest some alternate apporach?
I would suggest using a simple python micro web framework such as bottle. Then you would send the requests to a background process via a queue (thus allowing the connection to end).
The background process would then have a continuous loop that would check your conditions (time and number), and do the job once the condition is met.
Edit:
Here is an example webserver that batches the items before sending them to any queuing system you want to use (RabbitMQ always seemed overcomplicated to me with Python. I have used Celery and other simpler queuing systems before). That way the backend simply grabs a single 'item' from the queue, that will contain all required 50 requests.
import bottle
import threading
import Queue
app = bottle.Bottle()
app.queue = Queue.Queue()
def send_to_rabbitMQ(items):
"""Custom code to send to rabbitMQ system"""
print("50 items gathered, sending to rabbitMQ")
def batcher(queue):
"""Background thread that gathers incoming requests"""
while True:
batcher_loop(queue)
def batcher_loop(queue):
"""Loop that will run until it gathers 50 items,
then will call then function 'send_to_rabbitMQ'"""
count = 0
items = []
while count < 50:
try:
next_item = queue.get(timeout=.5)
except Queue.Empty:
pass
else:
items.append(next_item)
count += 1
send_to_rabbitMQ(items)
#app.route("/add_request", method=["PUT", "POST"])
def add_request():
"""Simple bottle request that grabs JSON and puts it in the queue"""
request = bottle.request.json['request']
app.queue.put(request)
if __name__ == '__main__':
t = threading.Thread(target=batcher, args=(app.queue, ))
t.daemon = True # Make sure the background thread quits when the program ends
t.start()
bottle.run(app)
Code used to test it:
import requests
import json
for i in range(101):
req = requests.post("http://localhost:8080/add_request",
data=json.dumps({"request": 1}),
headers={"Content-type": "application/json"})

Tweaking celery for high performance

I'm trying to send ~400 HTTP GET requests and collect the results.
I'm running from django.
My solution was to use celery with gevent.
To start the celery tasks I call get_reports :
def get_reports(self, clients, *args, **kw):
sub_tasks = []
for client in clients:
s = self.get_report_task.s(self, client, *args, **kw).set(queue='io_bound')
sub_tasks.append(s)
res = celery.group(*sub_tasks)()
reports = res.get(timeout=30, interval=0.001)
return reports
#celery.task
def get_report_task(self, client, *args, **kw):
report = send_http_request(...)
return report
I use 4 workers:
manage celery worker -P gevent --concurrency=100 -n a0 -Q io_bound
manage celery worker -P gevent --concurrency=100 -n a1 -Q io_bound
manage celery worker -P gevent --concurrency=100 -n a2 -Q io_bound
manage celery worker -P gevent --concurrency=100 -n a3 -Q io_bound
And I use RabbitMq as the broker.
And although it works much faster than running the requests sequentially (400 requests took ~23 seconds), I noticed that most of that time was overhead from celery itself, i.e. if I changed get_report_task like this:
#celery.task
def get_report_task(self, client, *args, **kw):
return []
this whole operation took ~19 seconds.
That means that I spend 19 seconds only on sending all the tasks to celery and getting the results back
The queuing rate of messages to rabbit mq is seems to be bound to 28 messages / sec and I think that this is my bottleneck.
I'm running on a win 8 machine if that matters.
some of the things I've tried:
using redis as broker
using redis as results backend
tweaking with those settings
BROKER_POOL_LIMIT = 500
CELERYD_PREFETCH_MULTIPLIER = 0
CELERYD_MAX_TASKS_PER_CHILD = 100
CELERY_ACKS_LATE = False
CELERY_DISABLE_RATE_LIMITS = True
I'm looking for any suggestions that will help speed things up.
Are you really running on Windows 8 without a Virtual Machine? I did the following simple test on 2 Core Macbook 8GB RAM running OS X 10.7:
import celery
from time import time
#celery.task
def test_task(i):
return i
grp = celery.group(test_task.s(i) for i in range(400))
tic1 = time(); res = grp(); tac1 = time()
print 'queued in', tac1 - tic1
tic2 = time(); vals = res.get(); tac2 = time()
print 'executed in', tac2 - tic2
I'm using Redis as broker, Postgres as a result backend and default worker with --concurrency=4. Guess what is the output? Here it is:
queued in 3.5009469986
executed in 2.99818301201
Well it turnes out I had 2 separate issues.
First off, the task was a member method. After extracting it out of the class, the time went down to about 12 seconds. I can only assume it has something to do with the pickling of self.
The second thing was the fact that it ran on windows.
After running it on my linux machine, the run time was less than 2 seconds.
Guess windows just isn't cut for high performance..
How about using twisted instead? You can reach for much simpler application structure. You can send all 400 requests from the django process at once and wait for all of them to finish. This works simultaneously because twisted sets the sockets into non-blocking mode and only reads the data when its available.
I had a similar problem a while ago and I've developed a nice bridge between twisted and django. I'm running it in production environment for almost a year now. You can find it here: https://github.com/kowalski/featdjango/. In simple words it has the main application thread running the main twisted reactor loop and the django view results is delegated to a thread. It use a special threadpool, which exposes methods to interact with reactor and use its asynchronous capabilities.
If you use it, your code would look like this:
from twisted.internet import defer
from twisted.web.client import getPage
import threading
def get_reports(self, urls, *args, **kw):
ct = threading.current_thread()
defers = list()
for url in urls:
# here the Deferred is created which will fire when
# the call is complete
d = ct.call_async(getPage, args=[url] + args, kwargs=kw)
# here we keep it for reference
defers.append(d)
# here we create a Deferred which will fire when all the
# consiting Deferreds are completed
deferred_list = defer.DeferredList(defers, consumeErrors=True)
# here we tell the current thread to wait until we are done
results = ct.wait_for_defer(deferred_list)
# the results is a list of the form (C{bool} success flag, result)
# below unpack it
reports = list()
for success, result in results:
if success:
reports.append(result)
else:
# here handle the failure, or just ignore
pass
return reports
This still is something you can optimize a lot. Here, every call to getPage() would create a separate TCP connection and close it when its done. This is as optimal as it can be, providing that each of your 400 requests is sent to a different host. If this is not a case, you can use a http connection pool, which uses persistent connections and http pipelineing. You instantiate it like this:
from feat.web import httpclient
pool = httpclient.ConnectionPool(host, port, maximum_connections=3)
Than a single request is perform like this (this goes instead the getPage() call):
d = ct.call_async(pool.request, args=(method, path, headers, body))

Categories