What is the fastest way to do http requests in Python - python

I am trying to build a web application fuzzer. It will take a wordlist and a url from the user and will do request to those urls. At the end, It will give output according to their responses' status codes.
I have written some code, it does ~600req/s in local (takes about 8 seconds to finish 4600 lines of wordlist) but since I'm using requests library I was thinking if there is a faster way to do so.
Only time consuming part as I analyzed is fuzz() and req() functions as they are doing the most job. I have also other functions but those that I've shown must be enough for you to understand (I didn't want to put so much code).
def __init__(self):
self.statusCodes = [200, 204, 301, 302, 307, 403]
self.session = requests.Session()
self.headers = {
'User-Agent': 'x',
'Connection': 'Closed'
}
def req(self, URL):
# request to only one url
try:
r = self.session.head(URL, allow_redirects=False, headers=self.headers, timeout=3)
if r.status_code in self.statusCodes:
if r.status_code == 301:
self.directories.append(URL)
self.warning("301", URL)
return
self.success(r.status_code, URL)
return
return
except requests.exceptions.ConnectTimeout:
return
except requests.exceptions.ConnectionError:
self.error("Connection error")
sys.exit(1)
def fuzz(self):
pool = ThreadPool(self.threads)
pool.map(self.req, self.URLList)
pool.close()
pool.join()
return
#self.threads is number of threads
#self.URLList is a list of full urls
'__init__' ((<MWAF.MWAF instance at 0x7f554cd8dcb0>, 'http://localhost', '/usr/share/wordlists/seclists/Discovery/Web-Content/common.txt', 25), {}) 0.00362110137939453125 sec
#each req is around this
'req' ((<MWAF.MWAF instance at 0x7f554cd8dcb0>, 'http://localhost/webedit'), {}) 0.00855112075805664062 sec
'fuzz' ((<MWAF.MWAF instance at 0x7f554cd8dcb0>,), {}) 7.39054012298583984375 sec
Whole Program
[*] 7.39426517487

You may wish to combine multiple processes with multiple threads. As 400 threads in 20 processes outperform 400 threads in 4 processes while performing an I/O-bound task shows, there's an optimal number of threads per process -- the more the higher percentage of time they spend waiting for I/O.
On the higher order of vanishing, you can try reusing prepared requests to save on object creation time. (I'm not sure if that'll have an effect -- requests might e.g. treat them as immutable so it would create a new object each time anyway. But this may still cut on input validation time or something.)

Related

How to make concurrent threaded requests to test response time of list of URLs in python

i am writing a python script for request testing (i am a beginner at this) where i have a list of urls that i want to test using multiple concurrent requests for ex if i have 10 urls and 100 input number of threads then 100 independent connections should be made and they should access those urls randomly and in the end return average response time of each url.
out = []
CONNECTIONS = 100
TIMEOUT = 50
json_str=[]
tlds = open('sampleurl.txt').read().splitlines()
for data in tlds:
json_str.append(''.join(data ))
def load_url(data, timeout):
response = requests.post('http://example.com', headers=headers,data=data,timeout=timeout)
return response.status_code
with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:
future_to_url = (executor.submit(load_url, data, TIMEOUT) for data in json_str)
time1 = time.time()
for future in concurrent.futures.as_completed(future_to_url):
try:
data = future.result()
except Exception as exc:
data = str(type(exc))
finally:
out.append(data)
time2 = time.time()
print(f'Took {time2-time1:.2f} s')
print(pd.Series(out).value_counts())
i tried this but it stops after testing each url only once. i want the code to continue running and open the url multiple times until all connections are exhausted.
I cannot see what the json will store but since it stores data, you are processing each request only once using this piece of code
future_to_url = (executor.submit(load_url, data, TIMEOUT) for data in json_str)
but you want that job to be submitted to executor service multiple times then you can create a loop around it and make it run multiple times.
while count < 10:
count += 1
future_to_url = (executor.submit(load_url, data, TIMEOUT) for data in json_str)
This will submit multiple requests to the Executor service.

Maximize number of parallel requests (aiohttp)

tl;dr: how do I maximize number of http requests I can send in parallel?
I am fetching data from multiple urls with aiohttp library. I'm testing its performance and I've observed that somewhere in the process there is a bottleneck, where running more urls at once just doesn't help.
I am using this code:
import asyncio
import aiohttp
async def fetch(url, session):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0'}
try:
async with session.get(
url, headers=headers,
ssl = False,
timeout = aiohttp.ClientTimeout(
total=None,
sock_connect = 10,
sock_read = 10
)
) as response:
content = await response.read()
return (url, 'OK', content)
except Exception as e:
print(e)
return (url, 'ERROR', str(e))
async def run(url_list):
tasks = []
async with aiohttp.ClientSession() as session:
for url in url_list:
task = asyncio.ensure_future(fetch(url, session))
tasks.append(task)
responses = asyncio.gather(*tasks)
await responses
return responses
loop = asyncio.get_event_loop()
asyncio.set_event_loop(loop)
task = asyncio.ensure_future(run(url_list))
loop.run_until_complete(task)
result = task.result().result()
Running this with url_list of varying length (tests against https://httpbin.org/delay/2) I see that adding more urls to be run at once helps only up to ~100 urls and then total time starts to grow proportionally to number of urls (or in other words, time per one url does not decrease). This suggests that something fails when trying to process these at once. In addition, with more urls in 'one batch' I am occasionally receiving connection timeout errors.
Why is it happening? What exactly limits the speed here?
How can I check what is the maximum number of parallel requests I can send on a given computer? (I mean an exact number - not approx by 'trial and error' as above)
What can I do to increase the number of requests processed at once?
I am runnig this on Windows.
EDIT in response to comment:
This is the same data with limit set to None. Only slight improvement in the end and there are many connection timeout errors with 400 urls sent at once. I ended up using limit = 200 on my actual data.
By default aiohttp limits number of simultaneous connections to 100. It achieves by setting default limit to TCPConnector object that is used by ClientSession. You can bypass it by creating and passing custom connector to session:
connector = aiohttp.TCPConnector(limit=None)
async with aiohttp.ClientSession(connector=connector) as session:
# ...
Note however that you probably don't want to set this number too high: your network capacity, CPU, RAM and target server have their own limits and try to make enormous amount of connection can lead to increasing failures.
Optimal number can probably be found only through experiments on concrete machine.
Unrelated:
You don't have to create tasks without reason. Most asyncio api accept regular coroutines. For example, your last lines of code can be altered this way:
loop = asyncio.get_event_loop()
loop.run_until_complete(run(url_list))
Or even to just asyncio.run(run(url_list)) (doc) if you're using Python 3.7

Python multiprocess Pool vs Process

I'm new to Python multiprocessing. I don't quite understand the difference between Pool and Process. Can someone suggest which one I should use for my needs?
I have thousands of http GET requests to send. After sending each and getting the response, I want to store to response (a simple int) to a (shared) dict. My final goal is to write all data in the dict to a file.
This is not CPU intensive at all. All my goal is the speed up sending the http GET requests because there are too many. The requests are all isolated and do not depend on each other.
Shall I use Pool or Process in this case?
Thanks!
----The code below is added on 8/28---
I programmed with multiprocessing. The key challenges I'm facing are:
1) GET request can fail sometimes. I have to set 3 retries to minimize the need to rerun my code/all requests. I only want to retry the failed ones. Can I achieve this with async http requests without using Pool?
2) I want to check the response value of every requests, and have exception handling
The code below is simplified from my actual code. It is working fine, but I wonder if it's the most efficient way of doing things. Can anyone give any suggestions? Thanks a lot!
def get_data(endpoint, get_params):
response = requests.get(endpoint, params = get_params)
if response.status_code != 200:
raise Exception("bad response for " + str(get_params))
return response.json()
def get_currency_data(endpoint, currency, date):
get_params = {'currency': currency,
'date' : date
}
for attempt in range(3):
try:
output = get_data(endpoint, get_params)
# additional return value check
# ......
return output['value']
except:
time.sleep(1) # I found that sleeping for 1s almost always make the retry successfully
return 'error'
def get_all_data(currencies, dates):
# I have many dates, but not too many currencies
for currency in currencies:
results = []
pool = Pool(processes=20)
for date in dates:
results.append(pool.apply_async(get_currency_data, args=(endpoint, date)))
output = [p.get() for p in results]
pool.close()
pool.join()
time.sleep(10) # Unfortunately I have to give the server some time to rest. I found it helps to reduce failures. I didn't write the server. This is not something that I can control
Neither. Use asynchronous programming. Consider the below code pulled directly from that article (credit goes to Paweł Miech)
#!/usr/local/bin/python3.5
import asyncio
from aiohttp import ClientSession
async def fetch(url, session):
async with session.get(url) as response:
return await response.read()
async def run(r):
url = "http://localhost:8080/{}"
tasks = []
# Fetch all responses within one Client session,
# keep connection alive for all requests.
async with ClientSession() as session:
for i in range(r):
task = asyncio.ensure_future(fetch(url.format(i), session))
tasks.append(task)
responses = await asyncio.gather(*tasks)
# you now have all response bodies in this variable
print(responses)
def print_responses(result):
print(result)
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(4))
loop.run_until_complete(future)
Just maybe create a URL's array, and instead of the given code, loop against that array and issue each one to fetch.
EDIT: Use requests_futures
As per #roganjosh comment below, requests_futures is a super-easy way to accomplish this.
from requests_futures.sessions import FuturesSession
sess = FuturesSession()
urls = ['http://google.com', 'https://stackoverflow.com']
responses = {url: sess.get(url) for url in urls}
contents = {url: future.result().content
for url, future in responses.items()
if future.result().status_code == 200}
EDIT: Use grequests to support Python 2.7
You can also us grequests, which supports Python 2.7 for performing asynchronous URL calling.
import grequests
urls = ['http://google.com', 'http://stackoverflow.com']
responses = grequests.map(grequests.get(u) for u in urls)
print([len(r.content) for r in rs])
# [10475, 250785]
EDIT: Using multiprocessing
If you want to do this using multiprocessing, you can. Disclaimer: You're going to have a ton of overhead by doing this, and it won't be anywhere near as efficient as async programming... but it is possible.
It's actually pretty straightforward, you're mapping the URL's through the http GET function:
import requests
urls = ['http://google.com', 'http://stackoverflow.com']
from multiprocessing import Pool
pool = Pool(8)
responses = pool.map(requests.get, urls)
The size of the pool will be the number of simultaneously issues GET requests. Sizing it up should increase your network efficiency, but it'll add overhead on the local machine for communication and forking.
Again, I don't recommend this, but it certainly is possible, and if you have enough cores it's probably faster than doing the calls synchronously.

Faster Scraping of JSON from API: Asynchronous or?

I need to scrape roughly 30GB of JSON data from a website API as quickly as possible. I don't need to parse it -- I just need to save everything that shows up on each API URL.
I can request quite a bit of data at a time -- say 1MB or even 50MB 'chunks' (API parameters are encoded in the URL and allow me to select how much data I want per request)
the API places a limit of 1 request per second.
I would like to accomplish this on a laptop and 100MB/sec internet connection
Currently, I am accomplishing this (synchronously & too slowly) by:
-pre-computing all of the (encoded) URL's I want to scrape
-using Python 3's requests library to request each URL and save the resulting JSON one-by-one in separate .txt files.
Basically, my synchronous, too-slow solution looks like this (simplified slightly):
#for each pre-computed encoded URL do:
curr_url_request = requests.get(encoded_URL_i, timeout=timeout_secs)
if curr_url_request.ok:
with open('json_output.txt', 'w') as outfile:
json.dump(curr_url_request.json(), outfile)
What would be a better/faster way to do this? Is there a straight-forward way to accomplish this asynchronously but respecting the 1-request-per-second threshold? I have read about grequests (no longer maintained?), twisted, asyncio, etc but do not have enough experience to know whether/if one of these is the right way to go.
EDIT
Based on Kardaj's reply below, I decided to give async Tornado a try. Here's my current Tornado version (which is heavily based on one of the examples in their docs). It successfully limits concurrency.
The hangup is, how can I do an overall rate-limit of 1 request per second globally across all workers? (Kardaj, the async sleep makes a worker sleep before working, but does not check whether other workers 'wake up' and request at the same time. When I tested it, all workers grab a page and break the rate limit, then go to sleep simultaneously).
from datetime import datetime
from datetime import timedelta
from tornado import httpclient, gen, ioloop, queues
URLS = ["https://baconipsum.com/api/?type=meat",
"https://baconipsum.com/api/?type=filler",
"https://baconipsum.com/api/?type=meat-and-filler",
"https://baconipsum.com/api/?type=all-meat&paras=2&start-with-lorem=1"]
concurrency = 2
def handle_request(response):
if response.code == 200:
with open("FOO"+'.txt', "wb") as thisfile:#fix filenames to avoid overwrite
thisfile.write(response.body)
#gen.coroutine
def request_and_save_url(url):
try:
response = yield httpclient.AsyncHTTPClient().fetch(url, handle_request)
print('fetched {0}'.format(url))
except Exception as e:
print('Exception: {0} {1}'.format(e, url))
raise gen.Return([])
#gen.coroutine
def main():
q = queues.Queue()
tstart = datetime.now()
fetching, fetched = set(), set()
#gen.coroutine
def fetch_url(worker_id):
current_url = yield q.get()
try:
if current_url in fetching:
return
#print('fetching {0}'.format(current_url))
print("Worker {0} starting, elapsed is {1}".format(worker_id, (datetime.now()-tstart).seconds ))
fetching.add(current_url)
yield request_and_save_url(current_url)
fetched.add(current_url)
finally:
q.task_done()
#gen.coroutine
def worker(worker_id):
while True:
yield fetch_url(worker_id)
# Fill a queue of URL's to scrape
list = [q.put(url) for url in URLS] # this does not make a list...it just puts all the URLS into the Queue
# Start workers, then wait for the work Queue to be empty.
for ii in range(concurrency):
worker(ii)
yield q.join(timeout=timedelta(seconds=300))
assert fetching == fetched
print('Done in {0} seconds, fetched {1} URLs.'.format(
datetime.now() - tstart, len(fetched)))
if __name__ == '__main__':
import logging
logging.basicConfig()
io_loop = ioloop.IOLoop.current()
io_loop.run_sync(main)
You are parsing the content and then serializing it again. You can just write the content directly to a file.
curr_url_request = requests.get(encoded_URL_i, timeout=timeout_secs)
if curr_url_request.ok:
with open('json_output.txt', 'w') as outfile:
outfile.write(curr_url_request.content)
That probably removes most of the processing overhead.
tornado has a very powerful asynchronous client. Here's a basic code that may do the trick:
from tornado.httpclient import AsyncHTTPClient
import tornado
URLS = []
http_client = AsyncHTTPClient()
loop = tornado.ioloop.IOLoop.current()
def handle_request(response):
if response.code == 200:
with open('json_output.txt', 'a') as outfile:
outfile.write(response.body)
#tornado.gen.coroutine
def queue_requests():
results = []
for url in URLS:
nxt = tornado.gen.sleep(1) # 1 request per second
res = http_client.fetch(url, handle_request)
results.append(res)
yield nxt
yield results # wait for all requests to finish
loop.add_callback(loop.stop)
loop.add_callback(queue_requests)
loop.start()
This is a straight-forward approach that may lead to too many connections with the remote server. You may have to resolve such problem using a sliding window while queuing the requests.
In case of request timeouts or specific headers required, feel free to read the doc

Recall functions in multiprocessing's child process

I'm using multiprocessing library (not new in python but new in multiprocessing). It seems that I lack of understanding how it works.
What I try to do: I send a lot of http requests to server and if I receive connection error it means that remote service is down and I restart it using paramiko and then resend a request. I use multiprocessing to load all available processors because there are about 70000 requests and it takes about 24 hours to process them all using one processor.
My code:
# Send request here
def send_request(server, url, data, timeout):
try:
return requests.post(server + url, json=data, timeout=(timeout or 60))
except Exception:
return None
# Try to get json from response
def do_gw_requests(request, data):
timeout = 0
response = send_request(server, request, data, timeout)
if response is not None:
response_json = json.loads(response.text)
else:
response_json = None
return response_json
# Function that recall itself if service is down
def safe_build(data):
exception_message = ""
response = {}
try:
response = do_gw_requests("/rgw_find_route", data)
if response is None:
# Function that uses paramiko to start service
# It will not end until service is up
start_service()
while response is None:
safe_build(data)
--some other work here--
return response, exception_message
# Multiprocessing lines in main function
pool = Pool(2)
# build_single_route prepares data, calls safe_build once and write logs
result = pool.map_async(build_single_route, args)
pool.close()
pool.join()
My problem is if service already down at the start of script (and potentially if service got down in the middle of script's work) I can't get non-empty response for two first requests. Script starts, send two first requests (I send them in loop by two), finds out that service is down (response become None), restarts service, resends requests and seems gets None again and again and again (in endless loop). If I remove loop while response is None: then first two requests will process as if they was None and other requests will process as expected. But I need every request result that's why I resend bad requests.
So it recall function with same data again and again but without success. It's very strange as for me. Can anyone please explain what am I doing wrong here?
It seems that problem not with behavior of Pool workers as I expected. response is a local variable of function and thus it become not None after reviving of service at the second call of safe_build, it's still None in the first call. response, _ = safe_build(data) seems work.

Categories