Recall functions in multiprocessing's child process - python

I'm using multiprocessing library (not new in python but new in multiprocessing). It seems that I lack of understanding how it works.
What I try to do: I send a lot of http requests to server and if I receive connection error it means that remote service is down and I restart it using paramiko and then resend a request. I use multiprocessing to load all available processors because there are about 70000 requests and it takes about 24 hours to process them all using one processor.
My code:
# Send request here
def send_request(server, url, data, timeout):
try:
return requests.post(server + url, json=data, timeout=(timeout or 60))
except Exception:
return None
# Try to get json from response
def do_gw_requests(request, data):
timeout = 0
response = send_request(server, request, data, timeout)
if response is not None:
response_json = json.loads(response.text)
else:
response_json = None
return response_json
# Function that recall itself if service is down
def safe_build(data):
exception_message = ""
response = {}
try:
response = do_gw_requests("/rgw_find_route", data)
if response is None:
# Function that uses paramiko to start service
# It will not end until service is up
start_service()
while response is None:
safe_build(data)
--some other work here--
return response, exception_message
# Multiprocessing lines in main function
pool = Pool(2)
# build_single_route prepares data, calls safe_build once and write logs
result = pool.map_async(build_single_route, args)
pool.close()
pool.join()
My problem is if service already down at the start of script (and potentially if service got down in the middle of script's work) I can't get non-empty response for two first requests. Script starts, send two first requests (I send them in loop by two), finds out that service is down (response become None), restarts service, resends requests and seems gets None again and again and again (in endless loop). If I remove loop while response is None: then first two requests will process as if they was None and other requests will process as expected. But I need every request result that's why I resend bad requests.
So it recall function with same data again and again but without success. It's very strange as for me. Can anyone please explain what am I doing wrong here?

It seems that problem not with behavior of Pool workers as I expected. response is a local variable of function and thus it become not None after reviving of service at the second call of safe_build, it's still None in the first call. response, _ = safe_build(data) seems work.

Related

Long running requests with asyncio and aiohttp

Apologies for asking with what may be considered redundant, but I'm finding it extremely difficult to figure out what are the current recommended best practices for using asyncio and aiohttp.
I'm working with an API that ultimately returns a link to a generated CSV file. There are two steps in using the API.
Submit request the triggers a long running process and returns a status URL.
Poll the status URL until the status_code is 201 and then get the URL of the CSV file from the headers.
Here's a stripped down example of how I can successfully do this synchronously with requests.
import time
import requests
def submit_request(id):
"""Submit request to create CSV for specified id"""
body = {'id': id}
response = requests.get(
url='https://www.example.com/endpoint',
json=body
)
response.raise_for_status()
return response
def get_status(request_response):
"""Check whether the CSV has been created."""
status_response = requests.get(
url=request_response.headers['Location']
)
status_response.raise_for_status()
return status_response
def get_data_url(id, poll_interval=10):
"""Submit request to create CSV for specified ID, wait for it to finish,
and return the URL of the CSV.
Wait between status requests based on poll_interval.
"""
response = submit_request(id)
while True:
status_response = get_status(response)
if status_response.status_code == 201:
break
time.sleep(poll_interval)
data_url = status_response.headers['Location']
return data_url
What I'd like to do is be able to submit a group of requests at once, and then wait on all of them to be finished. But I'm not clear on how to structure this with asyncio and aiohttp.
One option would be to first submit all of the requests and then use await.gather (or something) to get all of the status URLs. Then start another event loop where I continuously poll the status_urls until they have all completed and I end up with a list of data URLs.
Alternatively, I suppose I could create a single function that submits the request, gets the status URL, and then polls that until it completes. In that case I would just have a single event loop where I submit each of the IDs that I want processed.
If some pseudo code for those options would be useful I can try to provide it. I've looked at a lot of different examples where you submit requests for a bunch of URLs asynchronously -- this for example -- but I'm finding that I get a bit lost when trying to translate them to this slightly more complicated scenario where I submit the request and then get back a new URL to poll.
FYI based on the comments above my current solution is something like this.
import asyncio
import aiohttp
async def get_data_url(session, id):
url = 'https://www.example.com/endpoint'
body = {'id': id}
async with session.post(url=url, json=body) as response:
response.raise_for_status()
status_url = response.headers['Location']
while True:
async with session.get(url=status_url) as status_response:
status_response.raise_for_status()
if status_response.status == 201:
return status_response.headers['Location']
await asyncio.sleep(10)
async def main(access_token, id):
headers = {'token': access_token}
async with aiohttp.ClientSession(headers=headers) as session:
data_url = await get_data_url(session, id)
return data_url
This works though I'm still not sure on best practices for submitting a set of IDs. I think asyncio.gather would work but it looks like it's deprecated. Ideally I would have a queue of say 100 IDs and only have 5 requests running at any given time. I've found some examples like this but they depend on asyncio.Queue which is also deprecated.

What is the fastest way to do http requests in Python

I am trying to build a web application fuzzer. It will take a wordlist and a url from the user and will do request to those urls. At the end, It will give output according to their responses' status codes.
I have written some code, it does ~600req/s in local (takes about 8 seconds to finish 4600 lines of wordlist) but since I'm using requests library I was thinking if there is a faster way to do so.
Only time consuming part as I analyzed is fuzz() and req() functions as they are doing the most job. I have also other functions but those that I've shown must be enough for you to understand (I didn't want to put so much code).
def __init__(self):
self.statusCodes = [200, 204, 301, 302, 307, 403]
self.session = requests.Session()
self.headers = {
'User-Agent': 'x',
'Connection': 'Closed'
}
def req(self, URL):
# request to only one url
try:
r = self.session.head(URL, allow_redirects=False, headers=self.headers, timeout=3)
if r.status_code in self.statusCodes:
if r.status_code == 301:
self.directories.append(URL)
self.warning("301", URL)
return
self.success(r.status_code, URL)
return
return
except requests.exceptions.ConnectTimeout:
return
except requests.exceptions.ConnectionError:
self.error("Connection error")
sys.exit(1)
def fuzz(self):
pool = ThreadPool(self.threads)
pool.map(self.req, self.URLList)
pool.close()
pool.join()
return
#self.threads is number of threads
#self.URLList is a list of full urls
'__init__' ((<MWAF.MWAF instance at 0x7f554cd8dcb0>, 'http://localhost', '/usr/share/wordlists/seclists/Discovery/Web-Content/common.txt', 25), {}) 0.00362110137939453125 sec
#each req is around this
'req' ((<MWAF.MWAF instance at 0x7f554cd8dcb0>, 'http://localhost/webedit'), {}) 0.00855112075805664062 sec
'fuzz' ((<MWAF.MWAF instance at 0x7f554cd8dcb0>,), {}) 7.39054012298583984375 sec
Whole Program
[*] 7.39426517487
You may wish to combine multiple processes with multiple threads. As 400 threads in 20 processes outperform 400 threads in 4 processes while performing an I/O-bound task shows, there's an optimal number of threads per process -- the more the higher percentage of time they spend waiting for I/O.
On the higher order of vanishing, you can try reusing prepared requests to save on object creation time. (I'm not sure if that'll have an effect -- requests might e.g. treat them as immutable so it would create a new object each time anyway. But this may still cut on input validation time or something.)

Python multiprocess Pool vs Process

I'm new to Python multiprocessing. I don't quite understand the difference between Pool and Process. Can someone suggest which one I should use for my needs?
I have thousands of http GET requests to send. After sending each and getting the response, I want to store to response (a simple int) to a (shared) dict. My final goal is to write all data in the dict to a file.
This is not CPU intensive at all. All my goal is the speed up sending the http GET requests because there are too many. The requests are all isolated and do not depend on each other.
Shall I use Pool or Process in this case?
Thanks!
----The code below is added on 8/28---
I programmed with multiprocessing. The key challenges I'm facing are:
1) GET request can fail sometimes. I have to set 3 retries to minimize the need to rerun my code/all requests. I only want to retry the failed ones. Can I achieve this with async http requests without using Pool?
2) I want to check the response value of every requests, and have exception handling
The code below is simplified from my actual code. It is working fine, but I wonder if it's the most efficient way of doing things. Can anyone give any suggestions? Thanks a lot!
def get_data(endpoint, get_params):
response = requests.get(endpoint, params = get_params)
if response.status_code != 200:
raise Exception("bad response for " + str(get_params))
return response.json()
def get_currency_data(endpoint, currency, date):
get_params = {'currency': currency,
'date' : date
}
for attempt in range(3):
try:
output = get_data(endpoint, get_params)
# additional return value check
# ......
return output['value']
except:
time.sleep(1) # I found that sleeping for 1s almost always make the retry successfully
return 'error'
def get_all_data(currencies, dates):
# I have many dates, but not too many currencies
for currency in currencies:
results = []
pool = Pool(processes=20)
for date in dates:
results.append(pool.apply_async(get_currency_data, args=(endpoint, date)))
output = [p.get() for p in results]
pool.close()
pool.join()
time.sleep(10) # Unfortunately I have to give the server some time to rest. I found it helps to reduce failures. I didn't write the server. This is not something that I can control
Neither. Use asynchronous programming. Consider the below code pulled directly from that article (credit goes to Paweł Miech)
#!/usr/local/bin/python3.5
import asyncio
from aiohttp import ClientSession
async def fetch(url, session):
async with session.get(url) as response:
return await response.read()
async def run(r):
url = "http://localhost:8080/{}"
tasks = []
# Fetch all responses within one Client session,
# keep connection alive for all requests.
async with ClientSession() as session:
for i in range(r):
task = asyncio.ensure_future(fetch(url.format(i), session))
tasks.append(task)
responses = await asyncio.gather(*tasks)
# you now have all response bodies in this variable
print(responses)
def print_responses(result):
print(result)
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(4))
loop.run_until_complete(future)
Just maybe create a URL's array, and instead of the given code, loop against that array and issue each one to fetch.
EDIT: Use requests_futures
As per #roganjosh comment below, requests_futures is a super-easy way to accomplish this.
from requests_futures.sessions import FuturesSession
sess = FuturesSession()
urls = ['http://google.com', 'https://stackoverflow.com']
responses = {url: sess.get(url) for url in urls}
contents = {url: future.result().content
for url, future in responses.items()
if future.result().status_code == 200}
EDIT: Use grequests to support Python 2.7
You can also us grequests, which supports Python 2.7 for performing asynchronous URL calling.
import grequests
urls = ['http://google.com', 'http://stackoverflow.com']
responses = grequests.map(grequests.get(u) for u in urls)
print([len(r.content) for r in rs])
# [10475, 250785]
EDIT: Using multiprocessing
If you want to do this using multiprocessing, you can. Disclaimer: You're going to have a ton of overhead by doing this, and it won't be anywhere near as efficient as async programming... but it is possible.
It's actually pretty straightforward, you're mapping the URL's through the http GET function:
import requests
urls = ['http://google.com', 'http://stackoverflow.com']
from multiprocessing import Pool
pool = Pool(8)
responses = pool.map(requests.get, urls)
The size of the pool will be the number of simultaneously issues GET requests. Sizing it up should increase your network efficiency, but it'll add overhead on the local machine for communication and forking.
Again, I don't recommend this, but it certainly is possible, and if you have enough cores it's probably faster than doing the calls synchronously.

How to send a repeat request in urllib2.urlopen in Python if the first call is just stuck?

I am making a call to a URL in Python using urllib2.urlopen in a while(True) loop
My URL keeps changing every time (as there is a change in a particular parameter of the URL every time).
My code look as as follows:
def get_url(url):
'''Get json page data using a specified API url'''
response = urlopen(url)
data = str(response.read().decode('utf-8'))
page = json.loads(data)
return page
I am calling the above method from the main function by changing the url every time I make the call.
What I observe is that after few calls to the function, suddenly (I don;t know why), the code gets stuck at the statement
response = urlopen(url)
and it just waits and waits...
How do I best handle this situation?
I want to make sure that if it does not respond within say 10 seconds, I make the same call again.
I read about
response = urlopen(url, timeout=10)
but then what about the repeated call if this fails?
Depending on how many retries you want to attempt, use a try/catch inside a loop:
while True:
try:
response = urlopen(url, timeout=10)
break
except:
# do something with the error
pass
# do something with response
data = str(response.read().decode('utf-8'))
...
This will silence all exceptions, which may not be ideal (more on that here: Handling urllib2's timeout? - Python)
With this method you can retry once.
def get_url(url, trial=1):
try:
'''Get json page data using a specified API url'''
response = urlopen(url, timeout=10)
data = str(response.read().decode('utf-8'))
page = json.loads(data)
return page
except:
if trial == 1:
return get_url(url, trial=2)
else:
return

Using boto to invoke lambda functions how do I do so asynchronously?

SO I'm using boto to invoke my lambda functions and test my backend. I want to invoke them asynchronously. I have noted that "invoke_async" is deprecated and should not be used. Instead you should use "invoke" with an InvocationType of "Event" to do the function asynchronously.
I can't seem to figure out how to get the responses from the functions when they return though. I have tried the following:
payload3=b"""{
"latitude": 39.5732160891,
"longitude": -119.672918997,
"radius": 100
}"""
client = boto3.client('lambda')
for x in range (0, 5):
response = client.invoke(
FunctionName="loadSpotsAroundPoint",
InvocationType='Event',
Payload=payload3
)
time.sleep(15)
print(json.loads(response['Payload'].read()))
print("\n")
Even though I tell the code to sleep for 15 seconds, the response variable is still empty when I try and print it. If I change the invokation InvokationType to "RequestResponse" it all works fine and response variable prints, but this is synchronous. Am I missing something easy? How do i execute some code, for example print out the result, when the async invokation returns??
Thanks.
There is a difference between an 'async AWS lambda invocation' and 'async python code'. When you set the InvocationType to 'Event', by definition, it does not ever send back a response.
In your example, invoke() immediately returns None, and does not implicitly start up anything in the background to change that value at a later time (thank goodness!). So, when you look at the value of response 15 seconds later, it's still None.
It seems what you really want is the RequestResponse invocation type, with asynchronous Python code. You have a bunch of options to choose from, but my favorite is concurrent.futures. Another is threading.
Here's an example using concurrent.futures:
(If you're using Python2 you'll need to pip install futures)
from concurrent.futures import ThreadPoolExecutor
import json
payload = {...}
with ThreadPoolExecutor(max_workers=5) as executor:
futs = []
for x in xrange(0, 5):
futs.append(
executor.submit(client.invoke,
FunctionName = "loadSpotsAroundPoint",
InvocationType = "RequestResponse",
Payload = bytes(json.dumps(payload))
)
)
results = [ fut.result() for fut in futs ]
print results
Another pattern you might want to look into is to use the Event invocation type, and have your Lambda function push messages to SNS, which are then consumed by another Lambda function. You can check out a tutorial for SNS-triggered lambda functions here.
An asynchronously executed AWS Lambda function doesn't return the result of execution. If an asynchronous invocation request is successful (i.e. there were no errors due to permissions, etc), AWS Lambda immediately returns the HTTP status code 202 ACCEPTED and bears no further responsibility for communicating any information about the outcome of this asynchronous invocation.
From the documentation of AWS Lambda Invoke action:
Response Syntax
HTTP/1.1 StatusCode
X-Amz-Function-Error: FunctionError
X-Amz-Log-Result: LogResult
Payload
Response Elements
If the action is successful, the service sends back the following HTTP
response.
StatusCode
The HTTP status code will be in the 200 range for successful request.
For the RequestResponse invocation type this status code will be 200.
For the Event invocation type this status code will be 202. For the DryRun invocation type the status code will be 204.
[...]
The response returns the following as the HTTP body.
Payload
It is the JSON representation of the object returned by the Lambda
function. This is present only if the invocation type is
RequestResponse.
The following are a python function that accepts lambda-function-Name to invoke and payload to send to that function.
It invokes the lambda function by boto3 client.
import boto3, json, typing
def invokeLambdaFunction(*, functionName:str=None, payload:typing.Mapping[str, str]=None):
if functionName == None:
raise Exception('ERROR: functionName parameter cannot be NULL')
payloadStr = json.dumps(payload)
payloadBytesArr = bytes(payloadStr, encoding='utf8')
client = boto3.client('lambda')
return client.invoke(
FunctionName=functionName,
InvocationType="RequestResponse",
Payload=payloadBytesArr
)
And usage:
if __name__ == '__main__':
payloadObj = {"something" : "1111111-222222-333333-bba8-1111111"}
response = invokeLambdaFunction(functionName='myLambdaFuncName', payload=payloadObj)
print(f'response:{response}')

Categories