So the idea is to collect responses for 1 million queries and store them in a dictionary. I want it to be asynchronous because requests.post takes 1 second for each query and I want to keep the loop going while it's wait for response. After some research I have something like this.
async def get_response(id):
query_json = id2json_dict[id]
response = requests.post('some_url', json = query_json, verify=false)
return eval(response.text)
async def main(id_list):
for unique_id in id_list:
id2response_dict[unique_id] = get_response(unique_id)
I know this is not asynchronous, how do I use "await" in it to make it truly asynchronous?
The requests-async pacakge provides asyncio support for requests... https://github.com/encode/requests-async
Either that or use aiohttp.
Not tested, but this should work:
async def get_response(id):
query_json = id2json_dict[id]
# We need to_thread here because requests.post is synchronous.
response = await asyncio.to_thread(
requests.post,
'some_url', json = query_json, verify=false
)
return eval(response.text)
async def main(id_list):
tasks = [
get_response(unique_id) for unique_id in id_list
]
results = await asyncio.gather(*tasks)
id2response_dict = {
unique_id: result
for (unique_id, result) in zip(id_list, results)
}
Related
I am attempting to make an API request, pull down specific chunks of the response and ultimately save it into a file for later processing. I also first want to mention that the script works full, until I begin to pull larger sets of data.
When I open the params to a larger date range, I receive:
ContentTypeError(
aiohttp.client_exceptions.ContentTypeError: 0, message='Attempt to decode JSON with unexpected mimetype: text/html'
async def get_dataset(session, url):
async with session.get(url=url, headers=headers, params=params) as resp:
dataset = await resp.json()
return dataset['time_entries']
async def main():
tasks = []
async with aiohttp.ClientSession() as session:
for page in range(1, total_pages):
url = "https://api.harvestapp.com/v2/time_entries?page=" + str(page)
tasks.append(asyncio.ensure_future(get_dataset(session, url)))
dataset = await asyncio.gather(*tasks)
If I keep my params small enough, then it works without issue. But too large of a date range and the error pops up, and anything past the snippet I shared above does not run
More for reference:
url_address = "https://api.harvestapp.com/v2/time_entries/"
headers = {
"Content-Type": 'application/json',
"Authorization": authToken,
"Harvest-Account-ID": accountID
}
params = {
"from": StartDate,
"to": EndDate
}
Any ideas on what would cause this to work on certain data sizes but fail on larger sets? I am assuming the JSON is becoming malformed at some point, but I am unsure of how to examine that and/or prevent it from happening, since I am able to pull multiple pages from the API and successfully appending on the smaller data pulls.
OP: Thank you to the others who gave answers. I discovered the issue and implemented a solution. A friend pointed out that aiohttp can return that error message if the response is of an error page instead of the expected json content i.e. a html page giving a 429 HTTP too many requests. I looked up the API limits and found they do have it set to 100 requests per 15 seconds.
My solution was to implement the asyncio-throttle module which allowed me to directly limit the requests and time period. You can find this on the devs GitHub
Here is my updated code with the implementation, very simple! For my instance I needed to limit my requests to 100 per 15 seconds which you can see below as well.
async def get_dataset(session, url, throttler):
while True:
async with throttler:
async with session.get(url=url, headers=headers, params=params) as resp:
dataset = await resp.json()
return dataset['time_entries']
async def main():
tasks = []
throttler = Throttler(rate_limit=100, period=15)
async with aiohttp.ClientSession() as session:
try:
for page in range(1, total_pages):
url = "https://api.harvestapp.com/v2/time_entries?page=" + str(page)
tasks.append(asyncio.ensure_future(get_dataset(session, url, throttler)))
dataset = await asyncio.gather(*tasks)
I want to use HTTPX (within FastAPI, if that matters) to make asynchronous http requests to an outside API and store the responses as individual variables for processing in slightly different ways depending on which URL was fetched. I'm modifying the code from this StackOverflow answer.
import asyncio
import httpx
async def perform_request(client, url):
response = await client.get(url)
return response.text
async def gather_tasks(*urls):
async with httpx.AsyncClient() as client:
tasks = [perform_request(client, url) for url in urls]
result = await asyncio.gather(*tasks)
return result
async def f():
url1 = "https://api.com/object=562"
url2 = "https://api.com/object=383"
url3 = "https://api.com/object=167"
url4 = "https://api.com/object=884"
result = await gather_tasks(url1, url2, url3, url4)
# print(result[0])
# print(result[1])
# DO THINGS WITH url2, SOMETHING ELSE WITH url4, ETC.
if __name__ == '__main__':
asyncio.run(f())
What's the best way to access the individual responses? (If I use result[n] I wouldn't know which response I'm working with.)
And I'm pretty new to httpx and async operations in general so please share if you have any suggestions for how to achieve it in a better way.
Regardless of AsyncIO, I would probably put the logic inside gather_tasks. There you know the response, and you can define all the if else logic you want to proceed with the right path.
In my opinion you have two options:
1 - Process the request right away
In this case f would only initialize the urls and trigger the processing, everything else would happen inside gather_tasks.
2 - "Enrich" the response
In gather_tasks you can understand which kind of operation to do next, and "attach" to the response some sort of code to define it. For example, you could return a dict with two keys: response and operation. This would be the most explicit way of doing this, but you could also use a list or a tuple, you just need to know where the response and the "next step code" is within them.
This is useful if the further processing must happen later instead of right away.
Makes sense?
I was able to get the response of one API from another but unable to store it somewhere(in a file or something before returning the response)
response=RedirectResponse(url="/apiname/") (I want to access a post request with header and body)
I want to store this response content without returning it.
Yes, if I return the function I will get the results but when I print it I don't find results.
Also, if I give post request then I get error Entity not found.
I read the starlette and fastapi docs but couldn't get the workaround. The callbacks also didn't help.
I didn't exactly get the way to store response without returning using fastapi/starlette directly. But I found a workaround for completing this task.
For the people trying to implement same thing, Please consider this
way.
import requests
def test_function(request: Request, path_parameter: path_param):
request_example = {"test" : "in"}
host = request.client.host
data_source_id = path_parameter.id
get_test_url= f"http://{host}/test/{id}/"
get_inp_url = f"http://{host}/test/{id}/inp"
test_get_response = requests.get(get_test_url)
inp_post_response = requests.post(get_inp_url , json=request_example)
if inp_post_response .status_code == 200:
print(json.loads(test_get_response.content.decode('utf-8')))
Please let me know if there are better approaches.
I have the same problem & I needed to call the third-party API with async way
So I tried many ways & I came solution with requests-async library
and it works for me.
import http3
client = http3.AsyncClient()
async def call_api(url: str):
r = await client.get(url)
return r.text
#app.get("/")
async def root():
...
result_1 = await call_api('url_1')
result_2 = await call_api('url_2')
...
httpx also you can use
this video he is using httpx
I'm new to Python multiprocessing. I don't quite understand the difference between Pool and Process. Can someone suggest which one I should use for my needs?
I have thousands of http GET requests to send. After sending each and getting the response, I want to store to response (a simple int) to a (shared) dict. My final goal is to write all data in the dict to a file.
This is not CPU intensive at all. All my goal is the speed up sending the http GET requests because there are too many. The requests are all isolated and do not depend on each other.
Shall I use Pool or Process in this case?
Thanks!
----The code below is added on 8/28---
I programmed with multiprocessing. The key challenges I'm facing are:
1) GET request can fail sometimes. I have to set 3 retries to minimize the need to rerun my code/all requests. I only want to retry the failed ones. Can I achieve this with async http requests without using Pool?
2) I want to check the response value of every requests, and have exception handling
The code below is simplified from my actual code. It is working fine, but I wonder if it's the most efficient way of doing things. Can anyone give any suggestions? Thanks a lot!
def get_data(endpoint, get_params):
response = requests.get(endpoint, params = get_params)
if response.status_code != 200:
raise Exception("bad response for " + str(get_params))
return response.json()
def get_currency_data(endpoint, currency, date):
get_params = {'currency': currency,
'date' : date
}
for attempt in range(3):
try:
output = get_data(endpoint, get_params)
# additional return value check
# ......
return output['value']
except:
time.sleep(1) # I found that sleeping for 1s almost always make the retry successfully
return 'error'
def get_all_data(currencies, dates):
# I have many dates, but not too many currencies
for currency in currencies:
results = []
pool = Pool(processes=20)
for date in dates:
results.append(pool.apply_async(get_currency_data, args=(endpoint, date)))
output = [p.get() for p in results]
pool.close()
pool.join()
time.sleep(10) # Unfortunately I have to give the server some time to rest. I found it helps to reduce failures. I didn't write the server. This is not something that I can control
Neither. Use asynchronous programming. Consider the below code pulled directly from that article (credit goes to Paweł Miech)
#!/usr/local/bin/python3.5
import asyncio
from aiohttp import ClientSession
async def fetch(url, session):
async with session.get(url) as response:
return await response.read()
async def run(r):
url = "http://localhost:8080/{}"
tasks = []
# Fetch all responses within one Client session,
# keep connection alive for all requests.
async with ClientSession() as session:
for i in range(r):
task = asyncio.ensure_future(fetch(url.format(i), session))
tasks.append(task)
responses = await asyncio.gather(*tasks)
# you now have all response bodies in this variable
print(responses)
def print_responses(result):
print(result)
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(4))
loop.run_until_complete(future)
Just maybe create a URL's array, and instead of the given code, loop against that array and issue each one to fetch.
EDIT: Use requests_futures
As per #roganjosh comment below, requests_futures is a super-easy way to accomplish this.
from requests_futures.sessions import FuturesSession
sess = FuturesSession()
urls = ['http://google.com', 'https://stackoverflow.com']
responses = {url: sess.get(url) for url in urls}
contents = {url: future.result().content
for url, future in responses.items()
if future.result().status_code == 200}
EDIT: Use grequests to support Python 2.7
You can also us grequests, which supports Python 2.7 for performing asynchronous URL calling.
import grequests
urls = ['http://google.com', 'http://stackoverflow.com']
responses = grequests.map(grequests.get(u) for u in urls)
print([len(r.content) for r in rs])
# [10475, 250785]
EDIT: Using multiprocessing
If you want to do this using multiprocessing, you can. Disclaimer: You're going to have a ton of overhead by doing this, and it won't be anywhere near as efficient as async programming... but it is possible.
It's actually pretty straightforward, you're mapping the URL's through the http GET function:
import requests
urls = ['http://google.com', 'http://stackoverflow.com']
from multiprocessing import Pool
pool = Pool(8)
responses = pool.map(requests.get, urls)
The size of the pool will be the number of simultaneously issues GET requests. Sizing it up should increase your network efficiency, but it'll add overhead on the local machine for communication and forking.
Again, I don't recommend this, but it certainly is possible, and if you have enough cores it's probably faster than doing the calls synchronously.
I'm trying to rewrite this Python2.7 code to the new async world order:
def get_api_results(func, iterable):
pool = multiprocessing.Pool(5)
for res in pool.map(func, iterable):
yield res
map() blocks until all results have been computed, so I'm trying to rewrite this as an async implementation that will yield results as soon as they are ready. Like map(), return values must be returned in the same order as iterable. I tried this (I need requests because of legacy auth requirements):
import requests
def get(i):
r = requests.get('https://example.com/api/items/%s' % i)
return i, r.json()
async def get_api_results():
loop = asyncio.get_event_loop()
futures = []
for n in range(1, 11):
futures.append(loop.run_in_executor(None, get, n))
async for f in futures:
k, v = await f
yield k, v
for r in get_api_results():
print(r)
but with Python 3.6 I'm getting:
File "scratch.py", line 16, in <module>
for r in get_api_results():
TypeError: 'async_generator' object is not iterable
How can I accomplish this?
Regarding your older (2.7) code - multiprocessing is considered a powerful drop-in replacement for the much simpler threading module for concurrently processing CPU intensive tasks, where threading does not work so well. Your code is probably not CPU bound - since it just needs to make HTTP requests - and threading might have been enough for solving your problem.
However, instead of using threading directly, Python 3+ has a nice module called concurrent.futures that with a cleaner API via cool Executor classes. This module is available also for python 2.7 as an external package.
The following code works on python 2 and python 3:
# For python 2, first run:
#
# pip install futures
#
from __future__ import print_function
import requests
from concurrent import futures
URLS = [
'http://httpbin.org/delay/1',
'http://httpbin.org/delay/3',
'http://httpbin.org/delay/6',
'http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.coooom/',
]
def fetch(url):
r = requests.get(url)
r.raise_for_status()
return r.content
def fetch_all(urls):
with futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = {executor.submit(fetch, url): url for url in urls}
print("All URLs submitted.")
for future in futures.as_completed(future_to_url):
url = future_to_url[future]
if future.exception() is None:
yield url, future.result()
else:
# print('%r generated an exception: %s' % (
# url, future.exception()))
yield url, None
for url, s in fetch_all(URLS):
status = "{:,.0f} bytes".format(len(s)) if s is not None else "Failed"
print('{}: {}'.format(url, status))
This code uses futures.ThreadPoolExecutor, based on threading. A lot of the magic is in as_completed() used here.
Your python 3.6 code above, uses run_in_executor() which creates a futures.ProcessPoolExecutor(), and does not really use asynchronous IO!!
If you really want to go forward with asyncio, you will need to use an HTTP client that supports asyncio, such as aiohttp. Here is an example code:
import asyncio
import aiohttp
async def fetch(session, url):
print("Getting {}...".format(url))
async with session.get(url) as resp:
text = await resp.text()
return "{}: Got {} bytes".format(url, len(text))
async def fetch_all():
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, "http://httpbin.org/delay/{}".format(delay))
for delay in (1, 1, 2, 3, 3)]
for task in asyncio.as_completed(tasks):
print(await task)
return "Done."
loop = asyncio.get_event_loop()
resp = loop.run_until_complete(fetch_all())
print(resp)
loop.close()
As you can see, asyncio also has an as_completed(), now using real asynchronous IO, utilizing only one thread on one process.
You put your event loop in another co-routine. Don't do that. The event loop is the outermost 'driver' of async code, and should be run synchronous.
If you need to process the fetched results, write more coroutines that do so. They could take the data from a queue, or could be driving the fetching directly.
You could have a main function that fetches and processes results, for example:
async def main(loop):
for n in range(1, 11):
future = loop.run_in_executor(None, get, n)
k, v = await future
# do something with the result
loop = asyncio.get_event_loop()
loop.run_until_complete(main(loop))
I'd make the get() function properly async too using an async library like aiohttp so you don't have to use the executor at all.