I have a server which waits for a request containing a pictures:
#app.route("/uploader_ios", methods=['POST'])
def upload_file_ios():
imagefile = request.files['imagefile']
I can submit a post request quite easily using requests in python like so:
url = "<myserver>/uploader_ios"
files = {'imagefile': open(fname, 'rb')}
%time requests.post(url, files=files).json() # 2.77s
However, what I would like to do is submit 1000 or perhaps 100,000 requests at the same time. I wanted to try to do this using asyncio because I have been able to use this for get requests without a problem. However I can't see to create a valid post request that the server accepts.
My attempt is below:
import aiohttp
import asyncio
import json
# Testing with small amount
concurrent = 2
url_list = ['<myserver>/uploader_ios'] * 10
def handle_req(data):
return json.loads(data)['English']
def chunked_http_client(num_chunks, s):
# Use semaphore to limit number of requests
semaphore = asyncio.Semaphore(num_chunks)
#asyncio.coroutine
# Return co-routine that will work asynchronously and respect
# locking of semaphore
def http_get(url):
nonlocal semaphore
with (yield from semaphore):
# Attach files
files = aiohttp.FormData()
files.add_field('imagefile', open(fname, 'rb'))
response = yield from s.request('post', url, data=files)
print(response)
body = yield from response.content.read()
yield from response.wait_for_close()
return body
return http_get
def run_experiment(urls, _session):
http_client = chunked_http_client(num_chunks=concurrent, s=_session)
# http_client returns futures, save all the futures to a list
tasks = [http_client(url) for url in urls]
dfs_route = []
# wait for futures to be ready then iterate over them
for future in asyncio.as_completed(tasks):
data = yield from future
try:
out = handle_req(data)
dfs_route.append(out)
except Exception as err:
print("Error {0}".format(err))
return dfs_route
with aiohttp.ClientSession() as session: # We create a persistent connection
loop = asyncio.get_event_loop()
calc_routes = loop.run_until_complete(run_experiment(url_list, session))
The issue is that the response I get is:
.../uploader_ios) [400 BAD REQUEST]>
I am assuming this is because I am not correctly attaching the image-file
Related
I have the following test code:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor() as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
I need to use the concurrent.futures.ThreadPoolExecutor part of the code in a FastAPI endpoint.
My concern is the impact of the number of API calls and the inclusion of threads. Concern about creating too many threads and its related consequences, starving the host, crashing the application and/or the host.
Any thoughts or gotchas on this approach?
You should rather use the HTTPX library, which provides an async API. As described in this answer , you spawn a Client and reuse it every time you need it. To make asynchronous requests with HTTPX, you'll need an AsyncClient.
You could control the connection pool size as well, using the limits keyword argument on the Client, which takes an instance of httpx.Limits. For example:
limits = httpx.Limits(max_keepalive_connections=5, max_connections=10)
client = httpx.AsyncClient(limits=limits)
You can adjust the above per your needs. As per the documentation on Pool limit configuration:
max_keepalive_connections, number of allowable keep-alive connections, or None to always allow. (Defaults 20)
max_connections, maximum number of allowable connections, or None for no limits. (Default 100)
keepalive_expiry, time limit on idle keep-alive connections in seconds, or None for no limits. (Default 5)
If you would like to adjust the timeout as well, you can use the timeout paramter to set timeout on an individual request, or on a Client/AsyncClient instance, which results in the given timeout being used as the default for requests made with this client (see the implementation of Timeout class as well). You can specify the timeout behavior in a fine grained detail; for example, setting the read timeout parameter will specify the maximum duration to wait for a chunk of data to be received (i.e., a chunk of the response body). If HTTPX is unable to receive data within this time frame, a ReadTimeout exception is raised. If set to None instead of some positive numerical value, there will be no timeout on read. The default is 5 seconds timeout on all operations.
You can use await client.aclose() to explicitly close the AsyncClient when you are done with it (this could be done inside a shutdown event handler, for instance).
To run multiple asynchronous operations—as you need to request five different URLs, when your API endpoint is called—you can use the awaitable asyncio.gather(). It will execute the async operations and return a list of results in the same order the awaitables (tasks) were passed to that function.
Working Example:
from fastapi import FastAPI
import httpx
import asyncio
URLS = ['https://www.foxnews.com/',
'https://edition.cnn.com/',
'https://www.nbcnews.com/',
'https://www.bbc.co.uk/',
'https://www.reuters.com/']
limits = httpx.Limits(max_keepalive_connections=5, max_connections=10)
timeout = httpx.Timeout(5.0, read=15.0) # 15s timeout on read. 5s timeout elsewhere.
client = httpx.AsyncClient(limits=limits, timeout=timeout)
app = FastAPI()
#app.on_event('shutdown')
async def shutdown_event():
await client.aclose()
async def send(url, client):
return await client.get(url)
#app.get('/')
async def main():
tasks = [send(url, client) for url in URLS]
responses = await asyncio.gather(*tasks)
return [r.text[:50] for r in responses] # return the first 50 chars of each response
If you would like to avoid reading the entire response body into RAM, you could use Streaming responses, as described in this answer and demonstrated below:
# ... rest of the code is the same as above
from fastapi.responses import StreamingResponse
async def send(url, client):
req = client.build_request('GET', url)
return await client.send(req, stream=True)
async def iter_content(responses):
for r in responses:
async for chunk in r.aiter_text():
yield chunk[:50] # return the first 50 chars of each response
yield '\n'
break
await r.aclose()
#app.get('/')
async def main():
tasks = [send(url, client) for url in URLS]
responses = await asyncio.gather(*tasks)
return StreamingResponse(iter_content(responses), media_type='text/plain')
import requests
import json
from tqdm import tqdm
list of links to loop through
links =['https://www.google.com/','https://www.google.com/','https://www.google.com/']
for loop for the link using requests
data = []
for link in tqdm(range(len(links))):
response = requests.get(links[link])
response = response.json()
data.append(response)
the above for loop is used to loop through all the list of links but its time consuming when I tried to loop on around a thousand links any help.
Simplest way is to turn it multithreaded. Best way is probably asynchronous.
Multithreaded solution:
import requests
from tqdm.contrib.concurrent import thread_map
links =['https://www.google.com/','https://www.google.com/','https://www.google.com/']
def get_data(url):
response = requests.get(url)
response = response.json() # Do note this might fail at times
return response
data = thread_map(get_data, links)
Or without using tqdm.contrib.concurrent.thread_map:
import requests
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
links =['https://www.google.com/','https://www.google.com/','https://www.google.com/']
def get_data(url):
response = requests.get(url)
response = response.json() # Do note this might fail at times
return response
executor = ThreadPoolExecutor()
data = list(tqdm(executor.map(get_data, links), total=len(links)))
As suggested in the comment you can use asyncio and aiohttp.
import asyncio
import aiohttp
urls = ["your", "links", "here"]
# create aio connector
conn = aiohttp.TCPConnector(limit_per_host=100, limit=0, ttl_dns_cache=300)
# set number of parallel requests - if you are requesting different domains you are likely to be able to set this higher, otherwise you may be rate limited
PARALLEL_REQUESTS = 10
# Create results array to collect results
results = []
async def gather_with_concurrency(n):
# Create semaphore for async i/o
semaphore = asyncio.Semaphore(n)
# create an aiohttp session using the previous connector
session = aiohttp.ClientSession(connector=conn)
# await logic for get request
async def get(URL):
async with semaphore:
async with session.get(url, ssl=False) as response:
obj = await response.read()
# once object is acquired we append to list
results.append(obj)
# wait for all requests to be gathered and then close session
await asyncio.gather(*(get(url) for url in urls))
await session.close()
# get async event loop
loop = asyncio.get_event_loop()
# run using number of parallel requests
loop.run_until_complete(gather_with_concurrency(PARALLEL_REQUESTS))
# Close connection
conn.close()
# loop through results and do something to them
for res in results:
do_something(res)
I have tried to comment on the code as well as possible.
I have used BS4 to parse requests in this manner (in the do_something logic), but it will really depend on your use case.
I'm trying to use https proxy within async requests making use of asyncio library. When it comes to use http proxy, there is a clear instruction here but I get stuck in case of using https proxy. Moreover, I would like to reuse the same session, not creating a new session every time I send a requests.
I've tried so far (proxies used within the script are directly taken from a free proxy site, so consider them as placeholders):
import asyncio
import aiohttp
from bs4 import BeautifulSoup
proxies = [
'http://89.22.210.191:41258',
'http://91.187.75.48:39405',
'http://103.81.104.66:34717',
'http://124.41.213.211:41828',
'http://93.191.100.231:3128'
]
async def get_text(url):
global proxies,proxy_url
while True:
check_url = proxy_url
proxy = f'http://{proxy_url}'
print("trying using:",check_url)
async with aiohttp.ClientSession() as session:
try:
async with session.get(url,proxy=proxy,ssl=False) as resp:
return await resp.text()
except Exception:
if check_url == proxy_url:
proxy_url = proxies.pop()
async def field_info(field_link):
text = await get_text(field_link)
soup = BeautifulSoup(text,'lxml')
for item in soup.select(".summary .question-hyperlink"):
print(item.get_text(strip=True))
if __name__ == '__main__':
proxy_url = proxies.pop()
links = ["https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50".format(page) for page in range(2,5)]
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(asyncio.gather(*(field_info(url) for url in links)))
loop.run_until_complete(future)
loop.close()
How can I use https proxies within the script along with reusing the same session?
This script creates dictionary proxy_session_map, where keys are proxies and values are sessions. That way we know for which proxy belongs which session.
If there's some error using the proxy, I add this proxy to disabled_proxies set so I won't use this proxy again:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
from random import choice
proxies = [
'http://89.22.210.191:41258',
'http://91.187.75.48:39405',
'http://103.81.104.66:34717',
'http://124.41.213.211:41828',
'http://93.191.100.231:3128'
]
disabled_proxies = set()
proxy_session_map = {}
async def get_text(url):
while True:
try:
available_proxies = [p for p in proxies if p not in disabled_proxies]
if available_proxies:
proxy = choice(available_proxies)
else:
proxy = None
if proxy not in proxy_session_map:
proxy_session_map[proxy] = aiohttp.ClientSession(timeout = aiohttp.ClientTimeout(total=5))
print("trying using:",proxy)
async with proxy_session_map[proxy].get(url,proxy=proxy,ssl=False) as resp:
return await resp.text()
except Exception as e:
if proxy:
print("error, disabling:",proxy)
disabled_proxies.add(proxy)
else:
# we haven't used proxy, so return empty string
return ''
async def field_info(field_link):
text = await get_text(field_link)
soup = BeautifulSoup(text,'lxml')
for item in soup.select(".summary .question-hyperlink"):
print(item.get_text(strip=True))
async def main():
links = ["https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50".format(page) for page in range(2,5)]
tasks = [field_info(url) for url in links]
await asyncio.gather(
*tasks
)
# close all sessions:
for s in proxy_session_map.values():
await s.close()
if __name__ == '__main__':
asyncio.run(main())
Prints (for example):
trying using: http://89.22.210.191:41258
trying using: http://124.41.213.211:41828
trying using: http://124.41.213.211:41828
error, disabling: http://124.41.213.211:41828
trying using: http://93.191.100.231:3128
error, disabling: http://124.41.213.211:41828
trying using: http://103.81.104.66:34717
BeautifulSoup to get image name from P class picture tag in Python
Scrape instagram public information from google cloud functions [duplicate]
Webscraping using R - the full website data is not loading
Facebook Public Data Scraping
How it is encode in javascript?
... and so on.
I am trying to retrieve 10 images through an API which returns JSON data by first making a request to the API and then storing 10 image URLs from the returned JSON data in a list. In my original iteration I made individual requests to those urls and saved the response content to file. My code is given below with my API key removed for obvious reasons:
def get_image(search_term):
number_images = 10
images = requests.get("https://pixabay.com/api/?key=insertkey&q={}&per_page={}".format(search_term,number_images))
images_json_dict = images.json()
hits = images_json_dict["hits"]
urls = []
for i in range(len(hits)):
urls.append(hits[i]["webformatURL"])
count =0
for url in urls:
picture_request = requests.get(url)
if picture_request.status_code == 200:
try:
with open(dir_path+r'\\images\\{}.jpg'.format(count),'wb') as f:
f.write(picture_request.content)
except:
os.mkdir(dir_path+r'\\images\\')
with open(dir_path+r'\\images\\{}.jpg'.format(count),'wb') as f:
f.write(picture_request.content)
count+=1
This was working fine apart from the fact that it was very slow. It took maybe 7 seconds to pull in those 10 images and save in a folder. I read here that it's possible to use Sessions() in the requests library to improve performance - I'd like to have those images as quickly as possible. I've modified the code as shown below however the problem I'm having is that the get request on the sessions object returns a requests.sessions.Session object rather than a response code and there is also no .content method to retrieve the content (I've added comments to the relevant lines of code below). I'm relatively new to programming so I'm uncertain if this is even the best way to do this. My question is how can I use sessions to retrieve back the image content now that I am using Session() or is there some smarter way to do this?
def get_image(search_term):
number_images = 10
images = requests.get("https://pixabay.com/api/?key=insertkey&q={}&per_page={}".format(search_term,number_images))
images_json_dict = images.json()
hits = images_json_dict["hits"]
urls = []
for i in range(len(hits)):
urls.append(hits[i]["webformatURL"])
count =0
#Now using Session()
picture_request = requests.Session()
for url in urls:
picture_request.get(url)
#This will no longer work as picture_request is an object
if picture_request == 200:
try:
with open(dir_path+r'\\images\\{}.jpg'.format(count),'wb') as f:
#This will no longer work as there is no .content method
f.write(picture_request.content)
except:
os.mkdir(dir_path+r'\\images\\')
with open(dir_path+r'\\images\\{}.jpg'.format(count),'wb') as f:
#This will no longer work as there is no .content method
f.write(picture_request.content)
count+=1
Assuming you want to stick with requests library, then you will need to use threading for create multiple parallel instances.
Library concurrent.futures have convenient constructor for creating multiple threads with concurrent.futures.ThreadPoolExecutor.
fetch() used for downloading images.
fetch_all() used for creating thread pool, you can select how many threads you want to run by passing threads argument.
get_urls() is your function for retrieving list of urls. You should pass your token (key) and search_term.
Note: In case you have Python older, than 3.7, you should replace f-strings (f"{args}") to regular formatting functions ("{}".format(args)).
import os
import requests
from concurrent import futures
def fetch(url, session = None):
if session:
r = session.get(url, timeout = 60.)
else:
r = reqests.get(url, timeout = 60.)
r.raise_for_status()
return r.content
def fetch_all(urls, session = requests.session(), threads = 8):
with futures.ThreadPoolExecutor(max_workers = threads) as executor:
future_to_url = {executor.submit(fetch, url, session = session): url for url in urls}
for future in futures.as_completed(future_to_url):
url = future_to_url[future]
if future.exception() is None:
yield url, future.result()
else:
print(f"{url} generated an exception: {future.exception()}")
yield url, None
def get_urls(search_term, number_images = 10, token = "", session = requests.session()):
r = requests.get(f"https://pixabay.com/api/?key={token}&q={search_term}&per_page={number_images}")
r.raise_for_status()
urls = [hit["webformatURL"] for hit in r.json().get("hits", [])]
return urls
if __name__ == "__main__":
root_dir = os.getcwd()
session = requests.session()
urls = get_urls("term", token = "token", session = session)
for url, content in fetch_all(urls, session = session):
if content is not None:
f_dir = os.path.join(root_dir, "images")
if not os.path.isdir(f_dir):
os.makedirs(f_dir)
with open(os.path.join(f_dir, os.path.basename(url)), "wb") as f:
f.write(content)
Also I recommending you to look at aiohttp. I will not provide example here, but give you link to one article for similar task, where you can read more about it.
I'm trying to rewrite this Python2.7 code to the new async world order:
def get_api_results(func, iterable):
pool = multiprocessing.Pool(5)
for res in pool.map(func, iterable):
yield res
map() blocks until all results have been computed, so I'm trying to rewrite this as an async implementation that will yield results as soon as they are ready. Like map(), return values must be returned in the same order as iterable. I tried this (I need requests because of legacy auth requirements):
import requests
def get(i):
r = requests.get('https://example.com/api/items/%s' % i)
return i, r.json()
async def get_api_results():
loop = asyncio.get_event_loop()
futures = []
for n in range(1, 11):
futures.append(loop.run_in_executor(None, get, n))
async for f in futures:
k, v = await f
yield k, v
for r in get_api_results():
print(r)
but with Python 3.6 I'm getting:
File "scratch.py", line 16, in <module>
for r in get_api_results():
TypeError: 'async_generator' object is not iterable
How can I accomplish this?
Regarding your older (2.7) code - multiprocessing is considered a powerful drop-in replacement for the much simpler threading module for concurrently processing CPU intensive tasks, where threading does not work so well. Your code is probably not CPU bound - since it just needs to make HTTP requests - and threading might have been enough for solving your problem.
However, instead of using threading directly, Python 3+ has a nice module called concurrent.futures that with a cleaner API via cool Executor classes. This module is available also for python 2.7 as an external package.
The following code works on python 2 and python 3:
# For python 2, first run:
#
# pip install futures
#
from __future__ import print_function
import requests
from concurrent import futures
URLS = [
'http://httpbin.org/delay/1',
'http://httpbin.org/delay/3',
'http://httpbin.org/delay/6',
'http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.coooom/',
]
def fetch(url):
r = requests.get(url)
r.raise_for_status()
return r.content
def fetch_all(urls):
with futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = {executor.submit(fetch, url): url for url in urls}
print("All URLs submitted.")
for future in futures.as_completed(future_to_url):
url = future_to_url[future]
if future.exception() is None:
yield url, future.result()
else:
# print('%r generated an exception: %s' % (
# url, future.exception()))
yield url, None
for url, s in fetch_all(URLS):
status = "{:,.0f} bytes".format(len(s)) if s is not None else "Failed"
print('{}: {}'.format(url, status))
This code uses futures.ThreadPoolExecutor, based on threading. A lot of the magic is in as_completed() used here.
Your python 3.6 code above, uses run_in_executor() which creates a futures.ProcessPoolExecutor(), and does not really use asynchronous IO!!
If you really want to go forward with asyncio, you will need to use an HTTP client that supports asyncio, such as aiohttp. Here is an example code:
import asyncio
import aiohttp
async def fetch(session, url):
print("Getting {}...".format(url))
async with session.get(url) as resp:
text = await resp.text()
return "{}: Got {} bytes".format(url, len(text))
async def fetch_all():
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, "http://httpbin.org/delay/{}".format(delay))
for delay in (1, 1, 2, 3, 3)]
for task in asyncio.as_completed(tasks):
print(await task)
return "Done."
loop = asyncio.get_event_loop()
resp = loop.run_until_complete(fetch_all())
print(resp)
loop.close()
As you can see, asyncio also has an as_completed(), now using real asynchronous IO, utilizing only one thread on one process.
You put your event loop in another co-routine. Don't do that. The event loop is the outermost 'driver' of async code, and should be run synchronous.
If you need to process the fetched results, write more coroutines that do so. They could take the data from a queue, or could be driving the fetching directly.
You could have a main function that fetches and processes results, for example:
async def main(loop):
for n in range(1, 11):
future = loop.run_in_executor(None, get, n)
k, v = await future
# do something with the result
loop = asyncio.get_event_loop()
loop.run_until_complete(main(loop))
I'd make the get() function properly async too using an async library like aiohttp so you don't have to use the executor at all.