Python: best way to check for list of URLs - python

I have a file defining a list of RSS feeds:
RSS_FEEDS = [
"https://www.fanpage.it/feed/",
"https://www.ilfattoquotidiano.it/feed/",
"https://forbes.it/feed/",
"https://formiche.net/feed/",
]
I wrote the following test:
import requests
from feeds import RSS_FEEDS
for rssfeed in RSS_FEEDS:
response = requests.get(rssfeed)
assert response.status_code == 200
Are there more efficient (download less stuff) ways?
How would you handle a slow response vs a dead link?
The above would just tell me if the URL is fetchable, but how could I assess if it's a valid RSS stream?

You could solve it using the aiohttp library also together with asyncio, like this:
from aiohttp import ClientSession
from asyncio import gather, create_task, run, set_event_loop, set_event_loop_policy
from traceback import format_exc
import sys
# This is necessary on my Windows computer
if sys.version_info[0] == 3 and sys.version_info[1] >= 8 and sys.platform.startswith('win'): # Check for operating system
from asyncio import ProactorEventLoop, WindowsSelectorEventLoopPolicy
set_event_loop(ProactorEventLoop())
set_event_loop_policy(WindowsSelectorEventLoopPolicy()) # Bug is not present in Linux
RSS_FEEDS = [
"https://www.fanpage.it/feed/",
"https://www.ilfattoquotidiano.it/feed/",
"https://forbes.it/feed/",
"https://formiche.net/feed/",
]
async def GetRessource(url: str, session: ClientSession) -> dict:
try:
async with session.get(url) as response:
if response.status == 200:
return(response.status)
else:
r: str = await response.text()
print(f"Error, got response code: {response.status} message: {r}")
except Exception:
print(f"General Exception:\n{format_exc()}")
return({})
async def GetUrls() -> None:
async with ClientSession() as session:
Tasks: list = [create_task(GetRessource(url, session)) for url in RSS_FEEDS]
Results: list = await gather(*Tasks, return_exceptions=False)
for result in Results:
assert result == 200
async def main():
await GetUrls()
if __name__ == "__main__":
run(main())
Result of Results:
200
200
200
200
It's checking the URLs in parallel.

To optimize network usage, add a timeout parameter to the get request to limit the wait time for a response and a stream parameter to the get request to only download a portion of the response in chunks rather than the entire file.
To handle a slow/dead link, add a timeout parameter to the get request to raise an exception if the response takes too long, and catch and handle exceptions raised by the get request such as TimeoutError, ConnectionError, and HTTPError (e.g. retry, log error)
To validate an RSS stream, use a library like feedparser to parse the response and determine whether it's a valid RSS feed, as well as look for specific elements/attributes in the response (e.g. channel, item, title, link) that are required for an RSS feed.
import requests
import feedparser
from requests.exceptions import Timeout, ConnectionError, HTTPError
for rssfeed in RSS_FEEDS:
try:
response = requests.get(rssfeed, timeout=5)
response.raise_for_status()
feed = feedparser.parse(response.content)
if not feed.bozo:
# feed is valid
else:
# feed is invalid
except (Timeout, ConnectionError, HTTPError) as e:
# handle exceptions here (e.g. retry, log error)
pass

Related

Is having a concurrent.futures.ThreadPoolExecutor call dangerous in a FastAPI endpoint?

I have the following test code:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor() as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
I need to use the concurrent.futures.ThreadPoolExecutor part of the code in a FastAPI endpoint.
My concern is the impact of the number of API calls and the inclusion of threads. Concern about creating too many threads and its related consequences, starving the host, crashing the application and/or the host.
Any thoughts or gotchas on this approach?
You should rather use the HTTPX library, which provides an async API. As described in this answer , you spawn a Client and reuse it every time you need it. To make asynchronous requests with HTTPX, you'll need an AsyncClient.
You could control the connection pool size as well, using the limits keyword argument on the Client, which takes an instance of httpx.Limits. For example:
limits = httpx.Limits(max_keepalive_connections=5, max_connections=10)
client = httpx.AsyncClient(limits=limits)
You can adjust the above per your needs. As per the documentation on Pool limit configuration:
max_keepalive_connections, number of allowable keep-alive connections, or None to always allow. (Defaults 20)
max_connections, maximum number of allowable connections, or None for no limits. (Default 100)
keepalive_expiry, time limit on idle keep-alive connections in seconds, or None for no limits. (Default 5)
If you would like to adjust the timeout as well, you can use the timeout paramter to set timeout on an individual request, or on a Client/AsyncClient instance, which results in the given timeout being used as the default for requests made with this client (see the implementation of Timeout class as well). You can specify the timeout behavior in a fine grained detail; for example, setting the read timeout parameter will specify the maximum duration to wait for a chunk of data to be received (i.e., a chunk of the response body). If HTTPX is unable to receive data within this time frame, a ReadTimeout exception is raised. If set to None instead of some positive numerical value, there will be no timeout on read. The default is 5 seconds timeout on all operations.
You can use await client.aclose() to explicitly close the AsyncClient when you are done with it (this could be done inside a shutdown event handler, for instance).
To run multiple asynchronous operations—as you need to request five different URLs, when your API endpoint is called—you can use the awaitable asyncio.gather(). It will execute the async operations and return a list of results in the same order the awaitables (tasks) were passed to that function.
Working Example:
from fastapi import FastAPI
import httpx
import asyncio
URLS = ['https://www.foxnews.com/',
'https://edition.cnn.com/',
'https://www.nbcnews.com/',
'https://www.bbc.co.uk/',
'https://www.reuters.com/']
limits = httpx.Limits(max_keepalive_connections=5, max_connections=10)
timeout = httpx.Timeout(5.0, read=15.0) # 15s timeout on read. 5s timeout elsewhere.
client = httpx.AsyncClient(limits=limits, timeout=timeout)
app = FastAPI()
#app.on_event('shutdown')
async def shutdown_event():
await client.aclose()
async def send(url, client):
return await client.get(url)
#app.get('/')
async def main():
tasks = [send(url, client) for url in URLS]
responses = await asyncio.gather(*tasks)
return [r.text[:50] for r in responses] # return the first 50 chars of each response
If you would like to avoid reading the entire response body into RAM, you could use Streaming responses, as described in this answer and demonstrated below:
# ... rest of the code is the same as above
from fastapi.responses import StreamingResponse
async def send(url, client):
req = client.build_request('GET', url)
return await client.send(req, stream=True)
async def iter_content(responses):
for r in responses:
async for chunk in r.aiter_text():
yield chunk[:50] # return the first 50 chars of each response
yield '\n'
break
await r.aclose()
#app.get('/')
async def main():
tasks = [send(url, client) for url in URLS]
responses = await asyncio.gather(*tasks)
return StreamingResponse(iter_content(responses), media_type='text/plain')

python looping fast through links

import requests
import json
from tqdm import tqdm
list of links to loop through
links =['https://www.google.com/','https://www.google.com/','https://www.google.com/']
for loop for the link using requests
data = []
for link in tqdm(range(len(links))):
response = requests.get(links[link])
response = response.json()
data.append(response)
the above for loop is used to loop through all the list of links but its time consuming when I tried to loop on around a thousand links any help.
Simplest way is to turn it multithreaded. Best way is probably asynchronous.
Multithreaded solution:
import requests
from tqdm.contrib.concurrent import thread_map
links =['https://www.google.com/','https://www.google.com/','https://www.google.com/']
def get_data(url):
response = requests.get(url)
response = response.json() # Do note this might fail at times
return response
data = thread_map(get_data, links)
Or without using tqdm.contrib.concurrent.thread_map:
import requests
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
links =['https://www.google.com/','https://www.google.com/','https://www.google.com/']
def get_data(url):
response = requests.get(url)
response = response.json() # Do note this might fail at times
return response
executor = ThreadPoolExecutor()
data = list(tqdm(executor.map(get_data, links), total=len(links)))
As suggested in the comment you can use asyncio and aiohttp.
import asyncio
import aiohttp
urls = ["your", "links", "here"]
# create aio connector
conn = aiohttp.TCPConnector(limit_per_host=100, limit=0, ttl_dns_cache=300)
# set number of parallel requests - if you are requesting different domains you are likely to be able to set this higher, otherwise you may be rate limited
PARALLEL_REQUESTS = 10
# Create results array to collect results
results = []
async def gather_with_concurrency(n):
# Create semaphore for async i/o
semaphore = asyncio.Semaphore(n)
# create an aiohttp session using the previous connector
session = aiohttp.ClientSession(connector=conn)
# await logic for get request
async def get(URL):
async with semaphore:
async with session.get(url, ssl=False) as response:
obj = await response.read()
# once object is acquired we append to list
results.append(obj)
# wait for all requests to be gathered and then close session
await asyncio.gather(*(get(url) for url in urls))
await session.close()
# get async event loop
loop = asyncio.get_event_loop()
# run using number of parallel requests
loop.run_until_complete(gather_with_concurrency(PARALLEL_REQUESTS))
# Close connection
conn.close()
# loop through results and do something to them
for res in results:
do_something(res)
I have tried to comment on the code as well as possible.
I have used BS4 to parse requests in this manner (in the do_something logic), but it will really depend on your use case.

Can't use https proxies along with reusing the same session within a script built upon asyncio

I'm trying to use https proxy within async requests making use of asyncio library. When it comes to use http proxy, there is a clear instruction here but I get stuck in case of using https proxy. Moreover, I would like to reuse the same session, not creating a new session every time I send a requests.
I've tried so far (proxies used within the script are directly taken from a free proxy site, so consider them as placeholders):
import asyncio
import aiohttp
from bs4 import BeautifulSoup
proxies = [
'http://89.22.210.191:41258',
'http://91.187.75.48:39405',
'http://103.81.104.66:34717',
'http://124.41.213.211:41828',
'http://93.191.100.231:3128'
]
async def get_text(url):
global proxies,proxy_url
while True:
check_url = proxy_url
proxy = f'http://{proxy_url}'
print("trying using:",check_url)
async with aiohttp.ClientSession() as session:
try:
async with session.get(url,proxy=proxy,ssl=False) as resp:
return await resp.text()
except Exception:
if check_url == proxy_url:
proxy_url = proxies.pop()
async def field_info(field_link):
text = await get_text(field_link)
soup = BeautifulSoup(text,'lxml')
for item in soup.select(".summary .question-hyperlink"):
print(item.get_text(strip=True))
if __name__ == '__main__':
proxy_url = proxies.pop()
links = ["https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50".format(page) for page in range(2,5)]
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(asyncio.gather(*(field_info(url) for url in links)))
loop.run_until_complete(future)
loop.close()
How can I use https proxies within the script along with reusing the same session?
This script creates dictionary proxy_session_map, where keys are proxies and values are sessions. That way we know for which proxy belongs which session.
If there's some error using the proxy, I add this proxy to disabled_proxies set so I won't use this proxy again:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
from random import choice
proxies = [
'http://89.22.210.191:41258',
'http://91.187.75.48:39405',
'http://103.81.104.66:34717',
'http://124.41.213.211:41828',
'http://93.191.100.231:3128'
]
disabled_proxies = set()
proxy_session_map = {}
async def get_text(url):
while True:
try:
available_proxies = [p for p in proxies if p not in disabled_proxies]
if available_proxies:
proxy = choice(available_proxies)
else:
proxy = None
if proxy not in proxy_session_map:
proxy_session_map[proxy] = aiohttp.ClientSession(timeout = aiohttp.ClientTimeout(total=5))
print("trying using:",proxy)
async with proxy_session_map[proxy].get(url,proxy=proxy,ssl=False) as resp:
return await resp.text()
except Exception as e:
if proxy:
print("error, disabling:",proxy)
disabled_proxies.add(proxy)
else:
# we haven't used proxy, so return empty string
return ''
async def field_info(field_link):
text = await get_text(field_link)
soup = BeautifulSoup(text,'lxml')
for item in soup.select(".summary .question-hyperlink"):
print(item.get_text(strip=True))
async def main():
links = ["https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50".format(page) for page in range(2,5)]
tasks = [field_info(url) for url in links]
await asyncio.gather(
*tasks
)
# close all sessions:
for s in proxy_session_map.values():
await s.close()
if __name__ == '__main__':
asyncio.run(main())
Prints (for example):
trying using: http://89.22.210.191:41258
trying using: http://124.41.213.211:41828
trying using: http://124.41.213.211:41828
error, disabling: http://124.41.213.211:41828
trying using: http://93.191.100.231:3128
error, disabling: http://124.41.213.211:41828
trying using: http://103.81.104.66:34717
BeautifulSoup to get image name from P class picture tag in Python
Scrape instagram public information from google cloud functions [duplicate]
Webscraping using R - the full website data is not loading
Facebook Public Data Scraping
How it is encode in javascript?
... and so on.

Combining async and sync requests in python?

I am trying to make a request to server A, where the response will be a list of requests, which I will make to server B.
Currently request to server A is just a simple sync request like this:
import requests
req = requests.get('https://server-a.com')
data = req.json()
list_of_requests = data['requests'] # requests for server B
Since list_of_requests can be a few thousand items long, I would like to use async to speed up the requests to B.
I've looked at several examples of async HTTP requests using aiohttp, such as from
https://towardsdatascience.com/fast-and-async-in-python-accelerate-your-requests-using-asyncio-62dafca83c33
import aiohttp
import asyncio
import os
from aiohttp import ClientSession
GOOGLE_BOOKS_URL = "https://www.googleapis.com/books/v1/volumes?q=isbn:"
LIST_ISBN = [
'9780002005883',
'9780002238304',
'9780002261982',
'9780006163831',
'9780006178736',
'9780006280897',
'9780006280934',
'9780006353287',
'9780006380832',
'9780006470229',
]
def extract_fields_from_response(response):
"""Extract fields from API's response"""
item = response.get("items", [{}])[0]
volume_info = item.get("volumeInfo", {})
title = volume_info.get("title", None)
subtitle = volume_info.get("subtitle", None)
description = volume_info.get("description", None)
published_date = volume_info.get("publishedDate", None)
return (
title,
subtitle,
description,
published_date,
)
async def get_book_details_async(isbn, session):
"""Get book details using Google Books API (asynchronously)"""
url = GOOGLE_BOOKS_URL + isbn
try:
response = await session.request(method='GET', url=url)
response.raise_for_status()
print(f"Response status ({url}): {response.status}")
except HTTPError as http_err:
print(f"HTTP error occurred: {http_err}")
except Exception as err:
print(f"An error ocurred: {err}")
response_json = await response.json()
return response_json
async def run_program(isbn, session):
"""Wrapper for running program in an asynchronous manner"""
try:
response = await get_book_details_async(isbn, session)
parsed_response = extract_fields_from_response(response)
print(f"Response: {json.dumps(parsed_response, indent=2)}")
except Exception as err:
print(f"Exception occured: {err}")
pass
async with ClientSession() as session:
await asyncio.gather(*[run_program(isbn, session) for isbn in LIST_ISBN])
However, all of the examples I have looked at start with the list of requests already defined. My question is, what is the proper pythonic way/pattern of combining a single sync request and then using that request to 'spawn' async tasks?
Thanks a bunch!

Concurrent POST requests w/ attachment Python

I have a server which waits for a request containing a pictures:
#app.route("/uploader_ios", methods=['POST'])
def upload_file_ios():
imagefile = request.files['imagefile']
I can submit a post request quite easily using requests in python like so:
url = "<myserver>/uploader_ios"
files = {'imagefile': open(fname, 'rb')}
%time requests.post(url, files=files).json() # 2.77s
However, what I would like to do is submit 1000 or perhaps 100,000 requests at the same time. I wanted to try to do this using asyncio because I have been able to use this for get requests without a problem. However I can't see to create a valid post request that the server accepts.
My attempt is below:
import aiohttp
import asyncio
import json
# Testing with small amount
concurrent = 2
url_list = ['<myserver>/uploader_ios'] * 10
def handle_req(data):
return json.loads(data)['English']
def chunked_http_client(num_chunks, s):
# Use semaphore to limit number of requests
semaphore = asyncio.Semaphore(num_chunks)
#asyncio.coroutine
# Return co-routine that will work asynchronously and respect
# locking of semaphore
def http_get(url):
nonlocal semaphore
with (yield from semaphore):
# Attach files
files = aiohttp.FormData()
files.add_field('imagefile', open(fname, 'rb'))
response = yield from s.request('post', url, data=files)
print(response)
body = yield from response.content.read()
yield from response.wait_for_close()
return body
return http_get
def run_experiment(urls, _session):
http_client = chunked_http_client(num_chunks=concurrent, s=_session)
# http_client returns futures, save all the futures to a list
tasks = [http_client(url) for url in urls]
dfs_route = []
# wait for futures to be ready then iterate over them
for future in asyncio.as_completed(tasks):
data = yield from future
try:
out = handle_req(data)
dfs_route.append(out)
except Exception as err:
print("Error {0}".format(err))
return dfs_route
with aiohttp.ClientSession() as session: # We create a persistent connection
loop = asyncio.get_event_loop()
calc_routes = loop.run_until_complete(run_experiment(url_list, session))
The issue is that the response I get is:
.../uploader_ios) [400 BAD REQUEST]>
I am assuming this is because I am not correctly attaching the image-file

Categories