I am using asyncio in python to get data from a large number of urls.
Here I am trying to get data from yahoo.com 1000 times. And ~90% of the requests fail. Reducing the number of parallel requests reduces the % of fails. Trying to understand why this happens.
import asyncio
import aiohttp
could_not_fetch = []
fetching data from the url. almost 90% of them fail here.
async def fetch_page(url, session, id):
try:
async with session.get(url, timeout = 3) as res:
html = await res.text()
return html
except:
could_not_fetch.append(id)
async def process(id, url, session):
html = await fetch_page(url, session,id)
using aiohttp.ClientSession() to request data from urls, also passing their index.
async def dispatch(urls):
async with aiohttp.ClientSession() as session:
coros = (process(id, url, session) for id, url in enumerate(urls))
return await asyncio.gather(*coros)
using asyncio to get data from yahoo.com 1000 times. if I reduce the number to 100, a much lesser ~ 10% of requests fail.
def main():
loop = asyncio.get_event_loop()
loop = loop.run_until_complete(dispatch(1000 * ['https://yahoo.com/']))
print('could_not_fetch', len(could_not_fetch))
if __name__ == '__main__':
main()
Trying to understand why these requests fail and how to rectify this while doing 1k requests at a time.
Related
I have the static url, headers, and data.
Is it possible to make million post requests simultaneously with python?
This is the file.py:
import json
import requests
url = "https://abcd.com"
headers = "headers"
body = "body"
resp = requests.post(url, headers=headers, data=body)
json_resp = json.loads(resp.content)["data"]
print(json_resp)
You might want to use some python tools for that such as:
https://locust.io/
Your file would look like:
from locust import HttpUser, task, between
class QuickstartUser(HttpUser):
#task
def task_name(self):
self.client.post(url, headers=headers, data=body)
You could feed it to locust in such a way:
locust --headless --users <number_of_user> -f <your_file.py>
You can do this in several ways, which is the best method and idea of async work
The second method is ThreadPoolExecutor, which I do not highly recommend
there's a example for do this.
# modified fetch function with semaphore
import random
import asyncio
from aiohttp import ClientSession
async def fetch(url, session):
async with session.get(url) as response:
delay = response.headers.get("DELAY")
date = response.headers.get("DATE")
print("{}:{} with delay {}".format(date, response.url, delay))
return await response.read()
async def bound_fetch(sem, url, session):
# Getter function with semaphore.
async with sem:
await fetch(url, session)
async def run(r):
url = "http://localhost:8080/{}"
tasks = []
# create instance of Semaphore
sem = asyncio.Semaphore(1000)
# Create client session that will ensure we dont open new connection
# per each request.
async with ClientSession() as session:
for i in range(r):
# pass Semaphore and session to every GET request
task = asyncio.ensure_future(bound_fetch(sem, url.format(i), session))
tasks.append(task)
responses = asyncio.gather(*tasks)
await responses
number = 10000
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(number))
loop.run_until_complete(future)
I have the below code that will do GET requests at an http endpoint. However, doing them one at a time is super slow. So the code below will do them 50 at a time, but I need to add them to a set (I figured a set would be fastest, because there will be duplicate objects returned with this script. Right now, this just returns the objects in a string 50 at a time, when I need them separated so I can sort them after they are all in a set. I'm new to python so I'm not sure what else to try
import asyncio
from aiohttp import ClientSession
async def fetch(url, session):
async with session.get(url) as response:
return await response.read()
async def run(r):
url = "http://httpbin.org/get"
tasks = []
# Fetch all responses within one Client session,
# keep connection alive for all requests.
async with ClientSession() as session:
for i in range(r):
task = asyncio.ensure_future(fetch(url.format(i), session))
tasks.append(task)
responses = await asyncio.gather(*tasks)
# you now have all response bodies in this variable
print(responses)
def print_responses(result):
print(result)
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(20))
loop.run_until_complete(future)
Right now, it just dumps all of the request responses to result, I need it to add each response to a set so I can work with the data later
i'm new to web development and i'm testing my site with sending http get request to check on how well my site will handle request. with my code i can send multiple get request, how can i make code send more than multiple request i want the loop to never stop, i mean send the get request over and over again how can i do that.. i am very sorry for my bad English hope u get my question.
import time
import datetime
import asyncio
import aiohttp
domain = 'http://myserver.com'
a = '{}/page1?run={}'.format(domain, time.time())
b = '{}/page2?run={}'.format(domain, time.time())
async def get(url):
print('GET: ', url)
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
t = '{0:%H:%M:%S}'.format(datetime.datetime.now())
print('Done: {}, {} ({})'.format(t, response.url, response.status))
loop = asyncio.get_event_loop()
tasks = [
asyncio.ensure_future(get(a)),
asyncio.ensure_future(get(b))
]
loop.run_until_complete(asyncio.wait(tasks))
If you want something to happen over and over, add a for or while loop - see https://docs.python.org/3/tutorial/index.html
async def get(url):
async with aiohttp.ClientSession() as session:
while True:
print('GET: ', url)
async with session.get(url) as response:
t = '{0:%H:%M:%S}'.format(datetime.datetime.now())
print('Done: {}, {} ({})'.format(t, response.url, response.status))
I am making a script that gets the HTML of almost 20 000 pages and parses it to get just a portion of it.
I managed to get the 20 000 pages' content in a dataframe with aynchronous requests using asyncio and aiohttp but this script still wait for all the pages to be fetched to parse them.
async def get_request(session, url, params=None):
async with session.get(url, headers=HEADERS, params=params) as response:
return await response.text()
async def get_html_from_url(urls):
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
tasks.append(get_request(session, url))
html_page_response = await asyncio.gather(*tasks)
return html_page_response
html_pages_list = asyncio_loop.run_until_complete(get_html_from_url(urls))
Once I have the content of each page I managed to use multiprocessing's Pool to parallelize the parsing.
get_whatiwant_from_html(html_content):
parsed_html = BeautifulSoup(html_content, "html.parser")
clean = parsed_html.find("div", class_="class").get_text()
# Some re.subs
clean = re.sub("", "", clean)
clean = re.sub("", "", clean)
clean = re.sub("", "", clean)
return clean
pool = Pool(4)
what_i_want = pool.map(get_whatiwant_from_html, html_content_list)
This code mixes asynchronously the fetching and the parsing but I would like to integrate multiprocessing into it:
async def process(url, session):
html = await getRequest(session, url)
return await get_whatiwant_from_html(html)
async def dispatch(urls):
async with aiohttp.ClientSession() as session:
coros = (process(url, session) for url in urls)
return await asyncio.gather(*coros)
result = asyncio.get_event_loop().run_until_complete(dispatch(urls))
Is there any obvious way to do this? I thought about creating 4 processes that each run the asynchronous calls but the implementation looks a bit complex and I'm wondering if there is another way.
I am very new to asyncio and aiohttp so if you have anything to advise me to read to get a better understanding, I will be very happy.
You can use ProcessPoolExecutor.
With run_in_executor you can do IO in your main asyncio process.
But your heavy CPU calculations in separate processes.
async def get_data(session, url, params=None):
loop = asyncio.get_event_loop()
async with session.get(url, headers=HEADERS, params=params) as response:
html = await response.text()
data = await loop.run_in_executor(None, partial(get_whatiwant_from_html, html))
return data
async def get_data_from_urls(urls):
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
tasks.append(get_data(session, url))
result_data = await asyncio.gather(*tasks)
return result_data
executor = concurrent.futures.ProcessPoolExecutor(max_workers=10)
asyncio_loop.set_default_executor(executor)
results = asyncio_loop.run_until_complete(get_data_from_urls(urls))
You can increase your parsing speed by changing your BeautifulSoup parser from html.parser to lxml which is by far the fastest, followed by html5lib. html.parser is the slowest of them all.
Your bottleneck is not processing issue but IO. You might want multiple threads and not process:
E.g. here is a template program that scraping and sleep to make it slow but ran in multiple threads and thus complete task faster.
from concurrent.futures import ThreadPoolExecutor
import random,time
from bs4 import BeautifulSoup as bs
import requests
URL = 'http://quotesondesign.com/wp-json/posts'
def quote_stream():
'''
Quoter streamer
'''
param = dict(page=random.randint(1, 1000))
quo = requests.get(URL, params=param)
if quo.ok:
data = quo.json()
author = data[0]['title'].strip()
content = bs(data[0]['content'], 'html5lib').text.strip()
print(f'{content}\n-{author}\n')
else:
print('Connection Issues :(')
def multi_qouter(workers=4):
with ThreadPoolExecutor(max_workers=workers) as executor:
_ = [executor.submit(quote_stream) for i in range(workers)]
if __name__ == '__main__':
now = time.time()
multi_qouter(workers=4)
print(f'Time taken {time.time()-now:.2f} seconds')
In your case, create a function that performs the task you want from starry to finish. This function would accept url and necessary parameters as arguments. After that create another function that calls the previous function in different threads, each thread having its our url. So instead of i in range(..), for url in urls. You can run 2000 threads at once, but I would prefer chunks of say 200 running parallel.
I have the following problem that my code for api requests is really non deterministic. I use asyncio to make asynchronous requests, because I want to send multiple requests and have big frequency of changes(that's why I am sending 30 the same requests). Sometimes my code executes really quickly about 0.5s but sometimes it stucks after sending for example a half of the requests. Could anyone see some code bugs which can produce the following error? Or such thing is caused by some delays of the server responses?
import asyncio
from aiohttp import ClientSession
async def fetch(url, session):
async with session.get(url) as response:
data = await response.json()
print(data)
return await response.read()
async def run(r):
url = "https://www.bitstamp.net/api/ticker/"
tasks = []
async with ClientSession() as session:
for i in range(r):
task = asyncio.ensure_future(fetch(url.format(i), session))
tasks.append(task)
responses = asyncio.gather(*tasks)
await responses
t1 = time.time()
number = 30
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(number))
loop.run_until_complete(future)
t2= time.time()
print(t2-t1)