I am scraping blog urls from main page, and later I iterate over all urls to retrive text on it.
Will generator be faster if I move loop to blogscraper and make yield some_text ? I guess app will still be one threaded and It wont request next pages while computing text from html.
Should I use asyncio? or there are some better modules to make it parrel? Create generator that yields coroutine results as the coroutines finish
I also want to make later small rest app for displaying results
def readmainpage(self):
blogurls = []
while(nextPage):
r = requests.get(url)
...
blogurls += [new_url]
return blogurls
def blogscraper(self, url):
r = request.get(url)
...
return sometext
def run(self):
blog_list = self.readmainpage()
for blog in blog_list:
data = self.blogscraper(blog['url'])
Using threading package, you can run your top function (object initialitization) asynchronously. It will create sub parallel sub-process for your requests. For example, single page fetching is 2 mins and you have 10 pages. In threading, all will take 2 mins. Threading in Python 3.x
With asyncio you can try to use aiohttp module:
pip install aiohttp
As example code it's can looks something like this, also can be done some improvements but it depends on your code...
import sys
import aiohttp
import asyncio
import socket
from urllib.parse import urlparse
class YourClass:
def __init__(self):
self.url = "..."
url_parsed = urlparse( self.url )
self.session = aiohttp.ClientSession(
headers = { "Referer": f"{ url_parsed.scheme }://{ url_parsed.netloc }" },
auto_decompress = True,
connector = aiohttp.TCPConnector(family=socket.AF_INET, verify_ssl=False) )
async def featch(self, url):
async with self.session.get( url ) as resp:
assert resp.status == 200
return await resp.text()
async def readmainpage(self):
blogurls = []
while nextPage:
r = await self.featch(self.url)
# ...
blogurls += [new_url]
return blogurls
async def blogscraper(self, url):
r = await self.featch(url)
return r
# ...
return sometext
async def __call__(self):
url_parsed = urlparse( self.url )
blog_list = await self.readmainpage()
coros = [ asyncio.Task( self.blogscraper( blog['url']) ) for blog in blog_list ]
for data in await asyncio.gather( *coros ):
print(data)
# do not forget to close session if not using with statement
await self.session.close()
def main():
featcher = YourClass()
loop = asyncio.get_event_loop()
loop.run_until_complete( featcher() )
sys.exit(0)
if __name__ == "__main__":
main()
Related
Below is the python code (program.py) and the requirements file (requirements.txt).
Function async def get_title_range() is not working properly, it generates the following error code:
httpx.HTTPStatusError: Redirect response '301 Moved Permanently' for
url 'https://talkpython.fm/episodes/show/271' Redirect location:
'https://talkpython.fm/episodes/show/271/unlock-the-mysteries-of-time-pythons-datetime-that-is'
For more information check: https://httpstatuses.com/301
Python code, based on python 3.9 (program.py):
import asyncio
import datetime
import httpx
import bs4
from colorama import Fore
global loop
async def get_html(episode_number: int) -> str:
print(Fore.YELLOW + f"Getting HTML for episode {episode_number}", flush=True)
url = f"https://talkpython.fm/episodes/show/{episode_number}"
async with httpx.AsyncClient() as client:
resp = await client.get(url)
resp.raise_for_status()
return resp.text
def get_title(html: str, episode_number: int) -> str:
print(Fore.CYAN + f"Getting TITLE for episode {episode_number}", flush=True)
soup = bs4.BeautifulSoup(html, 'html.parser')
header = soup.select_one('h1')
if not header:
return "MISSING"
return header.text.strip()
def main():
t0 = datetime.datetime.now()
global loop
loop = asyncio.get_event_loop()
loop.run_until_complete(get_title_range())
dt = datetime.datetime.now() - t0
print(f"Done in {dt.total_seconds():.2f} sec.")
async def get_title_range()
tasks = []
for n in range(270, 280):
tasks.append((n, loop.create_task(get_html(n))))
for n, t in tasks:
html = await t
title = get_title(html, n)
print(Fore.WHITE + f"Title found: {title}", flush=True)
if __name__ == '__main__':
main()
The requirements (requitements.txt):
bs4
colorama
httpx
Here is my code using request-html ASyncHtmlSession in Fast api
#app.get('/')
async def ScrapeData(pages:Optional[int]= 1):
crawle = Crawler()
for page in range(1,pages+1):
url = f"url here"
asession = AsyncHTMLSession()
r = await asession.get(url)
await r.html.arender(sleep=1)
widget = r.html.xpath('//*[#id="widgetContent"]')[0]
items = widget.find('div')
crawle.GetData(items)
return crawle.data
You need to explicitly enable redirects in httpx (unlike in requests). From their docs:
Unlike requests, HTTPX does not follow redirects by default.
We differ in behaviour here because auto-redirects can easily mask unnecessary network calls being made.
You can still enable behaviour to automatically follow redirects, but you need to do so explicitly...
response = client.get(url, follow_redirects=True)
Or else instantiate a client, with redirect following enabled by default...
client = httpx.Client(follow_redirects=True)
I'm creating an optimized multi-threading app using asyncio and want to add a rotating proxy into the mix.
Starting with a sample taken from this outstanding article:
Speed Up Your Python Program With Concurrency
I added a rotating proxy and it stopped working. The code simply exits the function after touching the line for the proxy.
This little snippet of code works, but not when added to the main script as shown in the screenshot above.
import asyncio
import random as rnd
async def download_site():
proxy_list = [
('38.39.205.220:80'),
('38.39.204.100:80'),
('38.39.204.101:80'),
('38.39.204.94:80')
]
await asyncio.sleep(1)
proxy = rnd.choice(proxy_list)
print(proxy)
asyncio.run(download_site())
And here's the full sample:
import asyncio
import time
import aiohttp
# Sample code taken from here:
# https://realpython.com/python-concurrency/#asyncio-version
# Info for adding headers for the proxy (Scroll toward the bottom)
# https://docs.aiohttp.org/en/stable/client_advanced.html
# Good read to possible improve performance on large lists of URLs
# https://asyncio.readthedocs.io/en/latest/webscraper.html
# RUN THIS METHOD TO SEE HOW IT WORKS.
# # Original Code (working...)
# async def download_site(session, url):
# async with session.get(url, proxy="http://proxy.com") as response:
# print("Read {0} from {1}".format(response.content_length, url))
def get_proxy(self):
proxy_list = [
(754, '38.39.205.220:80'),
(681, '38.39.204.100:80'),
(682, '38.39.204.101:80'),
(678, '38.39.204.94:80')
]
proxy = random.choice(proxy_list)
print(proxy[1])
return proxy
async def download_site(session, url):
proxy_list = [
('38.39.205.220:80'),
('38.39.204.100:80'),
('38.39.204.101:80'),
('38.39.204.94:80')
]
await asyncio.sleep(1)
proxy = rnd.choice(proxy_list)
print(proxy)
async with session.get(url, proxy="http://" + proxy) as response:
print("Read {0} from {1}".format(response.content_length, url))
async def download_all_sites(sites):
async with aiohttp.ClientSession() as session:
tasks = []
for url in sites:
task = asyncio.ensure_future(download_site(session, url))
tasks.append(task)
await asyncio.gather(*tasks, return_exceptions=True)
# Modified to loop thru only 1 URL to make debugging simple
if __name__ == "__main__":
sites = [
"https://www.jython.org",
# "http://olympus.realpython.org/dice",
] #* 80
start_time = time.time()
asyncio.get_event_loop().run_until_complete(download_all_sites(sites))
duration = time.time() - start_time
print(f"Downloaded {len(sites)} sites in {duration} seconds")
Thank you for any help you can offer.
You use return_exceptions=True but you don't actually check the returned results for errors. You can use asyncio.as_completed to handle exceptions and get the earliest next result:
import asyncio
import random
import traceback
import aiohttp
URLS = ("https://stackoverflow.com",)
TIMEOUT = 5
PROXIES = (
"http://38.39.205.220:80",
"http://38.39.204.100:80",
"http://38.39.204.101:80",
"http://38.39.204.94:80",
)
def get_proxy():
return random.choice(PROXIES)
async def download_site(session, url):
proxy = get_proxy()
print(f"Got proxy: {proxy}")
async with session.get(url, proxy=f"{proxy}", timeout=TIMEOUT) as resp:
print(f"{url}: {resp.status}")
return await resp.text()
async def main():
tasks = []
async with aiohttp.ClientSession() as session:
for url in URLS:
tasks.append(asyncio.create_task(download_site(session, url)))
for coro in asyncio.as_completed(tasks):
try:
html = await coro
except Exception:
traceback.print_exc()
else:
print(len(html))
if __name__ == "__main__":
asyncio.run(main())
Hi I'm stuck the problem asyncio. I'm using Python version 3.7.3.
And Sorry for I'm not native speaker at English.
I'm writing a script for get a lyric from genius.
This is my script.
Every requests going to be 6times if I couldn't the result in the end.
I divided get request 2 times almost same time. It means like a 2*3.
Checking the result and if I could get the result I want to stop the other tasks.
Because of to be less request.
So I used cancel() and tried to raise asyncio.exceptions.CancelledError when I got the lyric but It doesn't work well.
It shows RuntimeError: Event loop is closed I don't know why doesn't work well.
Please teach me some one familiar with this situation.
import asyncio
import requests
from bs4 import BeautifulSoup, Comment
#Lyrics__Container-sc-1ynbvzw-2
class Lyric():
def __init__(self, artist, song_name):
self.artist = artist
self.song_name = song_name
self.__gtask = []
self.__canceled = False
self.__lyric = ''
self.genius_url = self.make_genius_url(artist, song_name)
lyric = self.lyric_from_genius(self.genius_url)
print(lyric)
def make_genius_url(self, artist, song_name):
search_song = f'{artist} {song_name}'
search_song = search_song.replace(' ', '-')
print(search_song)
return f'https://genius.com/{search_song}-lyrics'
def get_soup(self, url):
r = requests.get(url)
if r.status_code == 200:
soup = BeautifulSoup(r.content, 'lxml')
return soup
else:
return False
def scrape_genius(self, url):
soup = self.get_soup(url)
if soup and not self.__canceled:
lyric_soup = soup.select('.song_body-lyrics .lyrics p')
if lyric_soup:
self.__canceled = True
tags = lyric_soup[0].find_all(['a', 'i'])
for tag in tags:
tag.unwrap()
print('ここから歌詞です。')
print(lyric_soup[0].text)
self.__lyric = lyric_soup[0].text
self.__gtask.cancel()
else:
print('歌詞情報を取得出来なかった。')
else:
if self.__canceled:
print('歌詞取得した')
else:
print('歌詞情報がない')
self.__gtask.cancel()
def lyric_from_genius(self, url):
async def main_loop(url):
sem = asyncio.Semaphore(2)
async def get_lyric_soup(url):
async with sem:
await self.loop.run_in_executor(None, self.scrape_genius, url)
#main_loopの処理
for _ in range(6):
self.__gtask += [get_lyric_soup(url)]
return await asyncio.gather(*self.__gtask)
try:
self.loop = asyncio.new_event_loop()
self.loop.run_until_complete(main_loop(url))
except asyncio.exceptions.CancelledError as e:
print("*** CancelledError ***", e)
finally:
if self.__lyric:
return self.__lyric
else:
print('5回のリクエストで曲情報が取れなかった。')
Lyric = Lyric('kamal', 'blue')
I am writing a simple producer/consumer app to call multiple URL's asynchronously.
In the following code if I set the conn_count=1, and add 2 items to the Queue it works fine as only one consumer is created. But if I make conn_count=2 and add 4 items to the Queue only 3 request are being made. The other request fails with ClientConnectorError.
Can you please help be debug the reason for failure with multiple consumers? Thank You.
I am using a echo server I created.
Server:
import os
import logging.config
import yaml
from aiohttp import web
import json
def start():
setup_logging()
app = web.Application()
app.router.add_get('/', do_get)
app.router.add_post('/', do_post)
web.run_app(app)
async def do_get(request):
return web.Response(text='hello')
async def do_post(request):
data = await request.json()
return web.Response(text=json.dumps(data))
def setup_logging(
default_path='logging.yaml',
default_level=logging.INFO,
env_key='LOG_CFG'
):
path = default_path
value = os.getenv(env_key, None)
if value:
path = value
if os.path.exists(path):
with open(path, 'rt') as f:
config = yaml.safe_load(f.read())
logging.config.dictConfig(config)
else:
logging.basicConfig(level=default_level)
if __name__ == '__main__':
start()
Client:
import asyncio
import collections
import json
import sys
import async_timeout
from aiohttp import ClientSession, TCPConnector
MAX_CONNECTIONS = 100
URL = 'http://localhost:8080'
InventoryAccount = collections.namedtuple("InventoryAccount", "op_co customer_id")
async def produce(queue, num_consumers):
for i in range(num_consumers * 2):
await queue.put(InventoryAccount(op_co=i, customer_id=i * 100))
for j in range(num_consumers):
await queue.put(None)
async def consumer(n, queue, session, responses):
print('consumer {}: starting'.format(n))
while True:
try:
account = await queue.get()
if account is None:
queue.task_done()
break
else:
print(f"Consumer {n}, Updating cloud prices for account: opCo = {account.op_co!s}, customerId = {account.customer_id!s}")
params = {'opCo': account.op_co, 'customerId': account.customer_id}
headers = {'content-type': 'application/json'}
with async_timeout.timeout(10):
print(f"Consumer {n}, session state " + str(session.closed))
async with session.post(URL,
headers=headers,
data=json.dumps(params)) as response:
assert response.status == 200
responses.append(await response.text())
queue.task_done()
except:
e = sys.exc_info()[0]
print(f"Consumer {n}, Error updating cloud prices for account: opCo = {account.op_co!s}, customerId = {account.customer_id!s}. {e}")
queue.task_done()
print('consumer {}: ending'.format(n))
async def start(loop, session, num_consumers):
queue = asyncio.Queue(maxsize=num_consumers)
responses = []
consumers = [asyncio.ensure_future(loop=loop, coro_or_future=consumer(i, queue, session, responses)) for i in range(num_consumers)]
await produce(queue, num_consumers)
await queue.join()
for consumer_future in consumers:
consumer_future.cancel()
return responses
async def run(loop, conn_count):
async with ClientSession(loop=loop, connector=TCPConnector(verify_ssl=False, limit=conn_count)) as session:
result = await start(loop, session, conn_count)
print("Result: " + str(result))
if __name__ == '__main__':
conn_count = 2
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(run(loop, conn_count))
finally:
loop.close()
Reference:
https://pymotw.com/3/asyncio/synchronization.html
https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html
https://hackernoon.com/asyncio-for-the-working-python-developer-5c468e6e2e8e
First of all heres the code:
import random
import asyncio
from aiohttp import ClientSession
import csv
headers =[]
def extractsites(file):
sites = []
readfile = open(file, "r")
reader = csv.reader(readfile, delimiter=",")
raw = list(reader)
for a in raw:
sites.append((a[1]))
return sites
async def bound_fetch(sem, url):
async with sem:
print("doing request for "+ url)
async with ClientSession() as session:
async with session.get(url) as response:
responseheader = await response.headers
print(headers)
async def run():
urls = extractsites("cisco-umbrella.csv")
tasks = []
sem = asyncio.Semaphore(100)
for i in urls:
task = asyncio.ensure_future(bound_fetch(sem, "http://"+i))
tasks.append(task)
headers = await asyncio.wait(*tasks)
print(headers)
def main():
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run())
loop.run_until_complete(future)
if __name__ == '__main__':
main()
As per my last question I'm following this blog post:
https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html
I tried to adapt my code as close as possible to the example implementation but this code is still not making any requests and printing the headers in bound_headers as I wish.
Can somebody spot whats wrong with this code ?
response.headers is a regular property, no need to put await before the call
asyncio.wait on other hand accepts a list of futures and returns (done, pending) pair.
Looks like you should replace await wait() call with await asyncio.gather(*tasks) (gather doc)