I've created a script in python using pyppeteer to collect the links of different posts from a webpage and then parse the title of each post by going in their target page reusing those collected links. Although the content are static, I like to know how pyppeteer works in such cases.
I tried to supply this browser variable from main() function to fetch() and browse_all_links() function so that I can reuse the same browser over and over again.
My current approach:
import asyncio
from pyppeteer import launch
url = "https://stackoverflow.com/questions/tagged/web-scraping"
async def fetch(page,url):
await page.goto(url)
linkstorage = []
await page.waitForSelector('.summary .question-hyperlink')
elements = await page.querySelectorAll('.summary .question-hyperlink')
for element in elements:
linkstorage.append(await page.evaluate('(element) => element.href', element))
return linkstorage
async def browse_all_links(page,link):
await page.goto(link)
await page.waitForSelector('h1 > a')
title = await page.querySelectorEval('h1 > a','(e => e.innerText)')
print(title)
async def main():
browser = await launch(headless=False,autoClose=False)
[page] = await browser.pages()
links = await fetch(page,url)
tasks = [await browse_all_links(page,url) for url in links]
await asyncio.gather(*tasks)
if __name__ == '__main__':
asyncio.run(main())
The above script fetches some titles but spits out the following error at some point within the execution:
Possible to select <a> with specific text within the quotes?
Crawler Runs Too Slow
How do I loop a list of ticker to scrape balance sheet info?
How to retrive the url of searched video from youtbe using python
VBA-JSON to import data from all pages in one table
Is there an algorithm that detects semantic visual blocks in a webpage?
find_all only scrape the last value
#ERROR STARTS
Future exception was never retrieved
future: <Future finished exception=NetworkError('Protocol error (Runtime.releaseObject): Cannot find context with specified id')>
pyppeteer.errors.NetworkError: Protocol error (Runtime.releaseObject): Cannot find context with specified id
Future exception was never retrieved
AS it's been two days since this question has been posted but no one yet to answer, I will take this opportunity to address this issue what I
think might be helpful to you.
There are 15 links but you are getting only 7, this is probably websockets is loosing connection and page is not reachable anymore
List comprehension
tasks = [await browse_all_links(page,url) for url in links] What do expect is this list? If it's succesful, it will be a list
of none element. So your next line of code will throw error!
Solution
downgrade websockets 7.0 to websockets 6.0
remove this line of code await asyncio.gather(*tasks)
I am using python 3.6, so I had to change last line of code.
You don't need change it if you are using python 3.7 which I think you are using
import asyncio
from pyppeteer import launch
url = "https://stackoverflow.com/questions/tagged/web-scraping"
async def fetch(page,url):
await page.goto(url)
linkstorage = []
await page.waitForSelector('.summary .question-hyperlink')
elements = await page.querySelectorAll('.summary .question-hyperlink')
for element in elements:
linkstorage.append(await page.evaluate('(element) => element.href', element))
return linkstorage
async def browse_all_links(page,link):
await page.goto(link)
await page.waitForSelector('h1 > a')
title = await page.querySelectorEval('h1 > a','(e => e.innerText)')
print(title)
async def main():
browser = await launch(headless=False,autoClose=False)
[page] = await browser.pages()
links = await fetch(page,url)
tasks = [await browse_all_links(page,url) for url in links]
#await asyncio.gather(*tasks)
await browser.close()
if __name__ == '__main__':
#asyncio.run(main())
asyncio.get_event_loop().run_until_complete(main())
Output
(testenv) C:\Py\pypuppeteer1>python stack3.py
Scrapy Shell response.css returns an empty array
Scrapy real-time spider
Why do I get KeyError while reading data with get request?
Scrapy spider can't redefine custom_settings according to args
Custom JS Script using Lua in Splash UI
Can someone explain why and how this piece of code works [on hold]
How can I extract required data from a list of strings?
Scrapy CrawlSpider rules for crawling single page
how to scrape a web-page with search bar results, when the search query does not
appear in the url
Nested for loop keeps repeating
Get all tags except a list of tags BeautifulSoup
Get current URL using Python and webbot
How to login to site and send data
Unable to append value to colums. Getting error IndexError: list index out of ra
nge
NextSibling.Innertext not working. “Object doesn't support this property”
Related
I want to scrape a website asynchronously using a list of tor circuits with different exit nodes and making sure each exit node only makes a request every 5 seconds.
For testing purposes, I'm using the website https://books.toscrape.com/ and I'm lowering the sleep time, number of circuits and number of pages to scrape.
I'm getting the following two errors when I use the --tor argument. Both related to the torpy package.
'TorWebScraper' object has no attribute 'circuits'
'_GeneratorContextManager' object has no attribute 'create_stream'
Here is the relevant code causing the error:
async with aiohttp.ClientSession() as session:
for circuit in self.circuits:
async with circuit.create_stream() as stream:
async with session.get(url, proxy=stream.proxy) as response:
await asyncio.sleep(20e-3)
text = await response.text()
return url, text
Here there is more context
Your error is being caused by the fact that your code starts asyncio loop on object init which is not a good practice:
class WebScraper(object):
def __init__(self, urls: List[str]):
self.urls = urls
self.all_data = []
self.master_dict = {}
asyncio.run(self.run())
#^^^^^^
class TorWebScraper(WebScraper):
def __init__(self, urls: List[str]):
super().__init__(urls)
# ^^^^^ this already called run() from parent class
self.circuits = get_circuits(3)
asyncio.run(self.run())
# ^^^^^ now run() is being called a second time
Ideally, to avoid issues like this you should leave the logic code to your classes and separate out the run code to your script. In other words, move asyncio.run to asyncio.run(scrape_test_website()).
I'm currently designing a discord bot that scrapes a web page that is constantly updating for patches related to a PBE server. I have the bot running through Heroku successfully right now. The issue I'm running into is I want to create an automated (timed loop) refresh that will reload the website I have requested. As it currently stands, it only loads one instance of the website and if that website changes/updates, none of my content will update as I'm using the "old" request of the website.
Is there a way for me to bury code inside a function so that I can create a timed loop or do I only need to create one around my website request? How would that look? Thanks!
from bs4 import BeautifulSoup
from urllib.request import urlopen
from discord.ext import commands
import discord
# what I want the commands to start with
bot = commands.Bot(command_prefix='!')
# instantiating discord client
token = "************************************"
client = discord.Client()
# begin the scraping of passed in web page
URL = "*********************************"
page = urlopen(URL)
soup = BeautifulSoup(page, 'html.parser')
pbe_titles = soup.find_all('h1', attrs={'class': 'news-title'}) # using soup to find all header tags with the news-title
# class and storing them in pbe_titles
linksAndTitles = []
counter = 0
# finding tags that start with 'a' as in a href and appending those titles/links
for tag in pbe_titles:
for anchor in tag.find_all('a'):
linksAndTitles.append(tag.text.strip())
linksAndTitles.append(anchor['href'])
# counts number of lines stored inside linksAndTitles list
for i in linksAndTitles:
counter = counter + 1
print(counter)
# separates list by line so that it looks nice when printing
allPatches = '\n'.join(str(line) for line in linksAndTitles[:counter])
# stores the first two lines in list which is the current pbe patch title and link
currPatch = '\n'.join(str(line) for line in linksAndTitles[:2])
# command that allows user to type in exactly what patch they want to see information for based off date
#bot.command(name='patch')
async def pbe_patch(ctx, *, arg):
if any(item.startswith(arg) for item in linksAndTitles):
await ctx.send(arg + " exists!")
else:
await ctx.send('The date you entered: ' + '"' + arg + '"' + ' does not have a patch associated with it or that patch expired.')
# command that displays the current, most up to date, patch
#bot.command(name='current')
async def current_patch(ctx):
response = currPatch
await ctx.send(response)
bot.run(token)
I've played around with
while True:
loops but whenever I nest anything inside of them, I can't access the code in other places.
discord has special decorator tasks to run some code periodically
from discord.ext import tasks
#tasks.loop(seconds=5.0)
async def scrape():
# ... your scraping code ...
# ... your commands ...
scrape.start()
bot.run(token)
and it will repeate function scrape every 5 seconds.
Documentation: tasks
On Linux eventually I would use standard service cron to run periodically some script. This script could scrape data and save in file or database and discord could read from this file or database. But cron check tasks every 1 minute so it can't run task more often.
EDIT:
Minimal working code.
I use page http://books.toscrape.com created for scrape learning.
I changed few elements. There is no need to create client when there is bot because bot is special kind of client
I keep title and link as dictionary
{
'title': tag.text.strip(),
'link': url + anchor['href'],
}
so later it is easier to create text like
title: A Light in the ...
link: http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
import os
import discord
from discord.ext import commands, tasks
from bs4 import BeautifulSoup
from urllib.request import urlopen
# default value at start (before `scrape` will assign new value)
# because some function may try to use these variables before `scrape` will create them
links_and_titles = [] # PEP8: `lower_case_namese`
counter = 0
items = []
bot = commands.Bot(command_prefix='!')
#tasks.loop(seconds=5)
async def scrape():
global links_and_titles
global counter
global items
url = "http://books.toscrape.com/"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
#pbe_titles = soup.find_all('h1', attrs={'class': 'news-title'})
pbe_titles = soup.find_all('h3')
# remove previous content
links_and_titles = []
for tag in pbe_titles:
for anchor in tag.find_all('a'):
links_and_titles.append({
'title': tag.text.strip(),
'link': url + anchor['href'],
})
counter = len(links_and_titles)
print('counter:', counter)
items = [f"title: {x['title']}\nlink: {x['link']}" for x in links_and_titles]
#bot.command(name='patch')
async def pbe_patch(ctx, *, arg=None):
if arg is None:
await ctx.send('Use: !patch date')
elif any(item['title'].startswith(arg) for item in links_and_titles):
await ctx.send(arg + " exists!")
else:
await ctx.send(f'The date you entered: "{arg}" does not have a patch associated with it or that patch expired.')
#bot.command(name='current')
async def current_patch(ctx, *, number=1):
if items:
responses = items[:number]
text = '\n----\n'.join(responses)
await ctx.send(text)
else:
await ctx.send('no patches')
scrape.start()
token = os.getenv('DISCORD_TOKEN')
bot.run(token)
PEP 8 -- Style Guide for Python Code
I've created a script using python in association with pyppeteer to keep clicking on the next page button until there is no more. The script while clicking on the next page button throws this error pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 30000 ms exceeded. pointing at this line await page.waitForNavigation(). It can parse name and item_type from the landing page of that site, though. I know I can issue post http requests with appropriate payload to get data from there but my intention is to make use of pyppeteer and keep clicking on the next page button while parsing the required fields.
website address
import asyncio
from pyppeteer import launch
link = "https://www.e-ports.com/ships"
async def get_content():
wb = await launch(headless=True)
[page] = await wb.pages()
await page.goto(link)
while True:
await page.waitForSelector(".common_card", {'visible':True})
elements = await page.querySelectorAll('.common_card')
for element in elements:
name = await element.querySelectorEval('span.title > a','e => e.innerText')
item_type = await element.querySelectorEval('.bottom > span','e => e.innerText')
print(name.strip(),item_type.strip())
try:
await page.click("button.btn-next")
await page.waitForNavigation()
except Exception: break
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(get_content())
Btw, If I manually click on the next page button for the first time, It accomplishes the rest successfully.
I don't know the valid syntax in Pypeteer, but common syntax of waitForNavigation maybe this one.
await Promise.all([
page.waitForNavigation(),
page.click("button.btn-next")
])
With the iterator inside the array promised, all of the methods will resolved when become true or finished desired action.
I'd like to embed some async code in my Python Project to make the http request part be asyncable . for example, I read params from Kafka, use this params to generate some urls and put the urls into a list. if the length of the list is greater than 1000, then I send this list to aiohttp to batch get the response.
I can not change the whole project from sync to async, so I could only change the http request part.
the code example is:
async def async_request(url):
async with aiohttp.ClientSession() as client:
resp = await client.get(url)
result = await resp.json()
return result
async def do_batch_request(url_list, result):
task_list = []
for url in url_list:
task = asyncio.create_task(async_request(url))
task_list.append(task)
batch_response = asyncio.gather(*task_list)
result.extend(batch_response)
def batch_request(url_list):
batch_response = []
asyncio.run(do_batch_request(url_list, batch_response))
return batch_response
url_list = []
for msg in kafka_consumer:
url = msg['url']
url_list.append(url)
if len(url_list) >= 1000:
batch_response = batch_request(url_list)
parse(batch_response)
....
As we know, asyncio.run will create an even loop to run the async function and then close the even loop. My problem is that, will my method influence the performance of the async code? And do you have some better way for my situation?
There's no serious problem with your approach and you'll get speed benefit from asyncio. Only possible problem here is that if later you'll want to do something async in other place in the code you'll not be able to do it concurrently with batch_request.
There's not much to do if you don't want to change the whole project from sync to async, but if in the future you'll want to run batch_request in parallel with something, keep in mind that you can run it in thread and wait for result asynchronously.
I'm using Beautiful Soup 4 to scrape some text from a webpage into a Discord bot.
#commands.command(hidden=True)
async def roster(self):
"""A script using BS4."""
url = "http://www.clandestine.pw/roster.html"
async with aiohttp.get(url) as response:
soupObject = BeautifulSoup(await response.text(), "html.parser")
try:
txt = soupObject.find("font", attrs={'size': '4'}).get_text()
await self.bot.say(txt)
except:
await self.bot.say("Not found!")
Running the command, this returns "ThaIIen" (as it should). If I simply change find to find_all, it returns "Not found!" Why? Shouldn't that return every font size 4 text in the document?
find_all("font", attrs={'size': '4'}) will return:
[font_tag1, font_tag2, font_tag3 ....]
find("font", attrs={'size': '4'}) will return:
font_tag1
.get_text() is the method of tag object, not the list the object, so, when you run the find_all().get_text() will raise an Exception