Difficulty with find_all in BS4 - python

I'm using Beautiful Soup 4 to scrape some text from a webpage into a Discord bot.
#commands.command(hidden=True)
async def roster(self):
"""A script using BS4."""
url = "http://www.clandestine.pw/roster.html"
async with aiohttp.get(url) as response:
soupObject = BeautifulSoup(await response.text(), "html.parser")
try:
txt = soupObject.find("font", attrs={'size': '4'}).get_text()
await self.bot.say(txt)
except:
await self.bot.say("Not found!")
Running the command, this returns "ThaIIen" (as it should). If I simply change find to find_all, it returns "Not found!" Why? Shouldn't that return every font size 4 text in the document?

find_all("font", attrs={'size': '4'}) will return:
[font_tag1, font_tag2, font_tag3 ....]
find("font", attrs={'size': '4'}) will return:
font_tag1
.get_text() is the method of tag object, not the list the object, so, when you run the find_all().get_text() will raise an Exception

Related

is there a way for my discord bot to access the tenor website and then just add all gifs under a certain tag to it's output?

Title says it all. I have this discord bot that basically uploads cat gifs whenever a certain keyword or command is used. with my current code, I have to manually add the tenor/gif link to the output set so it can display that gif. Instead, I want the bot to just post any gifs of cats from tenor or any other gif website. I'm pretty sure those websites have a tag feature that assigns for example the tag "cat" to a cat gif. I want to know which gif is tagged cat and just add that gif to it's output set. Is there a way I can do this?
import discord
import os
import random
client = discord.Client()
cat_pictures = ["cats", "cat"]
cat_encouragements = [
"https://tenor.com/view/dimden-cat-cute-cat-cute-potato-gif-20953746", "https://tenor.com/view/dimden-cute-cat-cute-cat-potato-gif-20953747", "https://tenor.com/view/cute-cat-cute-cat-dimden-gif-19689251", "https://tenor.com/view/dimden-cute-cat-cute-cat-potato-gif-21657791",
"https://tenor.com/view/cats-kiss-gif-10385036",
"https://tenor.com/view/cute-kitty-best-kitty-alex-cute-pp-kitty-omg-yay-cute-kitty-munchkin-kitten-gif-15917800",
"https://tenor.com/view/cute-cat-oh-yeah-awesome-cats-amazing-gif-15805236",
"https://tenor.com/view/cat-broken-cat-cat-drinking-cat-licking-cat-air-gif-20661740",
"https://tenor.com/view/funny-animals-cute-chicken-cat-fight-dinner-time-gif-8953000"]
#client.event
async def on_ready():
print('We have logged in as catbot '.format(client))
#client.event
async def on_message(message):
if message.author == client.user:
return
if message.content.startswith ('!help'):
await message.channel.send('''
I only have two commands right now which are !cat which posts an image of a cat. !cats which gives you a video/gif
''')
if any(word in message.content for word in cat_pictures):
await message.channel.send(random.choice(cat_encouragements))
if any(word in message.content for word in cat_apology):
await message.channel.send(random.choice(cat_sad))
if any(word in message.content for word in cat_dog):
await message.channel.send(random.choice(cat_dogs))
client.run(os.getenv('TOKEN'))
If you want to get data from other page then you have to learn how to "scrape".
For some pages you may need to use requests (or urllib) to get HTML from server and beautifulsoup (or lxml) to search data in HTML. Often pages uses JavaScript to add elements so it may need Selenium to control real web browser which can run JavaScript (because requests, urllib, beautifulsoup, lxml can't run JavaScript)
But first you should check if page has API for developers to get data in simpler way - as JSON data - so you don't have to search in HTML.
As #ChrisDoyle noticed there is documentation for tensor API.
This documentation shows even example in Python (using requests) which gets JSON data. Example may need only to show how to get urls from JSON because there are other informations - like image sizes, gifs, small gif, animated gifs, mp4, etc.
This is my version based on example from documentation
import requests
# set the apikey and limit
API_KEY = "LIVDSRZULELA" # test value
search_term = "cat"
def get_urls(search, limit=8):
payload = {
'key': API_KEY,
'limit': limit,
'q': search,
}
# our test search
# get the top 8 GIFs for the search term
r = requests.get("https://g.tenor.com/v1/search", params=payload)
results = []
if r.status_code == 200:
data = r.json()
#print('[DEBUG] data:', data)
for item in data['results']:
#print('[DEBUG] item:', item)
for media in item['media']:
#print('[DEBUG] media:', media)
#for key, value in media.items():
# print(f'{key:10}:', value['url'])
#print('----')
if 'tinygif' in media:
results.append(media['tinygif']['url'])
else:
results = []
return results
# --- main ---
cat_encouragements = get_urls('cat')
for url in cat_encouragements:
print(url)
Which gives urls directly to tiny gif images
https://media.tenor.com/images/eff22afc2220e9df92a7aa2f53948f9f/tenor.gif
https://media.tenor.com/images/e0f28542d811073f2b3d223e8ed119f3/tenor.gif
https://media.tenor.com/images/75b3c8eca95d917c650cd574b91db7f7/tenor.gif
https://media.tenor.com/images/80aa0a25bee9defa1d1d7ecaab75f3f4/tenor.gif
https://media.tenor.com/images/042ef64f591bdbdf06edf17e841be4d9/tenor.gif
https://media.tenor.com/images/1e9df4c22da92f1197b997758c1b3ec3/tenor.gif
https://media.tenor.com/images/6562518088b121eab2d19917b65ee793/tenor.gif
https://media.tenor.com/images/eafc0f0bef6d6fd135908eaba24393ac/tenor.gif
If you uncomment some print() in code then you may see more information.
For example links from media.items() for single image
nanowebm : https://media.tenor.com/videos/513b211140bedc05d5ab3d8bc3456c29/webm
tinywebm : https://media.tenor.com/videos/7c1777a988eedb267a6b7d7ed6aaa858/webm
mp4 : https://media.tenor.com/videos/146935e698960bf723a1cd8031f6312f/mp4
loopedmp4 : https://media.tenor.com/videos/e8be91958367e8dc4e6a079298973362/mp4
nanomp4 : https://media.tenor.com/videos/4d46f8b4e95a536d2e25044a0a288968/mp4
tinymp4 : https://media.tenor.com/videos/390f512fd1900b47a7d2cc516dd3283b/mp4
tinygif : https://media.tenor.com/images/eff22afc2220e9df92a7aa2f53948f9f/tenor.gif
mediumgif : https://media.tenor.com/images/c90bf112a9292c442df9310ba5e140fd/tenor.gif
nanogif : https://media.tenor.com/images/6f6eb54b99e34a8128574bd860d70b2f/tenor.gif
gif : https://media.tenor.com/images/8ab88b79885ab587f84cbdfbc3b87835/tenor.gif
webm : https://media.tenor.com/videos/926d53c9889d7604da6745cd5989dc3c/webm
In code I use API_KEY = "LIVDSRZULELA" from documentation but you should register on page to get your unique API_KEY.
Usually API_KEYs from documentations may have restrictions or generate always the same data - they are created only for tests, not for use in real application.
Documentation show more methods to get and filter image - ie to get trending images.

Creating a timed loop inside Discord Bot script to reload web page (web scraper bot)

I'm currently designing a discord bot that scrapes a web page that is constantly updating for patches related to a PBE server. I have the bot running through Heroku successfully right now. The issue I'm running into is I want to create an automated (timed loop) refresh that will reload the website I have requested. As it currently stands, it only loads one instance of the website and if that website changes/updates, none of my content will update as I'm using the "old" request of the website.
Is there a way for me to bury code inside a function so that I can create a timed loop or do I only need to create one around my website request? How would that look? Thanks!
from bs4 import BeautifulSoup
from urllib.request import urlopen
from discord.ext import commands
import discord
# what I want the commands to start with
bot = commands.Bot(command_prefix='!')
# instantiating discord client
token = "************************************"
client = discord.Client()
# begin the scraping of passed in web page
URL = "*********************************"
page = urlopen(URL)
soup = BeautifulSoup(page, 'html.parser')
pbe_titles = soup.find_all('h1', attrs={'class': 'news-title'}) # using soup to find all header tags with the news-title
# class and storing them in pbe_titles
linksAndTitles = []
counter = 0
# finding tags that start with 'a' as in a href and appending those titles/links
for tag in pbe_titles:
for anchor in tag.find_all('a'):
linksAndTitles.append(tag.text.strip())
linksAndTitles.append(anchor['href'])
# counts number of lines stored inside linksAndTitles list
for i in linksAndTitles:
counter = counter + 1
print(counter)
# separates list by line so that it looks nice when printing
allPatches = '\n'.join(str(line) for line in linksAndTitles[:counter])
# stores the first two lines in list which is the current pbe patch title and link
currPatch = '\n'.join(str(line) for line in linksAndTitles[:2])
# command that allows user to type in exactly what patch they want to see information for based off date
#bot.command(name='patch')
async def pbe_patch(ctx, *, arg):
if any(item.startswith(arg) for item in linksAndTitles):
await ctx.send(arg + " exists!")
else:
await ctx.send('The date you entered: ' + '"' + arg + '"' + ' does not have a patch associated with it or that patch expired.')
# command that displays the current, most up to date, patch
#bot.command(name='current')
async def current_patch(ctx):
response = currPatch
await ctx.send(response)
bot.run(token)
I've played around with
while True:
loops but whenever I nest anything inside of them, I can't access the code in other places.
discord has special decorator tasks to run some code periodically
from discord.ext import tasks
#tasks.loop(seconds=5.0)
async def scrape():
# ... your scraping code ...
# ... your commands ...
scrape.start()
bot.run(token)
and it will repeate function scrape every 5 seconds.
Documentation: tasks
On Linux eventually I would use standard service cron to run periodically some script. This script could scrape data and save in file or database and discord could read from this file or database. But cron check tasks every 1 minute so it can't run task more often.
EDIT:
Minimal working code.
I use page http://books.toscrape.com created for scrape learning.
I changed few elements. There is no need to create client when there is bot because bot is special kind of client
I keep title and link as dictionary
{
'title': tag.text.strip(),
'link': url + anchor['href'],
}
so later it is easier to create text like
title: A Light in the ...
link: http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
import os
import discord
from discord.ext import commands, tasks
from bs4 import BeautifulSoup
from urllib.request import urlopen
# default value at start (before `scrape` will assign new value)
# because some function may try to use these variables before `scrape` will create them
links_and_titles = [] # PEP8: `lower_case_namese`
counter = 0
items = []
bot = commands.Bot(command_prefix='!')
#tasks.loop(seconds=5)
async def scrape():
global links_and_titles
global counter
global items
url = "http://books.toscrape.com/"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
#pbe_titles = soup.find_all('h1', attrs={'class': 'news-title'})
pbe_titles = soup.find_all('h3')
# remove previous content
links_and_titles = []
for tag in pbe_titles:
for anchor in tag.find_all('a'):
links_and_titles.append({
'title': tag.text.strip(),
'link': url + anchor['href'],
})
counter = len(links_and_titles)
print('counter:', counter)
items = [f"title: {x['title']}\nlink: {x['link']}" for x in links_and_titles]
#bot.command(name='patch')
async def pbe_patch(ctx, *, arg=None):
if arg is None:
await ctx.send('Use: !patch date')
elif any(item['title'].startswith(arg) for item in links_and_titles):
await ctx.send(arg + " exists!")
else:
await ctx.send(f'The date you entered: "{arg}" does not have a patch associated with it or that patch expired.')
#bot.command(name='current')
async def current_patch(ctx, *, number=1):
if items:
responses = items[:number]
text = '\n----\n'.join(responses)
await ctx.send(text)
else:
await ctx.send('no patches')
scrape.start()
token = os.getenv('DISCORD_TOKEN')
bot.run(token)
PEP 8 -- Style Guide for Python Code

I want to use kafka producer with python beautifulsoup to send message to kafka broker

I am using kafka-python and BeautifulSoup to Scrape website that I enter often, and send a message to kafka broker with python producer.
What I want to do is whenever new post is uploaded on website (actually it is some kind of community like reddit, usually korean hip-hop fans are using to share information etc), that post should be send to kafka broker.
However, my problem is within while loop, only the lateset post keeps sending to kafka broker repeatedly.
This is not I want.
Also, second problem is when new post is loaded,
HTTP Error 502: Bad Gateway error occurs on
soup = BeautifulSoup(urllib.request.urlopen("http://hiphople.com/kboard").read(), "html.parser")
and message is not send anymore.
this is dataScraping.py
from bs4 import BeautifulSoup
import re
import urllib.request
pattern = re.compile('[0-9]+')
def parseContent():
soup = BeautifulSoup(urllib.request.urlopen("http://hiphople.com/kboard").read(), "html.parser")
for div in soup.find_all("tr", class_="notice"):
div.decompose()
key_num = pattern.findall(soup.find_all("td", class_="no")[0].text)
category = soup.find_all("td", class_="categoryTD")[0].find("span").text
author = soup.find_all("td", class_="author")[0].find("span").text
title = soup.find_all("td", class_="title")[0].find("a").text
link = "http://hiphople.com" + soup.find_all("td", class_="title")[0].find("a").attrs["href"]
soup2 = BeautifulSoup(urllib.request.urlopen(link).read(), "html.parser")
content = str(soup2.find_all("div", class_="article-content")[0].find_all("p"))
content = re.sub("<.+?>","", content,0).strip()
content = re.sub("\xa0","", content, 0).strip()
result = {"key_num": key_num, "catetory": category, "title": title, "author": author, "content": content}
return result
if __name__ == "__main__":
print("data scraping from website")
and this is PythonWebScraping.py
import json
from kafka import KafkaProducer
from dataScraping import parseContent
def json_serializer(data):
return json.dumps(data).encode("utf-8")
producer = KafkaProducer(acks=1, compression_type = "gzip", bootstrap_servers=["localhost:9092"],
value_serializer = json_serializer)
if __name__ == "__main__":
while (True):
result = parseContent()
producer.send("hiphople",result)
Please let me know how to fix my code so I can send newly created post to kafka broker as I expected.
Your function is working but its true you return only one event, I did not get 502 bad gateway, maybe you are getting it as ddos protection because of accessing too much times to the url, try adding delays/sleep , or your ip been banned to stop it from scraping the url...
For your second error, your function returns only one/last message
You are sending each time the result to kafka, this is why you are seeing same message over and over again,
You are scraping and taking the last event , what did you wish your function to do?
prevResult = ""
while(True):
result = parseContent()
if(prevResult!=result):
prevResult = result
print( result )

Can't go on clicking on the next page button while scraping certain fields from a website

I've created a script using python in association with pyppeteer to keep clicking on the next page button until there is no more. The script while clicking on the next page button throws this error pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 30000 ms exceeded. pointing at this line await page.waitForNavigation(). It can parse name and item_type from the landing page of that site, though. I know I can issue post http requests with appropriate payload to get data from there but my intention is to make use of pyppeteer and keep clicking on the next page button while parsing the required fields.
website address
import asyncio
from pyppeteer import launch
link = "https://www.e-ports.com/ships"
async def get_content():
wb = await launch(headless=True)
[page] = await wb.pages()
await page.goto(link)
while True:
await page.waitForSelector(".common_card", {'visible':True})
elements = await page.querySelectorAll('.common_card')
for element in elements:
name = await element.querySelectorEval('span.title > a','e => e.innerText')
item_type = await element.querySelectorEval('.bottom > span','e => e.innerText')
print(name.strip(),item_type.strip())
try:
await page.click("button.btn-next")
await page.waitForNavigation()
except Exception: break
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(get_content())
Btw, If I manually click on the next page button for the first time, It accomplishes the rest successfully.
I don't know the valid syntax in Pypeteer, but common syntax of waitForNavigation maybe this one.
await Promise.all([
page.waitForNavigation(),
page.click("button.btn-next")
])
With the iterator inside the array promised, all of the methods will resolved when become true or finished desired action.

Script throws some error at some point within the execution

I've created a script in python using pyppeteer to collect the links of different posts from a webpage and then parse the title of each post by going in their target page reusing those collected links. Although the content are static, I like to know how pyppeteer works in such cases.
I tried to supply this browser variable from main() function to fetch() and browse_all_links() function so that I can reuse the same browser over and over again.
My current approach:
import asyncio
from pyppeteer import launch
url = "https://stackoverflow.com/questions/tagged/web-scraping"
async def fetch(page,url):
await page.goto(url)
linkstorage = []
await page.waitForSelector('.summary .question-hyperlink')
elements = await page.querySelectorAll('.summary .question-hyperlink')
for element in elements:
linkstorage.append(await page.evaluate('(element) => element.href', element))
return linkstorage
async def browse_all_links(page,link):
await page.goto(link)
await page.waitForSelector('h1 > a')
title = await page.querySelectorEval('h1 > a','(e => e.innerText)')
print(title)
async def main():
browser = await launch(headless=False,autoClose=False)
[page] = await browser.pages()
links = await fetch(page,url)
tasks = [await browse_all_links(page,url) for url in links]
await asyncio.gather(*tasks)
if __name__ == '__main__':
asyncio.run(main())
The above script fetches some titles but spits out the following error at some point within the execution:
Possible to select <a> with specific text within the quotes?
Crawler Runs Too Slow
How do I loop a list of ticker to scrape balance sheet info?
How to retrive the url of searched video from youtbe using python
VBA-JSON to import data from all pages in one table
Is there an algorithm that detects semantic visual blocks in a webpage?
find_all only scrape the last value
#ERROR STARTS
Future exception was never retrieved
future: <Future finished exception=NetworkError('Protocol error (Runtime.releaseObject): Cannot find context with specified id')>
pyppeteer.errors.NetworkError: Protocol error (Runtime.releaseObject): Cannot find context with specified id
Future exception was never retrieved
AS it's been two days since this question has been posted but no one yet to answer, I will take this opportunity to address this issue what I
think might be helpful to you.
There are 15 links but you are getting only 7, this is probably websockets is loosing connection and page is not reachable anymore
List comprehension
tasks = [await browse_all_links(page,url) for url in links] What do expect is this list? If it's succesful, it will be a list
of none element. So your next line of code will throw error!
Solution
downgrade websockets 7.0 to websockets 6.0
remove this line of code await asyncio.gather(*tasks)
I am using python 3.6, so I had to change last line of code.
You don't need change it if you are using python 3.7 which I think you are using
import asyncio
from pyppeteer import launch
url = "https://stackoverflow.com/questions/tagged/web-scraping"
async def fetch(page,url):
await page.goto(url)
linkstorage = []
await page.waitForSelector('.summary .question-hyperlink')
elements = await page.querySelectorAll('.summary .question-hyperlink')
for element in elements:
linkstorage.append(await page.evaluate('(element) => element.href', element))
return linkstorage
async def browse_all_links(page,link):
await page.goto(link)
await page.waitForSelector('h1 > a')
title = await page.querySelectorEval('h1 > a','(e => e.innerText)')
print(title)
async def main():
browser = await launch(headless=False,autoClose=False)
[page] = await browser.pages()
links = await fetch(page,url)
tasks = [await browse_all_links(page,url) for url in links]
#await asyncio.gather(*tasks)
await browser.close()
if __name__ == '__main__':
#asyncio.run(main())
asyncio.get_event_loop().run_until_complete(main())
Output
(testenv) C:\Py\pypuppeteer1>python stack3.py
Scrapy Shell response.css returns an empty array
Scrapy real-time spider
Why do I get KeyError while reading data with get request?
Scrapy spider can't redefine custom_settings according to args
Custom JS Script using Lua in Splash UI
Can someone explain why and how this piece of code works [on hold]
How can I extract required data from a list of strings?
Scrapy CrawlSpider rules for crawling single page
how to scrape a web-page with search bar results, when the search query does not
appear in the url
Nested for loop keeps repeating
Get all tags except a list of tags BeautifulSoup
Get current URL using Python and webbot
How to login to site and send data
Unable to append value to colums. Getting error IndexError: list index out of ra
nge
NextSibling.Innertext not working. “Object doesn't support this property”

Categories