beautiful soup how to avoid writing too many try catch blocks? - python

I am using beautiful soup library to extract out data from webpages. Sometimes we have the case where element could not be found in the webpage itself, and if we try to access the sub element than we get error like 'NoneType' object has no attribute 'find'.
Like let say for the below code
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
primary_name = soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text
company_number = soup.find('p', id="company-number").find('strong').text
If I want to handle the error, I have to write something like below.
try:
primary_name = error_handler(soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text)
except:
primary_name = None
try:
company_number = soup.find('p', id="company-number").find('strong').text.strip()
except:
company_number = None
And if there are too many elements, then we end up with lots of try and catch statements. I actually want to write code in the below manner.
def error_handler(_):
try:
return _
except:
return None
primary_name = error_handler(soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text)
# this will still raise the error
I know that above code wouldn't work because it will still try to execute first inner function in error_handler function, and it would still raise the error.
If you have any idea how to make this code looks cleaner, then please show me.

I don't know if this is the most efficient way, but you can pass a lambda expression to the error_handler:
def error_handler(_):
try:
return _()
except:
return None
primary_name = error_handler(lambda: soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text)

So, you are finding a way to handle a lot of element's exceptions.
For this, I will assume that you will (like any other scraper), use for loop.
You can handle exceptions as follows:
soup = BeautifulSoup(somehtml)
a_big_list_of_data = soup.find_all("div", {"class": "cards"})
for items in a_big_list_of_data:
try:
name = items.find_all("h3", {"id": "name"})
price = items.find_all("h5", {"id": "price"})
except:
continue

Related

Handling try except multiple times while web scraping BeautifulSoup

while web scraping using BeautifulSoup I have to write try except multiple times. See the code below:
try:
addr1 = soup.find('span', {'class' : 'addr1'}).text
except:
addr1 = ''
try:
addr2 = soup.find('span', {'class' : 'addr2'}).text
except:
addr2 = ''
try:
city = soup.find('strong', {'class' : 'city'}).text
except:
city = ''
The problem is that I have to write try except multiple times and that is very annoying. I want to write a function to handle the exception.
I tried to use the following function but it is still showing an error:
def datascraping(var):
try:
return var
except:
return None
addr1 = datascraping(soup.find('span', {'class' : 'addr1'}).text)
addr2 = datascraping(soup.find('span', {'class' : 'addr2'}).text)
Can anyone help me to solve the issue?
Use a for loop that iterates through a sequence containing your arguments. Then use a conditional statement that checks if the return value is None, prior to attempting to get the text attribute. Then store the results in a dictionary. This way there is no need to use try/except at all.
seq = [('span', 'addr1'), ('span', 'addr2'), ('strong', 'city')]
results = {}
for tag, value in seq:
var = soup.find(tag, {'class': value})
if var is not None:
results[value] = var.text
else:
results[value] = ''

Removing tags from a selected number by iterating using beautifulsoup

I am trying to clean html data using beautiful soup. I want to remove a set of tags along with the data associated in that tags which are consescutive starting from et_pb_row_inner et_pb_row_inner_2 to et_pb_row_inner et_pb_row_inner_22 .
The code which i was trying is like this
Code
def madisonsymphony(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
for h in soup.find_all('header'):
try:
h.extract()
except:
pass
for f in soup.find_all('footer'):
try:
f.extract()
except:
pass
tophead = soup.find("div",{"id":"top-header"})
tophead.extract()
for x in range(2,23):
mydiv = soup.find("div", {"class": "et_pb_row_inner et_pb_row_inner_{}".format(x)})
mydiv.extract()
text = soup.getText(separator=u' ')
return text
I got it by individually specifying the class name using find(), but how is it possible to do in a general manner.
You could use a regex to find all the <div> tags that have those attributes and end in 2 or higher.
So basically the regex r'et_pb_row_inner et_pb_row_inner_([2-9]|[\d]{2,}).*' is saying find all the et_pb_row_inner et_pb_row_inner_ that end in a single digit of 2 through 9 or is a digit in length of two or more.
def madisonsymphony(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
for h in soup.find_all('header'):
try:
h.extract()
except:
pass
for f in soup.find_all('footer'):
try:
f.extract()
except:
pass
tophead = soup.find("div",{"id":"top-header"})
tophead.extract()
for mydiv in soup.find_all("div", {"class":re.compile(r'et_pb_row_inner et_pb_row_inner_([2-9]|[\d]{2,}).*')}):
mydiv.extract()
text = soup.getText(separator=u' ')
return text
This way you don't need to hard code the range 2 through 21. It'll just go from 2 to whatever that last value is. The other way to do it is just use slicing.
def madisonsymphony(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
for h in soup.find_all('header'):
try:
h.extract()
except:
pass
for f in soup.find_all('footer'):
try:
f.extract()
except:
pass
tophead = soup.find("div",{"id":"top-header"})
tophead.extract()
mydivs = soup.find_all("div", {"class":re.compile(r'et_pb_row_inner et_pb_row_inner_.*')})
for mydiv in mydivs[2:]: # Start at the 2nd element in the list and continue to the end
mydiv.extract()
text = soup.getText(separator=u' ')
return text
Problem with that is you have to make the assumption theres a 0 and 1. If for whatever reason the attributes starts at 1, then youre keeping the 2, which would be the second element.

How to use lists outside of for loop?

I´m trying to monitor prices on the eshop. How I do it is, I´m webscraping the prices, then wait some time, do it again and compare the lists with prices if anything changed. But my lists with prices are inside the for loops and when I want to compare them outside the for loop, it says my lists are undefined. I hope you´ll understand that better with my code.
import requests, smtplib, time
from bs4 import BeautifulSoup
url = "https://www.sportisimo.sk/panska-vysoka-trekova-obuv/?riadenie=najlacnejs%C3%AD&dostupnost=vsetko&znacka[]=28"
getPage = requests.get(url)
page_soup = BeautifulSoup(getPage.content, "html.parser")
old_containers = page_soup.findAll("div", {"class": "product-box__in"})
for old_container in old_containers:
price_container = old_container.findAll("span", {"class":"price"})
product_price_old = price_container[0].text.strip()
print(product_price_old)
print("---------------------")
time.sleep(10)
getPage_new = requests.get(url)
page_soup_new = BeautifulSoup(getPage_new.content, "html.parser")
new_containers = page_soup_new.findAll("div", {"class": "product-box__in"})
for new_container in new_containers:
price_container = new_container.findAll("span", {"class": "price"})
product_price_new = price_container[0].text.strip()
print(product_price_new)
if product_price_old != product_price_new:
for new_container in new_containers:
name_container = new_container.findAll("div", {"class": "product-box__name"})
product_name = name_container[0].h2.a.text.strip()
print(f"Zmenený produkt: {product_name}")
print(f"Pôvodná cena: {product_price_old}")
print(f"Nová cena: {product_price_new}")
The product_price_old and product_price_new writes me that they are undefined lists. How can I fix this to work, please?
Thank you
You could define start with
product_price_old = []
product_price new = []
and in the loops you could do
for element in price_container[0].text.strip():
product_price_old.append(element)
# or whatever format you want your data in

How to get webelements value if found else skip

I have been trying to improve my knowledge with Python and I think the code is pretty forward. However I do dislike abit the coding style I have done where I use too much try except in a content there it might not needed to be at first place and also to try to avoid the silenced expetions.
My goal is basically to have a ready payload before scraping as you will see at the top of the code. Those should be always declared before scraping. What i'm trying to do basically is to try to scrape those different data. If we don't find the data, then it should skip or set the value to [], None or False (Depending on what we are trying to do).
I have read abit regarding getattr and isinstance functions but im not sure if there might be a better way than using lots of Try except as a cover if it doesn't find the element on the webpage.
import requests
from bs4 import BeautifulSoup
payload = {
"name": "Untitled",
"view": None,
"image": None,
"hyperlinks": []
}
site_url = "https://stackoverflow.com/questions/743806/how-to-split-a-string-into-a-list"
response = requests.get(site_url)
bs4 = BeautifulSoup(response.text, "html.parser")
try:
payload['name'] = "{} {}".format(
bs4.find('meta', {'property': 'og:site_name'})["content"],
bs4.find('meta', {'name': 'twitter:domain'})["content"]
)
except Exception: # noqa
pass
try:
payload['view'] = "{} in total".format(
bs4.find('div', {'class': 'grid--cell ws-nowrap mb8'}).text.strip().replace("\r\n", "").replace(" ", ""))
except Exception:
pass
try:
payload['image'] = bs4.find('meta', {'itemprop': 'image primaryImageOfPage'})["content"]
except Exception:
pass
try:
payload['hyperlinks'] = [hyperlinks['href'] for hyperlinks in bs4.find_all('a', {'class': 'question-hyperlink'})]
except Exception: # noqa
pass
print(payload)
EDIT:
Example to get incorrect value is to set any find bs4 elements to something else etc:
site_url = "https://stackoverflow.com/questions/743806/how-to-split-a-string-into-a-list"
response = requests.get(site_url)
bs4 = BeautifulSoup(response.text, "html.parser")
print(bs4.find('meta', {'property': 'og:site_name'})["content"]) # Should be found
print(bs4.find('meta', {'property': 'og:site_name_TEST'})["content"]) # Should give us an error due to not found
From the documentation, find returns None when it doesn't find anything while find_all returns an empty list []. You can check that the results are not None before trying to index.
import requests
from bs4 import BeautifulSoup
payload = {
"name": "Untitled",
"view": None,
"image": None,
"hyperlinks": []
}
site_url = "https://stackoverflow.com/questions/743806/how-to-split-a-string-into-a-list"
response = requests.get(site_url)
bs4 = BeautifulSoup(response.text, "html.parser")
try:
prop = bs4.find('meta', {'property': 'og:site_name'})
name = bs4.find('meta', {'name': 'twitter:domain'})
if prop is not None and name is not None:
payload['name'] = "{} {}".format(prop["content"], name["content"])
div = bs4.find('div', {'class': 'grid--cell ws-nowrap mb8'})
if div is not None:
payload['view'] = "{} in total".format(div.text.strip().replace("\r\n", "").replace(" ", ""))
itemprop = bs4.find('meta', {'itemprop': 'image primaryImageOfPage'})
if itemprop is not None:
payload['image'] = itemprop["content"]
payload['hyperlinks'] = [hyperlinks['href'] for hyperlinks in bs4.find_all('a', {'class': 'question-hyperlink'})]
except Exception: # noqa
pass
print(payload)
So you can use one try/except. If you want to handle exceptions differently you can have different except blocks for them.
try:
...
except ValueError:
value_error_handler()
except TypeError:
type_error_handler()
except Exception:
catch_all()

Beautifulsoup if class exists

Is there a way to make BeautifulSoup look for a class and if it exists then run the script? I am trying this:
if soup.find_all("div", {"class": "info"}) == True:
print("Tag Found")
I've also tried but it didn't work and gave an error about having too many attributes:
if soup.has_attr("div", {"class": "info"})
print("Tag Found")
You're very close... soup.findall will return an empty list if it doesn't find any matches. Your control statement is checking its return for a literal bool value. Instead you need to check its truthiness by omitting the ==True
if soup.find_all("div", {"class": "info"}):
print("Tag Found")
Why not simply this:
if soup.find("div", {"class": "info"}) is not None:
print("Tag Found")

Categories