while web scraping using BeautifulSoup I have to write try except multiple times. See the code below:
try:
addr1 = soup.find('span', {'class' : 'addr1'}).text
except:
addr1 = ''
try:
addr2 = soup.find('span', {'class' : 'addr2'}).text
except:
addr2 = ''
try:
city = soup.find('strong', {'class' : 'city'}).text
except:
city = ''
The problem is that I have to write try except multiple times and that is very annoying. I want to write a function to handle the exception.
I tried to use the following function but it is still showing an error:
def datascraping(var):
try:
return var
except:
return None
addr1 = datascraping(soup.find('span', {'class' : 'addr1'}).text)
addr2 = datascraping(soup.find('span', {'class' : 'addr2'}).text)
Can anyone help me to solve the issue?
Use a for loop that iterates through a sequence containing your arguments. Then use a conditional statement that checks if the return value is None, prior to attempting to get the text attribute. Then store the results in a dictionary. This way there is no need to use try/except at all.
seq = [('span', 'addr1'), ('span', 'addr2'), ('strong', 'city')]
results = {}
for tag, value in seq:
var = soup.find(tag, {'class': value})
if var is not None:
results[value] = var.text
else:
results[value] = ''
Related
I am using beautiful soup library to extract out data from webpages. Sometimes we have the case where element could not be found in the webpage itself, and if we try to access the sub element than we get error like 'NoneType' object has no attribute 'find'.
Like let say for the below code
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
primary_name = soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text
company_number = soup.find('p', id="company-number").find('strong').text
If I want to handle the error, I have to write something like below.
try:
primary_name = error_handler(soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text)
except:
primary_name = None
try:
company_number = soup.find('p', id="company-number").find('strong').text.strip()
except:
company_number = None
And if there are too many elements, then we end up with lots of try and catch statements. I actually want to write code in the below manner.
def error_handler(_):
try:
return _
except:
return None
primary_name = error_handler(soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text)
# this will still raise the error
I know that above code wouldn't work because it will still try to execute first inner function in error_handler function, and it would still raise the error.
If you have any idea how to make this code looks cleaner, then please show me.
I don't know if this is the most efficient way, but you can pass a lambda expression to the error_handler:
def error_handler(_):
try:
return _()
except:
return None
primary_name = error_handler(lambda: soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text)
So, you are finding a way to handle a lot of element's exceptions.
For this, I will assume that you will (like any other scraper), use for loop.
You can handle exceptions as follows:
soup = BeautifulSoup(somehtml)
a_big_list_of_data = soup.find_all("div", {"class": "cards"})
for items in a_big_list_of_data:
try:
name = items.find_all("h3", {"id": "name"})
price = items.find_all("h5", {"id": "price"})
except:
continue
def get_property_details(url):
page = r.get(URL)
soup = bs(page.content, 'html.parser')
div = soup.find_all('div', class_ = 'list-card-heading uk-grid')
try:
prop_info = {}
property_list = []
count = 0
for d in div:
details = d.find_all('li')
prop_info['Price'] = d.find('b', class_ = 'list-price').get_text().strip().replace('$','').replace('/mo','').replace('\xa0','')
prop_info['Type'] = d.find('span', class_ = 'rent_type').get_text().strip()
for index, detail in enumerate(details):
if index == 0:
prop_info['Bed'] = detail.text.strip().replace(' bds','')
elif index == 1:
prop_info['Bath'] = detail.text.strip().replace(' ba','')
elif index == 2:
prop_info['SQFT'] = detail.text.strip().replace(' sqft','')
else:
break
# This prints out the correct property details
print(prop_info)
# This is not working, add the same property repeatedly
property_list.append(prop_info)
except Exception as e:
print()
# list of dictionary
return property_list
URL = 'https://www.forrentbyowner.com/?showpage=/classifieds/&f=Oklahoma'
property_info = get_property_details(URL)
print(property_info)
Indeed you are adding the same object all the time, remember that Python uses a "call by object reference" system to pass arguments to functions, which means that when you pass arguments like strings, numbers, or tuples it can be considered as "call by value" arguments, but mutable objects can be considered as "call by reference" arguments, so you are appending the same object reference all the time, to avoid that, you just need to move prop_info = {} inside the for loop to create a new instance of the prop_info dictionary:
...
try:
property_list = []
count = 0
for d in div:
prop_info = {}
details = d.find_all('li')
prop_info['Price'] = ...
here is a bit modified code, you don't need that extra dictionary and for loop
import requests
from bs4 import BeautifulSoup as bs
def get_property_details(url):
page = requests.get(url)
soup = bs(page.content, 'html.parser')
div = soup.find_all('div', class_='list-card-heading uk-grid')
# prop_info = {}
property_list = []
try:
# count = 0
for d in div:
details = d.find_all('li')
property_list.append({'price': d.find('b', class_='list-price').get_text().strip().replace('$', '').replace('/mo', '').replace('\xa0', ''),
'Type': d.find('span', class_='rent_type').get_text().strip(),
'Bed': details[0].text.strip().replace(' bds', '').replace(' bd', ''),
'Bath': details[1].text.strip().replace(' ba', ''),
'SQFT': details[2].text.strip().replace(' sqft', '')})
except Exception as e:
print(e)
# list of dictionary
return property_list
URL = 'https://www.forrentbyowner.com/?showpage=/classifieds/&f=Oklahoma'
property_info = get_property_details(URL)
print(property_info)
Hope this will help, a small suggestion modify your question so that it can be easily understood
I have been trying to improve my knowledge with Python and I think the code is pretty forward. However I do dislike abit the coding style I have done where I use too much try except in a content there it might not needed to be at first place and also to try to avoid the silenced expetions.
My goal is basically to have a ready payload before scraping as you will see at the top of the code. Those should be always declared before scraping. What i'm trying to do basically is to try to scrape those different data. If we don't find the data, then it should skip or set the value to [], None or False (Depending on what we are trying to do).
I have read abit regarding getattr and isinstance functions but im not sure if there might be a better way than using lots of Try except as a cover if it doesn't find the element on the webpage.
import requests
from bs4 import BeautifulSoup
payload = {
"name": "Untitled",
"view": None,
"image": None,
"hyperlinks": []
}
site_url = "https://stackoverflow.com/questions/743806/how-to-split-a-string-into-a-list"
response = requests.get(site_url)
bs4 = BeautifulSoup(response.text, "html.parser")
try:
payload['name'] = "{} {}".format(
bs4.find('meta', {'property': 'og:site_name'})["content"],
bs4.find('meta', {'name': 'twitter:domain'})["content"]
)
except Exception: # noqa
pass
try:
payload['view'] = "{} in total".format(
bs4.find('div', {'class': 'grid--cell ws-nowrap mb8'}).text.strip().replace("\r\n", "").replace(" ", ""))
except Exception:
pass
try:
payload['image'] = bs4.find('meta', {'itemprop': 'image primaryImageOfPage'})["content"]
except Exception:
pass
try:
payload['hyperlinks'] = [hyperlinks['href'] for hyperlinks in bs4.find_all('a', {'class': 'question-hyperlink'})]
except Exception: # noqa
pass
print(payload)
EDIT:
Example to get incorrect value is to set any find bs4 elements to something else etc:
site_url = "https://stackoverflow.com/questions/743806/how-to-split-a-string-into-a-list"
response = requests.get(site_url)
bs4 = BeautifulSoup(response.text, "html.parser")
print(bs4.find('meta', {'property': 'og:site_name'})["content"]) # Should be found
print(bs4.find('meta', {'property': 'og:site_name_TEST'})["content"]) # Should give us an error due to not found
From the documentation, find returns None when it doesn't find anything while find_all returns an empty list []. You can check that the results are not None before trying to index.
import requests
from bs4 import BeautifulSoup
payload = {
"name": "Untitled",
"view": None,
"image": None,
"hyperlinks": []
}
site_url = "https://stackoverflow.com/questions/743806/how-to-split-a-string-into-a-list"
response = requests.get(site_url)
bs4 = BeautifulSoup(response.text, "html.parser")
try:
prop = bs4.find('meta', {'property': 'og:site_name'})
name = bs4.find('meta', {'name': 'twitter:domain'})
if prop is not None and name is not None:
payload['name'] = "{} {}".format(prop["content"], name["content"])
div = bs4.find('div', {'class': 'grid--cell ws-nowrap mb8'})
if div is not None:
payload['view'] = "{} in total".format(div.text.strip().replace("\r\n", "").replace(" ", ""))
itemprop = bs4.find('meta', {'itemprop': 'image primaryImageOfPage'})
if itemprop is not None:
payload['image'] = itemprop["content"]
payload['hyperlinks'] = [hyperlinks['href'] for hyperlinks in bs4.find_all('a', {'class': 'question-hyperlink'})]
except Exception: # noqa
pass
print(payload)
So you can use one try/except. If you want to handle exceptions differently you can have different except blocks for them.
try:
...
except ValueError:
value_error_handler()
except TypeError:
type_error_handler()
except Exception:
catch_all()
I am using Python 3.9.1 with selenium and BeatifulSoup in order to create my first webscraper for Tesco's website (a mini project to teach myself). However, when I run the code, as shown below, I receive an attribute error:
Traceback (most recent call last):
File "c:\Users\Ozzie\Dropbox\My PC (DESKTOP-HFVRPAV)\Desktop\Tesco\Tesco.py", line 37, in <module>
clean_product_data = process_products(html)
File "c:\Users\Ozzie\Dropbox\My PC (DESKTOP-HFVRPAV)\Desktop\Tesco\Tesco.py", line 23, in process_products
weight = product_price_weight.find("span",{"class":"weight"}).text.strip()
AttributeError: 'NoneType' object has no attribute 'find'
I am unsure what is going wrong - the title and URL sections work fine, but the weight and price sections return this value. When I have tried printing the product_price and product_price_weight variables, they have returned the values I expected them to (I won't post that here, it's just very long HTML).
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
import time
from bs4 import BeautifulSoup
driver = webdriver.Chrome(ChromeDriverManager().install())
def process_products(html):
clean_product_list = []
soup = BeautifulSoup(html, 'html.parser')
products = soup.find_all("div",{"class":"product-tile-wrapper"})
for product in products:
data_dict = {}
product_details = product.find("div",{"class":"product-details--content"})
product_price = product.find("div",{"class":"price-control-wrapper"})
product_price_weight = product.find("div",{"class":"price-per-quantity-weight"})
data_dict['title'] = product_details.find('a').text.strip()
data_dict['product_url'] = ('tesco.com') + (product_details.find('a')['href'])
weight = product_price_weight.find("span",{"class":"weight"}).text.strip()
data_dict['price'] = product_price.find("span",{"class":"value"}).text.strip()
data_dict['price'+weight] = product_price_weight.find("span",{"class":"value"}).text.strip()
clean_product_list.append(data_dict)
return clean_product_list
master_list = []
for i in range (1,3):
print (i)
driver.get(f"https://www.tesco.com/groceries/en-GB/shop/fresh-food/all?page={i}&count=48")
html = driver.page_source
driver.maximize_window()
clean_product_data = process_products(html)
master_list.extend(clean_product_data)
print (master_list)
Any help is much appreciated.
Many thanks,
You can try this by updating your process_products function. Take note again THERE ARE CASES where some of your variable that you are trying to do a .find() returns a None which simply means that it HAS NOT find any element base on the parameters given on your .find() function.
Example this one:
Let's say this part of code has been executed
product_details = product.find("div",{"class":"product-details--content"})
Now if it finds an element based on those tags & class it will return a bs4 object but if not it will return None so let's say it returned None.
So your product_details variable will be a None object so once it is None again here on your code you do this. Again where product_details is None
data_dict['title'] = product_details.find('a').text.strip()
#Another way of saying is
#data_dict['title'] = None.find('a').text.strip() ##Clearly an ERROR
So what I did this here is put it in a try except to simply catch those errors and give you empty strings indicating that probably your variable you're trying to do a .find() returns a None or might be some errors (the point is there is no relevant data being returned), that's why I use try except but you could also just make an if else out of this, but I think doing it in a try except is better.
def process_products(html):
clean_product_list = []
soup = BeautifulSoup(html, 'html.parser')
products = soup.find_all("div",{"class":"product-tile-wrapper"})
for product in products:
data_dict = {}
product_details = product.find("div",{"class":"product-details--content"})
product_price = product.find("div",{"class":"price-control-wrapper"})
product_price_weight = product.find("div",{"class":"price-per-quantity-weight"})
try:
data_dict['title'] = product_details.find('a').text.strip()
data_dict['product_url'] = ('tesco.com') + (product_details.find('a')['href'])
except BaseException as no_prod_details:
'''
This would mean that your product_details variable might be equal to None, so catching the error & setting
yoour data with empty strings, indicating it can't do a .find()
'''
data_dict['title'] = ''
data_dict['product_url'] = ''
try:
data_dict['price'] = product_price.find("span",{"class":"value"}).text.strip()
except BaseException as no_prod_price:
#Same here
data_dict['price'] =''
try:
weight = product_price_weight.find("span",{"class":"weight"}).text.strip()
data_dict['price'+weight] = product_price_weight.find("span",{"class":"value"}).text.strip()
except BaseException as no_prod_price_weigth:
#Same here again
weight = ''
data_dict['price'+weight] = ''
clean_product_list.append(data_dict)
return clean_product_list
from collections import defaultdict
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
r= requests.get("http://www.walmart.com/search/?query=marvel&cat_id=4096_530598")
r.content
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class" : "tile-content"})
data=defaultdict(list)
for tile in g_data:
#the "tile" value in g_data contains what you are looking for...
#find the product titles
try:
title = tile.find("a","js-product-title")
data['Product Title'].append(title.text)
except:
data['Product Title'].append("")
#find the prices
try:
price = tile.find('span', 'price price-display').text.strip()
data['Price'].append(price)
except:
data['Price'].append("")
#find the stars
try:
g_star = tile.find("div",{"class" : "stars stars-small tile-row"}).find('span','visuallyhidden').text.strip()
data['Stars'].append(g_star)
except:
data['Stars'].append("")
try:
dd_starring = tile.find('dd', {"class" : "media-details-multi-line media-details-artist-dd module"}).text.strip()
data['Starring'].append(dd_starring)
except:
data['Starring'].append("")
try:
running_time = tile.find_all('dl',{"class" : "media-details dl-horizontal copy-mini"})
for dd_run in running_time :
running = dd_run.find_all('dd')[1:2]
for run in running :
#print run.text.strip()
data['Running Time'].append(run.text.strip())
except:
data['Running Time'].append("")
try:
dd_format = tile.findAll('dd',{"class" :"media-details-multi-line"})[1:2]
for formatt in dd_format:
data['Format'].append(textOfFormat)
except:
data['Format'].append("")
try:
div_shipping =tile.find_all('div',{"data-offer-shipping-pass-eligible":"false"})
data['Shipping'].append("")
except:
freeshipping = "Free Shipping"
data['Shipping'].append(freeshipping)
df = pd.DataFrame(data)
df
I want to access the which if without a class name. How to access it?
Like row no.11 has 5 director field and few other have Release date.
Currently I am accessing it using [2:1] and so on.. But its not flxible and doesnt populate my table correctly.
Any function to do this?
Substitute Staring and Running time with:
try:
dd_starring = tile.find('dd', {"class" : "media-details-artist-dd"}).text.strip()
data['Starring'].append(dd_starring)
except:
data['Starring'].append("")
try:
running = tile.find('dt',{'class':'media-details-running-time'})
running_time = running.find_next("dd")
data['Running Time'].append(running_time.text)
except:
data['Running Time'].append("")
This should run now. It seems that when you select multiple classes with BeautifulSoup it can get confused so you can get the Actors just by css class media-details-artist-dd. For the running time I employed a simple trick :)
EDIT: Changed the code to find the dd for Running Time and then get the next sibling. The previous code had an extra unneeded part
It should work now