How to get webelements value if found else skip - python

I have been trying to improve my knowledge with Python and I think the code is pretty forward. However I do dislike abit the coding style I have done where I use too much try except in a content there it might not needed to be at first place and also to try to avoid the silenced expetions.
My goal is basically to have a ready payload before scraping as you will see at the top of the code. Those should be always declared before scraping. What i'm trying to do basically is to try to scrape those different data. If we don't find the data, then it should skip or set the value to [], None or False (Depending on what we are trying to do).
I have read abit regarding getattr and isinstance functions but im not sure if there might be a better way than using lots of Try except as a cover if it doesn't find the element on the webpage.
import requests
from bs4 import BeautifulSoup
payload = {
"name": "Untitled",
"view": None,
"image": None,
"hyperlinks": []
}
site_url = "https://stackoverflow.com/questions/743806/how-to-split-a-string-into-a-list"
response = requests.get(site_url)
bs4 = BeautifulSoup(response.text, "html.parser")
try:
payload['name'] = "{} {}".format(
bs4.find('meta', {'property': 'og:site_name'})["content"],
bs4.find('meta', {'name': 'twitter:domain'})["content"]
)
except Exception: # noqa
pass
try:
payload['view'] = "{} in total".format(
bs4.find('div', {'class': 'grid--cell ws-nowrap mb8'}).text.strip().replace("\r\n", "").replace(" ", ""))
except Exception:
pass
try:
payload['image'] = bs4.find('meta', {'itemprop': 'image primaryImageOfPage'})["content"]
except Exception:
pass
try:
payload['hyperlinks'] = [hyperlinks['href'] for hyperlinks in bs4.find_all('a', {'class': 'question-hyperlink'})]
except Exception: # noqa
pass
print(payload)
EDIT:
Example to get incorrect value is to set any find bs4 elements to something else etc:
site_url = "https://stackoverflow.com/questions/743806/how-to-split-a-string-into-a-list"
response = requests.get(site_url)
bs4 = BeautifulSoup(response.text, "html.parser")
print(bs4.find('meta', {'property': 'og:site_name'})["content"]) # Should be found
print(bs4.find('meta', {'property': 'og:site_name_TEST'})["content"]) # Should give us an error due to not found

From the documentation, find returns None when it doesn't find anything while find_all returns an empty list []. You can check that the results are not None before trying to index.
import requests
from bs4 import BeautifulSoup
payload = {
"name": "Untitled",
"view": None,
"image": None,
"hyperlinks": []
}
site_url = "https://stackoverflow.com/questions/743806/how-to-split-a-string-into-a-list"
response = requests.get(site_url)
bs4 = BeautifulSoup(response.text, "html.parser")
try:
prop = bs4.find('meta', {'property': 'og:site_name'})
name = bs4.find('meta', {'name': 'twitter:domain'})
if prop is not None and name is not None:
payload['name'] = "{} {}".format(prop["content"], name["content"])
div = bs4.find('div', {'class': 'grid--cell ws-nowrap mb8'})
if div is not None:
payload['view'] = "{} in total".format(div.text.strip().replace("\r\n", "").replace(" ", ""))
itemprop = bs4.find('meta', {'itemprop': 'image primaryImageOfPage'})
if itemprop is not None:
payload['image'] = itemprop["content"]
payload['hyperlinks'] = [hyperlinks['href'] for hyperlinks in bs4.find_all('a', {'class': 'question-hyperlink'})]
except Exception: # noqa
pass
print(payload)
So you can use one try/except. If you want to handle exceptions differently you can have different except blocks for them.
try:
...
except ValueError:
value_error_handler()
except TypeError:
type_error_handler()
except Exception:
catch_all()

Related

Handling try except multiple times while web scraping BeautifulSoup

while web scraping using BeautifulSoup I have to write try except multiple times. See the code below:
try:
addr1 = soup.find('span', {'class' : 'addr1'}).text
except:
addr1 = ''
try:
addr2 = soup.find('span', {'class' : 'addr2'}).text
except:
addr2 = ''
try:
city = soup.find('strong', {'class' : 'city'}).text
except:
city = ''
The problem is that I have to write try except multiple times and that is very annoying. I want to write a function to handle the exception.
I tried to use the following function but it is still showing an error:
def datascraping(var):
try:
return var
except:
return None
addr1 = datascraping(soup.find('span', {'class' : 'addr1'}).text)
addr2 = datascraping(soup.find('span', {'class' : 'addr2'}).text)
Can anyone help me to solve the issue?
Use a for loop that iterates through a sequence containing your arguments. Then use a conditional statement that checks if the return value is None, prior to attempting to get the text attribute. Then store the results in a dictionary. This way there is no need to use try/except at all.
seq = [('span', 'addr1'), ('span', 'addr2'), ('strong', 'city')]
results = {}
for tag, value in seq:
var = soup.find(tag, {'class': value})
if var is not None:
results[value] = var.text
else:
results[value] = ''

beautiful soup how to avoid writing too many try catch blocks?

I am using beautiful soup library to extract out data from webpages. Sometimes we have the case where element could not be found in the webpage itself, and if we try to access the sub element than we get error like 'NoneType' object has no attribute 'find'.
Like let say for the below code
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
primary_name = soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text
company_number = soup.find('p', id="company-number").find('strong').text
If I want to handle the error, I have to write something like below.
try:
primary_name = error_handler(soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text)
except:
primary_name = None
try:
company_number = soup.find('p', id="company-number").find('strong').text.strip()
except:
company_number = None
And if there are too many elements, then we end up with lots of try and catch statements. I actually want to write code in the below manner.
def error_handler(_):
try:
return _
except:
return None
primary_name = error_handler(soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text)
# this will still raise the error
I know that above code wouldn't work because it will still try to execute first inner function in error_handler function, and it would still raise the error.
If you have any idea how to make this code looks cleaner, then please show me.
I don't know if this is the most efficient way, but you can pass a lambda expression to the error_handler:
def error_handler(_):
try:
return _()
except:
return None
primary_name = error_handler(lambda: soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text)
So, you are finding a way to handle a lot of element's exceptions.
For this, I will assume that you will (like any other scraper), use for loop.
You can handle exceptions as follows:
soup = BeautifulSoup(somehtml)
a_big_list_of_data = soup.find_all("div", {"class": "cards"})
for items in a_big_list_of_data:
try:
name = items.find_all("h3", {"id": "name"})
price = items.find_all("h5", {"id": "price"})
except:
continue

try and except statement not being called

I am web scraping ebay for an item's information. The item is not very consistent with some of the info I need so I am using a try/except statement for the code to continue when an index error is arised but for some reason the try/except statement is not being called when the condition is met. Any ideas why this is happening and how to fix it? I have debugged the code but can't find the issue. Thanks
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0'}
my_url = 'https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1311&_nkw=sm-r800&_sacat=0&LH_TitleDesc=0' \
'&_osacat=0&_odkw=samsung+watch '
def get_data(url):
r = requests.get(url, headers=headers)
soup = bs(r.content, features='html.parser')
return soup
def parse_data(soup):
product_list = []
results = soup.find_all('div', {'class': 's-item__info clearfix'})
for item in results:
try:
products = {'Title': item.find_all('a', {'class': 's-item__link'})[0].h3.text,
'Price': float(item.find('span', {'class': 's-item__price'}).text[1:]),
'Product Rating': float(item.find('div', {'class': 's-item__reviews'}).a.div.find('span', {
'class': 'clipped'}).text.strip(' ')[0]),
'Watchers': float(item.find('div', {'class': 's-item__details clearfix'}).find('span', {
'class': 's-item__hotness s-item__itemHotness'}).text.split(' ')[0])
}
product_list.append(products)
except IndexError:
continue
return product_list
def output(product_list):
df = pd.DataFrame(product_list)
df.to_csv('Samsung Watch Data.csv', index=False)
print('Saved to CSV')
return
my_soup = get_data(my_url)
data = parse_data(my_soup)
output(data)
The problem is your return statement, which is causing your loop to end early, because return finishes the function execution. To make this easier to see:
def f():
for i in range(3):
print(i)
return
f()
0
# Nothing else happens here
To get all of the numbers, I need the return to be at the end of the loop:
def f():
for i in range(3):
print(i)
return
0
1
2
# Now I get all of the numbers
So move your return to the end of your loop, and unindent it:
def parse_data(soup):
product_list = []
results = soup.find_all('div', {'class': 's-item__info clearfix'})
for item in results:
try:
products = {'Title': item.find_all('a', {'class': 's-item__link'})[0].h3.text,
'Price': float(item.find('span', {'class': 's-item__price'}).text[1:]),
'Product Rating': float(item.find('div', {'class': 's-item__reviews'}).a.div.find('span', {
'class': 'clipped'}).text.strip(' ')[0]),
'Watchers': float(item.find('div', {'class': 's-item__details clearfix'}).find('span', {
'class': 's-item__hotness s-item__itemHotness'}).text.split(' ')[0])
}
product_list.append(products)
except IndexError:
continue
return product_list # <---- Here
To ignore any exceptions
Instead of except IndexError use except Exception. This will catch any kind of exception your code might throw, though I'd definitely print what kind of error occurred. Catching specific errors is usually better practice:
try:
# some code
except Exception as e:
print(f"Caught an exception: {e}")
continue

skipping Error 404 with BeautifulSoup

I'm trying to scrap some URL with BeautifulSoup. The URL I'm scraping are coming from a google analytics API call, some of then aren't working properly so I need to find a way to skip them.
I tried to add this:
except urllib2.HTTPError:
continue
But I got the following syntax error :
except urllib2.HTTPError:
^
SyntaxError: invalid syntax
Here is my full code:
rawdata = []
urllist = []
sharelist = []
mystring = 'http://www.konbini.com'
def print_results(results):
# Print data nicely for the user.
if results:
for row in results.get('rows'):
rawdata.append(row[0])
else:
print 'No results found'
urllist = [mystring + x for x in rawdata]
for row in urllist:
# query the website and return the html to the variable 'page'
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
name_box = soup.find(attrs={'class': 'nb-shares'})
if name_box is None:
continue
share = name_box.text.strip() # strip() is used to remove starting and trailing
# save the data in tuple
sharelist.append((row,share))
print(sharelist)
Your except statement is not preceded by a try statement. You should use the following pattern:
try:
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue
Also note the indentation levels. Code executed under the try clause must be indented, as well as the except clause.
Two errors:
1. No try statement
2. No indentation
Use this:
for row in urllist:
# query the website and return the html to the variable 'page'
try:
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue
If you just want to catch a 404, you need to check the code returned or raise the error or else you will catch and ignore more than just the 404:
import urllib2
from bs4 import BeautifulSoup
from urlparse import urljoin
def print_results(results):
base = 'http://www.konbini.com'
rawdata = []
sharelist = []
# Print data nicely for the user.
if results:
for row in results.get('rows'):
rawdata.append(row[0])
else:
print 'No results found'
# use urljoin to join to the base url
urllist = [urljoin(base, h) for h in rawdata]
for url in urllist:
# query the website and return the html to the variable 'page'
try: # need to open with try
page = urllib2.urlopen(url)
except urllib2.HTTPError as e:
if e.getcode() == 404: # check the return code
continue
raise # if other than 404, raise the error
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
name_box = soup.find(attrs={'class': 'nb-shares'})
if name_box is None:
continue
share = name_box.text.strip() # strip() is used to remove starting and trailing
# save the data in tuple
sharelist.append((url, share))
print(sharelist)
As already mentioned by others,
try statement missing
Proper indentation missing.
You should use IDE or Editors so that you won't face such problems, Some good IDE and Editors are
IDE - Eclipse Use Pydev plugin
Editors - Visual Studio Code
Anyways, Code after try and indent
rawdata = []
urllist = []
sharelist = []
mystring = 'http://www.konbini.com'
def print_results(results):
# Print data nicely for the user.
if results:
for row in results.get('rows'):
rawdata.append(row[0])
else:
print 'No results found'
urllist = [mystring + x for x in rawdata]
for row in urllist:
# query the website and return the html to the variable 'page'
try:
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
name_box = soup.find(attrs={'class': 'nb-shares'})
if name_box is None:
continue
share = name_box.text.strip() # strip() is used to remove starting and trailing
# save the data in tuple
sharelist.append((row, share))
print(sharelist)
Your syntax error is due to the fact that you're missing a try with your except statement.
try:
# code that might throw HTTPError
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue

How to create a looping url programmatically to scrape

I have this line of code I am trying to scrape; however, I am lost in how I can make the python code scrape a loop and save everything so I can .csv everything. Any help would be greatly appreciated:)
import requests
from bs4 import BeautifulSoup
url = url = "http://www.yellowpages.com/search?search_terms=bodyshop&geo_location_terms=Fort+Lauderdale%2C+FL"
soup = BeautifulSoup(r.content)
links = soup.find_all("a")
from link in links:
print "<a href='%s'>%s</a>" %(link.get("href"), link.text)
g_data = soup.find_all("div", {"class", "info"})
from item in g_data:
print item.content[0].find_all("a", {"class": "business-name"})[0].text
try:
print item.contents[1].find_all("span", {"itemprop": "streetAddress"})[0].text
except:
pass
try:
print item.contents[1].find_all("span", {"itemprop": "adressLocality"})[0].text.replace(',', '')
except:
pass
try:
print item.contents[1].find_all("span", {"itemprop": "adressRegion"})[0].text
except:
pass
try:
print item.contents[1].find_all("span", {"itemprop": "postalCode"})[0].text
except:
pass
try:
print item.contents[1].find_all("li", {"class": "primary"})[0].text
I know that with this code:
url_page2 = url + '&page=' + str(2) '&s=relevance'
I can loop to the second page, but how could one loop to all the page results of the website and make the results available in a .csv file?
Make an endless loop incrementing the page number starting from 1 and exit it when you'll get no results. Define a list of fields to extract and rely on the itemprop attribute to get the field values. Collect items in a list of dictionaries which you can later write into a csv file:
from pprint import pprint
import requests
from bs4 import BeautifulSoup
url = "http://www.yellowpages.com/search?search_terms=bodyshop&geo_location_terms=Fort%20Lauderdale%2C%20FL&page={page}&s=relevance"
fields = ["name", "streetAddress", "addressLocality", "addressRegion", "postalCode", "telephone"]
data = []
index = 1
while True:
url = url.format(page=index)
index += 1
response = requests.get(url)
soup = BeautifulSoup(response.content)
page_results = soup.select('div.result')
# exiting the loop if no results
if not page_results:
break
for item in page_results:
result = dict.fromkeys(fields)
for field in fields:
try:
result[field] = item.find(itemprop=field).get_text(strip=True)
except AttributeError:
pass
data.append(result)
break # DELETE ME
pprint(data)
For the first page, it prints:
[{'addressLocality': u'Fort Lauderdale,',
'addressRegion': u'FL',
'name': u"Abernathy's Paint And Body Shop",
'postalCode': u'33315',
'streetAddress': u'1927 SW 1st Ave',
'telephone': u'(954) 522-8923'},
...
{'addressLocality': u'Fort Lauderdale,',
'addressRegion': u'FL',
'name': u'Mega Auto Body Shop',
'postalCode': u'33304',
'streetAddress': u'828 NE 4th Ave',
'telephone': u'(954) 523-9331'}]

Categories