What is the best way to handle none in beautiful soup python

What is the best way to handle none in beautiful soup python - python

Hi I am trying to scrape and is wondering is there a one liner or simple way to handle none.
If none do something, if not none then do something else. I mean what would be the most pythonic way of handling none that references the value itself.
Right now what I have is
discount = soup.find_all('span', {"class":"jsx-30 discount"} )
if len(discount)==0:
discount =""
else:
discount = soup.find_all('span', {"class":"jsx-3024393758 label discount"} )[0].text

In case that you only want to grab the text of the first element, I would recommend to use find() instead of find_all().
To check if element exists you can use if statement:
discount = e.text if (e := soup.find('span', {"class":"jsx-3024393758 label discount"})) else ''
or try except:
try:
discount = soup.find('span', {"class":"jsx-3024393758 label discount"}).text
except:
discount = ''

Related

Handling try except multiple times while web scraping BeautifulSoup

while web scraping using BeautifulSoup I have to write try except multiple times. See the code below:
try:
addr1 = soup.find('span', {'class' : 'addr1'}).text
except:
addr1 = ''
try:
addr2 = soup.find('span', {'class' : 'addr2'}).text
except:
addr2 = ''
try:
city = soup.find('strong', {'class' : 'city'}).text
except:
city = ''
The problem is that I have to write try except multiple times and that is very annoying. I want to write a function to handle the exception.
I tried to use the following function but it is still showing an error:
def datascraping(var):
try:
return var
except:
return None
addr1 = datascraping(soup.find('span', {'class' : 'addr1'}).text)
addr2 = datascraping(soup.find('span', {'class' : 'addr2'}).text)
Can anyone help me to solve the issue?

Use a for loop that iterates through a sequence containing your arguments. Then use a conditional statement that checks if the return value is None, prior to attempting to get the text attribute. Then store the results in a dictionary. This way there is no need to use try/except at all.
seq = [('span', 'addr1'), ('span', 'addr2'), ('strong', 'city')]
results = {}
for tag, value in seq:
var = soup.find(tag, {'class': value})
if var is not None:
results[value] = var.text
else:
results[value] = ''

beautiful soup how to avoid writing too many try catch blocks?

I am using beautiful soup library to extract out data from webpages. Sometimes we have the case where element could not be found in the webpage itself, and if we try to access the sub element than we get error like 'NoneType' object has no attribute 'find'.
Like let say for the below code
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
primary_name = soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text
company_number = soup.find('p', id="company-number").find('strong').text
If I want to handle the error, I have to write something like below.
try:
primary_name = error_handler(soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text)
except:
primary_name = None
try:
company_number = soup.find('p', id="company-number").find('strong').text.strip()
except:
company_number = None
And if there are too many elements, then we end up with lots of try and catch statements. I actually want to write code in the below manner.
def error_handler(_):
try:
return _
except:
return None
primary_name = error_handler(soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text)
# this will still raise the error
I know that above code wouldn't work because it will still try to execute first inner function in error_handler function, and it would still raise the error.
If you have any idea how to make this code looks cleaner, then please show me.

I don't know if this is the most efficient way, but you can pass a lambda expression to the error_handler:
def error_handler(_):
try:
return _()
except:
return None
primary_name = error_handler(lambda: soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text)

So, you are finding a way to handle a lot of element's exceptions.
For this, I will assume that you will (like any other scraper), use for loop.
You can handle exceptions as follows:
soup = BeautifulSoup(somehtml)
a_big_list_of_data = soup.find_all("div", {"class": "cards"})
for items in a_big_list_of_data:
try:
name = items.find_all("h3", {"id": "name"})
price = items.find_all("h5", {"id": "price"})
except:
continue

Having trouble when webscraping return nothing

I'm building a real state web-scraper and i'm having problems when a certain index doesn't exist in the html.
How can i fix this? The code that is having this trouble is this
info_extra = container.find_all('div', class_="info-right text-xs-right")[0].text
I'm new to web-scraping so I'm kinda lost.
Thanks!

One general way is to check the length before you attempt to access the index.
divs = container.find_all('div', class_="info-right text-xs-right")
if len(divs) > 0:
info_extra = divs[0].text
else:
info_extra = None
You can simplify this further by knowing that an empty list is false.
divs = container.find_all('div', class_="info-right text-xs-right")
if divs:
info_extra = divs[0].text
else:
info_extra = None
You can simplify even further by using the walrus operator :=
if (divs := container.find_all('div', class_="info-right text-xs-right")):
info_extra = divs[0].text
else:
info_extra = None
Or all in one line:
info_extra = divs[0].text if (divs := container.find_all('div', class_="info-right text-xs-right") else None

I'm new to web-scraping too and most of my problems are when I ask for an element on the page that doesn't exist
Have you tried the Try/Except block?
try:
info_extra = container.find_all('div', class_="info-right text-xs-right")[0].text
except Exception as e:
raise
https://docs.python.org/3/tutorial/errors.html
Good luck

First of all, you should always check data before doing anything with it.
Now if there is just one result in site for your selector
info_extra_element = container.select_one('div.info-right.text-xs-right'
)
if info_extra_element:
info_extra = info_extra_element.text
else:
# On unexpected situation where selector couldn't be found
# report it and do something to prevent your program from crashing.
print("selector couldn't be found on the page")
info_extra = ''
If there are a list of elements that match your selector
info_extra_elements = container.select('div.info-right.text-xs-right'
).text
info_extra_texts = []
for element in info_extra_elements:
info_extra_texts.append(element.text)
PS.
Based on this answer, It's a good practice to use a CSS selector when you want to filter based on class.
find method can be used when you just want to filter based on element tag.

How to get span's text without inner attribute's text with selenium?

<span class="cname">
<em class="multiple">2017</em> Ford
</span>
<span class="cname">
Toyota
</span>
I want to get only "FORD" and TOYOTA in span.
test.find_element_by_class_name('cname').text
return "2017 FORD" and "TOYOTA". So how can i get particular text of span?

Pure XPath solution:
//span[#class='cname']//text()[not(parent::em[#class='multiple'])]
And if you alse want to filter white-space-only text-nodes():
//span[#class='cname']//text()[not(parent::em[#class='multiple']) and not(normalize-space()='')]
Both return text-nodes not an element. So Selenium will probably fail.
Take a look here: https://sqa.stackexchange.com/a/33097 on how to get a text-node().
Otherwise use this answer: https://stackoverflow.com/a/67518169/3710053
EDIT:
Another way to go is this XPath:
//span[#class='cname']
And then use this code python-example to get only direct text()-nodes.
EDIT 2
all_text = driver.find_element_by_xpath("//span[#class='cname']").text
child_text = driver.find_element_by_xpath("//span[#class='cname']/em[#class='multiple']").text
parent_text = all_text.replace(child_text, '')

If can have a check for integer, if it is a integer then don't print or do something else otherwise print them for //span[#class='cname'
Code :
cname_list = driver.find_elements(By.XPATH, "//span[#class='cname']")
for cname in cname_list:
if cname.text.isdigit() == True:
print("It is an integer")
else:
print(cname.text)
or
cname_list = driver.find_elements(By.XPATH, "//span[#class='cname']")
for cname in cname_list:
if type(cname.text) is int:
print("We don't like int for this use case") # if you don't want you can simply remove this line
else:
print(cname.text)

You can get the parent element text without the child element text as following:
total_text = driver.find_element_by_xpath(parent_div_element_xpath).text
child_text = driver.find_element_by_xpath(child_div_element_xpath).text
parent_only_text = total_text.replace(child_text, '')
So in your specific case try the following:
total_text = driver.find_element_by_xpath("//span[#class='cname']").text
child_text = driver.find_element_by_xpath(//*[#class='multiple']).text
parent_only_text = total_text.replace(child_text, '')
Or to be more precise
father = driver.find_element_by_xpath("//span[#class='cname']")
total_text = father.text
child_text = father.find_element_by_xpath(".//*[#class='multiple']").text
parent_only_text = total_text.replace(child_text, '')
In a general case you can define and use the following method:
def get_text_excluding_children(driver, element):
return driver.execute_script("""
return jQuery(arguments[0]).contents().filter(function() {
return this.nodeType == Node.TEXT_NODE;
}).text();
""", element)
The element argument passed here is the webelement returned by driver.find_element
In your particular case you can find the element with:
element = driver.find_element_by_xpath("//span[#class='cname']")
and then pass it to get_text_excluding_children and it will return you the required text

how to scrape the first element of multiple existing elements / beautifulsoup / python

I try to scrape an ID element of an html code. it exists twice and everytime I print it, I get it twice. this is how I scrape it:
for review in soup.find_all("div", {"class": "reviewContainer"}):
for review2 in review.findAll(True, {'id':True}):
if len(review2) > 0:
userid = review2['id']
print(userid)
else:
userid = "N/A"
print(userid)
Output:
ID_123
ID_123
ID_456
ID_456
I tried to add "review2['id'].next_element" to just get the first coming element but I get an error. is there a solution, how I can get the first found element, instead of getting it twice?

Try adding a conditional check to see if you've already found that userid before:
for review in soup.find_all("div", {"class": "reviewContainer"}):
userid_found = []
for review2 in review.findAll(True, {'id':True}):
if len(review2) > 0:
userid = review2['id']
if userid not in userid_found:
userid_found.append(userid)
print(userid)
else:
userid = "N/A"
print(userid)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

What is the best way to handle none in beautiful soup python - python

Related

Handling try except multiple times while web scraping BeautifulSoup

beautiful soup how to avoid writing too many try catch blocks?

Having trouble when webscraping return nothing

How to get span's text without inner attribute's text with selenium?

how to scrape the first element of multiple existing elements / beautifulsoup / python

Categories

Resources