Is there a way to make BeautifulSoup look for a class and if it exists then run the script? I am trying this:
if soup.find_all("div", {"class": "info"}) == True:
print("Tag Found")
I've also tried but it didn't work and gave an error about having too many attributes:
if soup.has_attr("div", {"class": "info"})
print("Tag Found")
You're very close... soup.findall will return an empty list if it doesn't find any matches. Your control statement is checking its return for a literal bool value. Instead you need to check its truthiness by omitting the ==True
if soup.find_all("div", {"class": "info"}):
print("Tag Found")
Why not simply this:
if soup.find("div", {"class": "info"}) is not None:
print("Tag Found")
Related
I am using beautiful soup library to extract out data from webpages. Sometimes we have the case where element could not be found in the webpage itself, and if we try to access the sub element than we get error like 'NoneType' object has no attribute 'find'.
Like let say for the below code
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
primary_name = soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text
company_number = soup.find('p', id="company-number").find('strong').text
If I want to handle the error, I have to write something like below.
try:
primary_name = error_handler(soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text)
except:
primary_name = None
try:
company_number = soup.find('p', id="company-number").find('strong').text.strip()
except:
company_number = None
And if there are too many elements, then we end up with lots of try and catch statements. I actually want to write code in the below manner.
def error_handler(_):
try:
return _
except:
return None
primary_name = error_handler(soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text)
# this will still raise the error
I know that above code wouldn't work because it will still try to execute first inner function in error_handler function, and it would still raise the error.
If you have any idea how to make this code looks cleaner, then please show me.
I don't know if this is the most efficient way, but you can pass a lambda expression to the error_handler:
def error_handler(_):
try:
return _()
except:
return None
primary_name = error_handler(lambda: soup.find('div', {"class": "company-header"}).find('p', {"class": "heading-xlarge"}).text)
So, you are finding a way to handle a lot of element's exceptions.
For this, I will assume that you will (like any other scraper), use for loop.
You can handle exceptions as follows:
soup = BeautifulSoup(somehtml)
a_big_list_of_data = soup.find_all("div", {"class": "cards"})
for items in a_big_list_of_data:
try:
name = items.find_all("h3", {"id": "name"})
price = items.find_all("h5", {"id": "price"})
except:
continue
I wrote a for-loop that I thought was extracting the text from the html elements that I had indicated using the Beautifulsoup library. It looks like this:
url = "https://www.researchgate.net/profile/David_Severson"
r = requests.get(url)
data = r.text
soup = bsoup(data, "lxml")
item = soup.find("div", {"class": "section section-research"})
papers = [paper for paper in item.find_all("div", {"class": "nova-o-stack__item"})]
for p in papers:
title = p.find("div", {"class": "nova-e-text nova-e-text--size-l nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-publication-item__title nova-v-publication-item__title--clamp-3"})
abstract = p.find("div", {"class": "nova-e-text nova-e-text--size-m nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-publication-item__description nova-v-publication-item__description--clamp-3"})
views = p.find("ul", {"class": "nova-e-list nova-e-list--size-m nova-e-list--type-inline nova-e-list--spacing-none nova-v-publication-item__metrics"})
date = p.find("li", {"class": "nova-e-list__item publication-item-meta-items__meta-data-item"})
authors = p.find("ul", {"class": "nova-e-list nova-e-list--size-m nova-e-list--type-inline nova-e-list--spacing-none nova-v-publication-item__person-list"})
link = p.find("a", {"class": "nova-e-badge nova-e-badge--color-green nova-e-badge--display-block nova-e-badge--luminosity-high nova-e-badge--size-l nova-e-badge--theme-solid nova-e-badge--radius-m nova-v-publication-item__type"}, href=True)
if link:
full_link = urllib.parse.urljoin("https://www.researchgate.net/", link["href"])
print(full_link)
print(p.text)
I noticed that it was printing out more than what I had indicated in the contents of the loop. After trying to debug each of the individual items (title, abstract etc...), I realized the loop was not even accessing the items therein at all.
For example, if I commented them all out, or totally remove them, it still gave the exact same output:
for p in papers:
print(p.text)
print("")
(This ^ gives me the exact same output as the code with the contents in the body.)
Somehow the loop is not even reading the elements it's supposed to be using to iterate through p...How can I get it to recognize the script contained therein, and extract the desired elements as defined by the elements I have (or thought I had) written in the body of the loop?
The problem is that you have space in your class that you specified
papers = [paper for paper in item.find_all("div", {"class": "nova-o-stack__item"})]
I removed the space and your code worked. so retry it by using this code.
I have written code to extract the url and title of a book using BeautifulSoup from a page.
But it is not extracting the name of the book Astounding Stories of Super-Science April 1930 between > and </a> tags.
How can I extract the name of the book?
I have tried the findnext method recommended in another question, but I get an AttributeError on that.
HTML:
<li>
<a class="extiw" href="//www.gutenberg.org/ebooks/29390" title="ebook:29390">Astounding Stories of Super-Science April 1930</a>
<a class="image" href="/wiki/File:BookIcon.png"><img alt="BookIcon.png" height="16" src="//www.gutenberg.org/w/images/9/92/BookIcon.png" width="16"/></a>
(English)
</li>
Code below:
def make_soup(BASE_URL):
r = requests.get(BASE_URL, verify = False)
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def extract_text_urls(html):
soup = make_soup(BASE_URL)
for li in soup.findAll('li'):
try:
try:
print li.a['href'], li.a['title']
print "\n"
except KeyError:
pass
except TypeError:
pass
extract_text_urls(filename)
You should use the text attribute of the element. The following works for me:
def make_soup(BASE_URL):
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def extract_text_urls(html):
soup = make_soup(BASE_URL)
for li in soup.findAll('li'):
try:
try:
print li.a['href'], li.a.text
print "\n"
except KeyError:
pass
except TypeError:
pass
extract_text_urls('http://www.gutenberg.org/wiki/Science_Fiction_(Bookshelf)')
I get the following output for the element in question
//www.gutenberg.org/ebooks/29390 Astounding Stories of Super-Science April 1930
According to the BeautifulSoup documentation the .string property should accomplish what you are trying to do, by editing your original listing this way:
# ...
try:
print li.a['href'], li.a['title']
print "\n"
print li.a.string
except KeyError:
pass
# ...
You probably want to surround it with something like
if li.a['class'] == "extiw":
print li.a.string
since, in your example, only the anchors of class extiw contain a book title.
Thanks #wilbur for pointing out the optimal solution.
I did not see how you can extract the text within the tag. I would do something like this:
from bs4 import BeatifulSoup as bs
from urllib2 import urlopen as uo
soup = bs(uo(html))
for li in soup.findall('li'):
a = li.find('a')
book_title = a.contents[0]
print book_title
To get just the text that is not inside any tags use the get_text() method. It is in the documentation here.
I can't test it because I don't know the url of the page you are trying to scrape, but you can probably just do it with the li tag since there doesn't seem to be any other text.
Try replacing this:
for li in soup.findAll('li'):
try:
try:
print li.a['href'], li.a['title']
print "\n"
except KeyError:
pass
except TypeError:
pass
with this:
for li in soup.findAll('li'):
try:
print(li.get_text())
print("\n")
except TypeError:
pass
I am new to Beautiful Soup and Python in general, but my question is how would I go about specifying a class that is dynamic (productId)? Can I use a mask or search part of the class, i.e. "product summary*"
<li class="product_summary clearfix {productId: 247559}">
</li>
I want to get the product_info and also the product_image (src) data below the product_summary class list, but I don't know how to find_all when my class is dynamic. Hope this makes sense. My goal is to insert this data into a MySQL table, so my thought is I need to store all data into variables at the highest (product summary) level. Thanks in advance for any help.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = Request('http://www.shopwell.com/sodas/c/22', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(url).read()
soup = BeautifulSoup(webpage)
product_info = soup.find_all("div", {"class": "product_info"})
for item in product_info:
detail_link = item.find("a", {"class": "detail_link"}).text
try:
detail_link_h2 = ""
detail_link_h2 = item.h2.text.replace("\n", "")
except:
pass
try:
detail_link_h3 = ""
detail_link_h3 = item.h3.text.replace("\n", "")
except:
pass
try:
detail_link_h4 = item.h4.text.replace("\n", "")
except:
pass
print(detail_link_h2 + ", " + detail_link_h3 + ", " + detail_link_h4)
product_image = soup.find_all("div", {"class": "product_image"})
for item in product_image:
img1 = item.find("img")
print(img1)
I think you can use regular expressions like this:
import re
product_image = soup.find_all("div", {"class": re.compile("^product_image")})
Use:
soup.find_all("li", class_="product_summary")
Or just:
soup.find_all(class_="product_summary")
See the documentation for searching by CSS class.
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_
I am scraping a website data using beautiful soup. I want the the anchor value (My name is nick) of the following. But i searched a lot in the google but can't find any perfect solution to solve my query.
news_panel = soup.findAll('div', {'class': 'menuNewsPanel_MenuNews1'})
for news in news_panel:
temp = news.find('h2')
print temp
output :
<h2 class="menuNewsHl2_MenuNews1">My name is nick</h2>
But i want output like this : My name is nick
Just grab the text attribute:
>>> soup = BeautifulSoup('''<h2 class="menuNewsHl2_MenuNews1">My name is nick</h2>''')
>>> soup.text
u'My name is nick'
Your error is probably occurring because you don't have that specific tag in your input string.
Check if temp is not None
news_panel = soup.findAll('div', {'class': 'menuNewsPanel_MenuNews1'})
for news in news_panel:
temp = news.find('h2')
if temp:
print temp.text
or put your print statement in a try ... except block
news_panel = soup.findAll('div', {'class': 'menuNewsPanel_MenuNews1'})
for news in news_panel:
try:
print news.find('h2').text
except AttributeError:
continue
Try using this:
all_string=soup.find_all("h2")[0].get_text()