Extract content of <a> tag - python

I have written code to extract the url and title of a book using BeautifulSoup from a page.
But it is not extracting the name of the book Astounding Stories of Super-Science April 1930 between > and </a> tags.
How can I extract the name of the book?
I have tried the findnext method recommended in another question, but I get an AttributeError on that.
HTML:
<li>
<a class="extiw" href="//www.gutenberg.org/ebooks/29390" title="ebook:29390">Astounding Stories of Super-Science April 1930</a>
<a class="image" href="/wiki/File:BookIcon.png"><img alt="BookIcon.png" height="16" src="//www.gutenberg.org/w/images/9/92/BookIcon.png" width="16"/></a>
(English)
</li>
Code below:
def make_soup(BASE_URL):
r = requests.get(BASE_URL, verify = False)
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def extract_text_urls(html):
soup = make_soup(BASE_URL)
for li in soup.findAll('li'):
try:
try:
print li.a['href'], li.a['title']
print "\n"
except KeyError:
pass
except TypeError:
pass
extract_text_urls(filename)

You should use the text attribute of the element. The following works for me:
def make_soup(BASE_URL):
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def extract_text_urls(html):
soup = make_soup(BASE_URL)
for li in soup.findAll('li'):
try:
try:
print li.a['href'], li.a.text
print "\n"
except KeyError:
pass
except TypeError:
pass
extract_text_urls('http://www.gutenberg.org/wiki/Science_Fiction_(Bookshelf)')
I get the following output for the element in question
//www.gutenberg.org/ebooks/29390 Astounding Stories of Super-Science April 1930

According to the BeautifulSoup documentation the .string property should accomplish what you are trying to do, by editing your original listing this way:
# ...
try:
print li.a['href'], li.a['title']
print "\n"
print li.a.string
except KeyError:
pass
# ...
You probably want to surround it with something like
if li.a['class'] == "extiw":
print li.a.string
since, in your example, only the anchors of class extiw contain a book title.
Thanks #wilbur for pointing out the optimal solution.

I did not see how you can extract the text within the tag. I would do something like this:
from bs4 import BeatifulSoup as bs
from urllib2 import urlopen as uo
soup = bs(uo(html))
for li in soup.findall('li'):
a = li.find('a')
book_title = a.contents[0]
print book_title

To get just the text that is not inside any tags use the get_text() method. It is in the documentation here.
I can't test it because I don't know the url of the page you are trying to scrape, but you can probably just do it with the li tag since there doesn't seem to be any other text.
Try replacing this:
for li in soup.findAll('li'):
try:
try:
print li.a['href'], li.a['title']
print "\n"
except KeyError:
pass
except TypeError:
pass
with this:
for li in soup.findAll('li'):
try:
print(li.get_text())
print("\n")
except TypeError:
pass

Related

How to get the text of div class?

#Days
try:
days2.append(link2.find_all('div',{'class':'list-card variable-text list-card-img-overlay'}).text)
except Exception:
days2.append('N/A')
#Views
try:
views.append(link2.find_all('div',{'class':'Text-c11n-8-53-2__sc-aiai24-0 duChdW'}[2]).text)
except Exception:
views.append('N/A')
https://www.zillow.com/manhattan-new-york-ny-10023/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%2210023%22%2C%22mapBounds%22%3A%7B%22west%22%3A-73.9951494604187%2C%22east%22%3A-73.9682415395813%2C%22south%22%3A40.763770638446054%2C%22north%22%3A40.7898340773195%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A61637%2C%22regionType%22%3A7%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A15%7D
enter image description here
keep getting N/A instead of 2hours and 88views
If you want to extract the time on Zillow data you mentioned in the comments, you could search for divs with the appropriate class name.
from bs4 import BeautifulSoup
html = """
<div class="hdp__sc-qe1dn6-1 jqZymu"><div class="Text-c11n-8-53-2__sc-aiai24-0 iBdXNb">Time on Zillow</div><div class="Text-c11n-8-53-2__sc-aiai24-0 duChdW">36 minutes</div></div>
"""
soup = BeautifulSoup(html, 'html.parser')
time = soup.find('div', class_="Text-c11n-8-53-2__sc-aiai24-0 duChdW")
print(time.text)
# 36 minutes

Why does Beautifulsoup.find() not give the specific result?

I have this code below, and I am trying to get 'Oswestry, England' as the result.
label = soup.findall('span',{'class':"ProfileHeaderCard-locationText"})
print(label)
But, it doesn't give me a value.
Here is what the HMTL code looks like
<span class="ProfileHeaderCard-locationText u-dir" dir="ltr">
<a data-place-id="5b756a1991aa8648" href="/search?q=place%3A5b756a1991aa8648">Oswestry, England</a>
</span>
When I print label the result is the HTML code I posted above.
​
Here is my full code:
import requests as req
from bs4 import BeautifulSoup
usernames = #list of username
location_list = []
for x in usernames:
url= "https://twitter.com/" + x
try:
html = req.get(url)
except Exception as e:
print("Failed to")
continue
soup = BeautifulSoup(html.text,'html.parser')
try:
label = soup.find('span',{'class':"ProfileHeaderCard-locationText"})
label_formatted = label.string.lstrip()
label_formatted = label_formatted.rstrip()
if label_formatted != "":
location_list.append(label_formatted)
print(x + ' : ' + label_formatted)
else:
print('Not found')
except:
print('Not found')
You should call find, not find_all to get a single element. Then use the .text attribute to get the text content.
label = soup.find('span',{'class':"ProfileHeaderCard-locationText"})
print(label.text)
For anyone having the same problem, I was able to do get the innerdata from the html code by just doing this:
label2 = soup.findAll('span',{"class":"ProfileHeaderCard-locationText"})[0].get_text()
It seems that you were searching for a span tag with the class attribute exactly matching your query class. As the span has two classes, your test failed and no results returned.
Using css selectors, you could try your solution as:
from bs4 import BeautifulSoup as BS
soup = BS('''<span class="ProfileHeaderCard-locationText u-dir">.....</span>''', 'html.parser')
soup.select('span.ProfileHeaderCard-locationText')
returns span tags that contain your prescribed class.
see also

Python, Beautiful Soup + how to parse dynamic class?

I am new to Beautiful Soup and Python in general, but my question is how would I go about specifying a class that is dynamic (productId)? Can I use a mask or search part of the class, i.e. "product summary*"
<li class="product_summary clearfix {productId: 247559}">
</li>
I want to get the product_info and also the product_image (src) data below the product_summary class list, but I don't know how to find_all when my class is dynamic. Hope this makes sense. My goal is to insert this data into a MySQL table, so my thought is I need to store all data into variables at the highest (product summary) level. Thanks in advance for any help.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = Request('http://www.shopwell.com/sodas/c/22', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(url).read()
soup = BeautifulSoup(webpage)
product_info = soup.find_all("div", {"class": "product_info"})
for item in product_info:
detail_link = item.find("a", {"class": "detail_link"}).text
try:
detail_link_h2 = ""
detail_link_h2 = item.h2.text.replace("\n", "")
except:
pass
try:
detail_link_h3 = ""
detail_link_h3 = item.h3.text.replace("\n", "")
except:
pass
try:
detail_link_h4 = item.h4.text.replace("\n", "")
except:
pass
print(detail_link_h2 + ", " + detail_link_h3 + ", " + detail_link_h4)
product_image = soup.find_all("div", {"class": "product_image"})
for item in product_image:
img1 = item.find("img")
print(img1)
I think you can use regular expressions like this:
import re
product_image = soup.find_all("div", {"class": re.compile("^product_image")})
Use:
soup.find_all("li", class_="product_summary")
Or just:
soup.find_all(class_="product_summary")
See the documentation for searching by CSS class.
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_

Can print but not return html table: "TypeError: ResultSet object is not an iterator"

Python newbie here. Python 2.7 with beautifulsoup 3.2.1.
I'm trying to scrape a table from a simple page. I can easily get it to print, but I can't get it to return to my view function.
The following works:
#app.route('/process')
def process():
queryURL = 'http://example.com'
br.open(queryURL)
html = br.response().read()
soup = BeautifulSoup(html)
table = soup.find("table")
print table
return 'All good'
I can also return html successfully. But when I try to return table instead of return 'All good' I get the following error:
TypeError: ResultSet object is not an iterator
I also tried:
br.open(queryURL)
html = br.response().read()
soup = BeautifulSoup(html)
table = soup.find("table")
out = []
for row in table.findAll('tr'):
colvals = [col.text for col in row.findAll('td')]
out.append('\t'.join(colvals))
return table
With no success. Any suggestions?
You're trying to return an object, you're not actually getting the text of the object so return table.text should be what you are looking for. Full modified code:
def process():
queryURL = 'http://example.com'
br.open(queryURL)
html = br.response().read()
soup = BeautifulSoup(html)
table = soup.find("table")
return table.text
EDIT:
Since I understand now that you want the HTML code that forms the site instead of the values, you can do something like this example I made:
import urllib
url = urllib.urlopen('http://www.xpn.org/events/concert-calendar')
htmldata = url.readlines()
url.close()
for tag in htmldata:
if '<th' in tag:
print tag
if '<tr' in tag:
print tag
if '<thead' in tag:
print tag
if '<tbody' in tag:
print tag
if '<td' in tag:
print tag
You can't do this with BeautifulSoup (at least not to my knowledge) is because BeautifulSoup is more for parsing or printing the HTML in a nice looking manner. You can just do what I did and have a for loop go through the HTML code and if a tag is in the line, then print it.
If you want to store the output in a list to use later, you would do something like:
htmlCodeList = []
for tag in htmldata:
if '<th' in tag:
htmlCodeList.append(tag)
if '<tr' in tag:
htmlCodeList.append(tag)
if '<thead' in tag:
htmlCodeList.append(tag)
if '<tbody' in tag:
htmlCodeList.append(tag)
if '<td' in tag:
htmlCodeList.append(tag)
This save the HTML line in a new element of the list. so <td> would be index 0 the next set of tags would be index 1, etc.
After #Heinst pointed out that I was trying to return an Object and not a string, I also found a more elegant solution to convert the BeautifulSoup Object into a string and return it:
return str(table)

Using beautifulsoup to scrape <h2> tag

I am scraping a website data using beautiful soup. I want the the anchor value (My name is nick) of the following. But i searched a lot in the google but can't find any perfect solution to solve my query.
news_panel = soup.findAll('div', {'class': 'menuNewsPanel_MenuNews1'})
for news in news_panel:
temp = news.find('h2')
print temp
output :
<h2 class="menuNewsHl2_MenuNews1">My name is nick</h2>
But i want output like this : My name is nick
Just grab the text attribute:
>>> soup = BeautifulSoup('''<h2 class="menuNewsHl2_MenuNews1">My name is nick</h2>''')
>>> soup.text
u'My name is nick'
Your error is probably occurring because you don't have that specific tag in your input string.
Check if temp is not None
news_panel = soup.findAll('div', {'class': 'menuNewsPanel_MenuNews1'})
for news in news_panel:
temp = news.find('h2')
if temp:
print temp.text
or put your print statement in a try ... except block
news_panel = soup.findAll('div', {'class': 'menuNewsPanel_MenuNews1'})
for news in news_panel:
try:
print news.find('h2').text
except AttributeError:
continue
Try using this:
all_string=soup.find_all("h2")[0].get_text()

Categories