Using beautifulsoup to scrape <h2> tag - python

I am scraping a website data using beautiful soup. I want the the anchor value (My name is nick) of the following. But i searched a lot in the google but can't find any perfect solution to solve my query.
news_panel = soup.findAll('div', {'class': 'menuNewsPanel_MenuNews1'})
for news in news_panel:
temp = news.find('h2')
print temp
output :
<h2 class="menuNewsHl2_MenuNews1">My name is nick</h2>
But i want output like this : My name is nick

Just grab the text attribute:
>>> soup = BeautifulSoup('''<h2 class="menuNewsHl2_MenuNews1">My name is nick</h2>''')
>>> soup.text
u'My name is nick'

Your error is probably occurring because you don't have that specific tag in your input string.
Check if temp is not None
news_panel = soup.findAll('div', {'class': 'menuNewsPanel_MenuNews1'})
for news in news_panel:
temp = news.find('h2')
if temp:
print temp.text
or put your print statement in a try ... except block
news_panel = soup.findAll('div', {'class': 'menuNewsPanel_MenuNews1'})
for news in news_panel:
try:
print news.find('h2').text
except AttributeError:
continue

Try using this:
all_string=soup.find_all("h2")[0].get_text()

Related

How can I extract a specific item attribute from an ebay listing using BeautifulSoup?

def get_data(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
return soup
current_data = get_data(link)
x = current_data.find_all(text="Style Code:")
I'm trying to get the style code of a shoe off ebay but the problem is that it doesn't have a specific class or any kind of unique identifier so I can't just use find() to get the data. Currently I searched by text to find 'Style Code:' but how can I get to the next div? An example of a shoe product page would be this.
soup.select_one('span.ux-textspans:-soup-contains("Style Code:")').find_next('span').get_text(strip=True)
Try this,
spans = soup.find_all('span', attrs={'class':'ux-textspans'})
style_code = None
for idx, span in enumerate(spans):
if span.text == 'Style Code:':
style_code = spans[idx+1].text
break
print(style_code)
# 554724-371
Since there are lot's of span is similar (with class 'ux-textspans') you need to iterate through it and find the next span after 'Style Code:'

Scrape href not working with python

I have copies of this very code that I am trying to do and every time I copy it line by line it isn't working right. I am more than frustrated and can't seem to figure out where it is not working. What I am trying to do is go to a website, scrap the different ratings pages which are labelled A, B, C ... etc. Then I am going to each site to pull the total number of pages they are using. I am trying to scrape the <span class='letter-pages' href='/ratings/A/1' and so on. What am I doing wrong?
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
hrefs = []
ratings = []
ks = []
pages_scrape = []
for href in soup.findAll('a'):
if 'href' in href.attrs:
hrefs.append(href.attrs['href'])
for good_ratings in hrefs:
if good_ratings.startswith('/ratings/'):
ratings.append(url[:-9]+good_ratings)
# elif good_ratings.startswith('/401k'):
# ks.append(url[:-9]+good_ratings)
del ratings[0]
del ratings[27:]
print(ratings)
for each_rating in ratings:
page = requests.get(each_rating)
soup = BeautifulSoup(page.text, 'html.parser')
for href in soup.find('span', class_='letter-pages'):
#Not working Here
pages_scrape.append(href.attrs['href'])
# Will print all the anchor tags with hrefs if I remove the above comment.
print(href)
You are trying to get the href prematurely. You are trying to extract the attribute directly from a span tag that has nested a tags, rather than a list of a tags.
for each_rating in ratings:
page = requests.get(each_rating)
soup = BeautifulSoup(page.text, 'html.parser')
span = soup.find('span', class_='letter-pages')
for a in span.find_all('a'):
href = a.get('href')
pages_scrape.append(href)
I didn't test this on all pages, but it worked for the first one. You pointed out that on some of the pages the content wasn't getting scraped, which is due to the span search returning None. To get around this you can do something like:
for each_rating in ratings:
page = requests.get(each_rating)
soup = BeautifulSoup(page.text, 'html.parser')
span = soup.find('span', class_='letter-pages')
if span:
for a in span.find_all('a'):
href = a.get('href')
pages_scrape.append(href)
print(href)
else:
print('span.letter-pages not found on ' + page)
Depending on your use case you might want to do something different, but this will indicate to you which pages don't match your scraping model and need to be manually investigated.
You probably meant to do find_all instead of find -- so change
for href in soup.find('span', class_='letter-pages'):
to
for href in soup.find_all('span', class_='letter-pages'):
You want to be iterating over a list of tags, not a single tag. find would give you a single tag object. When you iterate over a single tag, you iterate get NavigableString objects. find_all gives you the list of tag objects you want.

Extract content of <a> tag

I have written code to extract the url and title of a book using BeautifulSoup from a page.
But it is not extracting the name of the book Astounding Stories of Super-Science April 1930 between > and </a> tags.
How can I extract the name of the book?
I have tried the findnext method recommended in another question, but I get an AttributeError on that.
HTML:
<li>
<a class="extiw" href="//www.gutenberg.org/ebooks/29390" title="ebook:29390">Astounding Stories of Super-Science April 1930</a>
<a class="image" href="/wiki/File:BookIcon.png"><img alt="BookIcon.png" height="16" src="//www.gutenberg.org/w/images/9/92/BookIcon.png" width="16"/></a>
(English)
</li>
Code below:
def make_soup(BASE_URL):
r = requests.get(BASE_URL, verify = False)
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def extract_text_urls(html):
soup = make_soup(BASE_URL)
for li in soup.findAll('li'):
try:
try:
print li.a['href'], li.a['title']
print "\n"
except KeyError:
pass
except TypeError:
pass
extract_text_urls(filename)
You should use the text attribute of the element. The following works for me:
def make_soup(BASE_URL):
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def extract_text_urls(html):
soup = make_soup(BASE_URL)
for li in soup.findAll('li'):
try:
try:
print li.a['href'], li.a.text
print "\n"
except KeyError:
pass
except TypeError:
pass
extract_text_urls('http://www.gutenberg.org/wiki/Science_Fiction_(Bookshelf)')
I get the following output for the element in question
//www.gutenberg.org/ebooks/29390 Astounding Stories of Super-Science April 1930
According to the BeautifulSoup documentation the .string property should accomplish what you are trying to do, by editing your original listing this way:
# ...
try:
print li.a['href'], li.a['title']
print "\n"
print li.a.string
except KeyError:
pass
# ...
You probably want to surround it with something like
if li.a['class'] == "extiw":
print li.a.string
since, in your example, only the anchors of class extiw contain a book title.
Thanks #wilbur for pointing out the optimal solution.
I did not see how you can extract the text within the tag. I would do something like this:
from bs4 import BeatifulSoup as bs
from urllib2 import urlopen as uo
soup = bs(uo(html))
for li in soup.findall('li'):
a = li.find('a')
book_title = a.contents[0]
print book_title
To get just the text that is not inside any tags use the get_text() method. It is in the documentation here.
I can't test it because I don't know the url of the page you are trying to scrape, but you can probably just do it with the li tag since there doesn't seem to be any other text.
Try replacing this:
for li in soup.findAll('li'):
try:
try:
print li.a['href'], li.a['title']
print "\n"
except KeyError:
pass
except TypeError:
pass
with this:
for li in soup.findAll('li'):
try:
print(li.get_text())
print("\n")
except TypeError:
pass

Getting youtube link element from source code

I am observing http://www.bing.com/videos/search?q=kohli and trying to lookup video urls.
Anchor tag contains youtube link, but inside dictionary which I want to extract.
redditFile = urllib2.urlopen("http://www.bing.com/videos?q="+urllib.quote_plus(word))
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml)
productDivs = soup.findAll('div', attrs={'class' : 'dg_u'})
for div in productDivs:
print div.find('a')['vrhm'] #This element contains youtube urls but print does not display it
if div.find('div', {"class":"vthumb", 'smturl': True}) is not None:
print div.find('div', {"class":"vthumb", 'smturl': True})['smturl'] #this gives link to micro video
How can I get youtube link from a tag and vrhm attribute?
You can use the json.load to load a a dictionary from json string.
The for loop can be modified as
>>> productDivs = soup.findAll('div', attrs={'class' : 'dg_u'})
>>> for div in productDivs:
... a_dict = json.loads( div.a['vrhm'] )
... print a_dict['p']
https://www.youtube.com/watch?v=bWbrWI3PBss
https://www.youtube.com/watch?v=bWbrWI3PBss
https://www.youtube.com/watch?v=PbTx2Fjth-0
https://www.youtube.com/watch?v=pB1Kjx-eheY
..
..
What it does?
div.a['vrhm'] extracts the vrhm attribute of the immediate a child of the div.
a_dict = json.loads( div.a['vrhm'] ) loads the json string and creates the dictionary a_dict.
print a_dict['p'] The a_dict is a python dictionary. Use them as you usually do.

scraping using beautiful soup

I am scraping an article using BeautifulSoup. I want to scrape all of the p tags within the article body aside from a certain section. I was wondering if someone could give me a hint as to what I am doing wrong? I didn't get an error, it just didn't present anything different. At the moment it is grabbing the word "Print" from the undesirable section and printing it with the other p tags.
Section I want to ignore: soup.find("div", {'class': 'add-this'})
url: http://www.un.org/apps/news/story.asp?NewsID=47549&Cr=burundi&Cr1=#.U0vmB8fTYig
# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())
# Retrieve all of the paragraphs
tags = soup.find("div", {'id': 'fullstory'}).find_all('p')
for tag in tags:
ptags = soup.find("div", {'class': 'add-this'})
for tag in ptags:
txt.write(tag.nextSibling.text.encode('utf-8') + '\n' + '\n')
else:
txt.write(tag.text.encode('utf-8') + '\n' + '\n')
One option is to just pass recursive=False in order not to search p tags inside any other elements of a fullstory div:
tags = soup.find("div", {'id': 'fullstory'}).find_all('p', recursive=False)
for tag in tags:
print tag.text
This will grab only top-level paragraphs from the div, prints the complete article:
10 April 2014 The United Nations today called on the Government...
...
...follow up with the Government on these concerns.

Categories