scraping using beautiful soup - python

I am scraping an article using BeautifulSoup. I want to scrape all of the p tags within the article body aside from a certain section. I was wondering if someone could give me a hint as to what I am doing wrong? I didn't get an error, it just didn't present anything different. At the moment it is grabbing the word "Print" from the undesirable section and printing it with the other p tags.
Section I want to ignore: soup.find("div", {'class': 'add-this'})
url: http://www.un.org/apps/news/story.asp?NewsID=47549&Cr=burundi&Cr1=#.U0vmB8fTYig
# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())
# Retrieve all of the paragraphs
tags = soup.find("div", {'id': 'fullstory'}).find_all('p')
for tag in tags:
ptags = soup.find("div", {'class': 'add-this'})
for tag in ptags:
txt.write(tag.nextSibling.text.encode('utf-8') + '\n' + '\n')
else:
txt.write(tag.text.encode('utf-8') + '\n' + '\n')

One option is to just pass recursive=False in order not to search p tags inside any other elements of a fullstory div:
tags = soup.find("div", {'id': 'fullstory'}).find_all('p', recursive=False)
for tag in tags:
print tag.text
This will grab only top-level paragraphs from the div, prints the complete article:
10 April 2014 The United Nations today called on the Government...
...
...follow up with the Government on these concerns.

Related

loop to extract text from multiple tags in a <div> tag with a Beautiful Soup parse

I am trying to run a loop in a web scraping script that uses Beautiful Soup to extract data from this Page. The loop will loop through each div tag and extract 4 different pieces of information. It searches a h3, a div, and 2 span tags. But when I add the ".text" option I get errors from the 'date,' 'soldprice,' and 'shippingprice.' The error says:
AttributeError: 'NoneType' object has no attribute 'text'
I can get the text value from the 'title,' but nothing else when i put ".text" at the end of the line or in the print function. The script overall will extract the correct information when it is run, however I don't want the html tags.
results = soup.find_all("div", {"class": "s-item__info clearfix"}) #to separate the section of text for each item on the page
for item in results:
product = {
'title': item.find("h3", attrs={"class": "s-item__title s-item__title--has-tags"}).text,
'date': item.find("div", attrs={"class": "s-item__title--tag"}), #.find("span", attrs={"class": "POSITIVE"}),
'soldprice': item.find("span", attrs={"class": "s-item__price"}),
'shippingprice': item.find("span", attrs={"class": "s-item__shipping s-item__logisticsCost"}),
}
print(product)
Problem is because before offers there is other div with class="s-item__info clearfix" but without date, soldprice,shippingprice.
You have to add find to search only in offers
results = soup.find('div', class_='srp-river-results clearfix').find_all("div", {"class": "s-item__info clearfix"})

BeautifulSoup 4 - How to replace anchor tag with it's text

I have html content in my document. I need to replace all the anchor tags with their respective texts using BeautifulSoup.
My input is
html = '''They are also much more fuel-efficient than rockets.'''
Expected output
"They are also much more fuel-efficient than rockets."
Here is my code
soup = BeautifulSoup(html, 'html.parser')
for a in soup.find_all('a'):
...
replacement_string = a.string
//I get all the anchor tags here. I need to perform the replace operation here
...
//Should display 'They are also much more fuel-efficient than rockets.'
print(replaced_html_string)
I was able to replace the elements of the anchor tag but not the whole tag itself.
You don't really need to separate all the tags out to get the text. just use .text:
soup = BeautifulSoup(html, 'html.parser')
print(soup.text)
gives:
'They are also much more fuel-efficient than rockets.'
Or in your way:
res = str(soup)
for i in soup.find_all('a'):
res = res.replace(str(i),i.text)

Scrape href not working with python

I have copies of this very code that I am trying to do and every time I copy it line by line it isn't working right. I am more than frustrated and can't seem to figure out where it is not working. What I am trying to do is go to a website, scrap the different ratings pages which are labelled A, B, C ... etc. Then I am going to each site to pull the total number of pages they are using. I am trying to scrape the <span class='letter-pages' href='/ratings/A/1' and so on. What am I doing wrong?
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
hrefs = []
ratings = []
ks = []
pages_scrape = []
for href in soup.findAll('a'):
if 'href' in href.attrs:
hrefs.append(href.attrs['href'])
for good_ratings in hrefs:
if good_ratings.startswith('/ratings/'):
ratings.append(url[:-9]+good_ratings)
# elif good_ratings.startswith('/401k'):
# ks.append(url[:-9]+good_ratings)
del ratings[0]
del ratings[27:]
print(ratings)
for each_rating in ratings:
page = requests.get(each_rating)
soup = BeautifulSoup(page.text, 'html.parser')
for href in soup.find('span', class_='letter-pages'):
#Not working Here
pages_scrape.append(href.attrs['href'])
# Will print all the anchor tags with hrefs if I remove the above comment.
print(href)
You are trying to get the href prematurely. You are trying to extract the attribute directly from a span tag that has nested a tags, rather than a list of a tags.
for each_rating in ratings:
page = requests.get(each_rating)
soup = BeautifulSoup(page.text, 'html.parser')
span = soup.find('span', class_='letter-pages')
for a in span.find_all('a'):
href = a.get('href')
pages_scrape.append(href)
I didn't test this on all pages, but it worked for the first one. You pointed out that on some of the pages the content wasn't getting scraped, which is due to the span search returning None. To get around this you can do something like:
for each_rating in ratings:
page = requests.get(each_rating)
soup = BeautifulSoup(page.text, 'html.parser')
span = soup.find('span', class_='letter-pages')
if span:
for a in span.find_all('a'):
href = a.get('href')
pages_scrape.append(href)
print(href)
else:
print('span.letter-pages not found on ' + page)
Depending on your use case you might want to do something different, but this will indicate to you which pages don't match your scraping model and need to be manually investigated.
You probably meant to do find_all instead of find -- so change
for href in soup.find('span', class_='letter-pages'):
to
for href in soup.find_all('span', class_='letter-pages'):
You want to be iterating over a list of tags, not a single tag. find would give you a single tag object. When you iterate over a single tag, you iterate get NavigableString objects. find_all gives you the list of tag objects you want.

Scraping all h1 tags contents with beautiful soup

I am trying to scrape some review data with beautiful soup, and it will only let me grab a single element:
BASE_URL = "http://consequenceofsound.net/'category/reviews/album-reviews/"
html = urlopen(BASE_URL + section_url).read()
soup = BeautifulSoup(html, "lxml")
meta = soup.find("div", {"class": "content"}).h1
wordage = [s.contents for s in meta]
this will let me grab a single reviews title from this page. When I change find to find_all though, I can't identify h1 on this line, so I get some code like this:
meta = soup.find("div", {"class": "content"})
wordage = [s.h1 for s in meta]
and I'm unable to find a way to isolate the contents.
meta = soup.find_all("div", {"class": "content"})
wordage = [s.h1 for s in meta if s.h1 not in ([], None)]
link = [s.a['href'] for s in wordage]
Note the addition of the 'not in' statement. It seems on occassion empty and nonetype lists get added in to the 'soup' so this is an important measure.

Webcrawler BeautifulSoup - how do I get titles from links without class tags

The site I am trying to gather data from is http://www.boxofficemojo.com/yearly/chart/?yr=2015&p=.htm. Right now I want to get all the titles of the movies on this page and later move onto the rest of the data (studio, etc.) and additional data inside each of the links. This is what I have so far:
import requests
from bs4 import BeautifulSoup
from urllib2 import urlopen
def trade_spider(max_pages):
page = 0
while page <= max_pages:
url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2015&p=.htm'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'div':'body'}):
href = 'http://www.boxofficemojo.com' + link.get('href')
title = link.string
print title
get_single_item_data(href)
page += 1
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for item_name in soup.findAll('section', {'id':'postingbody'}):
print item_name.text
trade_spider(1)
I section I am having trouble with is
for link in soup.findAll('a', {'div':'body'}):
href = 'http://www.boxofficemojo.com' + link.get('href')
The issue is that on the site, there's no identifying class in which all the links are part of. The links just have an "< ahref > " tag.
How can I get all the titles of the links on this page?
One possible way is using .select() method which accept CSS selector parameter :
for link in soup.select('td > b > font > a[href^=/movies/?]'):
......
......
brief explanation about CSS selector being used :
td > b : find all td element, then from each td find direct child b element
> font : from filtered b elements, find direct child font element
> a[href^=/movies/?] : from filtered font elements, return direct child a element having href attribute value starts with "/movies/?"
Sorry for not giving a full answer, but heres a clue.
I have a made up name for these problems in scraping.
When I use the find(), find_all() methods I call this Abstract Identification since you could get random data when tag class/id names are not data oriented.
Then theres Nested Identification. That's when you have to find data not using the find(), find_all() methods, and instead literally crawl through a nest of tags. This requires more proficiency in BeautifulSoup.
Nested Identification is a longer proccess that's generally messy but is sometimes the only solution.
So how to do it? When you have hold of a <class 'bs4.element.Tag'> object you can locate tags that are stored as attributes of the tag object.
from bs4 import element, BeautifulSoup as BS
html = '' +\
'<body>' +\
'<h3>' +\
'<p>Some text to scrape</p>' +\
'<p>Some text NOT to scrape</p>' +\
'</h3>' +\
'\n\n' +\
'<strong>' +\
'<p>Some more text to scrape</p>' +\
'\n\n' +\
'Some Important Link' +\
'</strong>' +\
'</body>'
soup = BS(html)
# Starting point to extract a link
h3_tag = soup.find('h3') # finds the first h3 tag in the soup object
child_of_h3__p = h3_tag.p # locates the first p tag in the h3 tag
# climbing in the nest
child_of_h3__forbidden_p = h3_tag.p.next_sibling
# or
#child_of_h3__forbidden_p = child_of_h3__p.next_sibling
# sometimes `.next_sibling` will yield '' or '\n', think of this element as a
# tag separator in which case you need to continue using `.next_sibling`
# to get past the separator and onto the tag.
# Grab the tag below the h3 tag, which is the strong tag
# we need to go up 1 tag, and down 2 from our current object.
# (down 2 so we skip the tag_seperator)
tag_below_h3 = child_of_h3__p.parent.next_sibling.next_sibling
# Heres 3 different ways to get to the link tag using Nested Identification
# 1.) getting a list of childern from our object
childern_tags = tag_below_h3.contents
p_tag = childern_tags[0]
tag_separator = childern_tags[1]
a_tag = childern_tags[2] # or childrent_tags[-1] to get the last tag
print (a_tag)
print '1.) We Found the link: %s' % a_tag['href']
# 2.) Theres only 1 <a> tag, so we can just grab it directly
a_href = tag_below_h3.a['href']
print '\n2.) We Found the link: %s' % a_href
# 3.) using next_sibling to crawl
tag_separator = tag_below_h3.p.next_sibling
a_tag = tag_below_h3.p.next_sibling.next_sibling # or tag_separator.next_sibling
print '\n3.) We Found the link: %s' % a_tag['href']
print '\nWe also found a tag seperator: %s' % repr(tag_separator)
# our tag seperator is a NavigableString.
if type(tag_separator) == element.NavigableString:
print '\nNavigableString\'s are usually plain text that reside inside a tag.'
print 'In this case however it is a tag seperator.\n'
Now If I remember right, accessing a certain tag or a tag seperator, will change the object from a Tag to a NavigableString in which case you need to pass it through BeautifulSoup to be able to use methods such as find(). To check for this you can do as so.
from bs4 import element, BeautifulSoup
# ... Do some beautiful soup data mining
# reach a NavigableString object
if type(formerly_a_tag_obj) == element.NavigableString:
formerly_a_tag_obj = BeautifulSoup(formerly_a_tag_obj) # is now a soup

Categories