Parsing mutiple items using BeautifulSoup in Python - python

I'm trying to parse HTML from a website, where there are multiple elements having the same class ID. I can't seem to find a solution; I manage to get one item but not all of them.
Here's a bit of the HTML I'm trying to parse :
<h1>Synonymes travail</h1>
<div class="container-bloc1">
<strong> Nom</strong>
<br/>
-
<i><a class="lien2" href="/fr/accouchement.html"> accouchement </a></i>
:
<a class="lien3" href="/fr/gésine.html"> gésine</a>
<br/>
-
<i> <a class="lien2" href="/fr/action.html"> action </a></i>
:
<a class="lien3" href="/fr/activité.html"> activité</a>
,
<a class="lien3" href="/fr/labeur.html"> labeur</a>
</div>
In Python, I wrote it like this :
from bs4 import BeautifulSoup
import requests
import csv
source = requests.get("http://www.synonymes.net/fr/travail.html").text
soup = BeautifulSoup(source, "lxml")
for synonyme in soup.find_all("div", class_="container-bloc1"):
print(synonyme)
synonymesdumot = synonyme.find("a", class_="lien2").text
print(synonymesdumot)
for synonymesautres in synonyme.find_all("a", class_="lien3").text:
print(synonymesautres)
The first part is working, since there is only one "lien2" in the HTML file. I could do the same for "lien3" but I'd only get one item, and I want all of them.
What am I doing wrong here? Thanks for your help guys!

If you the code as is in your question, you run into an AttributeError because the output of .find_all() is a collection of tags (a ResultSet more specifically) that has no attribute text; but each of its elements, which are of type bs4.Element.Tag, do. So you need to get the text attribute for each of the tags inside the for loop:
for synonymesautres in synonyme.find_all("a", class_="lien3"):
print(synonymesautres.text)
Output:
le
travail
manque
de
travail
travail
fatigant

Related

with BeautifulSoup extract text from div in a href in loop

<div class="ELEMENT1">
<div class="ELEMENT2">
<div class="ELEMENT3">valeur1</div>
<div class="ELEMENT4">
<svg class="ELEMENT5 ">
<a href="ELEMENT6» target="ELEMENT7" class="ELEMENT8">
<div>TEXT</div
Hello to all,
My request is the following
From the following piece of code, I want to create a loop that allows me
to extract TEXT if and only if div class = ELEMENT 4 AND svg class = ELEMENT 5 (because there are other different ones)
thank you for your help
eddy
you'll need to import urllib2 or some other library that allows you to fetch a urls html structure. Then you need to import beautiful soup as well. Scrape the url and store into a variable. Then reformat the output in any way that serves your needs.
For example:
import urllib2
from bs4 import beautifulSoup
page = urlopen("the_url")
content = BeautifulSoup(page.read().decode("utf-8")) #decode data (utf-8)
filter = content.find_all("div") #finds all div elements in the body
Then you could use regexp to find the actual text inside the element.
Good luck on your assignment!

Getting an error using Beautifulsoup find_all() .get('href')

I'm trying to scrape a html for links under a specific class called "category-list"
Each link reside under a h4 tag(I'm ignoring its parent h3 tag):
<ul class="category-list">
<li class="category-item">
<h3>
<a href="/derdubor/c/alarm_og_sikkerhet/">
Alarm og sikkerhet
</a>
</h3>
<ul>
<li>
<h4>
<a href="/derdubor/c/alarm_og_sikkerhet/brannsikring/">
<span class="category-has-customers">
Brannsikring
</span>
(1)
</a>
</h4>
</li>
</ul>
</li>
...
My code for scraping the html is the following:
r = request.urlopen(str_top_url)
soup = BeautifulSoup(r.read(),'html.parser')
tag_category_list = soup.find('ul', class_ = 'category-list')
tag_items = tag_category_list.find_all('h4')
for tag_item in tag_items.find_all('a'):
print(tag_item.get('href'))
I get the error:
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item..."
Reading the BeautifulSoup manual on crummy, it looks like you can use the same methods belonging to the BeautifulSoup class on a tag object?
I can't seem to figure out what I'm doing wrong...
I've tried numerous answers her on stackoverflow. But to no avail...
Regards MH
Problem is in this line for tag_item in tag_items.find_all('a'):. You should first iterate through tag_items and the through find_all('a') items. Here is the edited code:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<ul class="category-list"><li class="category-item"><h3>Alarm og sikkerhet</h3><ul><li><h4><span class="category-has-customers">Brannsikring</span>(1)</h4></li></ul></li>','html.parser')
tag_category_list = soup.find('ul', class_ = 'category-list')
tag_items = tag_category_list.find_all('h4')
for elm in tag_items:
for tag_item in elm.find_all('a'):
print(tag_item.get('href'))
And here is the result:
/derdubor/c/alarm_og_sikkerhet/brannsikring/
The problem is that tag_items is a ResultSet, not a Tag.
From the Beautiful Soup documentation:
AttributeError: 'ResultSet' object has no attribute 'foo' - This usually happens because you expected find_all() to return a single tag or string. But find_all() returns a list of tags and strings–a ResultSet object. You need to iterate over the list and look at the .foo of each one. Or, if you really only want one result, you need to use find() instead of find_all().
So this nested loop should work:
for tag_item in tag_items:
for link in tag_item.find_all('a'):
print(link.get('href'))
Or, if you were only expecting one h4, change find_all('h4') to find('h4').

Python HTML Parsing with BS4

I'm having the problem of trying to parse through HTML using Python & Beautiful Soup and I'm encountering the problem of which I want to parse for a very specific piece of data. This is the kind of code I'm encountering:
<div class="big_div">
<div class="smaller div">
<div class="other div">
<div class="this">A</div>
<div class="that">2213</div>
<div class="other div">
<div class="this">B</div>
<div class="that">215</div>
<div class="other div">
<div class="this">C</div>
<div class="that">253</div>
There is a series of repeat HTML as you can see with only the values being different, my problem is locating a specific value. I want to locate the 253 in the last div. I would appreciate any help as this is a recurring problem in parsing through HTML.
Thank you in advance!
So far I've tried to parse for it but because the names are the same I have no idea how to navigate through it. I've tried using the for loop too but made little to no progress at all.
You can use string attribute as argument in find. BS docs for string attr.
"""Suppose html is the object holding html code of your web page that you want to scrape
and req_text is some text that you want to find"""
soup = BeautifulSoup(html, 'lxml')
req_div = soup.find('div', string=req_text)
req_div will contain the div element which you want.

beautifulsoup CSS Select - find a tag in which a particular attribute (style for ex) is not present

My first here on SO. Thanks for helping us noobs for so long. Coming straight to point:
Scenario:
I am working on an existing program that is reading the CSS selector as a string from a configuration file to make the program dynamic and able to scrap any site by just changing the configuration value of CSS selector.
Problem:
I am trying to scrape a site which is rendering items as one of the 2 options below:
Option1:
.........
<div class="price">
<span class="price" style="color:red;margin-right:0.1in">
<del>$299</del>
</span>
<span class="price">
$195
</span>
</div>
soup = soup.select("span.price") - this doesn't work as I need second span tag or last span tag :(
Option2:
.........
<div class="price">
<span class="price">
$199
</span>
</div>
soup = soup.select("span.price") - this works great!
Question:
In both the above options I want to be able to get the last span tag ($195 or $199) and don't care about the $299. Basically I just want to extract the final sale price and not the original price.
So the 2 ways I know as of now are:
1) Always get the last span tag
2) Always get the span tag which doesn't have style attribute
Now, I know the not operator, last-of-type are not present in bs4 (only nth-of-type is available) so I am stuck here. Any suggestions are helpful.
Edit: - Since this is an existing program, I cant use soup.find_all() or any other method apart from soup.select(). Sorry :(
Thanks!
You can search for the span tag without the style attribute:
prices = soup.select('span.price')
no_style = [price for price in prices if 'style' not in price.attrs]
>> [<span class="price">$199</span>]
This might be a good time to use a function. In this case BeautifulSoup gives span_with_style each tag and the function tests whether the tag's name is span and it has the attribute style. If this is true then BeautifulSoup appends the tag to its list of results.
HTML = '''\
<div class='price'>
<span class='price' style='color: red; margin-right: 0.1in'>
<del>$299</del>
</span>
<span class='price'>
$195
</span>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(HTML, 'lxml')
for item in soup.find_all(lambda tag: tag.name=='span' and tag.has_attr('style')):
print (item)
The code inside the select function needs to change to:
def select(soup, the_variable_you_pass):
soup.find('div', attrs={'class': 'price'}).find_all(the_variable_you_pass)[-1]

Unable to get correct link in BeautifulSoup

I'm trying to parse a bit of HTML and I'd like to extract the link that matches a particular pattern. I'm using the find method with a regular expression but it doesn't get me the correct link. Here's my snippet. Could someone tell me what I'm doing wrong?
from BeautifulSoup import BeautifulSoup
import re
html = """
<div class="entry">
<a target="_blank" href="http://www.rottentomatoes.com/m/diary_of_a_wimpy_kid/">RT</a>
<a target="_blank" href="http://www.imdb.com/video/imdb/vi2496267289/">Trailer</a> –
<a target="_blank" href="http://www.imdb.com/title/tt1196141/">IMDB</a> –
</div>
"""
soup = BeautifulSoup(html)
print soup.find('a', href = re.compile(r".*title/tt.*"))['href']
I should be getting the second link but BS always returns the first link. The href of the first link doesn't even match my regex so why does it return it?
Thanks.
find only returns the first <a> tag. You want findAll.
Can't answer your question, but anyway your (originally) posted code has an import typo. Change
import BeautifulSoup
to
from BeautifulSoup import BeautifulSoup
Then, your output (using beautifulsoup version 3.1.0.1) will be:
http://www.imdb.com/title/tt1196141/

Categories