<td>
<a name="corner"></a>
<div>
<div style="aaaaa">
<div class="class-a">My name is alis</div>
</div>
<div>
<span><span class="class-b " title="My title"><span>Very Good</span></span> </span>
<b>My Description</b><br />
My Name is Alis I am a python learner...
</div>
<div class="class-3" style="style-2 clear: both;">
alis
</div>
</div>
<br /></td>
I want the description after scraping it:
My Name is Alis I am a python learner...
I tried a lots of thing but i could not figure it out the best way. Can you guys give the in general solution for this.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("Your sample html here")
soup.td.div('div')[2].contents[-1]
This will return the string you are looking for (the unicode string, with any applicable whitespace, it should be noted).
This works by parsing the html, grabbing the first td tag and its contents, grabbing any div tags within the first div tag, selecting the 3rd item in the list (list index 2), and grabbing the last of its contents.
In BeautifulSoup, there are A LOT of ways to do this, so this answer probably hasn't taught you much and I genuinely recommend you read the tutorial that David suggested.
Have you tried reading the examples provided in the documentation? They quick start is located here http://www.crummy.com/software/BeautifulSoup/documentation.html#Quick Start
Edit:
To find
You would load your html up via
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("My html here")
myDiv = soup.find("div", { "class" : "class-a" })
Also remember you can do most of this via the python console and then using dir() along with help() walk through what you're trying to do. It might make life easier on you to try out ipython or perhaps python IDLE which have very friendly consoles for beginners.
Related
I'm trying to parse HTML from a website, where there are multiple elements having the same class ID. I can't seem to find a solution; I manage to get one item but not all of them.
Here's a bit of the HTML I'm trying to parse :
<h1>Synonymes travail</h1>
<div class="container-bloc1">
<strong> Nom</strong>
<br/>
-
<i><a class="lien2" href="/fr/accouchement.html"> accouchement </a></i>
:
<a class="lien3" href="/fr/gésine.html"> gésine</a>
<br/>
-
<i> <a class="lien2" href="/fr/action.html"> action </a></i>
:
<a class="lien3" href="/fr/activité.html"> activité</a>
,
<a class="lien3" href="/fr/labeur.html"> labeur</a>
</div>
In Python, I wrote it like this :
from bs4 import BeautifulSoup
import requests
import csv
source = requests.get("http://www.synonymes.net/fr/travail.html").text
soup = BeautifulSoup(source, "lxml")
for synonyme in soup.find_all("div", class_="container-bloc1"):
print(synonyme)
synonymesdumot = synonyme.find("a", class_="lien2").text
print(synonymesdumot)
for synonymesautres in synonyme.find_all("a", class_="lien3").text:
print(synonymesautres)
The first part is working, since there is only one "lien2" in the HTML file. I could do the same for "lien3" but I'd only get one item, and I want all of them.
What am I doing wrong here? Thanks for your help guys!
If you the code as is in your question, you run into an AttributeError because the output of .find_all() is a collection of tags (a ResultSet more specifically) that has no attribute text; but each of its elements, which are of type bs4.Element.Tag, do. So you need to get the text attribute for each of the tags inside the for loop:
for synonymesautres in synonyme.find_all("a", class_="lien3"):
print(synonymesautres.text)
Output:
le
travail
manque
de
travail
travail
fatigant
I'm having the problem of trying to parse through HTML using Python & Beautiful Soup and I'm encountering the problem of which I want to parse for a very specific piece of data. This is the kind of code I'm encountering:
<div class="big_div">
<div class="smaller div">
<div class="other div">
<div class="this">A</div>
<div class="that">2213</div>
<div class="other div">
<div class="this">B</div>
<div class="that">215</div>
<div class="other div">
<div class="this">C</div>
<div class="that">253</div>
There is a series of repeat HTML as you can see with only the values being different, my problem is locating a specific value. I want to locate the 253 in the last div. I would appreciate any help as this is a recurring problem in parsing through HTML.
Thank you in advance!
So far I've tried to parse for it but because the names are the same I have no idea how to navigate through it. I've tried using the for loop too but made little to no progress at all.
You can use string attribute as argument in find. BS docs for string attr.
"""Suppose html is the object holding html code of your web page that you want to scrape
and req_text is some text that you want to find"""
soup = BeautifulSoup(html, 'lxml')
req_div = soup.find('div', string=req_text)
req_div will contain the div element which you want.
My first here on SO. Thanks for helping us noobs for so long. Coming straight to point:
Scenario:
I am working on an existing program that is reading the CSS selector as a string from a configuration file to make the program dynamic and able to scrap any site by just changing the configuration value of CSS selector.
Problem:
I am trying to scrape a site which is rendering items as one of the 2 options below:
Option1:
.........
<div class="price">
<span class="price" style="color:red;margin-right:0.1in">
<del>$299</del>
</span>
<span class="price">
$195
</span>
</div>
soup = soup.select("span.price") - this doesn't work as I need second span tag or last span tag :(
Option2:
.........
<div class="price">
<span class="price">
$199
</span>
</div>
soup = soup.select("span.price") - this works great!
Question:
In both the above options I want to be able to get the last span tag ($195 or $199) and don't care about the $299. Basically I just want to extract the final sale price and not the original price.
So the 2 ways I know as of now are:
1) Always get the last span tag
2) Always get the span tag which doesn't have style attribute
Now, I know the not operator, last-of-type are not present in bs4 (only nth-of-type is available) so I am stuck here. Any suggestions are helpful.
Edit: - Since this is an existing program, I cant use soup.find_all() or any other method apart from soup.select(). Sorry :(
Thanks!
You can search for the span tag without the style attribute:
prices = soup.select('span.price')
no_style = [price for price in prices if 'style' not in price.attrs]
>> [<span class="price">$199</span>]
This might be a good time to use a function. In this case BeautifulSoup gives span_with_style each tag and the function tests whether the tag's name is span and it has the attribute style. If this is true then BeautifulSoup appends the tag to its list of results.
HTML = '''\
<div class='price'>
<span class='price' style='color: red; margin-right: 0.1in'>
<del>$299</del>
</span>
<span class='price'>
$195
</span>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(HTML, 'lxml')
for item in soup.find_all(lambda tag: tag.name=='span' and tag.has_attr('style')):
print (item)
The code inside the select function needs to change to:
def select(soup, the_variable_you_pass):
soup.find('div', attrs={'class': 'price'}).find_all(the_variable_you_pass)[-1]
I have a html that contains:
<b>
<p align="left">TXT1</p>
</b>
<p align="left">
<b>NR1</b>
<b>TXT2</b>
TXT3
<b>TXT4</b>
TXT5
</p>
When I do:
from BeautifulSoup import BeautifulSoup
html = urllib.urlopen('url')
htmlr = html.read()
soup = BeautifulSoup(htmlr)
print soup
I get something different:
<p align="left">TXT1</p>
<p align="left">NR1 <b>TXT2</b> TXT3 <b>TXT4</b>
TXT5</p>
I am analyzing html document layout, so losing tags is quite frustrating. Why is it happening and whats the best way to stop it? Help much appriciated!
EDIT: I need to handle the badly formed html documents for information extraction purposes. If their creator wanted some text to be rendered bold, I have to take it into account, even if the person created an invalid html.
The HTML is invalid. You can't have a <p> inside a <b>. BeautifulSoup is attempting to perform error recovery (as do browsers).
The best way to stop it is to fix the HTML.
HTML Tidy appears to correctly repair the invalid HTML. They have a web implementation of it here: http://infohound.net/tidy/
I entered:
<b><p>hello world</p></b>
and got this result:
<p><b>hello world</b></p>
There appears to by a python version here:
http://www.egenix.com/products/python/mxExperimental/mxTidy/
You could try html5lib instead of BeautifulSoup. Html5lib implements the HTML5 parser algorithm, so it should result in producing the same DOM as a modern browser does.
Disclaimer: I've not tried the html5lib parser for myself, so I don't know it's current stability level.
Same As quentin suggested.
If you want the <p> element to be bold then use inline CSS instead of <b> tag.
<p style='font-weight:bold;' align="left">TXT1</p>
<p align="left">
<b>NR1</b>
<b>TXT2</b>
TXT3
<b>TXT4</b>
TXT5
</p>
all.
I have an huge html file which contains tags like these:
<h3 class="r">
<a href="http://en.wikipedia.org/wiki/Digital_Signature_Algorithm" class=l onmousedown="return clk(this.href,'','','','6','','0CDEQFjACOAM')">
I need to extract all the urls from this page in python.
In a loop:
Find occurences of <h3 class="r"> one by one.
Extract the url
http://xrayoptics.by.ru/database/misc/goog2text.py I need to re-write this script to extract all the links found on google.
How can i achieve that?
Thanks.
from BeautifulSoup import BeautifulSoup
html = """<html>
...
<h3 class="r">
<a href="http://en.wikipedia.org/wiki/Digital_Signature_Algorithm" class=l
onmousedown="return clk(this.href,'','','','6','','0CDEQFjACOAM')">
text</a>
</h3>
...
<h3>Don't find me!</h3>
<h3 class="r"><a>Don't find me!</a></h3>
<h3 class="r"><a class="l">Don't error on missing href!</a></h3>
...
</html>
"""
soup = BeautifulSoup(html)
for h3 in soup.findAll("h3", {"class": "r"}):
for a in h3.findAll("a", {"class": "l", "href": True}):
print a["href"]
I'd use XPATH, see here for a question what package would be appropriate in Python.
You can use a Regular Expressions (RegEx) for that.
This RegEx will catch all URL's beginning with http and surrounded by quotes ("):
http([^\"]+)
And this is how it's done in Python:
import re
myRegEx = re.compile("http([^\"]+)")
myResults = MyRegEx.search('<source>')
Replace by the variable storing the source code you want to search for URL's.
myResults.start() and myResults.end() now contain the starting and ending position of the URL's. Use the myResults.group() function to find the string that matched the RegEx.
If anything isn't clear yet, just ask.