I was reading: Is it possible for BeautifulSoup to work in a case-insensitive manner?
But it's not what I actually needed, I'm looking for all img tags in webpage, which include: IMG, Img etc...
This code:
images = soup.findAll('img')
Will only look for img tags case sensitive so how can I solve this problem without adding new line for every single possibility (and maybe forget to add some)?
Please Note that the above question isn't about the tag but it's properties.
BeautifulSoup is not case sensitiv per se just give it a try. If you miss some information in your result maybe there is another issue. You could force it to parse sensitiv while using xml parser if needed in some case.
Note: In newer code avoid old syntax findAll() instead use find_all() - For more take a minute to check docs
Example
from bs4 import BeautifulSoup
html = '''
<img src="" alt="lower">
<IMG src="" alt="upper">
<iMG src="" alt="mixed">
'''
soup = BeautifulSoup(html)
soup.find_all('img')
Output
[<img alt="lower" src=""/>,
<img alt="upper" src=""/>,
<img alt="mixed" src=""/>]
Related
<div class="ELEMENT1">
<div class="ELEMENT2">
<div class="ELEMENT3">valeur1</div>
<div class="ELEMENT4">
<svg class="ELEMENT5 ">
<a href="ELEMENT6» target="ELEMENT7" class="ELEMENT8">
<div>TEXT</div
Hello to all,
My request is the following
From the following piece of code, I want to create a loop that allows me
to extract TEXT if and only if div class = ELEMENT 4 AND svg class = ELEMENT 5 (because there are other different ones)
thank you for your help
eddy
you'll need to import urllib2 or some other library that allows you to fetch a urls html structure. Then you need to import beautiful soup as well. Scrape the url and store into a variable. Then reformat the output in any way that serves your needs.
For example:
import urllib2
from bs4 import beautifulSoup
page = urlopen("the_url")
content = BeautifulSoup(page.read().decode("utf-8")) #decode data (utf-8)
filter = content.find_all("div") #finds all div elements in the body
Then you could use regexp to find the actual text inside the element.
Good luck on your assignment!
I tried to find image tags from a HTML page like this:
<img src="../img/gifts/img1.jpg">
<img src="../img/gifts/img1.jpg">
etc....
but when I use this code from Web Scraping 2 - author: Ryan Mitchell
from bs4 import BeautifulSoup
import re
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html,'html.parser')
images = bs.find_all('img',{'src':re.compile('\.\.\/img\/gifts/img.*\.jpg')})
the list of tags I received look like this:
[<img src="../img/gifts/img1.jpg"/>,
<img src="../img/gifts/img2.jpg"/>,
<img src="../img/gifts/img3.jpg"/>,
<img src="../img/gifts/img4.jpg"/>,
<img src="../img/gifts/img6.jpg"/>]
I saw that there is an additional '/' character at the end of each tag? Can someone explain this for me?
Thank so much
In HTML the tags which don't have an end tag, are ended with />. This is optional in most HTML versions, except XHTML where it is mandatory, and it is good practice. Beautifulsoup API automatically adds this to prettify the parsed DOM.
all.
I have an huge html file which contains tags like these:
<h3 class="r">
<a href="http://en.wikipedia.org/wiki/Digital_Signature_Algorithm" class=l onmousedown="return clk(this.href,'','','','6','','0CDEQFjACOAM')">
I need to extract all the urls from this page in python.
In a loop:
Find occurences of <h3 class="r"> one by one.
Extract the url
http://xrayoptics.by.ru/database/misc/goog2text.py I need to re-write this script to extract all the links found on google.
How can i achieve that?
Thanks.
from BeautifulSoup import BeautifulSoup
html = """<html>
...
<h3 class="r">
<a href="http://en.wikipedia.org/wiki/Digital_Signature_Algorithm" class=l
onmousedown="return clk(this.href,'','','','6','','0CDEQFjACOAM')">
text</a>
</h3>
...
<h3>Don't find me!</h3>
<h3 class="r"><a>Don't find me!</a></h3>
<h3 class="r"><a class="l">Don't error on missing href!</a></h3>
...
</html>
"""
soup = BeautifulSoup(html)
for h3 in soup.findAll("h3", {"class": "r"}):
for a in h3.findAll("a", {"class": "l", "href": True}):
print a["href"]
I'd use XPATH, see here for a question what package would be appropriate in Python.
You can use a Regular Expressions (RegEx) for that.
This RegEx will catch all URL's beginning with http and surrounded by quotes ("):
http([^\"]+)
And this is how it's done in Python:
import re
myRegEx = re.compile("http([^\"]+)")
myResults = MyRegEx.search('<source>')
Replace by the variable storing the source code you want to search for URL's.
myResults.start() and myResults.end() now contain the starting and ending position of the URL's. Use the myResults.group() function to find the string that matched the RegEx.
If anything isn't clear yet, just ask.
<td>
<a name="corner"></a>
<div>
<div style="aaaaa">
<div class="class-a">My name is alis</div>
</div>
<div>
<span><span class="class-b " title="My title"><span>Very Good</span></span> </span>
<b>My Description</b><br />
My Name is Alis I am a python learner...
</div>
<div class="class-3" style="style-2 clear: both;">
alis
</div>
</div>
<br /></td>
I want the description after scraping it:
My Name is Alis I am a python learner...
I tried a lots of thing but i could not figure it out the best way. Can you guys give the in general solution for this.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("Your sample html here")
soup.td.div('div')[2].contents[-1]
This will return the string you are looking for (the unicode string, with any applicable whitespace, it should be noted).
This works by parsing the html, grabbing the first td tag and its contents, grabbing any div tags within the first div tag, selecting the 3rd item in the list (list index 2), and grabbing the last of its contents.
In BeautifulSoup, there are A LOT of ways to do this, so this answer probably hasn't taught you much and I genuinely recommend you read the tutorial that David suggested.
Have you tried reading the examples provided in the documentation? They quick start is located here http://www.crummy.com/software/BeautifulSoup/documentation.html#Quick Start
Edit:
To find
You would load your html up via
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("My html here")
myDiv = soup.find("div", { "class" : "class-a" })
Also remember you can do most of this via the python console and then using dir() along with help() walk through what you're trying to do. It might make life easier on you to try out ipython or perhaps python IDLE which have very friendly consoles for beginners.
I'm trying to parse a bit of HTML and I'd like to extract the link that matches a particular pattern. I'm using the find method with a regular expression but it doesn't get me the correct link. Here's my snippet. Could someone tell me what I'm doing wrong?
from BeautifulSoup import BeautifulSoup
import re
html = """
<div class="entry">
<a target="_blank" href="http://www.rottentomatoes.com/m/diary_of_a_wimpy_kid/">RT</a>
<a target="_blank" href="http://www.imdb.com/video/imdb/vi2496267289/">Trailer</a> –
<a target="_blank" href="http://www.imdb.com/title/tt1196141/">IMDB</a> –
</div>
"""
soup = BeautifulSoup(html)
print soup.find('a', href = re.compile(r".*title/tt.*"))['href']
I should be getting the second link but BS always returns the first link. The href of the first link doesn't even match my regex so why does it return it?
Thanks.
find only returns the first <a> tag. You want findAll.
Can't answer your question, but anyway your (originally) posted code has an import typo. Change
import BeautifulSoup
to
from BeautifulSoup import BeautifulSoup
Then, your output (using beautifulsoup version 3.1.0.1) will be:
http://www.imdb.com/title/tt1196141/