Why I have addition '/' character when using BeautifulSoup.find_all function? - python

I tried to find image tags from a HTML page like this:
<img src="../img/gifts/img1.jpg">
<img src="../img/gifts/img1.jpg">
etc....
but when I use this code from Web Scraping 2 - author: Ryan Mitchell
from bs4 import BeautifulSoup
import re
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html,'html.parser')
images = bs.find_all('img',{'src':re.compile('\.\.\/img\/gifts/img.*\.jpg')})
the list of tags I received look like this:
[<img src="../img/gifts/img1.jpg"/>,
<img src="../img/gifts/img2.jpg"/>,
<img src="../img/gifts/img3.jpg"/>,
<img src="../img/gifts/img4.jpg"/>,
<img src="../img/gifts/img6.jpg"/>]
I saw that there is an additional '/' character at the end of each tag? Can someone explain this for me?
Thank so much

In HTML the tags which don't have an end tag, are ended with />. This is optional in most HTML versions, except XHTML where it is mandatory, and it is good practice. Beautifulsoup API automatically adds this to prettify the parsed DOM.

Related

beautifulsoup Case Insensitive?

I was reading: Is it possible for BeautifulSoup to work in a case-insensitive manner?
But it's not what I actually needed, I'm looking for all img tags in webpage, which include: IMG, Img etc...
This code:
images = soup.findAll('img')
Will only look for img tags case sensitive so how can I solve this problem without adding new line for every single possibility (and maybe forget to add some)?
Please Note that the above question isn't about the tag but it's properties.
BeautifulSoup is not case sensitiv per se just give it a try. If you miss some information in your result maybe there is another issue. You could force it to parse sensitiv while using xml parser if needed in some case.
Note: In newer code avoid old syntax findAll() instead use find_all() - For more take a minute to check docs
Example
from bs4 import BeautifulSoup
html = '''
<img src="" alt="lower">
<IMG src="" alt="upper">
<iMG src="" alt="mixed">
'''
soup = BeautifulSoup(html)
soup.find_all('img')
Output
[<img alt="lower" src=""/>,
<img alt="upper" src=""/>,
<img alt="mixed" src=""/>]

with BeautifulSoup extract text from div in a href in loop

<div class="ELEMENT1">
<div class="ELEMENT2">
<div class="ELEMENT3">valeur1</div>
<div class="ELEMENT4">
<svg class="ELEMENT5 ">
<a href="ELEMENT6» target="ELEMENT7" class="ELEMENT8">
<div>TEXT</div
Hello to all,
My request is the following
From the following piece of code, I want to create a loop that allows me
to extract TEXT if and only if div class = ELEMENT 4 AND svg class = ELEMENT 5 (because there are other different ones)
thank you for your help
eddy
you'll need to import urllib2 or some other library that allows you to fetch a urls html structure. Then you need to import beautiful soup as well. Scrape the url and store into a variable. Then reformat the output in any way that serves your needs.
For example:
import urllib2
from bs4 import beautifulSoup
page = urlopen("the_url")
content = BeautifulSoup(page.read().decode("utf-8")) #decode data (utf-8)
filter = content.find_all("div") #finds all div elements in the body
Then you could use regexp to find the actual text inside the element.
Good luck on your assignment!

What is the best way of getting images located after specific tag in html markup in python/beautiful soup

I'm trying to find images located passed H1 tag. The markup can be any article on online magazine (example).
That means I can't rely on specific containers etc.
My initial idea was to find character position of H1 tag and found images. That would let me determine their position in relation to H1 tag. I can't find a way to get character position of found element with beautiful soup, unless I'm missing something.
Whatever approach has to be used to parse html it has to work with ill-formed syntax.
Example:
<html>
<p>some text</p>
<img src="#" alt="I don't care about this image"/>
<h1>This is the title</h1>
<img src="#" alt="This is the first image I want to get"/>
<p>some more content</p>
<img src="#" alt="This is the secod image I want to get"/>
</html>
Parsing above html would return a list with 2 images located below H1 tag.
UPDATE: I completely rewrote my question to better explain the problem.
To answer my own question. The solution to get all images after H1 tag would be:
soup = BeautifulSoup(html_contents, 'html5lib') # parse html markup
soup_h1 = soup.find('h1') # find H1 tag
soup_imgs = soup_h1.find_all_next('img') # returns a list of img objects
Thanks to everyone for help.
lxml might be an easy fit for this. This will grab all img tags, but only print ones preceded by an h1 tag. It does it in order as they appear in the DOM as well.
from lxml import etree
from StringIO import StringIO
html = """
<body>
<h1>a</h1>
<img src="afterh1-1"/>
<h2>b</h2>
<img src="afterh2"/>
<h1>a</h1>
<img src="afterh1-2"/>
</body>
"""
f = StringIO(html)
tree = etree.parse(f)
for i in tree.xpath('//img'):
if i.getprevious().tag.lower() == "h1":
print "Match: %s - %s" % (i.get('src'), i.getprevious().tag)
Output:
Match: afterh1-1 - h1
Match: afterh1-2 - h1
Here's the beautifulsoup version that yields the same output
from bs4 import BeautifulSoup
html = """
<body>
<h1>a</h1>
<img src="afterh1-1"/>
<h2>b</h2>
<img src="afterh2"/>
<h1>a</h1>
<img src="afterh1-2"/>
</body>
"""
soup = BeautifulSoup(html)
for i in soup.find_all('img'):
if i.previous_sibling.previous_sibling.name == "h1":
print "Match: %s - %s" % (i.get('src'), i.previous_sibling.previous_sibling.name)

Parsing HTML using Python

I'm looking for an HTML Parser module for Python that can help me get the tags in the form of Python lists/dictionaries/objects.
If I have a document of the form:
<html>
<head>Heading</head>
<body attr1='val1'>
<div class='container'>
<div id='class'>Something here</div>
<div>Something else</div>
</div>
</body>
</html>
then it should give me a way to access the nested tags via the name or id of the HTML tag so that I can basically ask it to get me the content/text in the div tag with class='container' contained within the body tag, or something similar.
If you've used Firefox's "Inspect element" feature (view HTML) you would know that it gives you all the tags in a nice nested manner like a tree.
I'd prefer a built-in module but that might be asking a little too much.
I went through a lot of questions on Stack Overflow and a few blogs on the internet and most of them suggest BeautifulSoup or lxml or HTMLParser but few of these detail the functionality and simply end as a debate over which one is faster/more efficent.
So that I can ask it to get me the content/text in the div tag with class='container' contained within the body tag, Or something similar.
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)
You don't need performance descriptions I guess - just read how BeautifulSoup works. Look at its official documentation.
I guess what you're looking for is pyquery:
pyquery: a jquery-like library for python.
An example of what you want may be like:
from pyquery import PyQuery
html = # Your HTML CODE
pq = PyQuery(html)
tag = pq('div#id') # or tag = pq('div.class')
print tag.text()
And it uses the same selectors as Firefox's or Chrome's inspect element. For example:
The inspected element selector is 'div#mw-head.noprint'. So in pyquery, you just need to pass this selector:
pq('div#mw-head.noprint')
Here you can read more about different HTML parsers in Python and their performance. Even though the article is a bit dated it still gives you a good overview.
Python HTML parser performance
I'd recommend BeautifulSoup even though it isn't built in. Just because it's so easy to work with for those kinds of tasks. Eg:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://www.google.com/')
soup = BeautifulSoup(page)
x = soup.body.find('div', attrs={'class' : 'container'}).text
Compared to the other parser libraries lxml is extremely fast:
http://blog.dispatched.ch/2010/08/16/beautifulsoup-vs-lxml-performance/
http://www.ianbicking.org/blog/2008/03/python-html-parser-performance.html
And with cssselect it’s quite easy to use for scraping HTML pages too:
from lxml.html import parse
doc = parse('http://www.google.com').getroot()
for div in doc.cssselect('a'):
print '%s: %s' % (div.text_content(), div.get('href'))
lxml.html Documentation
I recommend lxml for parsing HTML. See "Parsing HTML" (on the lxml site).
In my experience Beautiful Soup messes up on some complex HTML. I believe that is because Beautiful Soup is not a parser, rather a very good string analyzer.
I recommend using justext library:
https://github.com/miso-belica/jusText
Usage:
Python2:
import requests
import justext
response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
print paragraph.text
Python3:
import requests
import justext
response = requests.get("http://bbc.com/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
print (paragraph.text)
I would use EHP
https://github.com/iogf/ehp
Here it is:
from ehp import *
doc = '''<html>
<head>Heading</head>
<body attr1='val1'>
<div class='container'>
<div id='class'>Something here</div>
<div>Something else</div>
</div>
</body>
</html>
'''
html = Html()
dom = html.feed(doc)
for ind in dom.find('div', ('class', 'container')):
print ind.text()
Output:
Something here
Something else

Unable to get correct link in BeautifulSoup

I'm trying to parse a bit of HTML and I'd like to extract the link that matches a particular pattern. I'm using the find method with a regular expression but it doesn't get me the correct link. Here's my snippet. Could someone tell me what I'm doing wrong?
from BeautifulSoup import BeautifulSoup
import re
html = """
<div class="entry">
<a target="_blank" href="http://www.rottentomatoes.com/m/diary_of_a_wimpy_kid/">RT</a>
<a target="_blank" href="http://www.imdb.com/video/imdb/vi2496267289/">Trailer</a> –
<a target="_blank" href="http://www.imdb.com/title/tt1196141/">IMDB</a> –
</div>
"""
soup = BeautifulSoup(html)
print soup.find('a', href = re.compile(r".*title/tt.*"))['href']
I should be getting the second link but BS always returns the first link. The href of the first link doesn't even match my regex so why does it return it?
Thanks.
find only returns the first <a> tag. You want findAll.
Can't answer your question, but anyway your (originally) posted code has an import typo. Change
import BeautifulSoup
to
from BeautifulSoup import BeautifulSoup
Then, your output (using beautifulsoup version 3.1.0.1) will be:
http://www.imdb.com/title/tt1196141/

Categories