Python and BeautifulSoup, not finding 'a' - python

Here's a piece of HTML code (from delicious):
<h4>
<a rel="nofollow" class="taggedlink " href="http://imfy.us/" >Generate Secure Links with Anonymous Referers & Anti-Bot Protection</a>
<span class="saverem">
<em class="bookmark-actions">
<strong><a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&title=Generate%20Secure%20Links%20with%20Anonymous%20Referers%20%26%20Anti-Bot%20Protection&jump=%2Fdux&key=fFS4QzJW2lBf4gAtcrbuekRQfTY-&original_user=dux&copyuser=dux&copytags=web+apps+url+security+generator+shortener+anonymous+links">SAVE</a></strong>
</em>
</span>
</h4>
I'm trying to find all the links where class="inlinesave action". Here's the code:
sock = urllib2.urlopen('http://delicious.com/theuser')
html = sock.read()
soup = BeautifulSoup(html)
tags = soup.findAll('a', attrs={'class':'inlinesave action'})
print len(tags)
But it doesn't find anything!
Any thoughts?
Thanks

If you want to look for an anchor with exactly those two classes you'd, have to use a regexp, I think:
tags = soup.findAll('a', attrs={'class': re.compile(r'\binlinesave\b.*\baction\b')})
Keep in mind that this regexp won't work if the ordering of the class names is reversed (class="action inlinesave").
The following statement should work for all cases (even though it looks ugly imo.):
soup.findAll('a',
attrs={'class':
re.compile(r'\baction\b.*\binlinesave\b|\binlinesave\b.*\baction\b')
})

Python string methods
html=open("file").read()
for item in html.split("<strong>"):
if "class" in item and "inlinesave action" in item:
url_with_junk = item.split('href="')[1]
m = url_with_junk.index('">')
print url_with_junk[:m]

May be that issue is fixed in verion 3.1.0, I could parse yours,
>>> html="""<h4>
... <a rel="nofollow" class="taggedlink " href="http://imfy.us/" >Generate Secure Links with Anony
... <span class="saverem">
... <em class="bookmark-actions">
... <strong><a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&title=Gen
... </em>
... </span>
... </h4>"""
>>>
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> tags = soup.findAll('a', attrs={'class':'inlinesave action'})
>>> print len(tags)
1
>>> tags
[<a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&title=Generate%20Secure%
>>>
I have tried with BeautifulSoup 2.1.1 also, its does not work at all.

You might make some forward progress using pyparsing:
from pyparsing import makeHTMLTags, withAttribute
htmlsrc="""<h4>... etc."""
atag = makeHTMLTags("a")[0]
atag.setParseAction(withAttribute(("class","inlinesave action")))
for result in atag.searchString(htmlsrc):
print result.href
Gives (long result output snipped at '...'):
/save?url=http%3A%2F%2Fimfy.us%2F&title=Genera...+anonymous+links

Related

How to loop thru a div class to get access to the li class within?

Im scraping a page and found that with my xpath and regex methods i cant seem to get to a set of values that are within a div class
I have tried the method stated here on this page
How to get all the li tag within div tag
and then the current logic shown below that is within my file
#PRODUCT ATTRIBUTES (STYLE, SKU, BRAND) need to figure out how to loop thru a class and pull out the 2 list tags
prodattr = re.compile(r'<div class=\"pdp-desc-attr spec-prod-attr\">([^<]+)</div>', re.IGNORECASE)
prodattrmatches = re.findall(prodattr, html)
for m in prodattrmatches:
m = re.compile(r'<li class=\"last last-item\">([^<]+)</li>', re.IGNORECASE)
stymatches = re.findall(m, html)
#STYLE
sty = re.compile(r'<li class=\"last last-item\">([^<]+)</li>', re.IGNORECASE)
stymatches = re.findall(sty, html)
#BRAND
brd = re.compile(r'<li class=\"first first-item\">([^<]+)</li>', re.IGNORECASE)
brdmatches = re.findall(brd, html)
The above is the current code that is NOT working.. everything comes back empty. For the purpose of my testing im merely writing the data, if any, out to the print command so i can see it on the console..
itmDetails2 = dets['sku'] +","+ dets['description']+","+ dets['price']+","+ dets['brand']
and within the console this is what i get this, which is what i expect and the generic messages are just place holders until i get this logic figured out.
SKUE GOES HERE,adidas Women's Essentials Tricot Track Jacket,34.97, BRAND GOES HERE
<div class="pdp-desc-attr spec-prod-attr">
<ul class="prod-attr-list">
<li class="first first-item">Brand: adidas</li>
<li>Country of Origin: Imported</li>
<li class="last last-item">Style: F18AAW400D</li>
</ul>
</div>
Do not use Regex to parse HTML
There are better and safer ways to do this.
Take a look in this code using Parsel and BeautifulSoup to extract the li tags of your sample code:
from parsel import Selector
from bs4 import BeautifulSoup
html = ('<div class="pdp-desc-attr spec-prod-attr">'
'<ul class="prod-attr-list">'
'<li class="first first-item">Brand: adidas</li>'
'<li>Country of Origin: Imported</li>'
'<li class="last last-item">Style: F18AAW400D</li>'
'</ul>'
'</div>')
# Using parsel
sel = Selector(text=html)
for li in sel.xpath('//li'):
print(li.xpath('./text()').get())
# Using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for li in soup.find_all('li'):
print(li.text)
Output:
Brand: adidas
Country of Origin: Imported
Style: F18AAW400D
Brand: adidas
Country of Origin: Imported
Style: F18AAW400D
I would use an html parser and look for the class of the ul. Using bs4 4.7.1
from bs4 import BeautifulSoup as bs
html = '''
<div class="pdp-desc-attr spec-prod-attr">
<ul class="prod-attr-list">
<li class="first first-item">Brand: adidas</li>
<li>Country of Origin: Imported</li>
<li class="last last-item">Style: F18AAW400D</li>
</ul>
</div>
'''
soup = bs(html, 'lxml')
for item in soup.select('.prod-attr-list:has(> li)'):
print([sub_item.text for sub_item in item.select('li')])

Python 3 get child elements (lxml)

I am using lxml with html:
from lxml import html
import requests
How would I check if any of an element's children have the class = "nearby"
my code (essentially):
url = "www.example.com"
Page = requests.get(url)
Tree = html.fromstring(Page.content)
resultList = Tree.xpath('//p[#class="result-info"]')
i=len(resultList)-1 #to go though the list backwards
while i>0:
if (resultList[i].HasChildWithClass("nearby")):
print('This result has a child with the class "nearby"')
How would I replace "HasChildWithClass()" to make it actually work?
Here's an example tree:
...
<p class="result-info">
<span class="result-meta">
<span class="nearby">
... #this SHOULD print something
</span>
</span>
</p>
<p class="result-info">
<span class="result-meta">
<span class="FAR-AWAY">
... # this should NOT print anything
</span>
</span>
</p>
...
I tried to understand why you use lxml to find the element. However BeautifulSoup and re may be a better choice.
lxml = """
<p class="result-info">
<span class="result-meta">
<span class="nearby">
... #this SHOULD print something
</span>
</span>
</p>
<p class="result-info">
<span class="result-meta">
<span class="FAR-AWAY">
... # this should NOT print anything
</span>
</span>
</p>
"""
But i done what you want.
from lxml import html
Tree = html.fromstring(lxml)
resultList = Tree.xpath('//p[#class="result-info"]')
i = len(resultList) - 1 #to go though the list backwards
for result in resultList:
for e in result.iter():
if e.attrib.get("class") == "nearby":
print(e.text)
Try to use bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(lxml,"lxml")
result = soup.find_all("span", class_="nearby")
print(result[0].text)
Here is an experiment I did.
Take r = resultList[0] in python shell and type:
>>> dir(r)
['__bool__', '__class__', ..., 'find_class', ...
Now this find_class method is highly suspicious. If you check its help doc:
>>> help(r.find_class)
you'll confirm the guess. Indeed,
>>> r.find_class('nearby')
[<Element span at 0x109788ea8>]
For the other tag s = resultList[1] in the example xml code you gave,
>>> s.find_class('nearby')
[]
Now it's clear how to tell whether a 'nearby' child exists or not.
Cheers!

Extract content with BeautifulSoup and Python

I'm trying to scrap a forum but I can't deal with the comments, because the users use emoticons, and bold font, and cite previous messages, and and and...
For example, here's one of the comments that I have a problem with:
<div class="content">
<blockquote>
<div>
<cite>User write:</cite>
I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">
</div>
</blockquote>
<br/>
THIS IS THE COMMENT THAT I NEED!
</div>
I searching for help for the last 4 days and I couldn't find anything, so I decided to ask here.
This is the code that I'm using:
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html, "lxml")
def get_messages(url):
soup = make_soup(url)
msg = soup.find("div", {"class" : "content"})
# I get in msg the hole message, exactly as I wrote previously
print msg
# Here I get:
# 1. <blockquote> ... </blockquote>
# 2. <br/>
# 3. THIS IS THE COMMENT THAT I NEED!
for item in msg.children:
print item
I'm looking for a way to deal with messages in a general way, no matter how they are. Sometimes they put emoticons between the text and I need to remove them and get the hole message (in this situation, bsp will put each part of the message (first part, emoticon, second part) in different items).
Thanks in advance!
Use decompose http://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose
Decompose extracts tags that you don't want. In your case:
soup.blockquote.decompose()
or all unwanted tags:
for tag in ['blockquote', 'img', ... ]:
soup.find(tag).decompose()
Your example:
>>> from bs4 import BeautifulSoup
>>> html = """<div class="content">
... <blockquote>
... <div>
... <cite>User write:</cite>
... I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">
... </div>
... </blockquote>
... <br/>
... THIS IS THE COMMENT THAT I NEED!
... </div>"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find('blockquote').decompose()
>>> soup.find("div", {"class" : "content"}).text.strip()
u'THIS IS THE COMMENT THAT I NEED!'
Update
Sometimes all you have is a tag starting point but you are actually interested in the content before or after that starting point. You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree:
>>> html = """<div>No<blockquote>No</blockquote>Yes.<em>Yes!</em>Yes?</div>No!"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> elm = soup.blockquote.next_sibling
>>> txt = ""
>>> while elm:
... txt += elm.string
... elm = elm.next_sibling
...
>>> print(txt)
u'Yes.Yes!Yes?'
BeautifulSoup has a get_text method. Maybe this is what you want.
From their documentation:
markup = '\nI linked to <i>example.com</i>\n'
soup = BeautifulSoup(markup)
soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'
If the text you want is never within any additional tags, as in your example, you can use extract() to get rid of all the tags and their contents:
html = '<div class="content">\
<blockquote>\
<div>\
<cite>User write:</cite>\
I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">\
</div>\
</blockquote>\
<br/>\
THIS IS THE COMMENT THAT I NEED!\
</div>'
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', class_='content')
tags = div.findAll(recursive=False)
for tag in tags:
tag.extract()
text = div.get_text(strip=True)
print(text)
This gives:
THIS IS THE COMMENT THAT I NEED!
To deal with emoticons, you'll have to do something more complicated. You'll probably have to define a list of emoticons to recognize yourself, and then parse the text to look for them.

Beautiful Soup and searching in results

These are my first steps with python, please bear with me.
Basically I want to parse a Table of Contents from a single Dokuwiki page with Beautiful Soup. The TOC looks like this:
<div id="dw__toc">
<h3 class="toggle">Table of Contents</h3>
<div>
<ul class="toc">
<li class="level1"><div class="li">#</div>
<ul class="toc">
<li class="level2"><div class="li">One</div></li>
<li class="level2"><div class="li">Two</div></li>
<li class="level2"><div class="li">Three</div></li>
I would like to be able to search in the content of the a-tags and if a result is found return its content and also return the href-link. So if I search for "one" the result should be
One
#link1
What I have done so far:
#!/usr/bin/python2
from BeautifulSoup import BeautifulSoup
import urllib2
#Grab and open URL, create BeatifulSoup object
url = "http://www.somewiki.at/wiki/doku.php"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
#Grab Table of Contents
grab_toc = soup.find('div', {"id":"dw__toc"})
#Look for all divs with class: li
ftext = grab_toc.findAll('div', {"class":"li"})
#Look for links
links = grab_toc.findAll('a',href=True)
#Iterate
for everytext in ftext:
text = ''.join(everytext.findAll(text=True))
data = text.strip()
print data
for everylink in links:
print everylink['href']
This prints out the data I want but I'm kind of lost to rewrite it to be able to search within the result and only return the searchterm. Tried something like
if data == 'searchtearm':
print data
break
else:
print 'Nothing found'
But this is kind of a weak search. Is there a nicer way to do this? In my example the Beatiful Soup resultset is changed into a list. Is it better to search in the result set in the first place, if so then how to do this?
Instead of searching through the links one-by-one, have BeautifulSoup search for you, using a regular expression:
import re
matching_link = grab_toc.find('a', text=re.compile('one', re.IGNORECASE))
This would find the first a link in the table of contents with the 3 characters one in the text somewhere. Then just print the link and text:
print matching_link.string
print matching_link['href']
Short demo based on your sample:
>>> from bs4 import BeautifulSoup
>>> import re
>>> soup = BeautifulSoup('''\
... <div id="dw__toc">
... <h3 class="toggle">Table of Contents</h3>
... <div>
...
... <ul class="toc">
... <li class="level1"><div class="li">#</div>
... <ul class="toc">
... <li class="level2"><div class="li">One</div></li>
... <li class="level2"><div class="li">Two</div></li>
... <li class="level2"><div class="li">Three</div></li>
... </ul></ul>''')
>>> matching_link = soup.find('a', text=re.compile('one', re.IGNORECASE))
>>> print matching_link.string
One
>>> print matching_link['href']
#link1
In BeautifulSoup version 3, the above .find() call returns the contained NavigableString object instead. To get back to the parent a element, use the .parent attribute:
matching_link = grab_toc.find('a', text=re.compile('one', re.IGNORECASE)).parent
print matching_link.string
print matching_link['href']

Matching HTML subtree with BeautifulSoup

I'm trying to match something like this with beautifulsoup.
<a href="./SlimLineUSB3/SlimLine1BayUSB3.php">
<b>1 Bay SlimLine with both eSATA and USB 3.0</b>
</a>
In a regexp, it would look something like this. I want to capture the url.
<a href="\.(.*)">
<b>.*</b>
</a>
How do I go about doing something like this with BeautifulSoup? I need to use the b tags inside of the 'a' tags I want, since that's the only thing that differentiates these 'a's from any other link on the page. It seems like I can only write regexps to match the tag name or specific attributes?
If you just want to get the href from all a tags which contain one b tag:
>>> from BeautifulSoup import BeautifulSoup
>>> html = """
... <html><head><title>Title</title></head><body>
... <b>first</b>
... <a><b>no-href</b></a>
... <div><b>second</b></div>
... <div><b>third</b></div>
... no-bold-tag
... <b>text</b><p>other-stuff</p>
... </body></html>
... ... """
>>> soup = BeautifulSoup(html)
>>> [a['href'] for a in soup('a', href=True) if a.b and len(a) == 1]
[u'first/index.php', u'second/index.php', u'third/index.php']
This can be done quite elegantly using an XPath expression if you don't mind using lxml.
import lxml.html as lh
html = '''
<a href="./SlimLineUSB3/SlimLine1BayUSB3.php">
<b>1 Bay SlimLine with both eSATA and USB 3.0</b>
</a>
<a href="./Another/URL.php">
<b>foo</b>
<p>bar</p>
</a>
'''
tree = lh.fromstring(html)
for link in tree.xpath('a[count(b) = 1 and count(*) = 1]'):
print lh.tostring(link)
Result:
<a href="./SlimLineUSB3/SlimLine1BayUSB3.php">
<b>1 Bay SlimLine with both eSATA and USB 3.0</b>
</a>
Or if you wanted to use a method more similar to #ekhumoro's with lxml you could do:
[a for a in tree.xpath('a[#href]') if a.find('b') != None and len(a) == 1]

Categories