Python 3 get child elements (lxml)

Python 3 get child elements (lxml) - python

I am using lxml with html:
from lxml import html
import requests
How would I check if any of an element's children have the class = "nearby"
my code (essentially):
url = "www.example.com"
Page = requests.get(url)
Tree = html.fromstring(Page.content)
resultList = Tree.xpath('//p[#class="result-info"]')
i=len(resultList)-1 #to go though the list backwards
while i>0:
if (resultList[i].HasChildWithClass("nearby")):
print('This result has a child with the class "nearby"')
How would I replace "HasChildWithClass()" to make it actually work?
Here's an example tree:
...
<p class="result-info">
<span class="result-meta">
<span class="nearby">
... #this SHOULD print something
</span>
</span>
</p>
<p class="result-info">
<span class="result-meta">
<span class="FAR-AWAY">
... # this should NOT print anything
</span>
</span>
</p>
...

I tried to understand why you use lxml to find the element. However BeautifulSoup and re may be a better choice.
lxml = """
<p class="result-info">
<span class="result-meta">
<span class="nearby">
... #this SHOULD print something
</span>
</span>
</p>
<p class="result-info">
<span class="result-meta">
<span class="FAR-AWAY">
... # this should NOT print anything
</span>
</span>
</p>
"""
But i done what you want.
from lxml import html
Tree = html.fromstring(lxml)
resultList = Tree.xpath('//p[#class="result-info"]')
i = len(resultList) - 1 #to go though the list backwards
for result in resultList:
for e in result.iter():
if e.attrib.get("class") == "nearby":
print(e.text)
Try to use bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(lxml,"lxml")
result = soup.find_all("span", class_="nearby")
print(result[0].text)

Here is an experiment I did.
Take r = resultList[0] in python shell and type:
>>> dir(r)
['__bool__', '__class__', ..., 'find_class', ...
Now this find_class method is highly suspicious. If you check its help doc:
>>> help(r.find_class)
you'll confirm the guess. Indeed,
>>> r.find_class('nearby')
[<Element span at 0x109788ea8>]
For the other tag s = resultList[1] in the example xml code you gave,
>>> s.find_class('nearby')
[]
Now it's clear how to tell whether a 'nearby' child exists or not.
Cheers!

Related

How to print the text inside of a child tag and the href of a grandchild element with a single BeautifulSoup Object?

I have a document which contains several div.inventory siblings.
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
I would like to iterate over them to print the item number and link of the item.
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
How do I parse these two values after selecting the div.inventory element?
import requests
from bs4 import BeautifulSoup
htmlSource = requests.get(url).text
soup = BeautifulSoup(htmlSource)
matches = soup.select('div.inventory')
for match in matches:
#prints 123
#prints http://linktoitem
Also - what is the difference between the select function and find* functions?

You can find both items using find() relying on the class attributes:
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Example:
from bs4 import BeautifulSoup
data = """
<body>
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">456</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">789</span>
<span class="cost">
$1.23
</span>
</div>
</body>
"""
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Prints:
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
Note the use of select() - this method allows to use CSS Selectors for searching over the page. Also note the use of class_ argument - underscore is important since class is a reversed keyword in Python.

lxml data from two tags

My code is like this
response = urllib2.urlopen("file:///C:/data20140801.html")
page = response.read()
tree = etree.HTML(page)
data = tree.xpath("//p/span/text()")
html page could have this structures
<span style="font-size:10.0pt">Something</span>
html page could also have this structures
<p class="Normal">
<span style="font-size:10.0pt">Some</span>
<span style="font-size:10.0pt">thing<span>
</p>
Using same code for both I want to get "Something"

The XPath expression returns a list of values:
>>> from lxml.html import etree
>>> tree = etree.HTML('''\
... <p class="Normal">
... <span style="font-size:10.0pt">Some</span>
... <span style="font-size:10.0pt">thing<span>
... </p>
... ''')
>>> tree.xpath("//p/span/text()")
['Some', 'thing']
Use ''.join() to combine the two strings into one:
>>> ''.join(tree.xpath("//p/span/text()"))
'Something'

Parse between pre tag Python

I'm trying to parse between PRE tags using Python using this code
s = br.open(base_url+str(string))
u = br.geturl()
seq = br.open(u)
blat = BeautifulSoup(seq)
for res in blat.find('pre').findChildren():
seq = res.string
print seq
from the following HTML source code:
<PRE><TT>
<span style='color:#22CCEE;'>T</span><span style='color:#3300FF;'>AAAAGATGA</span> <span style='color:#3300FF;'>AGTTTCTATC</span> <span style='color:#3300FF;'>ATCCAAA</span>aa<span style='color:#3300FF;'>A</span> <span style='color:#3300FF;'>TGGGCTACAG</span> <span style='color:#3300FF;'>AAAC</span><span style='color:#22CCEE;'>C</span></TT></PRE>
<HR ALIGN="CENTER"><H4><A NAME=genomic></A>Genomic chr17 (reverse strand):</H4>
<PRE><TT>
tacatttttc tctaactgca aacataatgt tttcccttgt attttacaga 41256278
tgcaaacagc tataattttg caaaaaagga aaataactct cctgaacatc 41256228
<A NAME=1></A><span style='color:#22CCEE;'>T</span><span style='color:#3300FF;'>AAAAGATGA</span> <span style='color:#3300FF;'>AGTTTCTATC</span> <span style='color:#3300FF;'>ATCCAAA</span>gt<span style='color:#3300FF;'>A</span> <span style='color:#3300FF;'>TGGGCTACAG</span> <span style='color:#3300FF;'>AAAC</span><span style='color:#22CCEE;'>C</span>gtgcc 41256178
aaaagacttc tacagagtga acccgaaaat ccttccttgg taaaaccatt 41256128
tgttttcttc ttcttcttct tcttcttttc tttttttttt ctttt</TT></PRE>
<HR ALIGN="CENTER"><H4><A NAME=ali></A>Side by Side Alignment</H4>
<PRE><TT>
00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
<<<<<<<< ||||||||||||||||||||||||||| |||||||||||||||| <<<<<<<<
41256227 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41256183
</TT></PRE>
It gives me the first PRE tag elements when I want to parse the last one. I'd appreciate any suggestions to achieve it.
I'd like the output to be like:
00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
<<<<<<<< ||||||||||||||||||||||||||| |||||||||||||||| <<<<<<<<
41256227 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41256183
whereas my current output is
T
AAAAGATGA
AGTTTCTATC
ATCCAAA
A
TGGGCTACAG
AAAC
C

You can use find_all() an get the last result:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('../index.html'), 'html5lib')
pre = soup.find_all('pre')[-1]
print pre.text.strip()
where index.html contains the html you provided.
It prints:
00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
<<<<<<<< ||||||||||||||||||||||||||| |||||||||||||||| <<<<<<<<
41256227 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41256183
Another option would be to rely on the previous h4 tag to get the appropriate pre:
h4 = soup.select('h4 > a[name="ali"]')[0].parent
print h4.find_next_sibling('pre').text.strip()

Beautiful Soup and searching in results

These are my first steps with python, please bear with me.
Basically I want to parse a Table of Contents from a single Dokuwiki page with Beautiful Soup. The TOC looks like this:
<div id="dw__toc">
<h3 class="toggle">Table of Contents</h3>
<div>
<ul class="toc">
<li class="level1"><div class="li">#</div>
<ul class="toc">
<li class="level2"><div class="li">One</div></li>
<li class="level2"><div class="li">Two</div></li>
<li class="level2"><div class="li">Three</div></li>
I would like to be able to search in the content of the a-tags and if a result is found return its content and also return the href-link. So if I search for "one" the result should be
One
#link1
What I have done so far:
#!/usr/bin/python2
from BeautifulSoup import BeautifulSoup
import urllib2
#Grab and open URL, create BeatifulSoup object
url = "http://www.somewiki.at/wiki/doku.php"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
#Grab Table of Contents
grab_toc = soup.find('div', {"id":"dw__toc"})
#Look for all divs with class: li
ftext = grab_toc.findAll('div', {"class":"li"})
#Look for links
links = grab_toc.findAll('a',href=True)
#Iterate
for everytext in ftext:
text = ''.join(everytext.findAll(text=True))
data = text.strip()
print data
for everylink in links:
print everylink['href']
This prints out the data I want but I'm kind of lost to rewrite it to be able to search within the result and only return the searchterm. Tried something like
if data == 'searchtearm':
print data
break
else:
print 'Nothing found'
But this is kind of a weak search. Is there a nicer way to do this? In my example the Beatiful Soup resultset is changed into a list. Is it better to search in the result set in the first place, if so then how to do this?

Instead of searching through the links one-by-one, have BeautifulSoup search for you, using a regular expression:
import re
matching_link = grab_toc.find('a', text=re.compile('one', re.IGNORECASE))
This would find the first a link in the table of contents with the 3 characters one in the text somewhere. Then just print the link and text:
print matching_link.string
print matching_link['href']
Short demo based on your sample:
>>> from bs4 import BeautifulSoup
>>> import re
>>> soup = BeautifulSoup('''\
... <div id="dw__toc">
... <h3 class="toggle">Table of Contents</h3>
... <div>
...
... <ul class="toc">
... <li class="level1"><div class="li">#</div>
... <ul class="toc">
... <li class="level2"><div class="li">One</div></li>
... <li class="level2"><div class="li">Two</div></li>
... <li class="level2"><div class="li">Three</div></li>
... </ul></ul>''')
>>> matching_link = soup.find('a', text=re.compile('one', re.IGNORECASE))
>>> print matching_link.string
One
>>> print matching_link['href']
#link1
In BeautifulSoup version 3, the above .find() call returns the contained NavigableString object instead. To get back to the parent a element, use the .parent attribute:
matching_link = grab_toc.find('a', text=re.compile('one', re.IGNORECASE)).parent
print matching_link.string
print matching_link['href']

Python and BeautifulSoup, not finding 'a'

Here's a piece of HTML code (from delicious):
<h4>
<a rel="nofollow" class="taggedlink " href="http://imfy.us/" >Generate Secure Links with Anonymous Referers & Anti-Bot Protection</a>
<span class="saverem">
<em class="bookmark-actions">
<strong><a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&title=Generate%20Secure%20Links%20with%20Anonymous%20Referers%20%26%20Anti-Bot%20Protection&jump=%2Fdux&key=fFS4QzJW2lBf4gAtcrbuekRQfTY-&original_user=dux&copyuser=dux&copytags=web+apps+url+security+generator+shortener+anonymous+links">SAVE</a></strong>
</em>
</span>
</h4>
I'm trying to find all the links where class="inlinesave action". Here's the code:
sock = urllib2.urlopen('http://delicious.com/theuser')
html = sock.read()
soup = BeautifulSoup(html)
tags = soup.findAll('a', attrs={'class':'inlinesave action'})
print len(tags)
But it doesn't find anything!
Any thoughts?
Thanks

If you want to look for an anchor with exactly those two classes you'd, have to use a regexp, I think:
tags = soup.findAll('a', attrs={'class': re.compile(r'\binlinesave\b.*\baction\b')})
Keep in mind that this regexp won't work if the ordering of the class names is reversed (class="action inlinesave").
The following statement should work for all cases (even though it looks ugly imo.):
soup.findAll('a',
attrs={'class':
re.compile(r'\baction\b.*\binlinesave\b|\binlinesave\b.*\baction\b')
})

Python string methods
html=open("file").read()
for item in html.split("<strong>"):
if "class" in item and "inlinesave action" in item:
url_with_junk = item.split('href="')[1]
m = url_with_junk.index('">')
print url_with_junk[:m]

May be that issue is fixed in verion 3.1.0, I could parse yours,
>>> html="""<h4>
... <a rel="nofollow" class="taggedlink " href="http://imfy.us/" >Generate Secure Links with Anony
... <span class="saverem">
... <em class="bookmark-actions">
... <strong><a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&title=Gen
... </em>
... </span>
... </h4>"""
>>>
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> tags = soup.findAll('a', attrs={'class':'inlinesave action'})
>>> print len(tags)
1
>>> tags
[<a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&title=Generate%20Secure%
>>>
I have tried with BeautifulSoup 2.1.1 also, its does not work at all.

You might make some forward progress using pyparsing:
from pyparsing import makeHTMLTags, withAttribute
htmlsrc="""<h4>... etc."""
atag = makeHTMLTags("a")[0]
atag.setParseAction(withAttribute(("class","inlinesave action")))
for result in atag.searchString(htmlsrc):
print result.href
Gives (long result output snipped at '...'):
/save?url=http%3A%2F%2Fimfy.us%2F&title=Genera...+anonymous+links

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python 3 get child elements (lxml) - python

Related

How to print the text inside of a child tag and the href of a grandchild element with a single BeautifulSoup Object?

lxml data from two tags

Parse between pre tag Python

Beautiful Soup and searching in results

Python and BeautifulSoup, not finding 'a'

Categories

Resources