So I have a html like this:
...
<ul class="myclass">
<li>blah</li>
<li>blah2</li>
</ul>
...
I want to get the texts "blah" and "blah2" from the ul with the class name "myclass"
So I tried to use innerhtml(), but for some reason it doesn't work with lxml.
I'm using Python 3.
I would try:
doc.xpath('.//ul[#class = "myclass"]/li/text()')
# out: ["blah","blah2"]
edit:
what if there was a <a> in the <li>? for example, how would I get "link" and text" from <li>text</li>?
link = doc.xpath('.//ul[#class = "myclass"]/li/a/#href')
txt= doc.xpath('.//ul[#class = "myclass"]/li/a/text()')
If you want you can combine those, and if we take #larsmans example, you can use '//' to get the whole text, because I belive that lxml does't support the string() method in an expression.
doc.xpath('.//ul[#class="myclass"]/li[a]//text() | .//ul[#class="myclass"]/li/a/#href')
# out: ['I contain a ', 'http://example.com', 'link', '.']
Also, you can use the text_content() method:
html=\
"""
<html>
<ul class="myclass">
<li>I contain a link.</li>
<li>blah</li>
<li>blah2</li>
</ul>
</html>
"""
import lxml.html as lh
doc=lh.fromstring(html)
for elem in doc.xpath('.//ul[#class="myclass"]/li'):
print elem.text_content()
prints:
#I contain a link.
#blah
#blah2
Related
I'm parsing a page that has structure like this:
<pre class="asdf">content a</pre>
<pre class="asdf">content b</pre>
# returns
content a
content b
And I'm using the following XPath to get the content:
"//pre[#class='asdf']/text()"
It works well, except if there are any elements nested inside the <pre> tag, where it doesn't concatenate them:
<pre class="asdf">content <a href="http://stackoverflow.com"</a>a</a></pre>
<pre class="asdf">content b</pre>
# returns
content
content b
If I use this XPath, I get the output that follows.
"//pre[#class='asdf']//text()"
content
a
content b
I don't want either of those. I want to get all text inside a <pre>, even if it has children. I don't care if the tags are stripped or not- but I want it concatenated together.
How do I do this? I'm using lxml.html.xpath in python2, but I don't think it matters. This answer to another question makes me think that maybe child:: has something to do with my answer.
Here's some code that reproduces it.
from lxml import html
tree = html.fromstring("""
<pre class="asdf">content a</pre>
<pre class="asdf">content b</pre>
""")
for row in tree.xpath("//*[#class='asdf']/text()"):
print("row: ", row)
.text_content() is what you should use:
.text_content():
Returns the text content of the element, including the text content of its children, with no markup.
for row in tree.xpath("//*[#class='asdf']"):
print("row: ", row.text_content())
Demo:
>>> from lxml import html
>>>
>>> tree = html.fromstring("""
... <pre class="asdf">content a</pre>
... <pre class="asdf">content b</pre>
... """)
>>> for row in tree.xpath("//*[#class='asdf']"):
... print("row: ", row.text_content())
...
('row: ', 'content a')
('row: ', 'content b')
As it says all. Is there anyway to search entire DOM for a specific text, for instance CAPTCHA word?
You can use find and specify the text argument:
With text you can search for strings instead of tags. As with name and
the keyword arguments, you can pass in a string, a regular expression,
a list, a function, or the value True.
>>> from bs4 import BeautifulSoup
>>> data = """
... <div>test1</div>
... <div class="myclass1">test2</div>
... <div class="myclass2">CAPTCHA</div>
... <div class="myclass3">test3</div>"""
>>> soup = BeautifulSoup(data)
>>> soup.find(text='CAPTCHA').parent
<div class="myclass2">CAPTCHA</div>
If CAPTCHA is just a part of a text, you can pass a lambda function into text and check if CAPTCHA is inside the tag text:
>>> data = """
... <div>test1</div>
... <div class="myclass1">test2</div>
... <div class="myclass2">Here CAPTCHA is a part of a sentence</div>
... <div class="myclass3">test3</div>"""
>>> soup = BeautifulSoup(data)
>>> soup.find(text=lambda x: 'CAPTCHA' in x).parent
<div class="myclass2">Here CAPTCHA is a part of a sentence</div>
Or, the same can be achieved if you pass a regular expression into text:
>>> import re
>>> soup.find(text=re.compile('CAPTCHA')).parent
<div class="myclass2">Here CAPTCHA is a part of a sentence</div>
I am trying to parse all elements under div using beautifulsoup the issue is that I don't know all the elements underneath the div prior to parsing. For example a div can have text data in paragraph mode and bullet format along with some href elements. Each url that I open can have different elements underneath the specific div class that I am looking at:
example:
url a can have following:
<div class='content'>
<p> Hello I have a link </p>
<li> I have a bullet point
foo
</div>
but url b
can have
<div class='content'>
<p> I only have paragraph </p>
</div>
I started as doing something like this:
content = souping_page.body.find('div', attrs={'class': 'content})
but how to go beyond this is little confuse. I was hoping to create one string from all the parse data as a end result.
At the end I want the following string to be obtain from each example:
Example 1: Final Output
parse_data = Hello I have a link I have a bullet point
parse_links = foo.com
Example 2: Final Output
parse_data = I only have paragraph
You can get just the text of a text with element.get_text():
>>> from bs4 import BeautifulSoup
>>> sample1 = BeautifulSoup('''\
... <div class='content'>
... <p> Hello I have a link </p>
...
... <li> I have a bullet point
...
... foo
... </div>
... ''').find('div')
>>> sample2 = BeautifulSoup('''\
... <div class='content'>
... <p> I only have paragraph </p>
...
... </div>
... ''').find('div')
>>> sample1.get_text()
u'\n Hello I have a link \n I have a bullet point\n\nfoo\n'
>>> sample2.get_text()
u'\n I only have paragraph \n'
or you can strip it down a little using element.stripped_strings:
>>> ' '.join(sample1.stripped_strings)
u'Hello I have a link I have a bullet point foo'
>>> ' '.join(sample2.stripped_strings)
u'I only have paragraph'
To get all links, look for all a elements with href attributes and gather these in a list:
>>> [a['href'] for a in sample1.find_all('a', href=True)]
['foo.com']
>>> [a['href'] for a in sample2.find_all('a', href=True)]
[]
The href=True argument limits the search to <a> tags that have a href attribute defined.
Per the Beautiful Soup docs, to iterate over the children of a tag use either .contents to get them as a list or .children (a generator).
for child in title_tag.children:
print(child)
So, in your case, for example, you grab the .text of each tag and concatenate it together. I'm not clear on whether you want the link location or simply the label, if the former, refer to this SO question.
These are my first steps with python, please bear with me.
Basically I want to parse a Table of Contents from a single Dokuwiki page with Beautiful Soup. The TOC looks like this:
<div id="dw__toc">
<h3 class="toggle">Table of Contents</h3>
<div>
<ul class="toc">
<li class="level1"><div class="li">#</div>
<ul class="toc">
<li class="level2"><div class="li">One</div></li>
<li class="level2"><div class="li">Two</div></li>
<li class="level2"><div class="li">Three</div></li>
I would like to be able to search in the content of the a-tags and if a result is found return its content and also return the href-link. So if I search for "one" the result should be
One
#link1
What I have done so far:
#!/usr/bin/python2
from BeautifulSoup import BeautifulSoup
import urllib2
#Grab and open URL, create BeatifulSoup object
url = "http://www.somewiki.at/wiki/doku.php"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
#Grab Table of Contents
grab_toc = soup.find('div', {"id":"dw__toc"})
#Look for all divs with class: li
ftext = grab_toc.findAll('div', {"class":"li"})
#Look for links
links = grab_toc.findAll('a',href=True)
#Iterate
for everytext in ftext:
text = ''.join(everytext.findAll(text=True))
data = text.strip()
print data
for everylink in links:
print everylink['href']
This prints out the data I want but I'm kind of lost to rewrite it to be able to search within the result and only return the searchterm. Tried something like
if data == 'searchtearm':
print data
break
else:
print 'Nothing found'
But this is kind of a weak search. Is there a nicer way to do this? In my example the Beatiful Soup resultset is changed into a list. Is it better to search in the result set in the first place, if so then how to do this?
Instead of searching through the links one-by-one, have BeautifulSoup search for you, using a regular expression:
import re
matching_link = grab_toc.find('a', text=re.compile('one', re.IGNORECASE))
This would find the first a link in the table of contents with the 3 characters one in the text somewhere. Then just print the link and text:
print matching_link.string
print matching_link['href']
Short demo based on your sample:
>>> from bs4 import BeautifulSoup
>>> import re
>>> soup = BeautifulSoup('''\
... <div id="dw__toc">
... <h3 class="toggle">Table of Contents</h3>
... <div>
...
... <ul class="toc">
... <li class="level1"><div class="li">#</div>
... <ul class="toc">
... <li class="level2"><div class="li">One</div></li>
... <li class="level2"><div class="li">Two</div></li>
... <li class="level2"><div class="li">Three</div></li>
... </ul></ul>''')
>>> matching_link = soup.find('a', text=re.compile('one', re.IGNORECASE))
>>> print matching_link.string
One
>>> print matching_link['href']
#link1
In BeautifulSoup version 3, the above .find() call returns the contained NavigableString object instead. To get back to the parent a element, use the .parent attribute:
matching_link = grab_toc.find('a', text=re.compile('one', re.IGNORECASE)).parent
print matching_link.string
print matching_link['href']
I'm trying to match something like this with beautifulsoup.
<a href="./SlimLineUSB3/SlimLine1BayUSB3.php">
<b>1 Bay SlimLine with both eSATA and USB 3.0</b>
</a>
In a regexp, it would look something like this. I want to capture the url.
<a href="\.(.*)">
<b>.*</b>
</a>
How do I go about doing something like this with BeautifulSoup? I need to use the b tags inside of the 'a' tags I want, since that's the only thing that differentiates these 'a's from any other link on the page. It seems like I can only write regexps to match the tag name or specific attributes?
If you just want to get the href from all a tags which contain one b tag:
>>> from BeautifulSoup import BeautifulSoup
>>> html = """
... <html><head><title>Title</title></head><body>
... <b>first</b>
... <a><b>no-href</b></a>
... <div><b>second</b></div>
... <div><b>third</b></div>
... no-bold-tag
... <b>text</b><p>other-stuff</p>
... </body></html>
... ... """
>>> soup = BeautifulSoup(html)
>>> [a['href'] for a in soup('a', href=True) if a.b and len(a) == 1]
[u'first/index.php', u'second/index.php', u'third/index.php']
This can be done quite elegantly using an XPath expression if you don't mind using lxml.
import lxml.html as lh
html = '''
<a href="./SlimLineUSB3/SlimLine1BayUSB3.php">
<b>1 Bay SlimLine with both eSATA and USB 3.0</b>
</a>
<a href="./Another/URL.php">
<b>foo</b>
<p>bar</p>
</a>
'''
tree = lh.fromstring(html)
for link in tree.xpath('a[count(b) = 1 and count(*) = 1]'):
print lh.tostring(link)
Result:
<a href="./SlimLineUSB3/SlimLine1BayUSB3.php">
<b>1 Bay SlimLine with both eSATA and USB 3.0</b>
</a>
Or if you wanted to use a method more similar to #ekhumoro's with lxml you could do:
[a for a in tree.xpath('a[#href]') if a.find('b') != None and len(a) == 1]