lxml doesn't get all text in element if text has <br />? - python

I am using lxml to parse web document, I want to get all the text in a <p> element, so I use the code as follow:
from lxml import etree
page = etree.HTML("<html><p>test1 <br /> test2</p></html>")
print page.xpath("//p")[0].text # this just print "test1" not "test1 <br/> test2"
The problem is I want to get all text in <p> which is test1 <br /> test2 in the example, but lxml just give me test1.
How can I get all text in <p> element?

Several other possible ways :
p = page.xpath("//p")[0]
print etree.tostring(p, method="text")
or using XPath string() function (notice that XPath position index starts from 1 instead of 0) :
page.xpath("string(//p[1])")

Maybe like this
from lxml import etree
pag = etree.HTML("<html><p>test1 <br /> test2</p></html>")
# get all texts
print(pag.xpath("//p/text()"))
['test1 ', ' test2']
# concate
print("".join(pag.xpath("//p/text()")))
test1 test2

Related

How can I improve my solution for getting unknown text between tags?

I'm very at Python and BeautifulSoup and trying to up my game. Let's say this is my HTML:
<div class="container">
<h4>Title 1</h4>
Text I want is here
<br /> # random break tags inserted throughout
<br />
More text I want here
<br />
yet more text I want
<h4>Title 2</h4>
More text here, but I do not want it
<br />
<ul> # More HTML that I do not want</ul>
</div> # End container div
My expected output is the text between the two H4 tags:
Text I want is here
More text I want here
yet more text I want
But I don't know in advance what this text will say or how much of it there will be. There might be only one line, or there might be several paragraphs. It is not tagged with anything: no p tags, no id, nothing. The only thing I know about it is that it will appear between those two H4 tags.
At the moment, what I'm doing is working backward from the second H4 tag by using .previous_siblings to get everything up to the container div.
text = soup.find('div', class_ = 'container').find_next('h4', text = 'Title 2')
text = text.previous_siblings
text_list = []
for line in text:
text_list.append(line)
text_list.reverse()
full_text = ' '.join([str(line) for line in text_list])
text = full_text.strip().replace('<h4>Title 1</h4>', '').replace('<br />'>, '')
This gives me the content I want, but it also gives me a lot more that I don't want, plus it gives it to me backwards, which is why I need to use reverse(). Then I end up having to strip out a lot of stuff using replace().
What I don't like about this is that since my end result is a list, I'm finding it hard to clean up the output. I can't use get_text() on a list. In my real-life version of this I have about ten instances of replace() and it's still not getting rid of everything.
Is there a more elegant way for me to get the desired output?
You can filter the previous siblings for NavigableStrings.
For example:
from bs4 import NavigableString
text = soup.find('div', class_ = 'container').find_next('h4', text = 'Title 2')
text = text.previous_siblings
text_list = [t for t in text if type(t) == NavigableString]
text_list will look like:
>>> text_list
[u'\nyet more text I want\n', u'\nMore text I want here\n', u'\n', u'\nText I want is here\n', u'\n']
You can also filter out \n's:
text_list = [t for t in text if type(t) == NavigableString and t != '\n']
Other solution: Use .find_next_siblings() with text=True (that will find only NavigableString nodes in the tree). Then each iteration check, if previous <h4> is correct one:
from bs4 import BeautifulSoup
txt = '''<div class="container">
<h4>Title 1</h4>
Text I want is here
<br />
<br />
More text I want here
<br />
yet more text I want
<h4>Title 2</h4>
More text here, but I do not want it
<br />
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
out = []
first_h4 = soup.find('h4')
for t in first_h4.find_next_siblings(text=True):
if t.find_previous('h4') != first_h4:
break
elif t.strip():
out.append(t.strip())
print(out)
Prints:
['Text I want is here', 'More text I want here', 'yet more text I want']

python - lxml - How to tag words in texts gracefully?

I need to tag certain words with lxml. Take this as an example,
<span>Please BOLD me, <br /> BOLD me too</span>
I need to find out all certain words, 'BOLD' here , and add tags to them. The result should be:
<span>Please <b>BOLD</b> me, <br /> <b>BOLD</b> me too</span>
It must use lxml, it is not only a regular expression problem. It needs some program calculation before tagging. More like this:
s = '<span>Please BOLD me, <br /> BOLD me too</span>'
from lxml import etree
et = etree.fromstring(s)
for e in et.iter():
if 'BOLD' in e.text:
**tag it**
if 'BOLD' in e.tail:
**tag it**
I guess I need to create an element bold = etree.Element('b'); bold.text = 'BOLD'
The problem is I don't know how to insert the above element bold gracefully.
You have to manually create a <b> element and .insert() it in place. Put the remaining text in the tail of the created element:
import lxml.html
from lxml.html import builder as E
text = '''
<html>
<body>
<span>Please BOLD me</span>
</body>
</html>
'''
doc = lxml.html.fromstring(text)
for span in doc.xpath('//span'):
# search for the word "BOLD" in the span text:
pre, sep, pos = span.text.partition('BOLD')
if sep:
span.text = pre
bold = E.B(sep) # create element
bold.tail = pos
span.insert(0, bold)
print(lxml.html.tostring(doc, pretty_print=True))
The results:
<html>
<body>
<span>Please <b>BOLD</b> me</span>
</body>
</html>
If you find it in a tail, then you have to insert the new element in the parent, just after the element you found:
parent = element.getparent()
parent.insert(parent.index(element) + 1, bold)

Getting multiple blocks of text between tags

This is my HTML:
<div class="left_panel">
<h4>Header1</h4>
block of text that I want.
<br />
<br />
another block of text that I want.
<br />
<br />
still more text that I want.
<br />
<br />
<p> </p>
<h4>Header2</h4>
The number of blocks of text is variable, Header1 is consistent, Header2 is not.
I'm successfully extracting the first block of text using the following code:
def get_summary (soup):
raw = soup.find('div',{"class":"left_panel"})
for h4 in raw.findAllNext('h4'):
following = h4.nextSibling
return following
However I need all of the items sitting between the two h4 tags, I was hoping that using h4.nextSiblings would solve this, but for some reason that returns the following error:
TypeError: 'NoneType' object is not callable
I've been trying variations on this answer: Find next siblings until a certain one using beautifulsoup but the absence of a leading tag is confusing me.
Find the first header and iterate over .next_siblings until you hit an another header:
from bs4 import BeautifulSoup
data = """
<div class="left_panel">
<h4>Header1</h4>
block of text that I want.
<br />
<br />
another block of text that I want.
<br />
<br />
still more text that I want.
<br />
<br />
<p> </p>
<h4>Header2</h4>
</div>
"""
soup = BeautifulSoup(data)
header1 = soup.find('h4', text='Header1')
for item in header1.next_siblings:
if getattr(item, 'name') == 'h4' and item.text == 'Header2':
break
print item
Update (collecting texts between two h4 tags):
texts = []
for item in header1.next_siblings:
if getattr(item, 'name') == 'h4' and item.text == 'Header2':
break
try:
texts.append(item.text)
except AttributeError:
texts.append(item)
print ''.join(texts)
I don't understand why are you passing soup as an argument but you don't use it.
If you use the correct soup instance you shouldn't get that error. findAllNext(h4) returns <h4>Header1</h4> and <h4>Header2</h4>, applying nextSibling on each returns the text sibling, which are
block of text that I want.
and
')
in your case.

Get bulletted list in lxml

So I have a html like this:
...
<ul class="myclass">
<li>blah</li>
<li>blah2</li>
</ul>
...
I want to get the texts "blah" and "blah2" from the ul with the class name "myclass"
So I tried to use innerhtml(), but for some reason it doesn't work with lxml.
I'm using Python 3.
I would try:
doc.xpath('.//ul[#class = "myclass"]/li/text()')
# out: ["blah","blah2"]
edit:
what if there was a <a> in the <li>? for example, how would I get "link" and text" from <li>text</li>?
link = doc.xpath('.//ul[#class = "myclass"]/li/a/#href')
txt= doc.xpath('.//ul[#class = "myclass"]/li/a/text()')
If you want you can combine those, and if we take #larsmans example, you can use '//' to get the whole text, because I belive that lxml does't support the string() method in an expression.
doc.xpath('.//ul[#class="myclass"]/li[a]//text() | .//ul[#class="myclass"]/li/a/#href')
# out: ['I contain a ', 'http://example.com', 'link', '.']
Also, you can use the text_content() method:
html=\
"""
<html>
<ul class="myclass">
<li>I contain a link.</li>
<li>blah</li>
<li>blah2</li>
</ul>
</html>
"""
import lxml.html as lh
doc=lh.fromstring(html)
for elem in doc.xpath('.//ul[#class="myclass"]/li'):
print elem.text_content()
prints:
#I contain a link.
#blah
#blah2

How to split the tags from html tree

This is my html tree
<li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
</h3>Get the IndianOil Citibank <b>Card</b>. Apply Now!
<br />
Get 10X Rewards On Shopping -
Save Over 5% On Fuel
<br />
<cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>
From this html i need to extract the lines beforeth of < br > tag
line1 : Get the IndianOil Citibank Card. Apply Now!
line2 : Get 10X Rewards On Shopping - Save Over 5% On Fuel
how it would supposed to do in python?
I think you just asked for the line before each <br/>.
This following code will do it for the sample you've provided, by striping out the <b> and <a> tags and printing the .tail of each element whose following-sibling is a <br/>.
from lxml import etree
doc = etree.HTML("""
<li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
</h3>Get the IndianOil Citibank <b>Card</b>. Apply Now!
<br />
Get 10X Rewards On Shopping -
Save Over 5% On Fuel
<br />
<cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>""")
etree.strip_tags(doc,'a','b')
for element in doc.xpath('//*[following-sibling::*[name()="br"]]'):
print repr(element.tail.strip())
Yields:
'Get the IndianOil Citibank Card. Apply Now!'
'Get 10X Rewards On Shopping -\n Save Over 5% On Fuel'
As with all parsing of HTML you need to make some assumptions about the format of the HTML. If we can assume that the previous line is everything before the <br> tag up to a block level tag, or another <br> then we can do the following...
from BeautifulSoup import BeautifulSoup
doc = """
<li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
</h3>Get the IndianOil Citibank <b>Card</b>. Apply Now!
<br />
Get 10X Rewards On Shopping -
Save Over 5% On Fuel
<br />
<cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>
"""
soup = BeautifulSoup(doc)
Now we have parsed the HTML, next we define the list of tags we don't want to treat as part of the line. There are other block tags really, but this does for this HTML.
block_tags = ["div", "p", "h1", "h2", "h3", "h4", "h5", "h6", "br"]
We cycle through each <br> tag stepping back through its siblings until we either have no more, or we hit a block level tag. Each time we loop we get add the node to the front of our line. NavigableStrings don't have name attributes, but we want to include them hence the two part test in the while loop.
for node in soup.findAll("br"):
line = ""
sibling = node.previousSibling
while sibling is not None and (not hasattr(sibling, "name") or sibling.name not in block_tags):
line = unicode(sibling) + line
sibling = sibling.previousSibling
print line
Solution without relaying on <br> tags:
import lxml.html
html = "..."
tree = lxml.html.fromstring(html)
line1 = ''.join(tree.xpath('//li[#class="taf"]/text() | b/text()')[:3]).strip()
line2 = ' - '.join(tree.xpath('//li[#class="taf"]//a[not(#id)]/text()'))
I dont know whether you want to use lxml or beautiful soup. But for lxml using xpath here is an example
import lxml
from lxml import etree
import urllib2
response = urllib2.urlopen('your url here')
html = response.read()
imdb = etree.HTML(html)
titles = imdb.xpath('/html/body/li/a/text()')//xpath for "line 2" data.[use firebug]
The xpath I used is for your given html snippet. It may change in the original context.
You can also give cssselect in lxml a try.
import lxml.html
import urllib
data = urllib.urlopen('your url').read()
doc = lxml.html.fromstring(data)
elements = doc.cssselect('your csspath here') // CSSpath[using firebug extension]
for element in elements:
print element.text_content()

Categories