Remove elements from tree based on list of terms

Remove elements from tree based on list of terms - python

I'm trying to capture some text from a webpage (whose URL is passed when running the script), but its buried in a paragraph tag with no other attributes assigned. I can collect the contents of every paragraph tag, but I want to remove any elements from the tree that contain any of a list of keywords.
I get the following error:
tree.remove(elem) TypeError: Argument 'element' has incorrect type
(expected lxml.etree._Element, got _ElementStringResult)
I understand that what I am getting back when I try to iterate through the tree is the wrong type, but how do I get the element instead?
Sample Code:
#!/usr/bin/python
from lxml import html
from lxml import etree
url = sys.argv[1]
page = requests.get(url)
tree = html.fromstring(page.content)
terms = ['keyword1','keyword2','keyword3','keyword4','keyword5','keyword6','keyword7']
paragraphs = tree.xpath('//p/text()')
for elem in paragraphs:
if any(term in elem for term in terms):
tree.remove(elem)

In your code, elem is an _ElementStringResult which has the instance method getparent. Its parent is an Element object of one of the <p> nodes.
The parent has a remove method which can be used to remove it from the tree:
element.getparent().remove(element)
I do not believe there is a more direct way and I don't have a good answer to why there isn't a removeself method.
Using the example html:
content = '''
<root>
<p> nothing1 </p>
<p> keyword1 </p>
<p> nothing2 </p>
<p> nothing3 </p>
<p> keyword4 </p>
</root>
'''
You can see this in action in your code with:
from lxml import html
from lxml import etree
tree = html.fromstring(content)
terms = ['keyword1','keyword2','keyword3','keyword4','keyword5','keyword6','keyword7']
paragraphs = tree.xpath('//p/text()')
for elem in paragraphs:
if any(term in elem for term in terms):
actual_element = elem.getparent()
actual_element.getparent().remove(actual_element)
for child in tree.getchildren():
print('<{tag}>{text}</{tag}>'.format(tag=child.tag, text=child.text))
# Output:
# <p> nothing1 </p>
# <p> nothing2 </p>
# <p> nothing3 </p>
From the comments, it seems like this code isn't working for you. If so, you might need to provide more information about the structure of your html.

Related

Xpath - get text separated by <p> tags

I can't figure out how to get formatted text using xpath from tag which contain's many <span> tag's and <p> tag's.
It's in this form:
<span>This</span>
<span> is</span>
<span> main</span>
<span> text</span>
<p><span>First</span>
<span>p-tag</span<
</p>
<p> second p tag ....
So there are two types of tag. <p> mean's that the text inside this tag is on new line. And text itself is divided into many substring's in <span> tag's.
The problem is that there are text's which are not inside <p>.
From the snippet above, I would like to get (for example in list):
['This is main text','First p-tag','seco....]
This work's but it get's only texts inside <p> tags:
def get_popis_url(url):
root = get_root(url)
ps = root.xpath('//div[#class="description"]/p')
for p in ps:
text = p.xpath('string()').replace('&nbsp',' ').strip()
print text
So the result for the html snippet above is:
First p-tag
second p tag
Do you have any idea?

I'm not sure which Python library you are using (xml.etree?). But concering the XPath try './/div[#class="description"]//*'. That will select all subelements beginnen at the div.
Using xml.etree it would look like this (assuming the HTML source code is given as a string):
def get_popis_url(html_source):
import xml.etree.ElementTree as ET
root = ET.fromstring(html_source)
ps = root.findall('.//div[#class="description"]//*')
for p in ps:
text = p.text
if text:
print text.replace(' ',' ').strip()

Find text using lxml etree

I'm trying to get a text from one tag using lxml etree.
<div class="litem__type">
<div>
Robbp
</div>
<div>Estimation</div>
+487 (0)639 14485653
•
<a href="mailto:herbrich#gmail.com">
Email Address
</a>
•
<a class="external" href="http://www.google.com">
Homepage
</a>
</div>
The problem is that I can't locate it because there are many differences between this kind of snippets. There are situations, when the first and second div is not there at all. As you can see, the telephone number is not in it's own div.
I suppose that it would be possible to extract the telephone using BeautifulSoups contents but I'm trying to use lxml module's xpath.
Do you have any ideas? (email don't have to be there sometimes)
EDIT: The best idea is probably to use regex but I don't know how to tell it that it should extract just text between two <div></div>

You should avoid using regex to parse XML/HTML wherever possible because it is not as efficient as using element trees.
The text after element A's closing tag, but before element B's opening tag, is called element A's tail text. To select this tail text using lxml etree you could do the following:
content = '''
<div class="litem__type">
<div>Robbp</div>
<div>Estimation</div>
+487 (0)639 14485653
Email Address
<a class="external" href="http://www.google.com">Homepage</a>
</div>'''
from lxml import etree
tree = etree.XML(content)
phone_number = tree.xpath('div[2]')[0].tail.strip()
print(phone_number)
Output
'+487 (0)639 14485653'
The strip() function is used here to remove whitespace on either side of the tail text.

You can iterate and get text after div tag.
from lxml import etree
tree = etree.parse("filename.xml")
items = tree.xpath('//div')
for node in items:
# you can check here if it is a phone number
print node.tail

Why won't lxml strip section tags?

I'm trying to parse some HTML with lxml and Python. I want to remove section tags. lxml seems to be capable of removing all other tags I specify but not section tags.
e.g.
test_html = '<section> <header> Test header </header> <p> Test text </p> </section>'
to_parse_html = etree.fromstring(test_html)
etree.strip_tags(to_parse_html,'header')
etree.tostring(to_parse_html)
'<section> Test header <p> Test text </p> </section>'
etree.strip_tags(to_parse_html,'p')
etree.tostring(to_parse_html)
'<section> Test header Test text </section>'
etree.strip_tags(to_parse_html,'section')
etree.tostring(to_parse_html)
'<section> Test header Test text </section>'
Why is this the case?

Why is this the case?
It isn't. The documention says the following:
Note that this will not delete the element (or ElementTree root
element) that you passed even if it matches. It will only treat its
descendants.
So:
>>> tree = etree.fromstring('<section> outer <section> inner </section> </section>')
>>> etree.strip_tags(tree, 'section')
>>> etree.tostring(tree)
'<section> outer inner </section>'
The behavior that you see has nothing to do with the <section> tag, but with the fact that it happens to be the outermost tag of your snippet. The actual answer to your question is thus "because it's implemented that way".
To remove the outermost tag: is it possible to change the code that creates the <section>...</section> to do this? If not, an ElementDepthFirstIterator might do the trick:
>>> tree = etree.fromstring('<section> outer <section> inner </section> </section>')
>>> for val in etree.ElementDepthFirstIterator(tree, tag=None, inclusive=False):
... print(etree.tostring(val))
b'<section> inner </section> '

Python BeautifulSoup extract text between element

I try to extract "THIS IS MY TEXT" from the following HTML:
<html>
<body>
<table>
<td class="MYCLASS">
<!-- a comment -->
<a hef="xy">Text</a>
<p>something</p>
THIS IS MY TEXT
<p>something else</p>
</br>
</td>
</table>
</body>
</html>
I tried it this way:
soup = BeautifulSoup(html)
for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
print hit.text
But I get all the text between all nested Tags plus the comment.
Can anyone help me to just get "THIS IS MY TEXT" out of this?

Learn more about how to navigate through the parse tree in BeautifulSoup. Parse tree has got tags and NavigableStrings (as THIS IS A TEXT). An example
from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
# <html>
# <head>
# <title>
# Page title
# </title>
# </head>
# <body>
# <p id="firstpara" align="center">
# This is paragraph
# <b>
# one
# </b>
# .
# </p>
# <p id="secondpara" align="blah">
# This is paragraph
# <b>
# two
# </b>
# .
# </p>
# </body>
# </html>
To move down the parse tree you have contents and string.
contents is an ordered list of the Tag and NavigableString objects
contained within a page element
if a tag has only one child node, and that child node is a string,
the child node is made available as tag.string, as well as
tag.contents[0]
For the above, that is to say you can get
soup.b.string
# u'one'
soup.b.contents[0]
# u'one'
For several children nodes, you can have for instance
pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']
so here you may play with contents and get contents at the index you want.
You also can iterate over a Tag, this is a shortcut. For instance,
for i in soup.body:
print i
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

Use .children instead:
from bs4 import NavigableString, Comment
print ''.join(unicode(child) for child in hit.children
if isinstance(child, NavigableString) and not isinstance(child, Comment))
Yes, this is a bit of a dance.
Output:
>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
... print ''.join(unicode(child) for child in hit.children
... if isinstance(child, NavigableString) and not isinstance(child, Comment))
...
THIS IS MY TEXT

You can use .contents:
>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
... print hit.contents[6].strip()
...
THIS IS MY TEXT

with your own soup object:
soup.p.next_sibling.strip()
you grab the <p> directly with soup.p *(this hinges on it being the first <p> in the parse tree)
then use next_sibling on the tag object that soup.p returns since the desired text is nested at the same level of the parse tree as the <p>
.strip() is just a Python str method to remove leading and trailing whitespace
*otherwise just find the element using your choice of filter(s)
in the interpreter this looks something like:
In [4]: soup.p
Out[4]: <p>something</p>
In [5]: type(soup.p)
Out[5]: bs4.element.Tag
In [6]: soup.p.next_sibling
Out[6]: u'\n THIS IS MY TEXT\n '
In [7]: type(soup.p.next_sibling)
Out[7]: bs4.element.NavigableString
In [8]: soup.p.next_sibling.strip()
Out[8]: u'THIS IS MY TEXT'
In [9]: type(soup.p.next_sibling.strip())
Out[9]: unicode

Short answer: soup.findAll('p')[0].next
Real answer: You need an invariant reference point from which you can get to your target.
You mention in your comment to Haidro's answer that the text you want is not always in the same place. Find a sense in which it is in the same place relative to some element. Then figure out how to make BeautifulSoup navigate the parse tree following that invariant path.
For example, in the HTML you provide in the original post, the target string appears immediately after the first paragraph element, and that paragraph is not empty. Since findAll('p') will find paragraph elements, soup.find('p')[0] will be the first paragraph element.
You could in this case use soup.find('p') but soup.findAll('p')[n] is more general since maybe your actual scenario needs the 5th paragraph or something like that.
The next field attribute will be the next parsed element in the tree, including children. So soup.findAll('p')[0].next contains the text of the paragraph, and soup.findAll('p')[0].next.next will return your target in the HTML provided.

soup = BeautifulSoup(html)
for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
hit = hit.text.strip()
print hit
This will print: THIS IS MY TEXT
Try this..

The BeautifulSoup documentation provides an example about removing objects from a document using the extract method. In the following example the aim is to remove all comments from the document:
Removing Elements
Once you have a reference to an element, you can rip it out of the
tree with the extract method. This code removes all the comments
from a document:
from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
<a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>

Equivalent to InnerHTML when using lxml.html to parse HTML

I'm working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed.
I would like to know what the most sensible way in the library is to do the equivalent of Javascript's InnerHtml - that is, to retrieve or set the complete contents of a tag.
<body>
<h1>A title</h1>
<p>Some text</p>
</body>
InnerHtml is therefore:
<h1>A title</h1>
<p>Some text</p>
I can do it using hacks (converting to string/regexes etc) but I'm assuming that there is a correct way to do this using the library which I am missing due to unfamiliarity. Thanks for any help.
EDIT: Thanks to pobk for showing me the way on this so quickly and effectively. For anyone trying the same, here is what I ended up with:
from lxml import html
from cStringIO import StringIO
t = html.parse(StringIO(
"""<body>
<h1>A title</h1>
<p>Some text</p>
Untagged text
<p>
Unclosed p tag
</body>"""))
root = t.getroot()
body = root.body
print (element.text or '') + ''.join([html.tostring(child) for child in body.iterdescendants()])
Note that the lxml.html parser will fix up the unclosed tag, so beware if this is a problem.

Sorry for bringing this up again, but I've been looking for a solution and yours contains a bug:
<body>This text is ignored
<h1>Title</h1><p>Some text</p></body>
Text directly under the root element is ignored. I ended up doing this:
(body.text or '') +\
''.join([html.tostring(child) for child in body.iterchildren()])

You can get the children of an ElementTree node using the getchildren() or iterdescendants() methods of the root node:
>>> from lxml import etree
>>> from cStringIO import StringIO
>>> t = etree.parse(StringIO("""<body>
... <h1>A title</h1>
... <p>Some text</p>
... </body>"""))
>>> root = t.getroot()
>>> for child in root.iterdescendants(),:
... print etree.tostring(child)
...
<h1>A title</h1>
<p>Some text</p>
This can be shorthanded as follows:
print ''.join([etree.tostring(child) for child in root.iterdescendants()])

import lxml.etree as ET
body = t.xpath("//body");
for tag in body:
h = html.fromstring( ET.tostring(tag[0]) ).xpath("//h1");
p = html.fromstring( ET.tostring(tag[1]) ).xpath("//p");
htext = h[0].text_content();
ptext = h[0].text_content();
you can also use .get('href') for a tag and .attrib for attribute ,
here tag no is hardcoded but you can also do this dynamic

Here is a Python 3 version:
from xml.sax import saxutils
from lxml import html
def inner_html(tree):
""" Return inner HTML of lxml element """
return (saxutils.escape(tree.text) if tree.text else '') + \
''.join([html.tostring(child, encoding=str) for child in tree.iterchildren()])
Note that this includes escaping of the initial text as recommended by andreymal -- this is needed to avoid tag injection if you're working with sanitized HTML!

I find none of the answers satisfying, some are even in Python 2. So I add a one-liner solution that produces innerHTML-like output and works with Python 3:
from lxml import etree, html
# generate some HTML element node
node = html.fromstring("""<container>
Some random text <b>bold <i>italic</i> yeah</b> no yeah
<!-- comment blah blah --> <img src='gaga.png' />
</container>""")
# compute inner HTML of element
innerHTML = "".join([
str(c) if type(c)==etree._ElementUnicodeResult
else html.tostring(c, with_tail=False).decode()
for c in node.xpath("node()")
]).strip()
The result will be:
'Some random text <b>bold <i>italic</i> yeah</b> no yeah\n<!-- comment blah blah --> <img src="gaga.png">'
What it does: The xpath delivers all node children (text, elements, comments). The list comprehension produces a list of the text contents of the text nodes and HTML content of element nodes. Those are then joined into a single string. If you want to get rid of comments, use *|text() instead of node() for xpath.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove elements from tree based on list of terms - python

Related

Xpath - get text separated by <p> tags

Find text using lxml etree

Why won't lxml strip section tags?

Python BeautifulSoup extract text between element

Equivalent to InnerHTML when using lxml.html to parse HTML

Categories

Resources