How to modify the content of an XML node using python? - python

I have a simple problem that, i want to update/modify the content of an xml node using python. I am using python 3.6 version.
I want to write a python script that, will modify the status node content to "On" and directoryName node to "Users/
<main>
<status>off</status>
<directoryName>nothing</directoryName>
</main>

I got the answer. I forgot to write at the end
xmlHandler = "System_Settings/System_controller.xml"
xmlDom=ElementTree.parse(xmlHandler)
xmlDom.find("status").text = "on"
print(xmlDom.find("status").text)
xmlDom.write(xmlHandler)

If you can afford to install an additional package, take a look at BeautifulSoup. It makes parsing html and xml quite simple.
import bs4
xml = """
<main>
<status>off</status>
<directoryName>nothing</directoryName>
</main>"""
soup = bs4.BeautifulSoup(xml, "xml")
soup.status.string="on"
print(soup.prettify())

Using lxml library (which is also used by BeautifulSoup):
from lxml import etree
node = etree.XML("""
<main>
<status>off</status>
<directoryName>nothing</directoryName>
</main>""")
status = "On"
status_node = node.xpath("/main/status")[0]
status_node.text = status
Then with print(etree.tounicode(node)), you get:
<main>
<status>On</status>
<directoryName>nothing</directoryName>
</main>

Related

How do I use lxml and python to traverse the <body> of a html document along with its children

I would like to take an html document and traverse the <body> part of the document with its children. I see lots of examples to get a subtree via xpath or tag name but this doesn't seem to give the children.
import lxml
from lxml import html, etree
html3 = "<html><head><title>test<body><h1>page title</h3><p>some text</p>"
root = lxml.html.fromstring(html3)
tree = etree.ElementTree(root)
for el in root.iter():
# do something
print(el.text, tree.getpath(el))
This will output
None /html
None /html/head
test /html/head/title
None /html/body
page title /html/body/h1
some text /html/body/p
I would like only
page title /html/body/h1
some text /html/body/p
Any help gratefully received.
I had similar difficulty, then I figured that each etree node has an iterator if its parent using which you can traverse
for instance, root here will give you the body using that you can iterate each element of body
from lxml import etree
parser = etree.HTMLParser()
tree = etree.parse('yourdocument.html', parser)
root = tree.xpath('/html/body/')[0]
for i in root.getiterator():
print(i.tag,i.text)
It seems that your html code has an invalid format, I just wrote a little program with beautifuSoup that maybe you can use to modify for your purpose:
from bs4 import BeautifulSoup
html3 = "<html><head><title>test</title></head><body><h1>page title</h1><p>some text</p><body></html>"
soup = BeautifulSoup(html3, "html5lib")
body = soup.find('body')
for item in body.findChildren():
print(item)
Output
<h1>page title</h1>
<p>some text</p>

lxml etree HTML parser changes order of nodes (<center> inside <p>)

I'm currently facing an issue where I can't explain the etree behaviour. Following code demonstrates the issue I am facing. I want to parse an HTML string as illustrated below, change the attribute of an element and reprint the HTML when done.
from lxml import etree
from io import StringIO, BytesIO
string = "<p><center><code>git clone https://github.com/AlexeyAB/darknet.git</code></center></p>"
parser = etree.HTMLParser()
test = etree.fromstring(string, parser)
print(etree.tostring(test, pretty_print=True, method="html")
I get this output:
<html><body>
<p></p>
<center><code>git clone https://github.com/AlexeyAB/darknet.git</code></center>
</body></html>
As you can see (let's ignore the <html> and <body> tags etree adds), the order of the nodes has been changed! The <p> tag that used to wrap the <center> tag, now loses its content, and that content gets added after the </p> tag closes. Eh?
When I omit the <center> tag, all of a sudden the parsing is done right:
from lxml import etree
from io import StringIO, BytesIO
string = "<p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p>"
parser = etree.HTMLParser()
test = etree.fromstring(string, parser)
print(etree.tostring(test, pretty_print=True, method="html"))
With correct output:
<html><body><p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p></body></html>
Am I doing something wrong here? I have to use the HTML parser because I get a lot of parsing errors when not using it. I also can't change the order of the <p> and <center> tags, as I read them this way.
<center> is a block level element.
<p> cannot legally contain block level elements.
Therefore the parser closes the <p> when it encounters <center>.
Use valid HTML - or an XML parser, which does not care about HTML rules (but in exchange can't deal with some of the HTML specifics, like most named entities, such as or unclosed/self-closing tags).
Centering content has been done with CSS for ages now, there is no reason to use <center> anymore (and, in fact, it's deprecated). But it still works, and if you insist on using it, switch the nesting.
<center><p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p></center>

parse json in html page with python [duplicate]

I am downloading HTML pages that have data defined in them in the following way:
... <script type= "text/javascript"> window.blog.data = {"activity":{"type":"read"}}; </script> ...
I would like to extract the JSON object defined in 'window.blog.data'.
Is there a simpler way than parsing it manually? (I am looking into Beautiful Soap but can't seem to find a method that will return the exact object without parsing)
Thanks
Edit:
Would it be possible and more correct to do this with a python headless browser (e.g., Ghost.py)?
BeautifulSoup is an html parser; you also need a javascript parser here. btw, some javascript object literals are not valid json (though in your example the literal is also a valid json object).
In simple cases you could:
extract <script>'s text using an html parser
assume that window.blog... is a single line or there is no ';' inside the object and extract the javascript object literal using simple string manipulations or a regex
assume that the string is a valid json and parse it using json module
Example:
#!/usr/bin/env python
html = """<!doctype html>
<title>extract javascript object as json</title>
<script>
// ..
window.blog.data = {"activity":{"type":"read"}};
// ..
</script>
<p>some other html here
"""
import json
import re
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
soup = BeautifulSoup(html)
script = soup.find('script', text=re.compile('window\.blog\.data'))
json_text = re.search(r'^\s*window\.blog\.data\s*=\s*({.*?})\s*;\s*$',
script.string, flags=re.DOTALL | re.MULTILINE).group(1)
data = json.loads(json_text)
assert data['activity']['type'] == 'read'
If the assumptions are incorrect then the code fails.
To relax the second assumption, a javascript parser could be used instead of a regex e.g., slimit (suggested by #approximatenumber):
from slimit import ast # $ pip install slimit
from slimit.parser import Parser as JavascriptParser
from slimit.visitors import nodevisitor
soup = BeautifulSoup(html, 'html.parser')
tree = JavascriptParser().parse(soup.script.string)
obj = next(node.right for node in nodevisitor.visit(tree)
if (isinstance(node, ast.Assign) and
node.left.to_ecma() == 'window.blog.data'))
# HACK: easy way to parse the javascript object literal
data = json.loads(obj.to_ecma()) # NOTE: json format may be slightly different
assert data['activity']['type'] == 'read'
There is no need to treat the object literal (obj) as a json object. To get the necessary info, obj can be visited recursively like other ast nodes. It would allow to support arbitrary javascript code (that can be parsed by slimit).
Something like this may work:
import re
HTML = """
<html>
<head>
...
<script type= "text/javascript">
window.blog.data = {"activity":
{"type":"read"}
};
...
</script>
</head>
<body>
...
</body>
</html>
"""
JSON = re.compile('window.blog.data = ({.*?});', re.DOTALL)
matches = JSON.search(HTML)
print matches.group(1)
I had a similar issue and ended up using selenium with phantomjs. It's a little hacky and I couldn't quite figure out the correct wait until method, but the implicit wait seems to work fine so far for me.
from selenium import webdriver
import json
import re
url = "http..."
driver = webdriver.PhantomJS(service_args=['--load-images=no'])
driver.set_window_size(1120, 550)
driver.get(url)
driver.implicitly_wait(1)
script_text = re.search(r'window\.blog\.data\s*=.*<\/script>', driver.page_source).group(0)
# split text based on first equal sign and remove trailing script tag and semicolon
json_text = script_text.split('=',1)[1].rstrip('</script>').strip().rstrip(';').strip()
# only care about first piece of json
json_text = json_text.split("};")[0] + "}"
data = json.loads(json_text)
driver.quit()
```
fast and easy way is ('here put exactly the start (.*?) and the end here') that's all !
import re
import json
html = """<!doctype html>
<title>extract javascript object as json</title>
<script>
// ..
window.blog.data = {"activity":{"type":"read"}};
// ..
</script>
<p>some other html here
"""
than simply
re.search('{"activity":{"type":"(.*?)"', html).group(1)
or for full json
jsondata = re.search('window.blog.data = (.*?);', html).group(1)
jsondata = json.loads(jsondata)
print(jsondata["activity"])
#output {'type': 'read'}

HTML Parsing issue with BeautifulSoup Library

I am working with the BS library for HTML parsing. My task is to remove everything between the head tags. So if i have <head> A lot of Crap! </head> then the result should be <head></head>. This is the code for it
raw_html = "entire_web_document_as_string"
soup = BeautifulSoup(raw_html)
head = soup.head
head.unwrap()
print(head)
And this works fine. But i want that these changes should take place in the raw_html string that contains the entire html document. How do reflect these commands in the original string and not only in the head string? Can you share a code snippet for doing it?
You're basically asking how to export a string of HTML from BS's soup object.
You can do it this way:
# Python 2.7
modified_raw_html = unicode(soup)
# Python3
modified_raw_html = str(soup)

Parsing HTML using Python

I'm looking for an HTML Parser module for Python that can help me get the tags in the form of Python lists/dictionaries/objects.
If I have a document of the form:
<html>
<head>Heading</head>
<body attr1='val1'>
<div class='container'>
<div id='class'>Something here</div>
<div>Something else</div>
</div>
</body>
</html>
then it should give me a way to access the nested tags via the name or id of the HTML tag so that I can basically ask it to get me the content/text in the div tag with class='container' contained within the body tag, or something similar.
If you've used Firefox's "Inspect element" feature (view HTML) you would know that it gives you all the tags in a nice nested manner like a tree.
I'd prefer a built-in module but that might be asking a little too much.
I went through a lot of questions on Stack Overflow and a few blogs on the internet and most of them suggest BeautifulSoup or lxml or HTMLParser but few of these detail the functionality and simply end as a debate over which one is faster/more efficent.
So that I can ask it to get me the content/text in the div tag with class='container' contained within the body tag, Or something similar.
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)
You don't need performance descriptions I guess - just read how BeautifulSoup works. Look at its official documentation.
I guess what you're looking for is pyquery:
pyquery: a jquery-like library for python.
An example of what you want may be like:
from pyquery import PyQuery
html = # Your HTML CODE
pq = PyQuery(html)
tag = pq('div#id') # or tag = pq('div.class')
print tag.text()
And it uses the same selectors as Firefox's or Chrome's inspect element. For example:
The inspected element selector is 'div#mw-head.noprint'. So in pyquery, you just need to pass this selector:
pq('div#mw-head.noprint')
Here you can read more about different HTML parsers in Python and their performance. Even though the article is a bit dated it still gives you a good overview.
Python HTML parser performance
I'd recommend BeautifulSoup even though it isn't built in. Just because it's so easy to work with for those kinds of tasks. Eg:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://www.google.com/')
soup = BeautifulSoup(page)
x = soup.body.find('div', attrs={'class' : 'container'}).text
Compared to the other parser libraries lxml is extremely fast:
http://blog.dispatched.ch/2010/08/16/beautifulsoup-vs-lxml-performance/
http://www.ianbicking.org/blog/2008/03/python-html-parser-performance.html
And with cssselect it’s quite easy to use for scraping HTML pages too:
from lxml.html import parse
doc = parse('http://www.google.com').getroot()
for div in doc.cssselect('a'):
print '%s: %s' % (div.text_content(), div.get('href'))
lxml.html Documentation
I recommend lxml for parsing HTML. See "Parsing HTML" (on the lxml site).
In my experience Beautiful Soup messes up on some complex HTML. I believe that is because Beautiful Soup is not a parser, rather a very good string analyzer.
I recommend using justext library:
https://github.com/miso-belica/jusText
Usage:
Python2:
import requests
import justext
response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
print paragraph.text
Python3:
import requests
import justext
response = requests.get("http://bbc.com/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
print (paragraph.text)
I would use EHP
https://github.com/iogf/ehp
Here it is:
from ehp import *
doc = '''<html>
<head>Heading</head>
<body attr1='val1'>
<div class='container'>
<div id='class'>Something here</div>
<div>Something else</div>
</div>
</body>
</html>
'''
html = Html()
dom = html.feed(doc)
for ind in dom.find('div', ('class', 'container')):
print ind.text()
Output:
Something here
Something else

Categories