I am currently having issues with parsing data in my Python class and was wondering if anyone was able to provide a solution to my problem. Here are the instructions for the assignment I'm doing:
XML is the basis for many interfaces and web services. Consequently, reading and manipulating XML data is a common task in software development.
Description
An online plant distributor has recently experience a shortage in its supply of Anemone plants such that the price has increased by 20%. Their plant catalog is maintained in an XML file and they need a Python utility to find the plant by name, read the current price, change it by the specified percentage, and update the file. Writing this utility is your assignment.
Using Python’s ElementTree XML API, write a Python program to perform the following tasks below. Note that your program’s execution syntax must be as follows:
python xmlparse.py plant_catalog.xml plantName percentChange
Using ElementTree, read in this assignments XML file plant_catalog.xml specified by a command line parameter as shown above.
Find the plant by the name passed in as an argument on the command line (plantName above).
Once found, read the current price and adjust it by the command line argument percentChange. Note that this value could be anything in the range of -90 < percentChange < 100.
For example, if you run your script as follows:
python plant_catalog.xml "Greek Valerian" -20
with the original XML containing:
<PLANT>
<COMMON>Greek Valerian</COMMON>
<BOTANICAL>Polemonium caeruleum</BOTANICAL>
<ZONE>Annual</ZONE>
<LIGHT>Shade</LIGHT>
<PRICE>4.36</PRICE>
<AVAILABILITY>071499</AVAILABILITY>
</PLANT>
The resulting file should contain:
<PLANT>
<COMMON>Greek Valerian</COMMON>
<BOTANICAL>Polemonium caeruleum</BOTANICAL>
<ZONE>Annual</ZONE>
<LIGHT>Shade</LIGHT>
<PRICE>3.48</PRICE>
<AVAILABILITY>071499</AVAILABILITY>
</PLANT>
Note: You may reduce the precision of the calculation if you wish but it isn’t required.
Hints
Since XML is just a text file, you could write the code to read all the data and the decode the XML information. However, I certainly don’t recommend this approach. Instead, let Python do it for you! Using Python’s ElementTree module, parse the file into an “in-memory” representation of the XML data. Once parsed, the root (or starting place) is as simple as requesting it from the tree. Once you have the root, you can call methods to find what you are looking for and modify them appropriately. You'll want to "findall" the plants and, "for" each plant "in" the result, you'll want to compare the name with the name passed on the command line. If you find a match you'll apply the percentage change, save the result back to the tree.
When you are done with the search you will "write" the tree back to a file. I suggest using a different file name or you will be having to re-download the original with each run.
One note of caution, be sure to read about XML in the Distributed Systems text. From doing so and reviewing the data file you will not that there are no attributes in the XML file. Consequently, you do not need to use attribute methods when you attempt this assignment.
The following code snippet will give you a good starting point:
# Calling arguments: plant_catalog.xml plantName percentChange
import xml.etree.ElementTree as ET
import sys
# input parameters
searchName = sys.argv[2]
percent = float(sys.argv[3])
# parse XML data file
tree = ET.parse(sys.argv[1])
root = tree.getroot()
Now here is my code:
import xml
import xml.etree.ElementTree as ET
import sys
searchName = sys.argv[2]
percent = float(sys.argv[3])
tree = ET.parse(sys.argv[1])
root = tree.getroot()
def main():
with open("plant_catalog.xml", "r") as file:
data = file.read()
for plant in root.findall("PLANT"):
name = plant.find("COMMON").text
if name == searchName:
original_price = float(plant.find("PRICE").text)
with open("plant_catalog - output.xml", "wb") as file:
file.write(percent)
def change_plant_price(plantName, newPrice):
root = ET.fromstring(xml)
plant = root.find(".//*[COMMON='{}']".format(plantName))
plant.find('PRICE').text = str(newPrice)
ET.dump(root)
if __name__ == "__main__":
main()
The problem with my code is that when I write the code, I get an error in the file. Write(percent) line and shows it needs a byte-like object instead of a float. I'm not sure what's wrong with the code but if anyone is able to provide a solution I would greatly appreciate it.
I think what you want to be doing is not file.write(percent) but instead tree.write(file). You want to write the tree to a file
Related
I'm working on a project that is currently having me parse some xml files for data and writing that edited data into a specific xml format. I'm also quite new to working with xml files, so bare with me with my lack of vocabulary.
My issue comes in when I am trying to write the new xml file. It is starting from when I write the root of the file as the output file is printing it like this:
<root />
when I need it to look like this: <root> </root>
The code that is responsible for writing the output file is:
root = ET.Element('root')
tree = ET.ElementTree()
tree._setroot(root)
for items in dataList:
temp = ET.Element('element')
temp.set('attr_1', 'val')
temp.set('attr_2', 'val')
temp.set('attr_3', 'val')
temp.set('attr_4', 'val' + Val)
temp.set('attr_5', 'val' + items)
tree.getroot().append(temp)
tree.write('filename', encoding='unicode')
The code inside the for loop isn't super important to the question, since an understanding of the root object will help with getting the rest of this to work, but I included it to show the whole picture. However, to cover my bases, the code within the for loop is supposed to be creating an Element with 5 attributes that are all direct children of the root.
I've been stuck on this file writing for a while, and this was the only code format that I could get to successfully write a new file in the first place. When it comes to file writing and creating an ElementTree object from scratch, I find myself circling around the documentation without finding anything new/helpful. I have also sifted through several sources and websites which all helped me get to this current point, but I'm not able to find anything to help move me forward.
Thank you for reading this and helping me out!
There are many ways to read XML, both all-at-once (DOM) and one-bit-at-a-time (SAX). I have used SAX or lxml to iteratively read large XML files (e.g. wikipedia dump which is 6.5GB compressed).
However after doing some iterative processing (in python using ElementTree) of that XML file, I want to write out the (new) XML data to another file.
Are there any libraries to iteratively write out XML data? I could create the XML tree and then write it out, but that is not possible without oodles of ram. Is there anyway to write the XML tree to a file iteratively? One bit at a time?
I know I could generate the XML myself with print "<%s>" % tag_name, etc., but that seems a bit... hacky.
Fredrik Lundh's elementtree.SimpleXMLWriter will let you write out XML incrementally. Here's the demo code embedded in the module:
from elementtree.SimpleXMLWriter import XMLWriter
import sys
w = XMLWriter(sys.stdout)
html = w.start("html")
w.start("head")
w.element("title", "my document")
w.element("meta", name="generator", value="my application 1.0")
w.end()
w.start("body")
w.element("h1", "this is a heading")
w.element("p", "this is a paragraph")
w.start("p")
w.data("this is ")
w.element("b", "bold")
w.data(" and ")
w.element("i", "italic")
w.data(".")
w.end("p")
w.close(html)
With lxml, you can use etree.Element to create new nodes, and etree.tostring to write out the XML representation. See, for example, Listing 6. Serialize an element's children from Liza Daly's article "High-performance XML parsing in Python with lxml".
If you're reading in XML dialect1, and have to write XML dialect2, wouldn't it be a good idea to write down the conversion process using xslt? You may not even need any source code that way.
If you don't find anything else, what I'd prefer here is to inherit from ElementTree and create a "iteractiveElementTree", adding to it a "file" attribute. I'd subclasse the nodes to have a "start_tag_comitted" attribute and a "commit" method. Upon being called, this "commit" method would call the render method for a subtree - starting from the fartest parent where e"start_tag_comitted" is false. With the string in hand I'd manually strip the closing tags for the parents of the current node. There is the need to handle the previously oppened but not closed parent siblings as well.
Then, I'd remove the "commited" node from the memory model.
You will need to anotate node parents to each node as well, as ElementTree does not do that.
(Write me if there are no better answers an dyou get stuck there, I could implement this)
I am being driven crazy by some oddly formed xml and would be grateful for some pointers:
The documents are defined like this:
<sphinx:document id="18059090929806848187">
<url>http://www.some-website.com</url>
<page_number>104</page_number>
<size>7865</size>
</sphinx:document>
Now, I need to read lots (500m+ of these files which are all gz compresed) and grab the text values form a few of the contained tags.
sample code:
from lxml import objectify, etree
import gzip
with open ('file_list','rb') as file_list:
for file in file_list:
in_xml = gzip.open(file.strip('\n'))
xml2 = etree.iterparse(in_xml)
for action, elem in xml2:
if elem.tag == "page_number":
print elem.text + str(file)
the first value elem.text is returned but only for the first file in the list and quickly followed by the error:
lxml.etree.XMLSyntaxError: Namespace prefix sphinx on document is not defined, line 1, column 20
Please excuse my ignorance but xml really hurts my head and I have been struggling with this for a while. Is there a way that I can either define the namespace prefix or handle this in some other more intelligent manner?
Thanks
Your input file is not well formed XML. I assume that it is a snippet from a larger XML document.
Your choices are:
Reconstruct the larger document. How you do this is specific to your application. You may have to consult with the people that created the file you are parsing.
Parse the file in spite of its errors. To do that, use the recover keyword from lxml.etree.iterparse:
xml2 =etree.iterparse(in_xml, recover=True)
I am trying to parse a few thousand html files and dump the variables into a csv file (excel spreadsheet). I've come up against several roadblocks--the first one which was (thankfully) solve here, a few days ago. The (hopefully) final roadblock is this: I can not get it to properly parse the file using xpath. Below is a brief explanation, the python code and example of the html code.
The trouble starts here:
for node in tree.iter():
name = node.attrib.get('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
if category =='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
category=node.text
It runs, but does not parse. I do not get any traceback errors.
I think I am misunderstanding the logic of parsing with ElementTree.
There are several headers that are the same--it is therefor difficult to find a unique id/header. Here is an example of the html:
<span class="s1">Business: Give Back to the Community and Save Money
on Equipment, Technology, Promotional Products, and Market<span
class="Apple-converted-space"> </span></span>
For which the xpath is:
/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]
/table/tbody/tr[1]/td[1]/p/span
I would like to scrape the text from this span (among others) and put it in the excel spreadsheet.
You can see an example of a similar page HERE
At any rate, because many spans/headers are no uniquely identified, I think I should use xpath. However, I have yet to be able to figure out how to successfully use xpath commands with ElementTree. In searching the documentation, the answer to this question (as well as the logic) eludes me. I have read up on http://lxml.de/parsing.html as well as on this site and have yet to find something that works.
So far, the code iterates through all the files (in dropbox) nicely. It also creates the csv file and creates the headers (though not in separate columns, only as one line separated by semicolons-- but that should be easy to fix).
In sum, I would like it to parse the text from different lines on in each file (webpage) and dump it into the excel file.
Any input would be greatly appreciated.
The python code:
import xml.etree.ElementTree as ET
import csv, codecs, os
from cStringIO import StringIO
# Note: you need to download and install this..
import unicodecsv
import lxml.html
# TODO: make into command line params (instead of constant)
CSV_FILE='output.csv'
HTML_PATH='/Users/C/data/Folder_NS'
f = open(CSV_FILE, 'wb')
w = unicodecsv.writer(f, encoding='utf-8', delimiter=';')
w.writerow(['file', 'category', 'about', 'title', 'subtitle', 'date', 'bodyarticle'])
# redundant declarations:
category=''
about=''
title=''
subtitle=''
date=''
bodyarticle=''
print "headers created"
allFiles = os.listdir(HTML_PATH)
#with open(CSV_FILE, 'wb') as csvfile:
print "all defined"
for file in allFiles:
#print allFiles
if '.html' in file:
print "in html loop"
tree = lxml.html.parse(HTML_PATH+"/"+file)
print '===================='
print 'Parsing file: '+file
print '===================='
for node in tree.iter():
name = node.attrib.get('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
if category =='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
print 'Category:'
category=node.text
f.close()
14 June 2015 (most recent change); I have just changed this section
for node in tree.iter():
name = node.attrib.get('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
if category =='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
print 'Category:'
category=node.text
to this:
for node in tree.iter():
row = dict.fromkeys(cols)
Category_name = tree.xpath('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
row['category'] = Category_name[0].text_content().encode('utf-8')
It still runs, but does not parse.
Try following code:
from lxml import etree
import requests
from StringIO import StringIO
data = requests.get('http://www.usprwire.com/Detailed/Banking_Finance_Investment/Confused.com_reveals_that_Life_Insurance_is_more_than_a_form_of_future_protection_284764.shtml').content
parser = etree.HTMLParser()
root = etree.parse(StringIO(data), parser)
category = root.xpath('//table/td/font/text()')
print category[0]
It uses requests library to download the html code of the page. You can choose whatever method that fits your needs. The important part is the xpath that searches any <table> followed by <td> followed by <font>, and it returns a list with two elements. The second one are blank characters and the first one contains the text.
Run it and yields just the sentence you are looking for:
Banking, Finance & Investment: Confused.com reveals that Life Insurance is more than a form of future protection
I am trying to build a offline wiktionary using the wikimedia dump files (.xml.bz2) using Python. I started with this article as the guide. It involves a number of languages, I wanted to combine all the steps as a single python project. I have found almost all the libraries required for the process. The only hump now is to effectively split the large .xml.bz2 file into number of smaller files for quicker parsing during search operations.
I know that bz2 library exists in python, but it provides only compress and decompress operations. But I need something that could do something like bz2recover does from the command line, which splits large files into a number of smaller junks.
One more important point is the splitting shouldn't split the page contents which start with <page> and ends </page> in the xml document that has been compressed.
Is there a library previously available which could handle this situation or the code has to be written from scratch?(Any outline/pseudo-code would be greatly helpful).
Note: I would like to make the resulting package cross-platform compatible, hence couldn't use OS specific commands.
At last I have written a Python Script myself:
import os
import bz2
def split_xml(filename):
''' The function gets the filename of wiktionary.xml.bz2 file as input and creates
smallers chunks of it in a the diretory chunks
'''
# Check and create chunk diretory
if not os.path.exists("chunks"):
os.mkdir("chunks")
# Counters
pagecount = 0
filecount = 1
#open chunkfile in write mode
chunkname = lambda filecount: os.path.join("chunks","chunk-"+str(filecount)+".xml.bz2")
chunkfile = bz2.BZ2File(chunkname(filecount), 'w')
# Read line by line
bzfile = bz2.BZ2File(filename)
for line in bzfile:
chunkfile.write(line)
# the </page> determines new wiki page
if '</page>' in line:
pagecount += 1
if pagecount > 1999:
#print chunkname() # For Debugging
chunkfile.close()
pagecount = 0 # RESET pagecount
filecount += 1 # increment filename
chunkfile = bz2.BZ2File(chunkname(filecount), 'w')
try:
chunkfile.close()
except:
print 'Files already close'
if __name__ == '__main__':
# When the script is self run
split_xml('wiki-files/tawiktionary-20110518-pages-articles.xml.bz2')
well, if you have a command-line-tool that offers the functionality you are after, you can always wrap it in a call using the subprocess module
The method you are referencing is quite a dirty hack :)
I wrote a offline Wikipedia Tool, and just Sax-parsed the dump completely. The throughput is usable if you just pipe the uncompressed xml into stdin from a proper bzip2 decompressor. Especially if its only the wiktionary.
As a simple way for testing I just compressed every page and wrote it into one big file and saved the offset and length in a cdb(small key-value store). This may be a valid solution for you.
Keep in mind, the mediawiki markup is the most horrible piece of sh*t I've come across in a long time. But in case of the wiktionary it might me feasible to handle.