I am being driven crazy by some oddly formed xml and would be grateful for some pointers:
The documents are defined like this:
<sphinx:document id="18059090929806848187">
<url>http://www.some-website.com</url>
<page_number>104</page_number>
<size>7865</size>
</sphinx:document>
Now, I need to read lots (500m+ of these files which are all gz compresed) and grab the text values form a few of the contained tags.
sample code:
from lxml import objectify, etree
import gzip
with open ('file_list','rb') as file_list:
for file in file_list:
in_xml = gzip.open(file.strip('\n'))
xml2 = etree.iterparse(in_xml)
for action, elem in xml2:
if elem.tag == "page_number":
print elem.text + str(file)
the first value elem.text is returned but only for the first file in the list and quickly followed by the error:
lxml.etree.XMLSyntaxError: Namespace prefix sphinx on document is not defined, line 1, column 20
Please excuse my ignorance but xml really hurts my head and I have been struggling with this for a while. Is there a way that I can either define the namespace prefix or handle this in some other more intelligent manner?
Thanks
Your input file is not well formed XML. I assume that it is a snippet from a larger XML document.
Your choices are:
Reconstruct the larger document. How you do this is specific to your application. You may have to consult with the people that created the file you are parsing.
Parse the file in spite of its errors. To do that, use the recover keyword from lxml.etree.iterparse:
xml2 =etree.iterparse(in_xml, recover=True)
Related
I am currently having issues with parsing data in my Python class and was wondering if anyone was able to provide a solution to my problem. Here are the instructions for the assignment I'm doing:
XML is the basis for many interfaces and web services. Consequently, reading and manipulating XML data is a common task in software development.
Description
An online plant distributor has recently experience a shortage in its supply of Anemone plants such that the price has increased by 20%. Their plant catalog is maintained in an XML file and they need a Python utility to find the plant by name, read the current price, change it by the specified percentage, and update the file. Writing this utility is your assignment.
Using Python’s ElementTree XML API, write a Python program to perform the following tasks below. Note that your program’s execution syntax must be as follows:
python xmlparse.py plant_catalog.xml plantName percentChange
Using ElementTree, read in this assignments XML file plant_catalog.xml specified by a command line parameter as shown above.
Find the plant by the name passed in as an argument on the command line (plantName above).
Once found, read the current price and adjust it by the command line argument percentChange. Note that this value could be anything in the range of -90 < percentChange < 100.
For example, if you run your script as follows:
python plant_catalog.xml "Greek Valerian" -20
with the original XML containing:
<PLANT>
<COMMON>Greek Valerian</COMMON>
<BOTANICAL>Polemonium caeruleum</BOTANICAL>
<ZONE>Annual</ZONE>
<LIGHT>Shade</LIGHT>
<PRICE>4.36</PRICE>
<AVAILABILITY>071499</AVAILABILITY>
</PLANT>
The resulting file should contain:
<PLANT>
<COMMON>Greek Valerian</COMMON>
<BOTANICAL>Polemonium caeruleum</BOTANICAL>
<ZONE>Annual</ZONE>
<LIGHT>Shade</LIGHT>
<PRICE>3.48</PRICE>
<AVAILABILITY>071499</AVAILABILITY>
</PLANT>
Note: You may reduce the precision of the calculation if you wish but it isn’t required.
Hints
Since XML is just a text file, you could write the code to read all the data and the decode the XML information. However, I certainly don’t recommend this approach. Instead, let Python do it for you! Using Python’s ElementTree module, parse the file into an “in-memory” representation of the XML data. Once parsed, the root (or starting place) is as simple as requesting it from the tree. Once you have the root, you can call methods to find what you are looking for and modify them appropriately. You'll want to "findall" the plants and, "for" each plant "in" the result, you'll want to compare the name with the name passed on the command line. If you find a match you'll apply the percentage change, save the result back to the tree.
When you are done with the search you will "write" the tree back to a file. I suggest using a different file name or you will be having to re-download the original with each run.
One note of caution, be sure to read about XML in the Distributed Systems text. From doing so and reviewing the data file you will not that there are no attributes in the XML file. Consequently, you do not need to use attribute methods when you attempt this assignment.
The following code snippet will give you a good starting point:
# Calling arguments: plant_catalog.xml plantName percentChange
import xml.etree.ElementTree as ET
import sys
# input parameters
searchName = sys.argv[2]
percent = float(sys.argv[3])
# parse XML data file
tree = ET.parse(sys.argv[1])
root = tree.getroot()
Now here is my code:
import xml
import xml.etree.ElementTree as ET
import sys
searchName = sys.argv[2]
percent = float(sys.argv[3])
tree = ET.parse(sys.argv[1])
root = tree.getroot()
def main():
with open("plant_catalog.xml", "r") as file:
data = file.read()
for plant in root.findall("PLANT"):
name = plant.find("COMMON").text
if name == searchName:
original_price = float(plant.find("PRICE").text)
with open("plant_catalog - output.xml", "wb") as file:
file.write(percent)
def change_plant_price(plantName, newPrice):
root = ET.fromstring(xml)
plant = root.find(".//*[COMMON='{}']".format(plantName))
plant.find('PRICE').text = str(newPrice)
ET.dump(root)
if __name__ == "__main__":
main()
The problem with my code is that when I write the code, I get an error in the file. Write(percent) line and shows it needs a byte-like object instead of a float. I'm not sure what's wrong with the code but if anyone is able to provide a solution I would greatly appreciate it.
I think what you want to be doing is not file.write(percent) but instead tree.write(file). You want to write the tree to a file
I am trying to split large xml file into smaller ones, first I started off beautifulsoup:
from bs4 import BeautifulSoup
import os
# Core settings
rootdir = r'C:\Users\XX\Documents\Grant Data\2010_xml'
extension = ".xml"
to_save = r'C:\Users\XX\Documents\all_patents_as_xml'
index = 0
for root, dirs, files in os.walk(rootdir):
for file in files:
if file.endswith(extension):
print(file)
file_name = os.path.join(root,file)
with open(file_name) as f:
data = f.read()
texts = data.split('?xml version="1.0" encoding="UTF-8"?')
for text in texts:
index += 1
filename = to_save + "\\"+ str(index) + ".txt"
with open(filename, 'w') as f:
f.write(text)
However, I got a memory error. Then I switched to xml etree:
from xml.etree import ElementTree as ET
import re
file_name = r'C:\Users\XX\Documents\Grant Data\2010_xml\2010cat_xml.xml'
with open(file_name) as f:
xml = f.read()
tree = ET.fromstring(re.sub(r"(<\?xml[^>]+\?>)", r"\1<root>", xml) + "</root>")
parser = ET.iterparse(tree)
to_save = r'C:\Users\Yilmaz\Documents\all_patents_as_xml'
index = 0
for event, element in parser:
# element is a whole element
if element.tag == '?xml version="1.0" encoding="UTF-8"?':
index += 1
filename = to_save + "\\"+ str(index) + ".txt"
with open(filename, 'w') as f:
f.write(ET.tostring(element))
# do something with this element
# then clean up
element.clear()
and I get the following error:
OverflowError: size does not fit in an int
I am using windows operating system, I know in Linux you can split the xmls from consule but in my case I don't know what to do.
If your XML can not be loaded because of memory limits, you should consider using SAX.
With SAX you will read "small bites" of the document, do what ever you want to do with them (Example: Save every N elements to a new file).
Python SAX example 1.
Python SAX example 2.
There are major issues with your question and your attempts at solving it:
You mention using Beautiful Soup. However, while you import Beautiful Soup in your code, you don't actually do anything with it.
The code you show that uses xml.etree is grossly incorrect. At the line parser = ET.iterparse(tree), tree is an XML tree already parsed with ET.fromstring, but the argument to iterparse must either be a file name or a file object. An XML tree is neither of those. So that attempt is dead on arrival.
But more importantly, it looks like what you are trying to process is a file which contains a bunch of concatenated XML files. In your xml.etree attempt you have this test:
element.tag == '?xml version="1.0" encoding="UTF-8"?'
The only intent I can imagine for this test is that you think that xml.etree will somehow interpret <?xml version="1.0" encoding="UTF-8"?> as an XML element which has a name of '?xml version="1.0" encoding="UTF-8"?'. However, the structure <?xml version="1.0" encoding="UTF-8"?> is not an XML element, it is an XML declaration.
And since your code seems to be attempting to split every time an XML declaration is encountered, it seems that your input is a file that contains multiple XML declarations. This file is not valid XML. The XML specification allows the XML declaration to appear once, and only once at the beginning of an XML file. (Don't confuse the XML declaration with a processing instruction. They look similar because they are both delimited by <? and ?>, but the XML declaration is not a processing instruction.) If you use an XML parser on your input file, and this parser conforms to the XML specification, then it has to reject your file as being not XML because XML does not allow XML declarations to appear at random positions in documents.
Where does that leave you? If all XML declarations present in your source document are the same, there's a relatively easy way to make your document parsable by an XML parser. (The attempts you made suggest that they are all the same since you do not use a regular expressions to match different forms of the XML declaration (e.g. one that would specify the standalone parameter).) You can just remove all XML declarations from your source document, wrap it in a new root element, and parse that with xml.etree. (This assumes that the individual XML documents that were concatenated to make up your source document were all individually well-formed. If they weren't then this won't work.)
Note, however, that the string <?xml version="1.0" encoding="UTF-8"?> can appear in an XML document in contexts where this string is not actually an XML declaration. Here is a well-formed XML document that would throw off an algorithm that just looks for a string that looks like an XML declaration:
<?xml version = "1.0" encoding = "UTF-8"?>
<a>
<![CDATA[
<?xml version = "1.0" encoding = "UTF-8"?>
]]>
<?q <?xml version = "1.0" encoding = "UTF-8"?> ?>
<!-- <?xml version = "1.0" encoding = "UTF-8"?> -->
</a>
If you know how your source file was created, you may already be able to know for sure that you don't have any of the cases above. Otherwise, you may want to examine your source to make sure none of the above happens.
Once you take care of this, then using a strategy based on ET.iterparse, or SAX should work.
I'm new to parsing in XML and am stuck with my code regarding finding all titles (title tags) in an XML. This is what I came up with, but it is returning just an empty list, while there should be titles in there.
import bz2
from xml.etree import ElementTree as etree
def parse_xml(filename):
with bz2.BZ2File(filename) as f:
doc = etree.parse(f)
titles = doc.findall('.//{http://www.mediawiki.org/xml/export-0.7/}title')
print titles[:10]
Can someone tell me why this is not working properly? Just to be clear; I need to find all text inside title tags stored in a list, taken from an XML wrapped in a bz2 file (as far as I read the best way is without unzipping).
So, my problem is I'm trying to do something a little un-orthodox. I have a complicated set of XSD files. However I don't want to use these XSD files to verify an XML file; I want to parse these XSDs as XML and interrogate them just as I would a normal XML file. This is possible because XSDs are valid XML. I am using lxml with Python3.
The problem I'm having is with the statement:
<xs:include schemaLocation="sdm-extension.xsd"/>
If I instruct lxml to create an XSD for verifying like this:
schema = etree.XMLSchema(schema_root)
this dependency will be resolved (the file exists in the same directory as the one I've just loaded). HOWEVER, I am treating these as XML so, correctly, lxml just treats this as a normal element with an attribute and does not follow it.
Is there an easy or correct way to extend lxml so that I may have the same or similar behaviour as, say
<xi:include href="metadata.xml" parse="xml" xpointer="title"/>
I could, of course, create a separate xml file manually that includes all the dependencies in the XSD schema. That is perhaps a solution?
So it seems like one option is to use the xi:xinclude method and create a separate xml file that includes all the XSDs I want to parse. Something along the lines of:
<fullxsd>
<xi:include href="./xsd-cdisc-sdm-1.0.0/sdm1-0-0.xsd" parse="xml"/>
<xi:include href="./xsd-cdisc-sdm-1.0.0/sdm-ns-structure.xsd" parse="xml"/>
</fullxsd>
Then use some lxml along the lines of
def combine(xsd_file):
with open(xsd_file, 'rb') as f_xsd:
parser = etree.XMLParser(recover=True, encoding='utf-8',remove_comments=True, remove_blank_text=True)
xsd_source = f_xsd.read()
root = etree.fromstring(xsd_source, parser)
incl = etree.XInclude()
incl(root)
print(etree.tostring(root, pretty_print=True))
Its not ideal but it seems the proper way. I've looked at custom URI parsers in the lxml but that would mean actually altering the XSDs which seems messier.
Try this:
def validate_xml(schema_file, xml_file):
xsd_doc = etree.parse(schema_file)
xsd = etree.XMLSchema(xsd_doc)
xml = etree.parse(xml_file)
return xsd.validate(xml)
when using Python's stock XML tools such as xml.dom.minidom for XML writing, a file would always start off like
<?xml version="1.0"?>
[...]
While this is perfectly legal XML code, and it's even recommended to use the header, I'd like to get rid of it as one of the programs I'm working with has problems here.
I can't seem to find the appropriate option in xml.dom.minidom, so I wondered if there are other packages which do allow to neglect the header.
Cheers,
Nico
Unfortunately minidom does not give you the option to omit the XML Declaration.
But you can always serialise the document content yourself by calling toxml() on the document's root element instead of the document. Then you won't get an XML Declaration:
xml= document.documentElement.toxml('utf-8')
...but then you also wouldn't get anything else outside the root element, such as the DOCTYPE, or any comments or processing instructions. If you need them, serialise each child of the document object one by one:
xml= '\n'.join(node.toxml('utf-8') for node in document.childNodes)
I wondered if there are other packages which do allow to neglect the header.
DOM Level 3 LS defines an xml-declaration config parameter you can use to suppress it. The only Python implementation I know of is pxdom, which is thorough on standards support, but not at all fast.
If you want to use minidom and maintain 'prettiness', how about this as a quick/hacky fix:
xml_without_declaration.py:
import xml.dom.minidom as xml
doc = xml.Document()
declaration = doc.toxml()
a = doc.createElement("A")
doc.appendChild(a)
b = doc.createElement("B")
a.appendChild(b)
xml = doc.toprettyxml()[len(declaration):]
print xml
The header is print in Document. If you print the node directly, it won't print the header.
root = doc.childNodes[0]
root.toprettyxml(encoding="utf-8")
Just replace the first line with blank:
import xml.dom.minidom as MD
<XML String>.replace(MD.Document().toxml()+'\n', '')
If you're set on using minidom, just scan back in the file and remove the first line after writing all the XML you need.
You might be able to use a custom file-like object which removes the first tag, e.g:
class RemoveFirstLine:
def __init__(self, f):
self.f = f
self.xmlTagFound = False
def __getattr__(self, attr):
return getattr(self, self.f)
def write(self, s):
if not self.xmlTagFound:
x = 0 # just to be safe
for x, c in enumerate(s):
if c == '>':
self.xmlTagFound = True
break
self.f.write(s[x+1:])
else:
self.f.write(s)
...
f = RemoveFirstLine(open('path', 'wb'))
Node.writexml(f, encoding='UTF-8')
or something similar. This has the advantage the file doesn't have to be totally rewritten if the XML files are fairly large.
Purists may not like to hear this, but I have found using an XML parser to generate XML to be overkill. Just generate it directly as strings. This also lets you generate files larger than you can keep in memory, which you can't do with DOM. Reading XML is another story.
Use string replace
from xml.dom import minidom
mydoc = minidom.parse('filename.xml')
with open(newfile, "w" ) as fs:
fs.write(mydoc.toxml().replace('?xml version="1.0" ?>', ''))
fs.close()
That's it ;)