Shelve raises out of overflow pages - python

I have a lot of data that I read from a XML file, but only a part of it will be saved in the database later.
The XML is structured like that:
<element id="1" other_attrib="">
<element id="2" other_attrib="">
...
<other_element>
<elem id="1">
<elem id="100">
</other_element>
<other_element>
<elem id...>
</other_element>
all of the element tags precede the other_elements tags, (this XML is third-party, I can't restructure it)
I have to read all of the elements first because they contain additional attributes, but only save those that are referenced by other_elements.
So I opened a shelve with writeback=False, use lxml.iterparse() to parse the XML and save elements on the go, but after many elements get added (I don't know the exact number, but it goes into hundreds of thousands) I receive the following error:
HASH: Out of overflow pages. Increase page size
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/Library/Python/2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
return self.run(*args, **kwargs)
...
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shelve.py", line 133, in __setitem__
self.dict[key] = f.getvalue()
Basically what I do is for every element iterparse returns:
if elem.tag == 'element':
shlv[elem['id']] = {'attrib1': elem['other_attrib'], 'attrib2': elem['attrib2']}
elif elem.tag == 'other_element':
# Here I iterate through this tags children and find references in shlv object
for ref in elem:
save_in_database(shlv[ref.attrib['id']])
What can I change so that shelve handles more data ? Or should I use something else to store that data ?

Related

Including XHTML content when creating ReqIf XML document using pyXB

A bit of background: in the scope of a requirements management plugin for Sphinx, I'm looking into ways to export ReqIF XML content. I've found pyreqif, but found that it isn't complete enough to suit our needs at the moment.
I decided to take a look at the Reqif bindings generated by pyXB instead, with the idea that pyXB can do all the grunt work of converting things to and from XML and I just have to worry about adding some convenience functions/classes.
The project can be found here: https://github.com/bavovanachte/reqif_pyxb_tryout
So far it's going great: I've managed to create instances of all the objects and they tie together nicely into an xml document. The only thing I'm having trouble with is the creation of XHTML content. Ideally I'd want to take existing html content and insert that into the tree.
The naieve approach of doing that caused the xml-unsafe characters to be escaped, so that didn't work.
These are some of my attempts:
Attempt 1: Passing the xml as a string to the XHTML_CONTENT constructor
xml_string = '''
<div>
XY Block Adapter shall translate the Communication to TMN-Block in a bidirectional manner and support all functionalities of a TMN-Block.<br/>
</div>'''
att_value_xhtml = ATTRIBUTE_VALUE_XHTML(definition=text_attribute, THE_VALUE=XHTML_CONTENT(div=xml_string))
Result: Escaped XML content:
<div>
XY Block Adapter shall translate the Communication to TMN-Block in a bidirectional manner and support all functionalities of a TMN-Block.<br/>
</div></ns2:div>
Attempt 2: Passing the xml as a string to the XHTML_CONTENT constructor, with the "_from_xml flag set"
xml_string = '''
<div>
XY Block Adapter shall translate the Communication to TMN-Block in a bidirectional manner and support all functionalities of a TMN-Block.<br/>
</div>'''
att_value_xhtml = ATTRIBUTE_VALUE_XHTML(definition=text_attribute, THE_VALUE=XHTML_CONTENT(xml_string, _from_xml=True))
Result: pyXB exception:
Traceback (most recent call last):
File "examples/export_test.py", line 105, in <module>
att_value_xhtml = ATTRIBUTE_VALUE_XHTML(definition=text_attribute, THE_VALUE=XHTML_CONTENT(xml_string, _from_xml=True))
File "/home/bvn/.pyenv/versions/3.6.10/lib/python3.6/site-packages/pyxb/binding/basis.py", line 2127, in __init__
self.extend(args, _from_xml=from_xml, _location=location)
File "/home/bvn/.pyenv/versions/3.6.10/lib/python3.6/site-packages/pyxb/binding/basis.py", line 2612, in extend
[ self.append(_v, **kw) for _v in value_list ]
File "/home/bvn/.pyenv/versions/3.6.10/lib/python3.6/site-packages/pyxb/binding/basis.py", line 2612, in <listcomp>
[ self.append(_v, **kw) for _v in value_list ]
File "/home/bvn/.pyenv/versions/3.6.10/lib/python3.6/site-packages/pyxb/binding/basis.py", line 2588, in append
raise pyxb.MixedContentError(self, value, location)
pyxb.exceptions_.MixedContentError: Invalid non-element content
Attempt number 3 - Passing the xml as a string to the xhtml_div_type constructor, with the "_from_xml flag set", then assigning this class to the div member.
xml_string = '''
<div>
XY Block Adapter shall translate the Communication to TMN-Block in a bidirectional manner and support all functionalities of a TMN-Block.<br/>
</div>'''
att_value_xhtml = ATTRIBUTE_VALUE_XHTML(definition=text_attribute, THE_VALUE=XHTML_CONTENT(div=xhtml_div_type(xml_string, _from_xml=True)))
Result: Escaped XML content:
<div>
XY Block Adapter shall translate the Communication to TMN-Block in a bidirectional manner and support all functionalities of a TMN-Block.<br/>
</div></ns2:div>
Attempt number 4 - Converting the string to dom first and using that in the constructor
xml_string = '''
<div>
XY Block Adapter shall translate the Communication to TMN-Block in a bidirectional manner and support all functionalities of a TMN-Block.<br/>
</div>'''
dom_content = xml.dom.minidom.parseString('<myxml>Some data<empty/> some more data</myxml>')
att_value_xhtml = ATTRIBUTE_VALUE_XHTML(definition=text_attribute, THE_VALUE=XHTML_CONTENT(div=xhtml_div_type(_dom_node=dom_content)))
Result: pyXB exception:
Unable to convert DOM node empty at [UNAVAILABLE] to binding
Traceback (most recent call last):
File "examples/export_test.py", line 130, in <module>
att_value_xhtml = ATTRIBUTE_VALUE_XHTML(definition=text_attribute, THE_VALUE=XHTML_CONTENT(div=xhtml_div_type(_dom_node=dom_content)))
File "/home/bvn/.pyenv/versions/3.6.10/lib/python3.6/site-packages/pyxb/binding/basis.py", line 2133, in __init__
self.extend(dom_node.childNodes[:], fallback_namespace)
File "/home/bvn/.pyenv/versions/3.6.10/lib/python3.6/site-packages/pyxb/binding/basis.py", line 2612, in extend
[ self.append(_v, **kw) for _v in value_list ]
File "/home/bvn/.pyenv/versions/3.6.10/lib/python3.6/site-packages/pyxb/binding/basis.py", line 2612, in <listcomp>
[ self.append(_v, **kw) for _v in value_list ]
File "/home/bvn/.pyenv/versions/3.6.10/lib/python3.6/site-packages/pyxb/binding/basis.py", line 2567, in append
raise pyxb.UnrecognizedContentError(self, self.__automatonConfiguration, value, location)
pyxb.exceptions_.UnrecognizedContentError: Invalid content empty (expect {http://www.w3.org/1999/xhtml}h1 or {http://www.w3.org/1999/xhtml}h2 or ...
What would be the correct way of handling the xhtml content?

Python XML Iterparse halt on text

I am new to python, using 3.x, and am running into an issue with an XML file that I'm testing/learning on. When I look at the raw file (which is ASCII encoded btw), the issue (I'm pretty sure) is that there's a U+00A0 code in there.
The XML is as follows:
<?xml version="1.0" encoding="utf-8"?>
<XMLSetData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://www.clientsite.com/subdir/r2.4/v1">
<FileCreationDate>2018-05-05T11:35:44.1043858-05:00</FileCreationDate>
<XMLSetDataList>
<DataIDNumber>99345346</DataIDNumber>
<DataName>RSRS TVL5697 ULLĀ  Georgetown</DataName>
</XMLSetDataList>
</XMLSetData>
Using Notepad++, it shows me that the text has "xA0 " instead of " " (two spaces) between ULL and Georgetown. So when I do the code below:
import xml.etree.ElementTree as ET
events = ("end", "start-ns", "end-ns")
for event, elem in ET.iterparse(xml_file, events=events):
if event == "end":
eltag = elem.tag
eltext = elem.text
print( eltag, eltext)
It gives me an error stating:
File "C:\Users\d\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1222, in iterator
yield from pullparser.read_events()
File "C:\Users\d\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1297, in read_events
raise event
File "C:\Users\d\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1269, in feed
self._parser.feed(data)
File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 6, column 30
How do I fix this / get around it? If I remove the xA0 part, it parses fine, but obviously something like this may come up again, and I'd like to programmatically handle it.

Error with Python and XML

I'm getting an error when trying to grab a value from my XML. I get "Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."
Here is my code:
import requests
import lxml.etree
from requests.auth import HTTPBasicAuth
r= requests.get("https://somelinkhere/folder/?parameter=abc", auth=HTTPBasicAuth('username', 'password'))
print r.text
root = lxml.etree.fromstring(r.text)
textelem = root.find("opensearch:totalResults")
print textelem.text
I'm getting this error:
Traceback (most recent call last):
File "tickets2.py", line 8, in <module>
root = lxml.etree.fromstring(r.text)
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:82934)
File "src/lxml/parser.pxi", line 1814, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:124471)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
Here is what the XML looks like, where I'm trying to grab the file in the last line.
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:apple-wallpapers="http://www.apple.com/ilife/wallpapers" xmlns:g-custom="http://base.google.com/cns/1.0" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:georss="http://www.georss.org/georss/" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:cc="http://web.resource.org/cc/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:g-core="http://base.google.com/ns/1.0">
<title>Feed from some link here</title>
<link rel="self" href="https://somelinkhere/folder/?parameter=abc" />
<link rel="first" href="https://somelinkhere/folder/?parameter=abc" />
<id>https://somelinkhere/folder/?parameter=abc</id>
<updated>2018-03-06T17:48:09Z</updated>
<dc:creator>company.com</dc:creator>
<dc:date>2018-03-06T17:48:09Z</dc:date>
<opensearch:totalResults>4</opensearch:totalResults>
I have tried various changes from links like https://twigstechtips.blogspot.com/2013/06/python-lxml-strings-with-encoding.html and http://makble.com/how-to-parse-xml-with-python-and-lxml but I keep running into the same error.
Instead of r.text, which guesses at the text encoding and decodes it, try using r.content which accesses the response body as bytes. (See http://docs.python-requests.org/en/latest/user/quickstart/#response-content.)
You could also use r.raw. See parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml) for more info.
Once that issue is fixed, you'll have the issue of the namespace. The element you're trying to find (opensearch:totalResults) has the prefix opensearch which is bound to the uri http://a9.com/-/spec/opensearch/1.1/.
You can find the element by combining the namespace uri and the local name (Clark notation):
{http://a9.com/-/spec/opensearch/1.1/}totalResults
See http://lxml.de/tutorial.html#namespaces for more info.
Here's an example with both changes implemented:
os = "{http://a9.com/-/spec/opensearch/1.1/}"
root = lxml.etree.fromstring(r.content)
textelem = root.find(os + "totalResults")
print textelem.text

Python XML Parsing Child Tag

I am trying to get the contents of a sub tag using lxml. The XML file I am parsing is valid but for some reason when I try and parse the child element it seems to think I have invalid XML. I have seen from other posts that this error is usually generated when there isn't a closing tag but the XML parses fine in a browser. Any ideas why this is happening ?
Contents of XML file (test.xml):
<?xml version="1.0" encoding="UTF-8"?>
<Group id="RHEL-07-010010">
<title>SRG-OS-000257-GPOS-00098</title>
<description><GroupDescription></GroupDescription> </description>
<Rule id="RHEL-07-010010_rule" severity="high" weight="10.0">
<version>RHEL-07-010010</version>
<title>The file permissions, ownership, and group membership of system files and commands must match the vendor values.</title>
<description><VulnDiscussion>Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.
Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278- GPOS-00108</VulnDiscussion><FalsePositives>< /FalsePositives><FalseNegatives>< /FalseNegatives><Documentable>false< /Documentable><Mitigations>< /Mitigations><SecurityOverrideGuidance>< /SecurityOverrideGuidance><PotentialImpacts>< /PotentialImpacts><ThirdPartyTools>< /ThirdPartyTools><MitigationControl>< /MitigationControl><Responsibility>< /Responsibility><IAControls></IAControls></description>
<ident system="http://iase.disa.mil/cci">CCI-001494</ident>
<ident system="http://iase.disa.mil/cci">CCI-001496</ident>
<fixtext fixref="F-RHEL-07-010010_fix">Run the following command to determine which package owns the file:
# rpm -qf <filename>
Reset the permissions of files within a package with the following command:
#rpm --setperms <packagename>
Reset the user and group ownership of files within a package with the following command:
#rpm --setugids <packagename></fixtext>
<fix id="F-RHEL-07-010010_fix" />
<check system="C-RHEL-07-010010_chk">
<check-content-ref name="M" href="VMS_XCCDF_Benchmark_SRG.xml" />
<check-content>Verify the file permissions, ownership, and group membership of system files and commands match the vendor values.
Check the file permissions, ownership, and group membership of system files and commands with the following command:
# rpm -Va | grep '^.M'
If there is any output from the command, this is a finding.</check-content>
</check>
</Rule>
</Group>
I am trying to get the contents of the VulnDiscussion tag. I can get the contents of the parent tag, discussion like this:
from lxml import etree as ET
xml = ET.parse("test.xml")
for description in xml.xpath('//description/text()'):
print(description)
This produces the following output:
<GroupDescription></GroupDescription>
<VulnDiscussion>Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.
Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278-GPOS-00108</VulnDiscussion> <FalsePositives></FalsePositives><FalseNegatives> </FalseNegatives><Documentable>false</Documentable><Mitigations></Mitigations> <SecurityOverrideGuidance></SecurityOverrideGuidance><PotentialImpacts> </PotentialImpacts><ThirdPartyTools></ThirdPartyTools><MitigationControl> </MitigationControl><Responsibility></Responsibility><IAControls></IAControls>
So far so good, now I try and extract the contents of VulnDiscussion with this code:
for description in xml.xpath('//description/text()'):
vulnDiscussion = next(iter(ET.XML(description).xpath('//VulnDiscussion/text()')), None)
print(vulnDiscussion)
and get the following error :
vulnDiscussion = next(iter(ET.XML(description).xpath('//VulnDiscussion/text()')), None)
File "src/lxml/lxml.etree.pyx", line 3192, in lxml.etree.XML (src/lxml/lxml.etree.c:78763)
File "src/lxml/parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:118341)
File "src/lxml/parser.pxi", line 1736, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:117021)
File "src/lxml/parser.pxi", line 1102, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:111265)
File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105109)
File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106817)
File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105671)
File "<string>", line 3
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 3, column 79
XML can only have one "root", xml.xpath('//description/text()') return multiple elements. Wrap all elements in to a single element, then your XML document will only have one root element.
Also noted that the text in the original XML has a space before each closing tag that you should remove
from lxml import etree as ET
xml = ET.parse("test.xml")
for description in xml.xpath('//description/text()'):
x = ET.XML('<Testroot>'+description.replace('< /','</')+'</Testroot>') # add root tag and remove space before the closing tag
vulnDiscussion = next(iter(x.xpath('//VulnDiscussion/text()')), None)
if vulnDiscussion:
print(vulnDiscussion)
Output
Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.
Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278- GPOS-00108

Parsing HTML tag with ":" with lxml

I am new in python and I'm trying to parse a Html page with lxml. I want to get text from <p> tag. But inside it I have a strange tag like this:
<p style="margin-left:0px;padding:0 0 0 0;float:left;">
<g:plusone size="medium">
</g:plusone>
</p>
How can I ignore this tag inside <p> ? I want to cut all tags with ":" inside any html page,because another functions of lxml didn't work properly with tags like this.
parser=etree.HTMLParser()
tree = etree.parse('problemtags.html',parser)
root=tree.getroot()
text = [ b.text for b in root.iterfind(".//p")]
I expect to get some text inside <p> tags.But when i look like this, it fails on fragment like above. it writes: "b'Tag g:plusone invalid'". All i need - it is ignore all incorect tags like this. I don't know exactly how many tags like this i will have in future, but i think a problem really in ":" now, because when I use ".tag" and get name,it is just "plusone",not "g:plusone".
Here is a way I found to clean up the html:
from lxml import etree
from StringIO import StringIO
s = '''<p style="margin-left:0px;padding:0 0 0 0;float:left;">
<g:plusone size="medium">
</g:plusone>
</p>'''
parser = etree.HTMLParser()
tree = etree.parse(StringIO(s), parser)
result = etree.tostring(tree.getroot(),pretty_print=True,method="html")
print result
This prints
<html><body><p style="margin-left:0px;padding:0 0 0 0;float:left;">
<plusone size="medium">
</plusone>
</p></body></html>
To get an etree.Element reference, namely an etree._Element, from an etree._ElementTree, just
root = tree.getroot()
print type(root) # prints lxml.etree._Element
According to _Element-class, lxml.etree._Element is the class of document instance references, in other words its what results from instantiating etree.Element, for example
el = etree.Element("an_etree.Element_reference")
print type(el) # prints lxml.etree._Element
The g: is a namespace prefix. The actual tag name is only plusone. So, lxml is correct in only returning plusone as the tag name. See a summary of namespaces here.
As I understand it, lxml's HTML Parser is not namespace aware. However, the XML Parser is. Presumably, given that this HTML document contains XML, it is most likely actually an XHTML document (if not, then it is probably an invalid HTML document and you cannot expect lxml to parse it correctly). Therefore, you need to run it through the XML Parser rather than HTML Parser. lxml's namespace API is explained in their tutorial.
However, with the fragment you provided the parser returns this:
>>> d = etree.fromstring('''<p style="margin-left:0px;padding:0 0 0 0;float:left;">
... <g:plusone size="medium">
... </g:plusone>
... </p>''')
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src\lxml\lxml.etree.c:68121)
File "parser.pxi", line 1786, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:102470)
File "parser.pxi", line 1674, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:101299)
File "parser.pxi", line 1074, in lxml.etree._BaseParser._parseDoc (src\lxml\lxml.etree.c:96481)
File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:91290)
File "parser.pxi", line 683, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:92476)
File "parser.pxi", line 622, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:91772)
lxml.etree.XMLSyntaxError: Namespace prefix g on plusone is not defined, line 2, column 23
Note that it complains that the "Namespace prefix g on plusone is not defined." Presumably, elsewhere in your document the namespace prefix is defined. As I don't know what that is, I'll just make something up and define if on the plusone tag in your fragment:
>>> d = etree.fromstring('''<p style="margin-left:0px;padding:0 0 0 0;float:left;">
... <g:plusone xmlns:g="something" size="medium">
... </g:plusone>
... </p>''')
>>> d
<Element p at 0x2563cd8>
>>> d.tag
'p'
>>> d[0]
<Element {something}plusone at 0x2563940>
>>> d[0].tag
'{something}plusone'
Notice that the g: prefix was replaced with the actual namespace ({something} in this case as I set is like so: xmlns:g="something"). Usually the namespace would actually be a URI. So you may find that your tag looks something like this: {http://where.it/is/from.xml}plusone
Nevertheless, I find working with namespaces rather bothersome when they are not necessary. You may actually find it easier to use the HTML parser which ignores the namespaces. Now that you know that the tag is named plusone, not g:plusone you may be able get on with your work using just the HTML parser.

Categories