Get line number of xml error in python libxml2

Get line number of xml error in python libxml2 - python

I am using libxml2 python for parsing my xml and validate it against an xsd file.
I was unable to find out the line number of the error in xml file when i get an schema validation error.
Is there a way?
For better understanding i have pasted my code below:
ctxt = libxml2.schemaNewParserCtxt("xsd1.xsd")
schema = ctxt.schemaParse()
del ctxt
validationCtxt = schema.schemaNewValidCtxt()
doc = libxml2.parseFile("myxml.xml")
iErrA1 = validationCtxt.schemaValidateDoc(doc)
#get the line number if there is error
Update:
I tried libxml2.lineNumbersDefault(1), this probably enables the library to collect line numbers.
Then i register a callback through validationCtxt.setValidityErrorHandler(ErrorHandler, ErrorHandler).
I get the msg part but no line number is present.
More info: I had seen a line number member in xmlError, but its unclear to me as to how I will get this xmlError type object back when there is an error.
Also there is a global function lastError which returns an xmlError type object, but it always returns 'none' although there is an error

Related

RuntimeError: dictionary keys changed during iteration

I am trying to use the following Script but there have been some errors that I've needed to fix. I was able to get it running but for most instances of the data it tries to process the following error arises:
C:\Users\Alexa\OneDrive\Skrivbord\Database Samples\LIDC-IDRI-0001\1.3.6.1.4.1.14519.5.2.1.6279.6001.175012972118199124641098335511\1.3.6.1.4.1.14519.5.2.1.6279.6001.141365756818074696859567662357\068.xml
Traceback (most recent call last):
File "C:\Users\Alexa\OneDrive\Documents\Universitet\Nuvarande\KEX\LIDC-IDRI-processing-master\lidc_data_to_nifti.py", line 370, in <module>
parse_xml_file(xml_file)
File "C:\Users\Alexa\OneDrive\Documents\Universitet\Nuvarande\KEX\LIDC-IDRI-processing-master\lidc_data_to_nifti.py", line 311, in parse_xml_file
root=xmlHelper.create_xml_tree(file)
File "C:\Users\Alexa\OneDrive\Documents\Universitet\Nuvarande\KEX\LIDC-IDRI-processing-master\lidcXmlHelper.py", line 23, in create_xml_tree
for at in el.attrib.keys(): # strip namespaces of attributes too
RuntimeError: dictionary keys changed during iteration
This corresponds to the following code:
def create_xml_tree(filepath):
"""
Method to ignore the namespaces if ElementTree is used.
Necessary becauseElementTree, by default, extend
Tag names by the name space, but the namespaces used in the
LIDC-IDRI dataset are not consistent.
Solution based on https://stackoverflow.com/questions/13412496/python-elementtree-module-how-to-ignore-the-namespace-of-xml-files-to-locate-ma
instead of ET.fromstring(xml)
"""
it = ET.iterparse(filepath)
for _, el in it:
if '}' in el.tag:
el.tag = el.tag.split('}', 1)[1] # strip all namespaces
for at in el.attrib.keys(): # strip namespaces of attributes too
if '}' in at:
newat = at.split('}', 1)[1]
el.attrib[newat] = el.attrib[at]
del el.attrib[at]
return it.root
I am not at all familiar with xml file reading in python and this problem has gotten me stuck for the last two days now. I tried reading up on threads both here and on other forums but it did not give me significant insight. From what I understand the problem arises because we're trying to manipulate the same object we are reading from, which is not allowed? I tried to fix this by making a copy of the file and then having that change depending on what is read in the original file but I did not get it working properly.

loading Behave test results in JSON file in environment.py's after_all() throws error

I'm trying to send my Behave test results to an API Endpoint. I set the output file to be a new JSON file, run my test, and then in the Behave after_all() send the JSON result via the requests package.
I'm running my Behave test like so:
args = ['--outfile=/home/user/nathan/results/behave4.json',
'--for mat=json.pretty']
from behave.__main__ import main as behave_main
behave_main(args)
In my environment.py's after_all(), I have:
def after_all(context):
data = json.load(open('/home/user/myself/results/behave.json', 'r')) # This line causes the error
sendoff = {}
sendoff['results'] = data
r = requests.post(MyAPIEndpoint, json=sendoff)
I'm getting the following error when running my Behave test:
HOOK-ERROR in after_all: ValueError: Expecting object: line 124 column 1
(char 3796)
ABORTED: By user.
The reported error is here in my JSON file:
[
{
...
} <-- line 124, column 1
]
However, behave.json is outputted after the run and according to JSONLint it is valid JSON. I don't know the exact details of after_all(), but I think the issue is that the JSON file isn't done writing by the time I try to open it in after_all(). If I try json.load() a second time on the behave.json file after the file is written, it runs without error and I am able to view my JSON file at the endpoint.
Any better explanation as to why this is happening? Any solution or change in logic to get past this?

Yes, it seems as though the file is still in the process of being written when I try to access it in after_all(). I put in a small delay before I open the file in my code, then I manually viewed the behave.json file and saw that there was no closing ] after the last }.
That explains that. I will create a new question to find out how to get by this, or if a change in a logic is required.

Google search issue in Python

I have implemented a program in python which performs the Google search and captures top ten links from the search results. I am using 'pygoogle' library for search, when I am implementing my program for the first two or three times, it is getting proper hits and the entire project is working very fine. But afterward, after certain links got downloaded, it's giving an error as follows. (gui_two.py is my program name)
Exception in Tkinter callback
Traceback (most recent call last):
File "/usr/lib/python2.7/lib-tk/Tkinter.py", line 1413, in __call__
return self.func(*args)
File "gui_two.py", line 113, in action
result = uc.utilcorpus(self.fn1,"")
File "/home/ci/Desktop/work/corpus/corpus.py", line 125, in utilcorpus
for url in g1.get_urls(): #this is key sentence based search loop
File "/home/ci/Desktop/work/corpus/pygoogle.py", line 132, in get_urls
for result in data['responseData']['results']:
TypeError: 'NoneType' object has no attribute '__getitem__'
I know this is most familiar error in python, but I am not able to do anything since it is a library. I wonder my program is spamming the Google or I need custom Google search API's or may be the other reason. Please give me precise information for performing search without any issue. I will be so grateful for your help.
Thanks.
Edited: Actually my code is very huge, here is a small piece of code, where problem arises.
g1 = pygoogle(query)
g1.pages = 1
for url in g1.get_urls(): #error is in this line
print "URL : ",url
It may work if we simply copy it in a simple .py file, but if we execute it many times, program gives an error.

Here's the culprit code from pygoogle.py (from http://pygoogle.googlecode.com/svn/trunk/pygoogle.py)
def get_urls(self):
"""Returns list of result URLs"""
results = []
search_results = self.__search__()
if not search_results:
self.logger.info('No results returned')
return results
for data in search_results:
if data and data.has_key('responseData') and data['responseData']['results']:
for result in data['responseData']['results']:
if result:
results.append(urllib.unquote(result['unescapedUrl']))
return results
Unlike every other place where data['responseData']['results'] is used, they're not both being checked for existence using has_key().
I suspect that your responseData is missing results, hence the for loop fails.
Since you have the source, you can edit this yourself.
Also make an issue for the project - very similar to this one in fact.

I fixed the issue by modifying the source code of pygoogle.py library program. The bug in this code is, whether an element has the data or none is not checked in the code. The modified code is:
def get_urls(self):
"""Returns list of result URLs"""
results = []
for data in self.__search__():
#following two lines are added to fix the issue
if data['responseData'] == None or data['responseData']['results'] == None:
break
for result in data['responseData']['results']:
if result:
results.append(urllib.unquote(result['unescapedUrl']))
return results

Parsing an XML file in python for emailing purposes

I am writing code in python that can not only read a xml but also send the results of that parsing as an email. Now I am having trouble just trying to read the file I have in xml. I made a simple python script that I thought would at least read the file which I can then try to email within python but I am getting a Syntax Error in line 4.
root.tag 'log'
Anyways here is the code I written so far:
import xml.etree.cElementTree as etree
tree = etree.parse('C:/opidea.xml')
response = tree.getroot()
log = response.find('log').text
logentry = response.find('logentry').text
author = response.find('author').text
date = response.find('date').text
msg = [i.text for i in response.find('msg')]
Now the xml file has this type of formating
<log>
<logentry
revision="12345">
<author>glv</author>
<date>2012-08-09T13:16:24.488462Z</date>
<paths>
<path
action="M"
kind="file">/trunk/build.xml</path>
</paths>
<msg>BUG_NUMBER:N/A
FEATURE_AFFECTED:N/A
OVERVIEW:Example</msg>
</logentry>
</log>
I want to be able to send an email of this xml file. For now though I am just trying to get the python code to read the xml file.

response.find('log') won't find anything, because:
find(self, path, namespaces=None)
Finds the first matching subelement, by tag name or path.
In your case log is not a subelement, but rather the root element itself. You can get its text directly, though: response.text. But in your example the log element doesn't have any text in it, anyway.
EDIT: Sorry, that quote from the docs actually applies to lxml.etree documentation, rather than xml.etree.
I'm not sure about the reason, but all other calls to find also return None (you can find it out by printing response.find('date') and so on). With lxml ou can use xpath instead:
author = response.xpath('//author')[0].text
msg = [i.text for i in response.xpath('//msg')]
In any case, your use of find is not correct for msg, because find always returns a single element, not a list of them.

lxml etree.parse MemoryAllocation Error

I'm using lxml etree.parse to parse a, somehow, huge XML file (around 65MB - 300MB).
When I run my stand alone python script containing the below function, I am getting a Memory Allocation failure:
Error:
Memory allocation failed : xmlSAX2Characters, line 5350155, column 16
Partial function code:
def getID():
try:
from lxml import etree
xml = etree.parse(<xml_file>) # here is where the failure occurs
for element in xml.iter():
...
result = <formed by concatenating element texts>
return result
except Exception, ex:
<handle exception>
The weird thing is when I input the same function on IDLE, and tested the same XML file, I am not encountering any MemoryAllocation error.
Any ideas on this issue? Thanks in advance.

I would parse the document using the iterative parser instead, calling .clear() on any element you are done with; that way you avoid having to load the whole document in memory in one go.
You can limit the iterative parser to only those tags you are interested in. If you only want to parse <person> tags, tell your parser so:
for _, element in etree.iterparse(input, tag='person'):
# process your person data
element.clear()
By clearing the element in the loop, you free it from memory.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.