RuntimeError: dictionary keys changed during iteration - python

I am trying to use the following Script but there have been some errors that I've needed to fix. I was able to get it running but for most instances of the data it tries to process the following error arises:
C:\Users\Alexa\OneDrive\Skrivbord\Database Samples\LIDC-IDRI-0001\1.3.6.1.4.1.14519.5.2.1.6279.6001.175012972118199124641098335511\1.3.6.1.4.1.14519.5.2.1.6279.6001.141365756818074696859567662357\068.xml
Traceback (most recent call last):
File "C:\Users\Alexa\OneDrive\Documents\Universitet\Nuvarande\KEX\LIDC-IDRI-processing-master\lidc_data_to_nifti.py", line 370, in <module>
parse_xml_file(xml_file)
File "C:\Users\Alexa\OneDrive\Documents\Universitet\Nuvarande\KEX\LIDC-IDRI-processing-master\lidc_data_to_nifti.py", line 311, in parse_xml_file
root=xmlHelper.create_xml_tree(file)
File "C:\Users\Alexa\OneDrive\Documents\Universitet\Nuvarande\KEX\LIDC-IDRI-processing-master\lidcXmlHelper.py", line 23, in create_xml_tree
for at in el.attrib.keys(): # strip namespaces of attributes too
RuntimeError: dictionary keys changed during iteration
This corresponds to the following code:
def create_xml_tree(filepath):
"""
Method to ignore the namespaces if ElementTree is used.
Necessary becauseElementTree, by default, extend
Tag names by the name space, but the namespaces used in the
LIDC-IDRI dataset are not consistent.
Solution based on https://stackoverflow.com/questions/13412496/python-elementtree-module-how-to-ignore-the-namespace-of-xml-files-to-locate-ma
instead of ET.fromstring(xml)
"""
it = ET.iterparse(filepath)
for _, el in it:
if '}' in el.tag:
el.tag = el.tag.split('}', 1)[1] # strip all namespaces
for at in el.attrib.keys(): # strip namespaces of attributes too
if '}' in at:
newat = at.split('}', 1)[1]
el.attrib[newat] = el.attrib[at]
del el.attrib[at]
return it.root
I am not at all familiar with xml file reading in python and this problem has gotten me stuck for the last two days now. I tried reading up on threads both here and on other forums but it did not give me significant insight. From what I understand the problem arises because we're trying to manipulate the same object we are reading from, which is not allowed? I tried to fix this by making a copy of the file and then having that change depending on what is read in the original file but I did not get it working properly.

Related

Using Python, NLTK, to analyse German text

I am a beginner in Python and currently trying to use NLTK to analyze German text (extract the German noun and it's frequency of German text) by following this tutorial: https://datascience.blog.wzb.eu/2016/07/13/accurate-part-of-speech-tagging-of-german-texts-with-nltk/
There are several issues that I faced during the process and I am not able to solve them.
When I follow the website to execute the code below:
import random
tagged_sents = list(corp.tagged_sents())
random.shuffle(tagged_sents)
split_perc = 0.1
split_size = int(len(tagged_sents) * split_perc)
train_sents, test_sents = tagged_sents[split_size:], tagged_sents[:split_size]
and it comes out with this
Traceback (most recent call last):
File "test2.py", line 7, in <module>
tagged_sents = list(corp.tagged_sents())
File "C:\Users\User\anaconda3\lib\site-packages\nltk\corpus\reader\conll.py", line 130, in tagged_sents
return LazyMap(get_tagged_words, self._grids(fileids))
File "C:\Users\User\anaconda3\lib\site-packages\nltk\corpus\reader\conll.py", line 215, in _grids
return concat(
File "C:\Users\User\anaconda3\lib\site-packages\nltk\corpus\reader\util.py", line 433, in concat
raise ValueError("concat() expects at least one object!")
ValueError: concat() expects at least one object!
Then I try to fix by following this solution https://teamtreehouse.com/community/randomshuffle-crashes-when-passed-a-range-somenums-randomshufflerange5250
and alter the
tagged_sents = list(corp.tagged_sents())
to
tagged_sents = list(range(5,250))
And the ValueError didn't come out, I don't know what (5,250) means, although I have read the explanation.
Then I continue to execute the follow step
from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import ClassifierBasedGermanTagger
tagger = ClassifierBasedGermanTagger(train=train_sents)
And it shows
Traceback (most recent call last):
File "test1.py", line 90, in <module>
from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import ClassifierBasedGermanTagger
ModuleNotFoundError: No module named 'ClassifierBasedGermanTagger'
I have already downloaded the ClassifierBasedGermanTagger.py and init.py and put them in the folder which link to the VS CODE, don't know if it is correct as the passage said:
'Using his Python class ClassifierBasedGermanTagger (which you can download from the github page) we can create a tagger and train it with the data from the TIGER corpus:'
Please help me to fix these problems, thanks!
First of all, welcome to StackOverflow! Before posting a question, please make sure that you have done your own research and most of the time it solves the problem.
Secondly, range(start, end) is a very basic function in Python to get list of numbers based on the input and I don't think using it like the way you did is going to solve the problem. I would suggest you to use print to see what kind of data is being populated in corp and start debugging from there. Maybe corp is just empty and that's why you don't get any tagged_sents.
For the the import part, it is not clear to me where did you put the ClassifierBasedGermanTagger.py but wherever it is, it is not visible to your code. You can try to put your code (test2.py) and ClassifierBasedGermanTagger.py in the same directory. Read the link below for more details on how to properly import module in Python.
https://docs.python.org/3/reference/import.html

lxml.etree.XMLSyntaxError: No declaration for attribute

I am trying to use dblp data set to convert the xml file to csv file. Right now, I am using iterparse() to parse the xml file.
Here is my code:
def iterpar():
f = open(dblp.xml', 'rb')
context = etree.iterparse(f, dtd_validation=True, events=("start", "end"))
context = iter(context)
event, root = next(context)
for event, ele in context:
print event
print ele
However, when I tried to print out something to see what it is, an error was reported:
Traceback (most recent call last):
File "C:\dblp\Data\XML2csv", line 34, in <module>
iterpar()
File "C:\dblp\Data\XML2csv", line 29, in iterpar
for event, ele in context:
File "iterparse.pxi", line 208, in lxml.etree.iterparse.__next__ (src\lxml\lxml.etree.c:131498)
lxml.etree.XMLSyntaxError: No declaration for attribute mdate of element article, line 4, column 19
I guess this might result from a fail dtd validation because all attributes are declared in the dtd file. I also tried to google if there are any explanations for my problem but didn't find a good one.
Can anybody explain it and tell me how to fix it? Thank you very much.
-----------update---------
I think I do need the dtd_validation. Otherwise it will report:
lxml.etree.XMLSyntaxError: Entity 'ouml' not defined, line 47, column 25
Entities like 'ouml', 'uuml' occurs in xml file, and is defined in the dtd file. Although setting the dtd_validation to be false prevents the No declaration error, but this one will occur.
Without seeing your XML or DTD, it's hard to say. It sounds like your XML is violating the DTD because it defines an 'mdate' attribute not listed in the DTD for a specific element. You definitely need the DTD because it defines at least one special character in your XML, so removing the DTD is out of the question.
Is it possible for you to add the 'mdate' attribute into the DTD so that the parser will accept your XML?
<!ATTLIST element-name attribute-name attribute-type #IMPLIED>
Your xml file needs to have something similar to this:
<!DOCTYPE dblp SYSTEM "dblp.dtd">
If you start fixing Entity 'ouml', you can break the previous dependency.

Google search issue in Python

I have implemented a program in python which performs the Google search and captures top ten links from the search results. I am using 'pygoogle' library for search, when I am implementing my program for the first two or three times, it is getting proper hits and the entire project is working very fine. But afterward, after certain links got downloaded, it's giving an error as follows. (gui_two.py is my program name)
Exception in Tkinter callback
Traceback (most recent call last):
File "/usr/lib/python2.7/lib-tk/Tkinter.py", line 1413, in __call__
return self.func(*args)
File "gui_two.py", line 113, in action
result = uc.utilcorpus(self.fn1,"")
File "/home/ci/Desktop/work/corpus/corpus.py", line 125, in utilcorpus
for url in g1.get_urls(): #this is key sentence based search loop
File "/home/ci/Desktop/work/corpus/pygoogle.py", line 132, in get_urls
for result in data['responseData']['results']:
TypeError: 'NoneType' object has no attribute '__getitem__'
I know this is most familiar error in python, but I am not able to do anything since it is a library. I wonder my program is spamming the Google or I need custom Google search API's or may be the other reason. Please give me precise information for performing search without any issue. I will be so grateful for your help.
Thanks.
Edited: Actually my code is very huge, here is a small piece of code, where problem arises.
g1 = pygoogle(query)
g1.pages = 1
for url in g1.get_urls(): #error is in this line
print "URL : ",url
It may work if we simply copy it in a simple .py file, but if we execute it many times, program gives an error.
Here's the culprit code from pygoogle.py (from http://pygoogle.googlecode.com/svn/trunk/pygoogle.py)
def get_urls(self):
"""Returns list of result URLs"""
results = []
search_results = self.__search__()
if not search_results:
self.logger.info('No results returned')
return results
for data in search_results:
if data and data.has_key('responseData') and data['responseData']['results']:
for result in data['responseData']['results']:
if result:
results.append(urllib.unquote(result['unescapedUrl']))
return results
Unlike every other place where data['responseData']['results'] is used, they're not both being checked for existence using has_key().
I suspect that your responseData is missing results, hence the for loop fails.
Since you have the source, you can edit this yourself.
Also make an issue for the project - very similar to this one in fact.
I fixed the issue by modifying the source code of pygoogle.py library program. The bug in this code is, whether an element has the data or none is not checked in the code. The modified code is:
def get_urls(self):
"""Returns list of result URLs"""
results = []
for data in self.__search__():
#following two lines are added to fix the issue
if data['responseData'] == None or data['responseData']['results'] == None:
break
for result in data['responseData']['results']:
if result:
results.append(urllib.unquote(result['unescapedUrl']))
return results

Get line number of xml error in python libxml2

I am using libxml2 python for parsing my xml and validate it against an xsd file.
I was unable to find out the line number of the error in xml file when i get an schema validation error.
Is there a way?
For better understanding i have pasted my code below:
ctxt = libxml2.schemaNewParserCtxt("xsd1.xsd")
schema = ctxt.schemaParse()
del ctxt
validationCtxt = schema.schemaNewValidCtxt()
doc = libxml2.parseFile("myxml.xml")
iErrA1 = validationCtxt.schemaValidateDoc(doc)
#get the line number if there is error
Update:
I tried libxml2.lineNumbersDefault(1), this probably enables the library to collect line numbers.
Then i register a callback through validationCtxt.setValidityErrorHandler(ErrorHandler, ErrorHandler).
I get the msg part but no line number is present.
More info: I had seen a line number member in xmlError, but its unclear to me as to how I will get this xmlError type object back when there is an error.
Also there is a global function lastError which returns an xmlError type object, but it always returns 'none' although there is an error

Can't open file: "NameError: name <filename> is not defined"

I am creating a program to read a FASTA file and split at some specifc characters such as '>' etc. But I'm facing a problem.
The program part is:
>>> def read_FASTA_strings(seq_fasta):
... with open(seq_fasta.txt) as file:
... return file.read().split('>')
The error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'seq_fasta' is not defined
How to get rid of this problem?
You need to specify the file name as a string literal:
open('seq_fasta.txt')
You need to quote the filename: open('seq_fasta.txt').
Besides that, you might choose a different name but file as using this name shadows a builtin name.
Your program is seeing seq_fasta.txt as a object label, similar to how you would use math.pi after importing the math module.
This is not going to work because seq_fasta.txt doesnt actually point to anything, thus your error. What you need to do is either put quotes around it 'seq_fasta.txt' or create a text string object containing that and use that variable name in the open function. Because of the .txt it thinks seq_fasta(in the function header) and seq_fasta.txt(in the function body) are two different labels.
Next, you shouldnt use file as it is an important keyword for python and you could end up with some tricky bugs and a bad habit.
def read_FASTA_strings(somefile):
with open(somefile) as textf:
return textf.read().split('>')
and then to use it
lines = read_FASTA_strings("seq_fasta.txt")

Categories