How to support recursive include when parsing xml - python

I'm defining an xml schema of my own which supports the additional tag "insert_tag", which when reached should insert the text file at that point in the stream and then continue the parsing:
Here is an example:
my.xml:
<xml>
Something
<insert_file name="foo.html"/>
or another
</xml>
I'm using xmlreader as follows:
class HtmlHandler(xml.sax.handler.ContentHandler):
def __init__(self):
xml.sax.handler.ContentHandler.__init__(self)
parser = xml.sax.make_parser()
parser.setContentHandle(HtmlHandler())
parser.parse(StringIO(html))
The question is how do I insert the included contents directly into the parsing stream? Of course I could recursively build up the non-interpolated text by repeatedly inserting included text, but that means that I have to parse the xml multiple times.
I tried replacing StringIO(html) with my own stream that allows inserting contents mid stream, but it doesn't work because the sax parser reads the stream buffered.
Update:
I did find a solution that is hackish at the best. It is built on the following stream class:
class InsertReader():
"""A reader class that supports the concept of pushing another
reader in the middle of the use of a first reader. This may
be used for supporting insertion commands."""
def __init__(self):
self.reader_stack = []
def push(self,reader):
self.reader_stack += [reader]
def pop(self):
self.reader_stack.pop()
def __iter__(self):
return self
def read(self,n=-1):
"""Read from the top most stack element. Never trancends elements.
Should it?
The code below is a hack. It feeds only a single token back to
the reader.
"""
while len(self.reader_stack)>0:
# Return a single token
ret_text = StringIO()
state = 0
while 1:
c = self.reader_stack[-1].read(1)
if c=='':
break
ret_text.write(c)
if c=='>':
break
ret_text = ret_text.getvalue()
if ret_text == '':
self.reader_stack.pop()
continue
return ret_text
return ''
def next(self):
while len(self.reader_stack)>0:
try:
v = self.reader_stack[-1].next()
except StopIteration:
self.reader_stack.pop()
continue
return v
raise StopIteration
This class creates a stream structure that restricts the amount of characters that are returned to the user of the stream. I.e. even if the xml parser does read(16386) the class will only return bytes up to the next '>' character. Since the '>' character also signifies the end of tags, we have the opportunity to inject our recursive include into the stream at this point.
What is hackish about this solution is the following:
Reading one character at a time from a stream is slow.
This has an implicit assumption about how the sax stream class is reading text.
This solves the problem for me, but I'm still interested in a more beautiful solution.

Have you considered using xinclude? The lxml library has builtin support for it.

Related

Find all occurrences of bytestrings in a python code snippet

I'm trying to parse python snippets, some of which contains bytestrings.
for example:
"""
from gzip import decompress as __;_=exec;_(__(b'\x1f\x8b\x08\x00\xcbYmc\x02\xff\xbd7i\xb3\xdaJv\xdf\xdf\xaf /I\xf9\xbar\xc6%\x81#\x92k\x9c)\x16I,b\x95Xm\x87\x92Z-$\xd0\x86\x16\x10LM~{N\x03\xd7\xc6\xd7\x9e%\xa9\xa9PE/\xa7\xcf\xbeuk\xd3\xacm\xdd"\x94\x1b\'\xa5\xda\x04"H\x17\xae\xe3t\xf4\xcdn\x03\xa9/&T>\x13\xdbu\g=\x9f\x13~\x11\xf6\x9b\xd7\x15~\xb2\xe7\xbc\xe6\xc2K\xb8\x18\x03\xfd|[\x7f\xe8\xb8I;\xf0\xf1\x93\xec\x83\x8eo15\x8dC\xfc\xc6I\xf1\xfd\xf5r\x8f\xeb\x0f\xd7\xc53#\xa8<_\xb2Py\xbe\xe1\xde\xff\x0fk&\x93\xa8V\x18\x00\x00'))
x = b"\x1f\x8b\x08"
y = "hello world"
"""
Is there a regex pattern I can use to correctly find those strings?
I have tried implementing a regex query myself, like so:
bytestrings= re.findall(r'b"(.+?)"', text) + re.findall(r"b\'(.+?)'", text)
I was expecting to receive an array
[b'\x1f\x8b\x08\x00\xcbYmc\x02\xff\xbd7i\xb3\xdaJv\xdf\xdf\xaf /I\xf9\xbar\xc6%\x81#\x92k\x9c)\x16I,b\x95Xm\x87\x92Z-$\xd0\x86\x16\x10LM~{N\x03\xd7\xc6\xd7\x9e%\xa9\xa9PE/\xa7\xcf\xbeuk\xd3\xacm\xdd"\x94\x1b\'\xa5\xda\x04"H\x17\xae\xe3t\xf4\xcdn\x03\xa9/&T>\x13\xdbu\g=\x9f\x13~\x11\xf6\x9b\xd7\x15~\xb2\xe7\xbc\xe6\xc2K\xb8\x18\x03\xfd|[\x7f\xe8\xb8I;\xf0\xf1\x93\xec\x83\x8eo15\x8dC\xfc\xc6I\xf1\xfd\xf5r\x8f\xeb\x0f\xd7\xc53#\xa8<_\xb2Py\xbe\xe1\xde\xff\x0fk&\x93\xa8V\x18\x00\x00', b"\x1f\x8b\x08"]
instead it returns an empty array.
This isn't a job for regular expressions, but for a Python parser.
import ast
code = """
...
"""
tree = ast.parse(code)
Now you can walk the tree looking for values of type ast.Constant whose value attributes have type bytes. Do this by defining a subclass of ast.NodeVisitor and overriding its visit_Constant method. This method will be called on each node of type ast.Constant in the tree, letting you examine the value. Here, we simply add appropriate values to a global list.
bytes_literals = []
class BytesLiteralCollector(ast.NodeVisitor):
def visit_Constant(self, node):
if isinstance(node.value, bytes):
bytes_literals.append(node.value)
BytesLiteralCollector().visit(tree)
The documentation for NodeVisitor is not great. Aside from the two documented methods visit and generic_visit, I believe you can define visit_* where * can be any of the node types defined in the abstract grammar presented at the start of the documentation.
You can use print(ast.dump(ast.parse(code), indent=4)) to get a more-or-less readable representation of the tree that your visitor will walk.

Parser that preserves comments and recover from error

I'm working on a GUI editor for a propriety config format. Basically the editor will parse the config file, display the object properties so that users can edit from GUI and then write the objects back to the file.
I've got the parse - edit - write part done, except for:
The parsed data structure only include object properties information, so comments and whitespaces are lost on write
If there is any syntax error, the rest of the file is skipped
How would you address these issues? What is the usual approach to this problem? I'm using Python and Parsec module https://pythonhosted.org/parsec/documentation.html, however and help and general direction is appreciated.
I've also tried Pylens (https://pythonhosted.org/pylens/), which is really close to what I need except it can not skip syntax errors.
You asked about typical approaches to this problem. Here are two projects which tackle similar challenges to the one you describe:
sketch-n-sketch: "Direct manipulation" interface for vector images, where you can either edit the image-describing source language, or edit the image it represents directly and see those changes reflected in the source code. Check out the video presentation, it's super cool.
Boomerang: Using lenses to "focus" on the abstract meaning of some concrete syntax, alter that abstract model, and then reflect those changes in the original source.
Both projects have yielded several papers describing the approaches their authors took. As far as I can tell, the Lens approach is popular, where parsing and printing become the get and put functions of a Lens which takes a some source code and focuses on the abstract concept which that code describes.
Eventually I ran out of research time and have to settle with a rather manual skipping. Basically each time the parser fail we try to advance the cursor one character and repeat. Any parts skipped by the process, regardless of whitespace/comment/syntax error is dump into a Text structure. The code is quite reusable, except for the part you have to incorporate it to all the places with repeated results and the original parser may fail.
Here's the code, in case it helps anyone. It is written for Parsy.
class Text(object):
'''Structure to contain all the parts that the parser does not understand.
A better name would be Whitespace
'''
def __init__(self, text=''):
self.text = text
def __repr__(self):
return "Text(text='{}')".format(self.text)
def __eq__(self, other):
return self.text.strip() == getattr(other, 'text', '').strip()
def many_skip_error(parser, skip=lambda t, i: i + 1, until=None):
'''Repeat the original `parser`, aggregate result into `values`
and error in `Text`.
'''
#Parser
def _parser(stream, index):
values, result = [], None
while index < len(stream):
result = parser(stream, index)
# Original parser success
if result.status:
values.append(result.value)
index = result.index
# Check for end condition, effectively `manyTill` in Parsec
elif until is not None and until(stream, index).status:
break
# Aggregate skipped text into last `Text` value, or create a new one
else:
if len(values) > 0 and isinstance(values[-1], Text):
values[-1].text += stream[index]
else:
values.append(Text(stream[index]))
index = skip(stream, index)
return Result.success(index, values).aggregate(result)
return _parser
# Example usage
skip_error_parser = many_skip_error(original_parser)
On other note, I guess the real issue here is I'm using a parser combinator library instead of a proper two stages parsing process. In traditional parsing, the tokenizer will handle retaining/skipping any whitespace/comment/syntax error, making them all effectively whitespace and are invisible to the parser.

How to perform a query on Big XML file using Python?

I have an XML file 7 GB, It is about all transactions in one company, and I want to filter only the records of the Last Year (2015).
The structure of a file is:
<Customer>
<Name>A</Name>
<Year>2015<Year>
</Customer>
I have also its DTD file.
I do not know how can I filter such data into text file.
Is there any tutorial or library to be used in this regard.
Welcome!
As your data is large, I assume you've already decided that you won't be able to load the whole lot into memory. This would be the approach using a DOM-style (document object model) parser. And you've actually tagged your question 'SAX' (the simple API for XML) which further implies you know that you need a non-memory approach.
Two approaches come to mind:
Using grep
Sometimes with XML it can be useful to use plain text processing tools. grep would allow you to filter through your XML document as plain text and find the occurrences of 2015:
$ grep -B 2 -A 1 "<Year>2015</Year>"
The -B and -A options instruct grep to print some lines of context around the match.
However, this approach will only work if your XML is also precisely structured plain text, which there's absolutely no need for it (as XML) to be. That is, your XML could have any combination of whitespace (or non at all) and still be semantically identical, but the grep approach depends on exact whitespace arrangement.
SAX
So a more reliable non-memory approach would be to use SAX. SAX implementations are conceptually quite simple, but a little tedious to write. Essentially, you have to override a class which provides methods that are called when certain 'events' occur in the source XML document. In the xml.sax.handler module in the standard library, this class is ContentHandler. These methods include:
startElement
endElement
characters
Your overridden methods then determine how to handle those events. In a typical implementation of startElement(name, attrs) you might test the name argument to determine what the tag name of the element is. You might then maintain a stack of elements you have entered. When endElement(name) occurs, you might then pop the top element off that stack, and possibly do some processing on the completed element. The characters(content) happens when character data is encountered in the source document. In this method you might consider building up a string of the character data which can then be processed when you encounter an endElement.
So for your specific task, something like this may work:
from xml.sax import parse
from xml.sax.handler import ContentHandler
class filter2015(ContentHandler):
def __init__(self):
self.elements = [] # stack of elements
self.char_data = u'' # string buffer
self.current_customer = u'' # name of customer
self.current_year = u''
def startElement(self, name, attrs):
if name == u'Name':
self.elements.append(u'Name')
if name == u'Year':
self.elements.append(u'Year')
def characters(self, chars):
if len(self.elements) > 0 and self.elements[-1] in [u'Name', u'Year']:
self.char_data += chars
def endElement(self, name):
self.elements.pop() if len(self.elements) > 0 else None
if name == u'Name':
self.current_customer = self.char_data
self.char_data = ''
if name == u'Year':
self.current_year = self.char_data
self.char_data = ''
if name == 'Customer':
# wait to check the year until the Customer is closed
if self.current_year == u'2015':
print 'Found:', self.current_customer
# clear the buffers now that the Customer is finished
self.current_year = u''
self.current_customer = u''
self.char_data = u''
source = open('test.xml')
parse(source, filter2015())
Check out this question. It will let you interact with it as a generator:
python: is there an XML parser implemented as a generator?
You want to use a generator so that you don't load the entire doc into memory first.
Specifically:
import xml.etree.cElementTree as ET
for event, element in ET.iterparse('huge.xml'):
if event == 'end' and element.tag == 'ticket':
#process ticket...
Source: http://enginerds.craftsy.com/blog/2014/04/parsing-large-xml-files-in-python-without-a-billion-gigs-of-ram.html

Create classes that grab youtube queries and display information using Python

I want to use the urllib module to send HTTP requests and grab data. I can get the data by using the urlopen() function, but not really sure how to incorporate it into classes. I really need help with the query class to move forward. From the query I need to pull
• Top Rated
• Top Favorites
• Most Viewed
• Most Recent
• Most Discussed
My issue is, I can't parse the XML document to retrieve this data. I also don't know how to use classes to do it.
Here is what I have so far:
import urllib #this allows the programm to sen HTTP requests and to read the responses.
class Query:
'''performs the actual HTTP requests and initial parsing to build the Video-
objects from the response. It will also calculate the following information
based on the video and user results. '''
def __init__(self, feed_id, max_results):
'''Takes as input the type of query (feed_id) and the maximum number of
results (max_results) that the query should obtain. The correct HTTP
request must be constructed and submitted. The results are converted
into Video objects, which are stored within this class.
'''
self.feed = feed_id
self.max = max_results
top_rated = urllib.urlopen("http://gdata.youtube.com/feeds/api/standardfeeds/top_rated")
results_str = top_rated.read()
splittedlist = results_str.split('<entry')
top_rated.close()
def __str__(self):
''' prints out the information on each video and Youtube user. '''
pass
class Video:
pass
class User:
pass
#main function: This handles all the user inputs and stuff.
def main():
useinput = raw_input('''Welcome to the YouTube text-based query application.
You can select a popular feed to perform a query on and view statistical
information about the related videos and users.
1) today
2) this week
3) this month
4) since youtube started
Please select a time(or 'Q' to quit):''')
secondinput = raw_input("\n1) Top Rated\n2) Top Favorited\n3) Most Viewed\n4) Most Recent\n5) Most Discussed\n\nPlease select a feed (or 'Q' to quit):")
thirdinput = raw_input("Enter the maximum number of results to obtain:")
main()
toplist = []
top_rated = urllib.urlopen("http://gdata.youtube.com/feeds/api/standardfeeds/top_rated")
result_str = top_rated.read()
top_rated.close()
splittedlist = result_str.split('<entry')
results_str = top_rated.read()
x=splittedlist[1].find('title')#find the title index
splittedlist[1][x: x+75]#string around the title (/ marks the end of the title)
w=splittedlist[1][x: x+75].find(">")#gives you the start index
z=splittedlist[1][x: x+75].find("<")#gives you the end index
titles = splittedlist[1][x: x+75][w+1:z]#gives you the title!!!!
toplist.append(titles)
print toplist
I assume that your challenge is parsing XML.
results_str = top_rated.read()
splittedlist = results_str.split('<entry')
And I see you are using string functions to parse XML. Such functions based on finite automata (regular languages) are NOT suited for parsing context-free languages such as XML. Expect it to break very easily.
For more reasons, please refer RegEx match open tags except XHTML self-contained tags
Solution: consider using an XML parser like elementree. It comes with Python and allows you to browse the XML tree pythonically. http://effbot.org/zone/element-index.htm
Your may come up with code like:
import elementtree.ElementTree as ET
..
results_str = top_rated.read()
root = ET.fromstring(results_str)
for node in root:
print node
I also don't know how to use classes to do it.
Don't be in a rush to create classes :-)
In the above example, you are importing a module, not importing a class and instantiating/initializing it, like you do for Java. Python has powerful primitive types (dictionaries, lists) and considers modules as objects: so (IMO) you can go easy on classes.
You use classes to organize stuff, not because your teacher has indoctrinated you "classes are good. Lets have lots of them".
Basically you want to use the Query class to communicate with the API.
def __init__(self, feed_id, max_results, time):
qs = "http://gdata.youtube.com/feeds/api/standardfeeds/"+feed_id+"?max- results="+str(max_results)+"&time=" + time
self.feed_id = feed_id
self.max_results = max_results
wo = urllib.urlopen(qs)
result_str = wo.read()
wo.close()

Python trying to write and read class from file but something went horribly wrong

Considering this is only for my homework I don't expect much help but I just can't figure this out and honestly I can't get my head around what's going wrong. Usually I have an idea where the problem is but now I just don't get it.
Long story short: I'm trying to create a valid looking telephone number within a class and then loading it onto an array or list then later on save all of them as string into a folder. When I start the program again I want it to read the file and re-create my class and load it back into the list. (Basically a very simple repository).
Problem is even though I evaluate the stored phone number in the exact same way I validate it as input data ... I get an error which makes no sens.
Another small problem is the fact that when I re-use the data for some reason it creates white spaces in the file which in turn messes my program up badly.
Here I validate phone numbers:
def validateTel(call_ID):
if isinstance (call_ID, str) == True:
call_ID = call_ID.replace (" ", "")
if (len (call_ID) != 10):
print ("Telephone numbers are 10 digits long")
return False
for item in call_ID:
try:
int(item)
except:
print ("Telephone numbers should contain non-negative digits")
return False
else:
if (int(item) < 0):
print ("Digits are non-negative")
After this I use it and other non-relevant (to this discussion) data to create an object (class instance) and move them to a list.
Inside my class I have a load from string and a load to string. What they do is take everything from my class object so I can write it to a file using "+" as a separator so I can use string.split("+") and write it to a file. This works nicely, but when I read it ... well it's not working.
def load_data():
f = open ("data.txt", "r")
ch = f.read()
contact = agenda.contact () # class object
if ch in (""," ","None"," None"):
f.close()
return [] # if the file is empty or has None in some way I pass an empty stack
else:
stack = [] # the list where I load all my class objects
f.seek(0,0)
for line in f:
contact.loadFromString(line) # explained bellow
stack.append(deepcopy(contact))
f.close()
return stack
In loadFromString(line) all I do is validate the line (see if the data inside it at least looks OK).
Now here is the place where I validate the string I just read from the file:
def validateString (load_string):
string = string.split("+")
if len (string) != 4:
print ("System error in loading from file: Program skipping segment of corrupt data")
return False
if string[0] == "" or string[0] == " " or string[0] == None or string[0] == "None" or string[0] == " None":
print ("System error in loading from file: Name field cannot be empty")
try:
int(string[1])
except:
print("System error in loading from file: ID is not integer")
return False
if (validateTel(str(string[2])) == False):
print ("System error in loading from file: Call ID (telephone number)")
return False
return True
Small recap:
I try to load the data from file using loadFromString(). The only relevant thing that does is it tries to validate my data with validateString(string) in there the only thing that messes me up is the validateTel. But my input data gets validated in the same way my stored data does. They are perfectly identical but it gives a "System error" BUT to give such an error it should have also gave an error in the validate sub-program but it doesn't.
I hope this is enough info because my program is kinda big (for me any way) however the bug should be here somewhere.
I thank anyone brave enough to sift trough this mess.
EDIT:
The class is very simple, it looks like this:
class contact:
def __init__ (self, name = None, ID = None, tel = None, address = None):
self.__name = name
self.__id = ID
self.__tel = tel
self.__address = address
After this I have a series of setters and getters (to modify contacts and to return parts of the abstract data)
Here I also have my loadFromString and loadToString but those work just fine (except maybe they cause a small jump after each line (an empty line) which then kills my program, but that I can deal with)
My problem is somewhere in the validate or a way the repository interacts with it. The point is that even if it gives an error in the loading of the data, first the validate should print an error ... but it doesn't -_-
You said I just can't figure this out and honestly I can't get my head around what's going wrong. I think this is a great quote which sums up a large part of programming and software development in general -- dealing with crazy, weird problems and spending a lot of time trying to wrap your head around them.
Figuring out how to turn ridiculously complicated problems into small, manageable problems is the hardest part of programming, but also arguably the most important and valuable.
Here's some general advice which I think might help you:
use meaningful names for functions and variables (validateString doesn't tell me anything about what the function does; string tells me nothing about the meaning of its contents)
break down problems into small, well-defined pieces
specify your data -- what is a phone number? 10 positive digits, no spaces, no punctuation?
document/comment the input/output from functions if it's not obvious
Specific suggestions:
validateTel could probably be replaced with a simple regular expression match
try using json for serialization
if you're using json, then it's easy to use lists. I would strongly recommend this over using + as a separator -- that looks highly questionable to me
Example: using a regex
import re
def validateTel(call_ID):
phoneNumberRegex = re.compile("^\d{10}$") # match a string of 10 digits
return phoneNumberRegex.match(call_ID)
Example: using json
import json
phoneNumber1, phoneNumber2, phoneNumber3 = ... whatever ...
mylist = [phoneNumber1, phoneNumber2, phoneNumber3]
print json.dumps(mylist)
For starters, don't name your variables after reserved keywords. Instead of calling it string, call it telNumber or s or my_string.
def validateString (my_string):
working_string= my_string.split("+")
if len (working_string) != 4:
print ("System error in loading from file: Program skipping segment of corrupt data")
return False
The next line I don't really get - what is this If chain for? Accounting for bad data or something? Probably better to check for good data; bad data can come in infinite variety.
if working_string[0] == "" or working_string[0] == " " or working_string[0] == None or working_string[0] == "None" or string[0] == " None":
print ("System error in loading from file: Name field cannot be empty")
try:
int(string[1])
except:
print("System error in loading from file: ID is not integer")
return False
if (validateTel(str(working_string[2])) == False):
print ("System error in loading from file: Call ID (telephone number)")
return False
return True
Also, to give you a hint - you may want to look into regular expressions.
wow - many problems maybe connected to your problem, also as I commented - I suspect your problem is with turning the telnumber object to string.
f is is file object it won't be equal to anything. if you want to check if the file exists you should just do try /except around the file creation block. like:
try:
f = open ('data.txt','r') #also could call c=f.read() and check if c equals to none.. not really needed because you can cover an empty file in the next part iterating over f
except:
return
for line in f:
all sorts of stuff
return stack
don't use string reserved word and checking with negative numbers is very strange -is this part of the homeework? and why check by turning to int? this could also break your code - since the rest is a string.
all that said - I still suspect your main problem is with the way you turning the object into string data, It would never remain an instance of unless you used json/pickle/something else to strigfy. an object instance isn't just the class str.
and another thing - keep it simple, python is (also) about elegent and simple coding and you are trying to throw brute force with everything you know at a simple problem. focus, relax and rewrite the program.
I don't perceive all the logic.
For the moment , I can say you that you should correct the code of load_data() as follows:
def load_data():
f = open ("data.txt", "r")
ch = f.read()
contact = agenda.contact () # class object
if ch in (""," ","None"," None"):
f.close()
return [] # if the file is empty or has None in some way I pass an empty stack
else:
stack = [] # the list where I load all my class objects
f.seek(0,0)
for line in f:
contact.loadFromString(line) # explained bellow
stack.append(deepcopy(contact))
f.close()
return stack
I don't see how the file-like handler f could ever have a value None or string, so I think you want to test the content of the file -> f.read()
But then, the file's pointer is at its end and must be moved back to the start -> seek(0,0)
I will progressively add complementing considerations when I will understand more the problem.
edit
To test if a file is empty or not
import os.path
if os.path.isfile(filepath) and os.path.getsize(filepath):
.........
If the file with path filepath doesn't exist, getsize() raises an error. So the preliminary test if os.path.isfile() is necessary to avoid the second condition test to be evaluated.
But if your file can contain the strings "None" or " None" (really ? !), the getsize() will return 4 or 5 in this case.
You should avoid to manage with files containing these kinds of useless data.
edit
In validateTel(), after the instruction if isinstance (call_ID, str) == True: you are sure that call_ID is a string. Then the iteration for item in call_ID: will produce only item being ONE character long, hence it's useless to test if (int(item) < 0): , it will never happen; it could be possible that there is a sign - in the string call_ID but you won't detect it with this last condition.
In fact, as you test each character of callID, it is enough to test if it is one of the digits 0,1,2,3,4,5,6,7,8,9. If the sign - is in calml_ID, it will be detected as not being a digit.
To test if all the character in call_ID, there's an easy way provided by Python: all()
def validateTel(call_ID):
if isinstance(call_ID, str):
call_ID = call_ID.replace (" ", "")
if len(call_ID) != 10:
print ("A telephone number must be 10 digits long")
return False
else:
print ("Every character in a telephone number must be a digit")
return all(c in '0123456789' for c in call_ID)
else:
print ("call_ID must be a string")
return False
If one of the character c in call_ID isn't a digit, c in '0123456789' is False and the function all() stop the exam of the following characters and returns False; otherwise, it returns True

Categories