How to perform a query on Big XML file using Python? - python

I have an XML file 7 GB, It is about all transactions in one company, and I want to filter only the records of the Last Year (2015).
The structure of a file is:
<Customer>
<Name>A</Name>
<Year>2015<Year>
</Customer>
I have also its DTD file.
I do not know how can I filter such data into text file.
Is there any tutorial or library to be used in this regard.
Welcome!

As your data is large, I assume you've already decided that you won't be able to load the whole lot into memory. This would be the approach using a DOM-style (document object model) parser. And you've actually tagged your question 'SAX' (the simple API for XML) which further implies you know that you need a non-memory approach.
Two approaches come to mind:
Using grep
Sometimes with XML it can be useful to use plain text processing tools. grep would allow you to filter through your XML document as plain text and find the occurrences of 2015:
$ grep -B 2 -A 1 "<Year>2015</Year>"
The -B and -A options instruct grep to print some lines of context around the match.
However, this approach will only work if your XML is also precisely structured plain text, which there's absolutely no need for it (as XML) to be. That is, your XML could have any combination of whitespace (or non at all) and still be semantically identical, but the grep approach depends on exact whitespace arrangement.
SAX
So a more reliable non-memory approach would be to use SAX. SAX implementations are conceptually quite simple, but a little tedious to write. Essentially, you have to override a class which provides methods that are called when certain 'events' occur in the source XML document. In the xml.sax.handler module in the standard library, this class is ContentHandler. These methods include:
startElement
endElement
characters
Your overridden methods then determine how to handle those events. In a typical implementation of startElement(name, attrs) you might test the name argument to determine what the tag name of the element is. You might then maintain a stack of elements you have entered. When endElement(name) occurs, you might then pop the top element off that stack, and possibly do some processing on the completed element. The characters(content) happens when character data is encountered in the source document. In this method you might consider building up a string of the character data which can then be processed when you encounter an endElement.
So for your specific task, something like this may work:
from xml.sax import parse
from xml.sax.handler import ContentHandler
class filter2015(ContentHandler):
def __init__(self):
self.elements = [] # stack of elements
self.char_data = u'' # string buffer
self.current_customer = u'' # name of customer
self.current_year = u''
def startElement(self, name, attrs):
if name == u'Name':
self.elements.append(u'Name')
if name == u'Year':
self.elements.append(u'Year')
def characters(self, chars):
if len(self.elements) > 0 and self.elements[-1] in [u'Name', u'Year']:
self.char_data += chars
def endElement(self, name):
self.elements.pop() if len(self.elements) > 0 else None
if name == u'Name':
self.current_customer = self.char_data
self.char_data = ''
if name == u'Year':
self.current_year = self.char_data
self.char_data = ''
if name == 'Customer':
# wait to check the year until the Customer is closed
if self.current_year == u'2015':
print 'Found:', self.current_customer
# clear the buffers now that the Customer is finished
self.current_year = u''
self.current_customer = u''
self.char_data = u''
source = open('test.xml')
parse(source, filter2015())

Check out this question. It will let you interact with it as a generator:
python: is there an XML parser implemented as a generator?
You want to use a generator so that you don't load the entire doc into memory first.
Specifically:
import xml.etree.cElementTree as ET
for event, element in ET.iterparse('huge.xml'):
if event == 'end' and element.tag == 'ticket':
#process ticket...
Source: http://enginerds.craftsy.com/blog/2014/04/parsing-large-xml-files-in-python-without-a-billion-gigs-of-ram.html

Related

Parser that preserves comments and recover from error

I'm working on a GUI editor for a propriety config format. Basically the editor will parse the config file, display the object properties so that users can edit from GUI and then write the objects back to the file.
I've got the parse - edit - write part done, except for:
The parsed data structure only include object properties information, so comments and whitespaces are lost on write
If there is any syntax error, the rest of the file is skipped
How would you address these issues? What is the usual approach to this problem? I'm using Python and Parsec module https://pythonhosted.org/parsec/documentation.html, however and help and general direction is appreciated.
I've also tried Pylens (https://pythonhosted.org/pylens/), which is really close to what I need except it can not skip syntax errors.
You asked about typical approaches to this problem. Here are two projects which tackle similar challenges to the one you describe:
sketch-n-sketch: "Direct manipulation" interface for vector images, where you can either edit the image-describing source language, or edit the image it represents directly and see those changes reflected in the source code. Check out the video presentation, it's super cool.
Boomerang: Using lenses to "focus" on the abstract meaning of some concrete syntax, alter that abstract model, and then reflect those changes in the original source.
Both projects have yielded several papers describing the approaches their authors took. As far as I can tell, the Lens approach is popular, where parsing and printing become the get and put functions of a Lens which takes a some source code and focuses on the abstract concept which that code describes.
Eventually I ran out of research time and have to settle with a rather manual skipping. Basically each time the parser fail we try to advance the cursor one character and repeat. Any parts skipped by the process, regardless of whitespace/comment/syntax error is dump into a Text structure. The code is quite reusable, except for the part you have to incorporate it to all the places with repeated results and the original parser may fail.
Here's the code, in case it helps anyone. It is written for Parsy.
class Text(object):
'''Structure to contain all the parts that the parser does not understand.
A better name would be Whitespace
'''
def __init__(self, text=''):
self.text = text
def __repr__(self):
return "Text(text='{}')".format(self.text)
def __eq__(self, other):
return self.text.strip() == getattr(other, 'text', '').strip()
def many_skip_error(parser, skip=lambda t, i: i + 1, until=None):
'''Repeat the original `parser`, aggregate result into `values`
and error in `Text`.
'''
#Parser
def _parser(stream, index):
values, result = [], None
while index < len(stream):
result = parser(stream, index)
# Original parser success
if result.status:
values.append(result.value)
index = result.index
# Check for end condition, effectively `manyTill` in Parsec
elif until is not None and until(stream, index).status:
break
# Aggregate skipped text into last `Text` value, or create a new one
else:
if len(values) > 0 and isinstance(values[-1], Text):
values[-1].text += stream[index]
else:
values.append(Text(stream[index]))
index = skip(stream, index)
return Result.success(index, values).aggregate(result)
return _parser
# Example usage
skip_error_parser = many_skip_error(original_parser)
On other note, I guess the real issue here is I'm using a parser combinator library instead of a proper two stages parsing process. In traditional parsing, the tokenizer will handle retaining/skipping any whitespace/comment/syntax error, making them all effectively whitespace and are invisible to the parser.

Create classes that grab youtube queries and display information using Python

I want to use the urllib module to send HTTP requests and grab data. I can get the data by using the urlopen() function, but not really sure how to incorporate it into classes. I really need help with the query class to move forward. From the query I need to pull
• Top Rated
• Top Favorites
• Most Viewed
• Most Recent
• Most Discussed
My issue is, I can't parse the XML document to retrieve this data. I also don't know how to use classes to do it.
Here is what I have so far:
import urllib #this allows the programm to sen HTTP requests and to read the responses.
class Query:
'''performs the actual HTTP requests and initial parsing to build the Video-
objects from the response. It will also calculate the following information
based on the video and user results. '''
def __init__(self, feed_id, max_results):
'''Takes as input the type of query (feed_id) and the maximum number of
results (max_results) that the query should obtain. The correct HTTP
request must be constructed and submitted. The results are converted
into Video objects, which are stored within this class.
'''
self.feed = feed_id
self.max = max_results
top_rated = urllib.urlopen("http://gdata.youtube.com/feeds/api/standardfeeds/top_rated")
results_str = top_rated.read()
splittedlist = results_str.split('<entry')
top_rated.close()
def __str__(self):
''' prints out the information on each video and Youtube user. '''
pass
class Video:
pass
class User:
pass
#main function: This handles all the user inputs and stuff.
def main():
useinput = raw_input('''Welcome to the YouTube text-based query application.
You can select a popular feed to perform a query on and view statistical
information about the related videos and users.
1) today
2) this week
3) this month
4) since youtube started
Please select a time(or 'Q' to quit):''')
secondinput = raw_input("\n1) Top Rated\n2) Top Favorited\n3) Most Viewed\n4) Most Recent\n5) Most Discussed\n\nPlease select a feed (or 'Q' to quit):")
thirdinput = raw_input("Enter the maximum number of results to obtain:")
main()
toplist = []
top_rated = urllib.urlopen("http://gdata.youtube.com/feeds/api/standardfeeds/top_rated")
result_str = top_rated.read()
top_rated.close()
splittedlist = result_str.split('<entry')
results_str = top_rated.read()
x=splittedlist[1].find('title')#find the title index
splittedlist[1][x: x+75]#string around the title (/ marks the end of the title)
w=splittedlist[1][x: x+75].find(">")#gives you the start index
z=splittedlist[1][x: x+75].find("<")#gives you the end index
titles = splittedlist[1][x: x+75][w+1:z]#gives you the title!!!!
toplist.append(titles)
print toplist
I assume that your challenge is parsing XML.
results_str = top_rated.read()
splittedlist = results_str.split('<entry')
And I see you are using string functions to parse XML. Such functions based on finite automata (regular languages) are NOT suited for parsing context-free languages such as XML. Expect it to break very easily.
For more reasons, please refer RegEx match open tags except XHTML self-contained tags
Solution: consider using an XML parser like elementree. It comes with Python and allows you to browse the XML tree pythonically. http://effbot.org/zone/element-index.htm
Your may come up with code like:
import elementtree.ElementTree as ET
..
results_str = top_rated.read()
root = ET.fromstring(results_str)
for node in root:
print node
I also don't know how to use classes to do it.
Don't be in a rush to create classes :-)
In the above example, you are importing a module, not importing a class and instantiating/initializing it, like you do for Java. Python has powerful primitive types (dictionaries, lists) and considers modules as objects: so (IMO) you can go easy on classes.
You use classes to organize stuff, not because your teacher has indoctrinated you "classes are good. Lets have lots of them".
Basically you want to use the Query class to communicate with the API.
def __init__(self, feed_id, max_results, time):
qs = "http://gdata.youtube.com/feeds/api/standardfeeds/"+feed_id+"?max- results="+str(max_results)+"&time=" + time
self.feed_id = feed_id
self.max_results = max_results
wo = urllib.urlopen(qs)
result_str = wo.read()
wo.close()

Why is this Python method leaking memory?

This method iterate over a list of terms in the data base, check if the terms are in a the text passed as argument, and if one is, replace it with a link to the search page with the term as parameter.
The number of terms is high (about 100000), so the process is pretty slow, but this is Ok since it is performed as a cron job. However, it causes the script memory consumtion to skyrocket and I can't find why:
class SearchedTerm(models.Model):
[...]
#classmethod
def add_search_links_to_text(cls, string, count=3, queryset=None):
"""
Take a list of all researched terms and search them in the
text. If they exist, turn them into links to the search
page.
This process is limited to `count` replacements maximum.
WARNING: because the sites got different URLS schemas, we don't
provides direct links, but we inject the {% url %} tag
so it must be rendered before display. You can use the `eval`
tag from `libs` for this. Since they got different namespace as
well, we enter a generic 'namespace' and delegate to the
template to change it with the proper one as well.
If you have a batch process to do, you can pass a query set
that will be used instead of getting all searched term at
each calls.
"""
found = 0
terms = queryset or cls.on_site.all()
# to avoid duplicate searched terms to be replaced twice
# keep a list of already linkified content
# added words we are going to insert with the link so they won't match
# in case of multi passes
processed = set((u'video', u'streaming', u'title',
u'search', u'namespace', u'href', u'title',
u'url'))
for term in terms:
text = term.text.lower()
# no small word and make
# quick check to avoid all the rest of the matching
if len(text) < 3 or text not in string:
continue
if found and cls._is_processed(text, processed):
continue
# match the search word with accent, for any case
# ensure this is not part of a word by including
# two 'non-letter' character on both ends of the word
pattern = re.compile(ur'([^\w]|^)(%s)([^\w]|$)' % text,
re.UNICODE|re.IGNORECASE)
if re.search(pattern, string):
found += 1
# create the link string
# replace the word in the description
# use back references (\1, \2, etc) to preserve the original
# formatin
# use raw unicode strings (ur"string" notation) to avoid
# problems with accents and escaping
query = '-'.join(term.text.split())
url = ur'{%% url namespace:static-search "%s" %%}' % query
replace_with = ur'\1<a title="\2 video streaming" href="%s">\2</a>\3' % url
string = re.sub(pattern, replace_with, string)
processed.add(text)
if found >= 3:
break
return string
You'll probably want this code as well:
class SearchedTerm(models.Model):
[...]
#classmethod
def _is_processed(cls, text, processed):
"""
Check if the text if part of the already processed string
we don't use `in` the set, but `in ` each strings of the set
to avoid subtring matching that will destroy the tags.
This is mainly an utility function so you probably won't use
it directly.
"""
if text in processed:
return True
return any(((text in string) for string in processed))
I really have only two objects with references that could be the suspects here: terms and processed. But I can't see any reason for them to not being garbage collected.
EDIT:
I think I should say that this method is called inside a Django model method itself. I don't know if it's relevant, but here is the code:
class Video(models.Model):
[...]
def update_html_description(self, links=3, queryset=None):
"""
Take a list of all researched terms and search them in the
description. If they exist, turn them into links to the search
engine. Put the reset into `html_description`.
This use `add_search_link_to_text` and has therefor, the same
limitations.
It DOESN'T call save().
"""
queryset = queryset or SearchedTerm.objects.filter(sites__in=self.sites.all())
text = self.description or self.title
self.html_description = SearchedTerm.add_search_links_to_text(text,
links,
queryset)
I can imagine that the automatic Python regex caching eats up some memory. But it should do it only once and the memory consumtion goes up at every call of update_html_description.
The problem is not just that it consumes a lot of memory, the problem is that it does not release it: every calls take about 3% of the ram, eventually filling it up and crashing the script with 'cannot allocate memory'.
The whole queryset is loaded into memory once you call it, that is what will eat up your memory. You want to get chunks of results if the resultset is that large, it might be more hits on the database but it will mean a lot less memory consumption.
I was complety unable to find the cause of the problem, but for now I'm by passing this by isolating the infamous snippet by calling a script (using subprocess) that containes this method call. The memory goes up but of course, goes back to normal after the python process dies.
Talk about dirty.
But that's all I got for now.
make sure that you aren't running in DEBUG.
I think I should say that this method is called inside a Django model method itself.
#classmethod
Why? Why is this "class level"
Why aren't these ordinary methods that can have ordinary scope rules and -- in the normal course of events -- get garbage collected?
In other words (in the form of an answer)
Get rid of #classmethod.

How to support recursive include when parsing xml

I'm defining an xml schema of my own which supports the additional tag "insert_tag", which when reached should insert the text file at that point in the stream and then continue the parsing:
Here is an example:
my.xml:
<xml>
Something
<insert_file name="foo.html"/>
or another
</xml>
I'm using xmlreader as follows:
class HtmlHandler(xml.sax.handler.ContentHandler):
def __init__(self):
xml.sax.handler.ContentHandler.__init__(self)
parser = xml.sax.make_parser()
parser.setContentHandle(HtmlHandler())
parser.parse(StringIO(html))
The question is how do I insert the included contents directly into the parsing stream? Of course I could recursively build up the non-interpolated text by repeatedly inserting included text, but that means that I have to parse the xml multiple times.
I tried replacing StringIO(html) with my own stream that allows inserting contents mid stream, but it doesn't work because the sax parser reads the stream buffered.
Update:
I did find a solution that is hackish at the best. It is built on the following stream class:
class InsertReader():
"""A reader class that supports the concept of pushing another
reader in the middle of the use of a first reader. This may
be used for supporting insertion commands."""
def __init__(self):
self.reader_stack = []
def push(self,reader):
self.reader_stack += [reader]
def pop(self):
self.reader_stack.pop()
def __iter__(self):
return self
def read(self,n=-1):
"""Read from the top most stack element. Never trancends elements.
Should it?
The code below is a hack. It feeds only a single token back to
the reader.
"""
while len(self.reader_stack)>0:
# Return a single token
ret_text = StringIO()
state = 0
while 1:
c = self.reader_stack[-1].read(1)
if c=='':
break
ret_text.write(c)
if c=='>':
break
ret_text = ret_text.getvalue()
if ret_text == '':
self.reader_stack.pop()
continue
return ret_text
return ''
def next(self):
while len(self.reader_stack)>0:
try:
v = self.reader_stack[-1].next()
except StopIteration:
self.reader_stack.pop()
continue
return v
raise StopIteration
This class creates a stream structure that restricts the amount of characters that are returned to the user of the stream. I.e. even if the xml parser does read(16386) the class will only return bytes up to the next '>' character. Since the '>' character also signifies the end of tags, we have the opportunity to inject our recursive include into the stream at this point.
What is hackish about this solution is the following:
Reading one character at a time from a stream is slow.
This has an implicit assumption about how the sax stream class is reading text.
This solves the problem for me, but I'm still interested in a more beautiful solution.
Have you considered using xinclude? The lxml library has builtin support for it.

Python xml.dom.minidom removeChild whitespace problem

I'm trying to read an xml file into python, pull out certain elements from the xml file and then write the results back to an xml file (so basically it's the original xml file without several elements). When I use .removeChild(source) it removes the individual elements I want to remove but leaves white space in its stead making the file very unreadable. I know I can still parse the file with all of the whitespace, but there are times when I need to manually alter the values of certain element's attributes and it makes it difficult (and annyoing) to do this. I can certainly remove the whitespace by hand but if I have dozens of these xml files that's not really feasible.
Is there a way to do .removeChild and have it remove the white space as well?
Here's what my code looks like:
dom=parse(filename)
main=dom.childNodes[0]
sources = main.getElementsByTagName("source")
for source in sources :
name=source.getAttribute("name")
spatialModel=source.getElementsByTagName("spatialModel")
val1=float(spatialModel[0].getElementsByTagName("parameter")[0].getAttribute("value"))
val2=float(spatialModel[0].getElementsByTagName("parameter")[1].getAttribute("value"))
if angsep(val1,val2,X,Y)>=ROI :
main.removeChild(source)
else:
print name,val1,val2,angsep(val1,val2,X,Y)
f=open(outfile,"write")
f.write("<?xml version=\"1.0\" ?>\n")
f.write(dom.saveXML(main))
f.close()
Thanks much for the help.
If you have PyXML installed you can use xml.dom.ext.PrettyPrint()
I couldn't figure out how to do this using xml.dom.minidom, so I just wrote a quick function to read in the output file and remove all blank lines and then rewrite to a new file:
f = open(xmlfile).readlines()
w = open('src_model.xml','w')
empty=re.compile('^$')
for line in open(xmlfile).readlines():
if empty.match(line):
continue
else:
w.write(line)
This works good enough for me :)
… for searching ppl:
This funny snippet
skey = lambda x: getattr(x, "tagName", None)
mainnode.childNodes = sorted(
[n for n in mainnode.childNodes if n.nodeType != n.TEXT_NODE],
cmp=lambda x, y: cmp(skey(y), skey(x)))
removes all text nodes (and, also, reverse sorts them by tagname).
I.e. you can (recursively) do tr.childNodes = [recurseclean(n) for n in tr.childNodes if n.nodeType != n.TEXT_NODE] to remove all text nodes
Or you might want to do something like … if n.nodeType != n.TEXT_NODE or not re.match(r'^[:whitespace:]*$', n.data, re.MULTILINE) (did't try that one myself) if you need text nodes with some data. Or something more complex to leave text inside specific tags.
After that tree.toprettyxml(…) will return well-formatted XML text.
I know, that this question is quite dated, but since it took a while to figure out different approaches to the problem, here are my solutions:
The best way, I found is using lxml, indeed:
from lxml import etree
root = etree.fromstring(data)
# for tag in root.iter('tag') doesn't cope with namespaces...
for tag in root.xpath('//*[local-name() = "tag"]'):
tag.getparent().remove(tag)
data = etree.tostring(root, encoding = 'utf-8', pretty_print = True)
With minidom, it's a bit more convoluted due to the fact, that every node is accompanied with a trailing whitespace node:
import xml.dom.minidom
dom = xml.dom.minidom.parseString(data)
for tag in dom.getElementsByTagName('tag'):
if tag.nextSibling \
and tag.nextSibling.nodeType == meta.TEXT_NODE \
and tag.nextSibling.data.isspace():
tag.parentNode.removeChild(tag.nextSibling)
tag.parentNode.removeChild(tag)
data = dom.documentElement.toxml(encoding = 'utf-8')

Categories