How does the SAX parser process characters? - python

I wrote a little code to parse a XML file, and want to print it's characters, but each character seems invoke characters() callback function three times.
code:
def characters(self,chrs):
if self.flag==1:
self.outfile.write(chrs+'\n')
xml file:
<e1>9308</e1>
<e2>865</e2>
and the output is like below, many blank lines.
9308
865
I think it should like:
9308
865
Why there are space line? and I read the doc info:
characters(self, content)
Receive notification of character data.
The Parser will call this method to report each chunk of
character data. SAX parsers may return all contiguous
character data in a single chunk, or they may split it into
several chunks; however, all of the characters in any single
event must come from the same external entity so that the
Locator provides useful information.
so SAX will process one character area as several fragments? and callback several times?

The example XML you posted is obviously not the full XML, because that would be malformed (and the SAX parser would tell you that instead of producing your output). So I'll assume that there's more to the XML than you showed us.
You need to be aware that every whitespace between any XML elements is character data. So if you have something like that:
<foo>
<bar>123</bar>
</foo>
Then you have at least 3 text nodes: one containing "\n " (i.e. one newline, two space characters), one containing "123" and last but not least another one with "\n" (i.e. just a newline).

Using self.outfile.write(chrs+'\n') you don't have a chance of seeing exactly what is happening.
Try self.outfile.write("Chrs: %r\n" % chrs)
Look up the built-in function repr() ... "%r" % foo produces the same as repr(foo); both constructs are very useful in error messages and when debugging.

so SAX will process one character area as several fragments? and callback several times?
This is obviously happening in your case - any doubt?
But your problem description is poor since you did not mention which parser you are exactly using.

Related

How to read HTML file without any limit using python?

So I have a HTML file that consist of 4,574 words 57,718 characters.
But recently, when I read it using .read() command it got a limitation and only show 3,004 words 39,248 characters when I export it.
How can I read it and export it fully without any limitation?
This is my python script:
from IPython.display import FileLink, HTML
title = "Download HTML file"
filename = "data.html"
payload = open("./dendo_plot(2).html").read()
payload = payload.replace('"', """)
html = '<a download="{filename}" href="data:text/html;charset=utf-8,'+payload+'" target="_blank">{title}</a>'
print(payload)
HTML(html)
This is what I mean, Left (Source File), Right (Exported File), you can see there were a gap on both file.
I don't think there's a problem here, I think you are simply misinterpreting a variation in a metric between your input and output.
When you call read() on an opened file with no arguments, it reads the whole content of the file (until EOF) and put it in your memory:
To read a file’s contents, call f.read(size), which reads some quantity of data and returns it as a string [...]. size is an optional numeric argument. When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory.
From the official Python tutorial
So technically Python might be unable to read the whole file because it is too big to fit in your memory, but I strongly doubt that's what happening here.
I believe the difference in the number of characters and words you see between your input and output are because your data is changed when processed.
Look at: payload = payload.replace('"', """). From an HTML validation point of view, both " and " are the same and displayed the same (which is why you can switch them), but from a Python point of view, they are different and have different length:
>>> len('"')
1
>>> len(""")
6
So just with this line you get a variation in your input and output.
That being said, I don't think it is very relevant to use the number of characters and words to check if two pieces of HTML are the same. Take the following example:
>>> first_html = """<div>
... <p>Hello there</p>
... </div>"""
>>> len(first_html)
32
>>> second_html = "<div><p>Hello there</p></div>"
>>> len(second_html)
29
You would agree that both HTML will display the same thing, but they don't have the same number of characters. The HTML specification is quite tolerant in the usage of spaces, tabulation and new lines, that's why both previous examples are treated as equal by an HTML parser.
About the number of words, one simple question (well not that simple to answer though ^^'): what qualifies as a word in HTML? Is it only the text displayed? Does the HTML tags counts aswell? If so what about their attributes?
So to sum up, I don't think you have a real problem here, only a difference that is a problem from a certain point of view, but not from an other one.

Simple parser, but not a calculator

I am trying to write a very simple parser. I read similar questions here on SO and on the Internet, but all I could find was limited to "arithmetic like" things.
I have a very simple DSL, for example:
ELEMENT TYPE<TYPE> elemName {
TYPE<TYPE> memberName;
}
Where the <TYPE> part is optional and valid only for some types.
Following what I read, I tried to write a recursive descent parser in Python, but there are a few things that I can't seem to understand:
How do I look for tokens that are longer than 1 char?
How do I break up the text in the different parts? For example, after a TYPE I can have a whitespace or a < or a whitespace followed by a <. How do I address that?
Short answer
All your questions boil down to the fact that you are not tokenizing your string before parsing it.
Long answer
The process of parsing is actually split in two distinct parts: lexing and parsing.
Lexing
What seems to be missing in the way you think about parsing is called tokenizing or lexing. It is the process of converting a string into a stream of tokens, i.e. words. That is what you are looking for when asking How do I break up the text in the different parts?
You can do it by yourself by checking your string against a list of regexp using re, or you can use some well-known librairy such as PLY. Although if you are using Python3, I will be biased toward a lexing-parsing librairy that I wrote, which is ComPyl.
So proceeding with ComPyl, the syntax you are looking for seems to be the following.
from compyl.lexer import Lexer
rules = [
(r'\s+', None),
(r'\w+', 'ID'),
(r'< *\w+ *>', 'TYPE'), # Will match your <TYPE> token with inner whitespaces
(r'{', 'L_BRACKET'),
(r'}', 'R_BRACKET'),
]
lexer = Lexer(rules=rules, line_rule='\n')
# See ComPyl doc to figure how to proceed from here
Notice that the first rule (r'\s+', None), is actually what solves your issue about whitespace. It basically tells the lexer to match any whitespace character and to ignore them. Of course if you do not want to use a lexing tool, you can simply add a similar rule in your own re implementation.
Parsing
You seem to want to write your own LL(1) parser, so I will be brief on that part. Just know that there exist a lot of tools that can do that for you (PLY and ComPyl librairies offer LR(1) parsers which are more powerful but harder to hand-write, see the difference between LL(1) and LR(1) here).
Simply notice that now that you know how to tokenize your string, the issue of How do I look for tokens that are longer than 1 char? has been solved. You are now parsing, not a stream of characters, but a stream of tokens that encapsulate the matched words.
Olivier's answer regarding lexing/tokenizing and then parsing is helpful.
However, for relatively simple cases, some parsing tools are able to handle your kind of requirements without needing a separate tokenizing step. parsy is one of those. You build up parsers from smaller building blocks - there is good documentation to help.
An example of a parser done with parsy for your kind of grammar is here: http://parsy.readthedocs.io/en/latest/howto/other_examples.html#proto-file-parser .
It is significantly more complex than yours, but shows what is possible. Where whitespace is allowed (but not required), it uses the lexeme utility (defined at the top) to consume optional whitespace.
You may need to tighten up your understanding of where whitespace is necessary and where it is optional, and what kind of whitespace you really mean.

Why is this Python script with regular expressions that slow?

The job is to read in a very large XML file line by line and store what has been already read in a string. When the string contains a full record between tags 'player' and '/player', all the values of xml tags within this record should be written to a text file as a tab separated line and the record removed from the already read chunk.
At the end of the process the unremoved part ( remainder ) should be printed, to check if all records have been properly processed and nothing remained unprocessed.
I have already this code in Perl and it runs swiftly, but I want to switch to Python.
The Python script I currently have is extremely slow.
Is Python that slow, or do I do something wrong with using the regular expressions?
import re
fh=open("players_list_xml.xml")
outf=open("players.txt","w")
x=""
cnt=0
while(cnt<10000):
line=fh.readline().rstrip()
x+=line
mo=re.search(r"<player>(.*)</player>",x)
while(mo):
cnt=cnt+1
if((cnt%1000)==0):
print("processing",cnt)
x=re.sub(re.escape(mo.group()),"",x)
print("\t".join(re.findall(r"<[a-z]+>([^<]+)<[^>]+>",mo.group(1))),file=outf)
mo=re.search(r"<player>(.*)</player>",x)
print("remainder",x)
outf.close()
fh.close()
Your regex is slow because of "backtracking" as you are using a "greedy" expression (this answer provides a simple Python example). Also, as mentioned in a comment, you should be using an XML parser to parse XML. Regex has never been very good for XML (or HTML).
In an attempt to explain why your specific expression is slow...
Lets assume you have three <player>...</player> elements in your XML. Your regex would start by matching the first opening <player> tag (that part is fine). Then (because you are using a greedy match) it would skip to the end of the document and start working backwards (backtracking) until it matched the last closing </player> tag. With a poorly written regex, it would stop there (all three elements would be in one match with all non player elements between them as well). However, that match would obviously be wrong so you make a few changes. Then the new regex would continue were the previously left off by continuing to backtrack until it found the first closing </player> tag. Then it would continue to backtrack until it determined there were no additional </player> tags between the opening tag and the most recently found closing tag. Then it would repeat that process for the second set of tags and again for the third. All that backtracking takes a lot of time. And that is for a relatively small file. In a comment you mention your files contain "more than half a million records". Ouch! I can't image how long that would take. And you're actually matching all elements, not just "player" elements. Then you are running a second regex against each element to check whether they are player elements. I would never expect this to be fast.
To avoid all that backtracking, you can use a "nongreedy" or "lazy" regex. For example (greatly simplified form your code):
r"<player>(.*?)</player>"
Note that the ? indicates that the previous pattern (.*) is nongreedy. In this instance, After finding the first opening <player> tag, it would then continue to move forward through the document (not jumping to the end) until it found the first closing </player> tag and then it would be satisfied that the pattern had matched and move on to find the second occurrence (but only by searching within the document after the end of the first occurrence).
Naturally, the nongreedy expression will be much faster. In my experience, nongreedy is almost always what you want when doing * or + matches (except for the rare cases when you don't).
That said, as stated previously, an XML parser is much more suited to parsing XML. In fact, many XML parsers offer some sort of steaming API which allows you to feed the document in in pieces in order to avoid loading the entire document into memory at once (regex does not offer this advantage). I'd start with lxml and then move to some of the builtin parsers if the C dependency doesn't work for you.
With XML parser:
import xml.parsers.expat
cnt=0
state="idle"
current_key=""
current_value=""
fields=[]
def start_element(name, attrs):
global state
global current_key
global current_value
global fields
if name=="player":
state="player"
elif state=="player":
current_key=name
def end_element(name):
global state
global current_key
global current_value
global fields
global cnt
if state=="player":
if name=="player":
state="idle"
line="\t".join(fields)
print(line,file=outf)
fields=[]
cnt+=1
if((cnt%10000)==0):
print(cnt,"players processed")
else:
fields.append(current_value)
current_key=""
current_value=""
def char_data(data):
global state
global current_key
global current_value
if state=="player" and not current_key=="":
current_value=data
p = xml.parsers.expat.ParserCreate()
p.StartElementHandler = start_element
p.EndElementHandler = end_element
p.CharacterDataHandler = char_data
fh=open("players_list_xml.xml")
outf=open("players.txt","w")
line=True
while((cnt<1000000) and line):
line=fh.readline().rstrip()
p.Parse(line)
outf.close()
fh.close()
This is quite an amount of code.
At least this produces a 29MB text file from the original XML, which size seems right.
The speed is reasonable, though this is a simplistic version, more processing is needed on the records.
In the end of the day it seems that a Perl script with only regexes is working at the speed of a dedicated XML parser, which is remarkable.
The correct answer as everyone else has said is to use an XML parser to parse XML.
The answer to your question about why it's so much slower than your perl version is that for some reason python's regular expressions are just slow, much slower than perl's to handle the same expression. I often find that code that uses regexps is more than twice as fast in perl.

Python regex to find characters unsupported by XML 1.0 returns no results

I'm writing a Python 3.2 script to find characters in a Unicode XML-formatted text file which aren't valid in XML 1.0. The file itself isn't XML 1.0, so it could easily contain characters supported in 1.1 and later, but the application which uses it can only handle characters valid in XML 1.0 so I need to find them.
XML 1.0 doesn't support any characters in the range \u0001-\u0020, except for \u0009, \u000A, \u000D, and \u0020. Above that, \u0021-\uD7FF and \u010000-\u10FFFF are also supported ranges, but nothing else. In my Python code, I define that regex pattern this way:
re.compile("[^\u0009\u000A\u000D\u0020\u0021-\uD7FF\uE000-\uFFFD\u010000-\u10FFFF]")
However, the code below isn't finding a known bad character in my sample file (\u0007, the 'bell' character.) Unfortunately I can't provide a sample line (proprietary data).
I think the problem is in one of two places: Either a bad regex pattern, or how I'm opening the file and reading in lines—i.e. an encoding problem. I could be wrong, of course.
Here's the relevant code snippet.
processChunkFile() takes three parameters: chunkfile is an absolute path to a file (a 'chunk' of 500,000 lines of the original file, in this case) which may or may not contain a bad character. outputfile is an absolute path to an optional, pre-existing file to write output to. verbose is a boolean flag to enable more verbose command-line output. The rest of the code is just getting command-line arguments (using argparse) and breaking the single large file up into smaller files. (The original file's typically larger than 4GB, hence the need to 'chunk' it.)
def processChunkFile(chunkfile, outputfile, verbose):
"""
Processes a given chunk file, looking for XML 1.0 chars.
Outputs any line containing such a character.
"""
badlines = []
if verbose:
print("Processing file {0}".format(os.path.basename(chunkfile)))
# open given chunk file and read it as a list of lines
with open(chunkfile, 'r') as chunk:
chunklines = chunk.readlines()
# check to see if a line contains a bad character;
# if so, add it to the badlines list
for line in chunklines:
if badCharacterCheck(line, verbose) == True:
badlines.append(line)
# output to file if required
if outputfile is not None:
with open(outputfile.encode(), 'a') as outfile:
for badline in badlines:
outfile.write(str(badline) + '\n')
# return list of bad lines
return badlines
def badCharacterCheck(line, verbose):
"""
Use regular expressions to seek characters in a line
which aren't supported in XML 1.0.
"""
invalidCharacters = re.compile("[^\u0009\u000A\u000D\u0020\u0021-\uD7FF\uE000-\uFFFD\u010000-\u10FFFF]")
matches = re.search(invalidCharacters, line)
if matches:
if verbose:
print(line)
print("FOUND: " + matches.groups())
return True
return False
\u010000
Python \u escapes are four-digit only, so that U+0100 followed by two U+0030 Digit Zeros. Use capital-U escape with eight digits for characters outside the BMP:
\U00010000-\U0010FFFF
Note that this and your expression in general won't work on ‘narrow builds’ of Python where strings are based on UTF-16 code units and characters outside the BMP are handled as two surrogate code units. (Narrow build were the default for Windows. Thankfully they go away with Python 3.3.)
it could easily contain characters supported in 1.1 and later
(Although XML 1.1 can only contain those characters when they're encoded as numeric character references &#...;, so the file itself may still not be well-formed.)
open(chunkfile, 'r')
Are you sure the chunkfile is encoded in locale.getpreferredencoding?
The original file's typically larger than 4GB, hence the need to 'chunk' it.
Ugh, monster XML is painful. But with sensible streaming APIs (and filesystems!) it should still be possible to handle. For example here, you could process each line one at a time using for line in chunk: instead of reading all of the chunk at once using readlines().
re.search(invalidCharacters, line)
As invalidCharacters is already a compiled pattern object you can just invalidCharacters.search(...).
Having said all that, it still matches U+0007 Bell for me.
The fastest way to remove words, characters, strings or anything between two known tags or two known characters in a string is by using a direct and Native C approach using RE along with a Common as shown below.
var = re.sub('<script>', '<!--', var)
var = re.sub('</script>', '-->', var)
#And finally
var = re.sub('<!--.*?-->', '', var)
It removes everything and works faster, better and cleaner than Beautiful Soup. Batch files are where the "" got there beginnings and were only borrowed for use with batch and html from native C". When using all Pythonic methods with regular expressions you have to realize that Python has not altered or changed much from all regular expressions used by Machine Language so why iterate many times when a single loop can find it all as one chunk in one iteration? Do the same individually with Characters also.
var = re.sub('\[', '<!--', var)
var = re.sub('\]', '-->', var)
#And finally
var = re.sub('<!--.*?-->', '' var)#wipes it all out from between along with.
And you do not need Beautiful Soup. You can also scalp data using them if you understand how this works.

IronPython, C# and XML - break indentation?

We're using IronPython on C#, and I get different results on the console and on our application.
This code runs fine on IronPython Console:
str=[]
a = 1
b = 0
c = 1
if a==1:
str.append('One')
if b==1:
str.append('Two')
if c==1:
str.append('Three')
out=','.join(str)
print out
But the same code returns an error on our application:
unexpected token 'if'
I suspect that the problem is in my newlines, because the string containing the Python code is passed through XML (XML>C#>Python):
<Set key="PythonCode" value="ipy:str=[]
a = 1
b = 0
c = 1
if a==1:
str.append('One')
if b==1:
str.append('Two')
if c==1:
str.append('Three')
out=','.join(str)"/>
Other commands return expected results, my problem is with indented commands (conditions, loops).
As I don't have access to the C# code, I look for a way to write one-liners, or any other way not to be dependent on indentation or newlines.
I tried this:
<Set key="PythonCode" value="ipy:str=[];
a = 1;
b = 0;
c = 1;
if a==1: str.append('One');
if b==1: str.append('Two');
if c==1: str.append('Three');
out=','.join(str);"/>
But I get again the same error, because there should be a blank line after each condition.
Any ideas?
Nonsignificant whitespace in xml is not preserved
http://www.w3.org/TR/1998/REC-xml-19980210#AVNormalize
Before the value of an attribute is passed to the application or checked for validity, the XML processor must normalize it as follows:
(...)
a whitespace character (#x20, #xD, #xA, #x9) is processed by appending #x20 to the normalized value, except that only a single #x20 is appended for a "#xD#xA" sequence that is part of an external parsed entity or the literal entity value of an internal parsed entity
(...)
If the declared value is not CDATA, then the XML processor must further process the normalized attribute value by discarding any leading and trailing space (#x20) characters, and by replacing sequences of space (#x20) characters by a single space (#x20) character.
if you want to transmit text with significant whitespace within xml tags, you need to enclose it inside a cdata section:
<<Set key="PythonCode"><![CDATA[
YOUR CODE HERE
]]></Set>
As far as I know, you cannot use a cdata section inside an attribute string, so you will have to change that part of your xml format to enclose the code in tags instead.
Another workaround would be to tell your xml exporter as well as your xml importer to preserve nonsignificant whitespace.
for c# how to do this depends on which method you use to parse xml (XDocument, XmlDocument, ...), see for example
http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.preservewhitespace(v=vs.71).aspx
http://msdn.microsoft.com/en-us/library/bb387014.aspx
http://msdn.microsoft.com/en-us/library/bb387103.aspx
But using cdata is definitely the better solution
what you definitely should not do is use Whython – Python For People Who Hate Whitespace
It seems like the code that goes out of XML is without line breaks.
If it's so, you have little hope of running Python code.
I have no idea how to make XML behave differently. Maybe there's something you can embed in the text which would translate to a newline (Maybe \n or <br>).
The if statement can't work without newlines, even in the single-line format. This is because a single line can't have : twice.
For this program, you could replace the if statements with and:
a==1 and str.append('One')
This way your code can be a one-liner.
But if you try to take this further, you'll find it very hard to program this way.

Categories