The job is to read in a very large XML file line by line and store what has been already read in a string. When the string contains a full record between tags 'player' and '/player', all the values of xml tags within this record should be written to a text file as a tab separated line and the record removed from the already read chunk.
At the end of the process the unremoved part ( remainder ) should be printed, to check if all records have been properly processed and nothing remained unprocessed.
I have already this code in Perl and it runs swiftly, but I want to switch to Python.
The Python script I currently have is extremely slow.
Is Python that slow, or do I do something wrong with using the regular expressions?
import re
fh=open("players_list_xml.xml")
outf=open("players.txt","w")
x=""
cnt=0
while(cnt<10000):
line=fh.readline().rstrip()
x+=line
mo=re.search(r"<player>(.*)</player>",x)
while(mo):
cnt=cnt+1
if((cnt%1000)==0):
print("processing",cnt)
x=re.sub(re.escape(mo.group()),"",x)
print("\t".join(re.findall(r"<[a-z]+>([^<]+)<[^>]+>",mo.group(1))),file=outf)
mo=re.search(r"<player>(.*)</player>",x)
print("remainder",x)
outf.close()
fh.close()
Your regex is slow because of "backtracking" as you are using a "greedy" expression (this answer provides a simple Python example). Also, as mentioned in a comment, you should be using an XML parser to parse XML. Regex has never been very good for XML (or HTML).
In an attempt to explain why your specific expression is slow...
Lets assume you have three <player>...</player> elements in your XML. Your regex would start by matching the first opening <player> tag (that part is fine). Then (because you are using a greedy match) it would skip to the end of the document and start working backwards (backtracking) until it matched the last closing </player> tag. With a poorly written regex, it would stop there (all three elements would be in one match with all non player elements between them as well). However, that match would obviously be wrong so you make a few changes. Then the new regex would continue were the previously left off by continuing to backtrack until it found the first closing </player> tag. Then it would continue to backtrack until it determined there were no additional </player> tags between the opening tag and the most recently found closing tag. Then it would repeat that process for the second set of tags and again for the third. All that backtracking takes a lot of time. And that is for a relatively small file. In a comment you mention your files contain "more than half a million records". Ouch! I can't image how long that would take. And you're actually matching all elements, not just "player" elements. Then you are running a second regex against each element to check whether they are player elements. I would never expect this to be fast.
To avoid all that backtracking, you can use a "nongreedy" or "lazy" regex. For example (greatly simplified form your code):
r"<player>(.*?)</player>"
Note that the ? indicates that the previous pattern (.*) is nongreedy. In this instance, After finding the first opening <player> tag, it would then continue to move forward through the document (not jumping to the end) until it found the first closing </player> tag and then it would be satisfied that the pattern had matched and move on to find the second occurrence (but only by searching within the document after the end of the first occurrence).
Naturally, the nongreedy expression will be much faster. In my experience, nongreedy is almost always what you want when doing * or + matches (except for the rare cases when you don't).
That said, as stated previously, an XML parser is much more suited to parsing XML. In fact, many XML parsers offer some sort of steaming API which allows you to feed the document in in pieces in order to avoid loading the entire document into memory at once (regex does not offer this advantage). I'd start with lxml and then move to some of the builtin parsers if the C dependency doesn't work for you.
With XML parser:
import xml.parsers.expat
cnt=0
state="idle"
current_key=""
current_value=""
fields=[]
def start_element(name, attrs):
global state
global current_key
global current_value
global fields
if name=="player":
state="player"
elif state=="player":
current_key=name
def end_element(name):
global state
global current_key
global current_value
global fields
global cnt
if state=="player":
if name=="player":
state="idle"
line="\t".join(fields)
print(line,file=outf)
fields=[]
cnt+=1
if((cnt%10000)==0):
print(cnt,"players processed")
else:
fields.append(current_value)
current_key=""
current_value=""
def char_data(data):
global state
global current_key
global current_value
if state=="player" and not current_key=="":
current_value=data
p = xml.parsers.expat.ParserCreate()
p.StartElementHandler = start_element
p.EndElementHandler = end_element
p.CharacterDataHandler = char_data
fh=open("players_list_xml.xml")
outf=open("players.txt","w")
line=True
while((cnt<1000000) and line):
line=fh.readline().rstrip()
p.Parse(line)
outf.close()
fh.close()
This is quite an amount of code.
At least this produces a 29MB text file from the original XML, which size seems right.
The speed is reasonable, though this is a simplistic version, more processing is needed on the records.
In the end of the day it seems that a Perl script with only regexes is working at the speed of a dedicated XML parser, which is remarkable.
The correct answer as everyone else has said is to use an XML parser to parse XML.
The answer to your question about why it's so much slower than your perl version is that for some reason python's regular expressions are just slow, much slower than perl's to handle the same expression. I often find that code that uses regexps is more than twice as fast in perl.
Related
I wrote a script in Python for custom HTML page that finds a word within a string/line and highlights just that word with use of following tags where instance is the word that is searched for.
<b><font color=\"red\">"+instance+"</font></b>
With the following result:
I need to find a word (case insensitive) let's say "port" within a string that can be port, Port, SUPPORT, Support, support etc, which is easy enough.
pattern = re.compile(word, re.IGNORECASE)
find_all_instances = pattern.findall(string_to_search)
However my strings often contain 2 or more instances in single line, and I need to append
<b><font color=\"red\">"+instance+"</font></b> to each of those instances, without changing cases.
Problem with my approach, is that I am attempting to itterate over each of instances found with findall (exact match),
while multiple same matches can also be found within the string.
for instance in find_all_instances:
second_pattern = re.compile(instance)
string_to_search = second_pattern.sub("<b><font color=\"red\">"+instance+"</font></b>", string_to_search)
This results in following:
<b><font color="red"><b><font color="red"><b><font color="red">Http</font></b></font></b></font></b></font>
when I need
<b><font color="red">Http</font></b>
I was thinking, I would be able to avoid this if I was able to find out exact part of the string that the pattern.sub substitutes at the moment of doing it,
however I was not able to find any examples of that kind of usage, which leads me to believe that I am doing something very wrong.
If anyone have a way I could use to insert <b><font color="red">instance</font></b> without replacing instance for all matches(case insensitive), then I would be grateful.
Maybe I'm misinterpretting your question, but wouldn't re.sub be the best option?
Example: https://repl.it/DExs
Okay so two ways I did quickly! The second loop is definitely the way to go. It uses re.sub (as someone else commented too). It replaces with the lowercase search term bear in mind.
import re
FILE = open("testing.txt","r")
word="port"
#THIS LOOP IS CASE SENSITIVE
for line in FILE:
newline=line.replace(word,"<b><font color=\"red\">"+word+"</font></b>")
print newline
#THIS LOOP IS INCASESENSITIVE
for line in FILE:
pattern=re.compile(word,re.IGNORECASE)
newline = pattern.sub("<b><font color=\"red\">"+word+"</font></b>",line)
print newline
I have a file I need to parse. The parsing is built incrementally, such that on each iteration the expressions becomes more case specific.
The code segment which overloads the system looks roughly like this:
for item in ret:
pat = r'a\sstyle=".+class="VEAPI_Pushpin"\sid="msftve(.+?)".+>%s<'%item[1]
r=re.compile(pat, re.DOTALL)
match = r.findall(f)
The file is a rather large HTML file (parsed from bing maps), and each answer must match its exact id.
Before appying this change the workflow was very good. Is there anything I can do to avoid this? Or to optimize the code?
My only guess is that you are getting too many matches and running out of memory. Though this doesn't seem very reasonable, it might be the case. Try using finditer instead of findall to get one match at a time without creating a monster list of matches. If that doesn't fix your problem, you might have stumbled on a more serious bug in the re module.
I wrote a little code to parse a XML file, and want to print it's characters, but each character seems invoke characters() callback function three times.
code:
def characters(self,chrs):
if self.flag==1:
self.outfile.write(chrs+'\n')
xml file:
<e1>9308</e1>
<e2>865</e2>
and the output is like below, many blank lines.
9308
865
I think it should like:
9308
865
Why there are space line? and I read the doc info:
characters(self, content)
Receive notification of character data.
The Parser will call this method to report each chunk of
character data. SAX parsers may return all contiguous
character data in a single chunk, or they may split it into
several chunks; however, all of the characters in any single
event must come from the same external entity so that the
Locator provides useful information.
so SAX will process one character area as several fragments? and callback several times?
The example XML you posted is obviously not the full XML, because that would be malformed (and the SAX parser would tell you that instead of producing your output). So I'll assume that there's more to the XML than you showed us.
You need to be aware that every whitespace between any XML elements is character data. So if you have something like that:
<foo>
<bar>123</bar>
</foo>
Then you have at least 3 text nodes: one containing "\n " (i.e. one newline, two space characters), one containing "123" and last but not least another one with "\n" (i.e. just a newline).
Using self.outfile.write(chrs+'\n') you don't have a chance of seeing exactly what is happening.
Try self.outfile.write("Chrs: %r\n" % chrs)
Look up the built-in function repr() ... "%r" % foo produces the same as repr(foo); both constructs are very useful in error messages and when debugging.
so SAX will process one character area as several fragments? and callback several times?
This is obviously happening in your case - any doubt?
But your problem description is poor since you did not mention which parser you are exactly using.
I'm fairly new to Python, and am writing a series of script to convert between some proprietary markup formats. I'm iterating line by line over files and then basically doing a large number (100-200) of substitutions that basically fall into 4 categories:
line = line.replace("-","<EMDASH>") # Replace single character with tag
line = line.replace("<\\#>","#") # tag with single character
line = line.replace("<\\n>","") # remove tag
line = line.replace("\xe1","•") # replace non-ascii character with entity
the str.replace() function seems to be pretty efficient (fairly low in the numbers when I examine profiling output), but is there a better way to do this? I've seen the re.sub() method with a function as an argument, but am unsure if this would be better? I guess it depends on what kind of optimizations Python does internally. Thought I would ask for some advice before creating a large dict that might not be very helpful!
Additionally I do some parsing of tags (that look somewhat like HTML, but are not HTML). I identify tags like this:
m = re.findall('(<[^>]+>)',line)
And then do ~100 search/replaces (mostly removing matches) within the matched tags as well, e.g.:
m = re.findall('(<[^>]+>)',line)
for tag in m:
tag_new = re.sub("\*t\([^\)]*\)","",tag)
tag_new = re.sub("\*p\([^\)]*\)","",tag_new)
# do many more searches...
if tag != tag_new:
line = line.replace(tag,tag_new,1) # potentially problematic
Any thoughts of efficiency here?
Thanks!
str.replace() is more efficient if you're going to do basic search and replaces, and re.sub is (obviously) more efficient if you need complex pattern matching (because otherwise you'd have to use str.replace several times).
I'd recommend you use a combination of both. If you have several patterns that all get replaced by one thing, use re.sub. If you just have some cases where you just need to replace one specific tag with another, use str.replace.
You can also improve efficiency by using larger strings (call re.sub once instead of once for each line). Increases memory use, but shouldn't be a problem unless the file is HUGE, but also improves execution time.
If you don't actually need the regex and are just doing literal replacing, string.replace() will almost certainly be faster. But even so, your bottleneck here will be file input/output, not string manipulation.
The best solution though would probably be to use cStringIO
Depending on the ratio of relevant-to-not-relevant portions of the text you're operating on (and whether or not the parts each substitution operates on overlap), it might be more efficient to try to break down the input into tokens and work on each token individually.
Since each replace() in your current implementation has to examine the entire input string, that can be slow. If you instead broke down that stream into something like...
[<normal text>, <tag>, <tag>, <normal text>, <tag>, <normal text>]
# from an original "<normal text><tag><tag><normal text><tag><normal text>"
...then you could simply look to see if a given token is a tag, and replace it in the list (and then ''.join() at the end).
You can pass a function object to re.sub instead of a substitution string, it takes the match object and returns the substitution, so for example
>>> r = re.compile(r'<(\w+)>|(-)')
>>> r.sub(lambda m: '(%s)' % (m.group(1) if m.group(1) else 'emdash'), '<atag>-<anothertag>')
'(atag)(emdash)(anothertag)'
Of course you can use a more complex function object, this lambda is just an example.
Using a single regex that does all the substitution should be slightly faster than iterating the string many times, but if a lot of substitutions are perfomed the overhead of calling the function object that computes the substitution may be significant.
This question is similar to "How to concisely cascade through multiple regex statements in Python" except instead of matching one regular expression and doing something I need to make sure I do not match a bunch of regular expressions, and if no matches are found (aka I have valid data) then do something. I have found one way to do it but am thinking there must be a better way, especially if I end up with many regular expressions.
Basically I am filtering URL's for bad stuff ("", \\", etc.) that occurs when I yank what looks like a valid URL out of an HTML document but it turns out to be part of a JavaScript (and thus needs to be evaluated, and thus the escaping characters). I can't use Beautiful soup to process these pages since they are far to mangled (actually I use BeautifulSoup, then fall back to my ugly but workable parser).
So far I have found the following works relatively well: I compile a dict or regular expressions outside the main loop (so I only have to compile it once, but benefit from the speed increase every time I use it), I then loop a URL through this dict, if there is a match then the URL is bad, if not the url is good:
regex_bad_url = {"1" : re.compile('\"\"'),
"2" : re.compile('\\\"')}
Followed by:
url_state = "good"
for key, pattern in regex_bad_url_components.items():
match = re.search(pattern, url)
if (match):
url_state = "bad"
if (url_state == "good"):
# do stuff here ...
Now the obvious thought is to use regex "or" ("|"), i.e.:
re.compile('(\"\"|\\\")')
Which reduces the number of compares and whatnot, but makes it much harder to trouble shoot (with one expression per compare I can easily add a print statement like:
print "URL: ", url, " matched by key ", key
So is there someway to get the best of both worlds (i.e. minimal number of compares) yet still be able to print out which regex is matching the URL, or do I simply need to bite the bullet and have my slower but easier to troubleshoot code when debugging and then squoosh all the regex's together into one line for production? (which means one more step of programming and code maintenance and possible problems).
Update:
Good answer by Dave Webb, so the actual code for this would look like:
match = re.search(r'(?P<double_quotes>\"\")|(?P<slash_quote>\\\")', fullurl)
if (match == None):
# do stuff here ...
else:
#optional for debugging
print "url matched by", match.lastgroup
"Squoosh" all the regexes into one line but put each in a named group using (?P<name>...) then use MatchOjbect.lastgroup to find which matched.