I have a file I need to parse. The parsing is built incrementally, such that on each iteration the expressions becomes more case specific.
The code segment which overloads the system looks roughly like this:
for item in ret:
pat = r'a\sstyle=".+class="VEAPI_Pushpin"\sid="msftve(.+?)".+>%s<'%item[1]
r=re.compile(pat, re.DOTALL)
match = r.findall(f)
The file is a rather large HTML file (parsed from bing maps), and each answer must match its exact id.
Before appying this change the workflow was very good. Is there anything I can do to avoid this? Or to optimize the code?
My only guess is that you are getting too many matches and running out of memory. Though this doesn't seem very reasonable, it might be the case. Try using finditer instead of findall to get one match at a time without creating a monster list of matches. If that doesn't fix your problem, you might have stumbled on a more serious bug in the re module.
Related
I'm doing a lot of file processing where I look for one of several substrings in each line. So I have code equivalent to this:
with open(file) as infile:
for line in infile:
for key in MY_SUBSTRINGS:
if key in line:
print(key, line)
MY_SUBSTRINGS is a list of 6-20 substrings. Substrings vary in length 10-30 chars and may contain spaces.
I'd really like to find a much faster way of doing this. Files have many 100k lines in them. Lines are typically 150 chars. User has to wait for 30s to a minute while file processes. The above is not the only thing taking time but it's taking quite a lot. I'm doing various other processes on a line-by-line basis so not appropraite to search the whole file as once.
I've tried the regex and ahocorasick answers from here but they both come out slower in my tests:
Fastest way to check whether a string is a substring in a list of strings
Any suggestions for faster methods?
I'm not quite sure of the best way to share example datasets. A logcat off an Android phone would be an example. One that's at least 200k lines long.
Then search for 10 strings like:
(NL80211_CMD_TRIGGER_SCAN) received for
Trying to associate with
Request to deauthenticate
interface state UNINITIALIZED->ENABLED
I tried regexes like this:
match_str = "|".join(MY_SUBSTRINGS)
regex = re.compile(match_str)
with open(file) as infile:
for line in infile:
match = regex.search(line)
if match:
print(match.group(0))
I would build a regular expression to search through the file.
Make sure that you're not running each of the search terms in loops when you use regex.
If each of your expressions are in one regexp it would look something like this:
import re
line = 'fsjdk abc def abc jkl'
re.findall(r'(abc|def)', line)
https://docs.python.org/3/library/re.html
If you need to to run still faster consider running a process concurrently with threads. This is a much broader topic but one method that might work is to first take a look at your problem and consider what the bottleneck might be.
If the issue is that your look is starved for disk throughput on the read what you can do is first run through the file and split it up into chunks and then map those chunks to worker threads that can process the data like a queue.
Definitely would need some more on your problem to understand exactly what kind of issue you're looking to solve. And there's people here that definitely would love to dig into a challenge.
I wrote a script in Python for custom HTML page that finds a word within a string/line and highlights just that word with use of following tags where instance is the word that is searched for.
<b><font color=\"red\">"+instance+"</font></b>
With the following result:
I need to find a word (case insensitive) let's say "port" within a string that can be port, Port, SUPPORT, Support, support etc, which is easy enough.
pattern = re.compile(word, re.IGNORECASE)
find_all_instances = pattern.findall(string_to_search)
However my strings often contain 2 or more instances in single line, and I need to append
<b><font color=\"red\">"+instance+"</font></b> to each of those instances, without changing cases.
Problem with my approach, is that I am attempting to itterate over each of instances found with findall (exact match),
while multiple same matches can also be found within the string.
for instance in find_all_instances:
second_pattern = re.compile(instance)
string_to_search = second_pattern.sub("<b><font color=\"red\">"+instance+"</font></b>", string_to_search)
This results in following:
<b><font color="red"><b><font color="red"><b><font color="red">Http</font></b></font></b></font></b></font>
when I need
<b><font color="red">Http</font></b>
I was thinking, I would be able to avoid this if I was able to find out exact part of the string that the pattern.sub substitutes at the moment of doing it,
however I was not able to find any examples of that kind of usage, which leads me to believe that I am doing something very wrong.
If anyone have a way I could use to insert <b><font color="red">instance</font></b> without replacing instance for all matches(case insensitive), then I would be grateful.
Maybe I'm misinterpretting your question, but wouldn't re.sub be the best option?
Example: https://repl.it/DExs
Okay so two ways I did quickly! The second loop is definitely the way to go. It uses re.sub (as someone else commented too). It replaces with the lowercase search term bear in mind.
import re
FILE = open("testing.txt","r")
word="port"
#THIS LOOP IS CASE SENSITIVE
for line in FILE:
newline=line.replace(word,"<b><font color=\"red\">"+word+"</font></b>")
print newline
#THIS LOOP IS INCASESENSITIVE
for line in FILE:
pattern=re.compile(word,re.IGNORECASE)
newline = pattern.sub("<b><font color=\"red\">"+word+"</font></b>",line)
print newline
The job is to read in a very large XML file line by line and store what has been already read in a string. When the string contains a full record between tags 'player' and '/player', all the values of xml tags within this record should be written to a text file as a tab separated line and the record removed from the already read chunk.
At the end of the process the unremoved part ( remainder ) should be printed, to check if all records have been properly processed and nothing remained unprocessed.
I have already this code in Perl and it runs swiftly, but I want to switch to Python.
The Python script I currently have is extremely slow.
Is Python that slow, or do I do something wrong with using the regular expressions?
import re
fh=open("players_list_xml.xml")
outf=open("players.txt","w")
x=""
cnt=0
while(cnt<10000):
line=fh.readline().rstrip()
x+=line
mo=re.search(r"<player>(.*)</player>",x)
while(mo):
cnt=cnt+1
if((cnt%1000)==0):
print("processing",cnt)
x=re.sub(re.escape(mo.group()),"",x)
print("\t".join(re.findall(r"<[a-z]+>([^<]+)<[^>]+>",mo.group(1))),file=outf)
mo=re.search(r"<player>(.*)</player>",x)
print("remainder",x)
outf.close()
fh.close()
Your regex is slow because of "backtracking" as you are using a "greedy" expression (this answer provides a simple Python example). Also, as mentioned in a comment, you should be using an XML parser to parse XML. Regex has never been very good for XML (or HTML).
In an attempt to explain why your specific expression is slow...
Lets assume you have three <player>...</player> elements in your XML. Your regex would start by matching the first opening <player> tag (that part is fine). Then (because you are using a greedy match) it would skip to the end of the document and start working backwards (backtracking) until it matched the last closing </player> tag. With a poorly written regex, it would stop there (all three elements would be in one match with all non player elements between them as well). However, that match would obviously be wrong so you make a few changes. Then the new regex would continue were the previously left off by continuing to backtrack until it found the first closing </player> tag. Then it would continue to backtrack until it determined there were no additional </player> tags between the opening tag and the most recently found closing tag. Then it would repeat that process for the second set of tags and again for the third. All that backtracking takes a lot of time. And that is for a relatively small file. In a comment you mention your files contain "more than half a million records". Ouch! I can't image how long that would take. And you're actually matching all elements, not just "player" elements. Then you are running a second regex against each element to check whether they are player elements. I would never expect this to be fast.
To avoid all that backtracking, you can use a "nongreedy" or "lazy" regex. For example (greatly simplified form your code):
r"<player>(.*?)</player>"
Note that the ? indicates that the previous pattern (.*) is nongreedy. In this instance, After finding the first opening <player> tag, it would then continue to move forward through the document (not jumping to the end) until it found the first closing </player> tag and then it would be satisfied that the pattern had matched and move on to find the second occurrence (but only by searching within the document after the end of the first occurrence).
Naturally, the nongreedy expression will be much faster. In my experience, nongreedy is almost always what you want when doing * or + matches (except for the rare cases when you don't).
That said, as stated previously, an XML parser is much more suited to parsing XML. In fact, many XML parsers offer some sort of steaming API which allows you to feed the document in in pieces in order to avoid loading the entire document into memory at once (regex does not offer this advantage). I'd start with lxml and then move to some of the builtin parsers if the C dependency doesn't work for you.
With XML parser:
import xml.parsers.expat
cnt=0
state="idle"
current_key=""
current_value=""
fields=[]
def start_element(name, attrs):
global state
global current_key
global current_value
global fields
if name=="player":
state="player"
elif state=="player":
current_key=name
def end_element(name):
global state
global current_key
global current_value
global fields
global cnt
if state=="player":
if name=="player":
state="idle"
line="\t".join(fields)
print(line,file=outf)
fields=[]
cnt+=1
if((cnt%10000)==0):
print(cnt,"players processed")
else:
fields.append(current_value)
current_key=""
current_value=""
def char_data(data):
global state
global current_key
global current_value
if state=="player" and not current_key=="":
current_value=data
p = xml.parsers.expat.ParserCreate()
p.StartElementHandler = start_element
p.EndElementHandler = end_element
p.CharacterDataHandler = char_data
fh=open("players_list_xml.xml")
outf=open("players.txt","w")
line=True
while((cnt<1000000) and line):
line=fh.readline().rstrip()
p.Parse(line)
outf.close()
fh.close()
This is quite an amount of code.
At least this produces a 29MB text file from the original XML, which size seems right.
The speed is reasonable, though this is a simplistic version, more processing is needed on the records.
In the end of the day it seems that a Perl script with only regexes is working at the speed of a dedicated XML parser, which is remarkable.
The correct answer as everyone else has said is to use an XML parser to parse XML.
The answer to your question about why it's so much slower than your perl version is that for some reason python's regular expressions are just slow, much slower than perl's to handle the same expression. I often find that code that uses regexps is more than twice as fast in perl.
solution:[a-zA-Z0-9.!#$%&'*+-/=?\^_`{|}~-]+#[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)* is a good choice
I am using a regular expression like the below to match email addresses in a file:
email = re.search('(\w+-*[.|\w]*)*#(\w+[.])*\w+',line)
When used on a file like the following, my regular expression works well:
mlk407289715#163.com huofenggib wrong in get_gsid
mmmmmmmmmm776#163.com rouni816161 wrong in get_gsid
But when I use it on a file like below, my regular expression runs unacceptably slowly:
9b871484d3af90c89f375e3f3fb47c41e9ff22 mingyouv9gueishao#163.com
e9b845f2fd3b49d4de775cb87bcf29cc40b72529e mlb331055662#163.com
And when I use the regular expression from this website, it still runs very slowly.
I need a solution and want to know what's wrong.
That's a problem with backtracking. Read this article for more information.
You might want to split the line and work with the part containing an #:
pattern = '(\w+-*[.|\w]*)*#(\w+[.])*\w+'
line = '9b871484d3af90c89f375e3f3fb47c41e9ff22 mingyouv9gueishao#163.com'
for element in line.split():
if '#' in element:
g = re.match(pattern, element)
print g.groups()
Generally when regular expressions are slow, it is due to catastrophic bactracking. This can happen in your regex because of the nested repetition during in the following section:
(\w+-*[.|\w]*)*
If you can work on this section of the regex to remove the repetition from within the parentheses you should see a substantial speed increase.
However, you are probably better of just searching for an email regex and seeing how other people have approached this problem.
It's always a good idea to search StackOverflow to see if your question has already been discussed.
Using a regular expression to validate an email address
This one, from that discussion, looks like a good one to me:
[a-zA-Z0-9.!#$%&'*+-/=?\^_`{|}~-]+#[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
I'm fairly new to Python, and am writing a series of script to convert between some proprietary markup formats. I'm iterating line by line over files and then basically doing a large number (100-200) of substitutions that basically fall into 4 categories:
line = line.replace("-","<EMDASH>") # Replace single character with tag
line = line.replace("<\\#>","#") # tag with single character
line = line.replace("<\\n>","") # remove tag
line = line.replace("\xe1","•") # replace non-ascii character with entity
the str.replace() function seems to be pretty efficient (fairly low in the numbers when I examine profiling output), but is there a better way to do this? I've seen the re.sub() method with a function as an argument, but am unsure if this would be better? I guess it depends on what kind of optimizations Python does internally. Thought I would ask for some advice before creating a large dict that might not be very helpful!
Additionally I do some parsing of tags (that look somewhat like HTML, but are not HTML). I identify tags like this:
m = re.findall('(<[^>]+>)',line)
And then do ~100 search/replaces (mostly removing matches) within the matched tags as well, e.g.:
m = re.findall('(<[^>]+>)',line)
for tag in m:
tag_new = re.sub("\*t\([^\)]*\)","",tag)
tag_new = re.sub("\*p\([^\)]*\)","",tag_new)
# do many more searches...
if tag != tag_new:
line = line.replace(tag,tag_new,1) # potentially problematic
Any thoughts of efficiency here?
Thanks!
str.replace() is more efficient if you're going to do basic search and replaces, and re.sub is (obviously) more efficient if you need complex pattern matching (because otherwise you'd have to use str.replace several times).
I'd recommend you use a combination of both. If you have several patterns that all get replaced by one thing, use re.sub. If you just have some cases where you just need to replace one specific tag with another, use str.replace.
You can also improve efficiency by using larger strings (call re.sub once instead of once for each line). Increases memory use, but shouldn't be a problem unless the file is HUGE, but also improves execution time.
If you don't actually need the regex and are just doing literal replacing, string.replace() will almost certainly be faster. But even so, your bottleneck here will be file input/output, not string manipulation.
The best solution though would probably be to use cStringIO
Depending on the ratio of relevant-to-not-relevant portions of the text you're operating on (and whether or not the parts each substitution operates on overlap), it might be more efficient to try to break down the input into tokens and work on each token individually.
Since each replace() in your current implementation has to examine the entire input string, that can be slow. If you instead broke down that stream into something like...
[<normal text>, <tag>, <tag>, <normal text>, <tag>, <normal text>]
# from an original "<normal text><tag><tag><normal text><tag><normal text>"
...then you could simply look to see if a given token is a tag, and replace it in the list (and then ''.join() at the end).
You can pass a function object to re.sub instead of a substitution string, it takes the match object and returns the substitution, so for example
>>> r = re.compile(r'<(\w+)>|(-)')
>>> r.sub(lambda m: '(%s)' % (m.group(1) if m.group(1) else 'emdash'), '<atag>-<anothertag>')
'(atag)(emdash)(anothertag)'
Of course you can use a more complex function object, this lambda is just an example.
Using a single regex that does all the substitution should be slightly faster than iterating the string many times, but if a lot of substitutions are perfomed the overhead of calling the function object that computes the substitution may be significant.