Efficient way to do a large number of search/replaces in Python? - python

I'm fairly new to Python, and am writing a series of script to convert between some proprietary markup formats. I'm iterating line by line over files and then basically doing a large number (100-200) of substitutions that basically fall into 4 categories:
line = line.replace("-","<EMDASH>") # Replace single character with tag
line = line.replace("<\\#>","#") # tag with single character
line = line.replace("<\\n>","") # remove tag
line = line.replace("\xe1","•") # replace non-ascii character with entity
the str.replace() function seems to be pretty efficient (fairly low in the numbers when I examine profiling output), but is there a better way to do this? I've seen the re.sub() method with a function as an argument, but am unsure if this would be better? I guess it depends on what kind of optimizations Python does internally. Thought I would ask for some advice before creating a large dict that might not be very helpful!
Additionally I do some parsing of tags (that look somewhat like HTML, but are not HTML). I identify tags like this:
m = re.findall('(<[^>]+>)',line)
And then do ~100 search/replaces (mostly removing matches) within the matched tags as well, e.g.:
m = re.findall('(<[^>]+>)',line)
for tag in m:
tag_new = re.sub("\*t\([^\)]*\)","",tag)
tag_new = re.sub("\*p\([^\)]*\)","",tag_new)
# do many more searches...
if tag != tag_new:
line = line.replace(tag,tag_new,1) # potentially problematic
Any thoughts of efficiency here?
Thanks!

str.replace() is more efficient if you're going to do basic search and replaces, and re.sub is (obviously) more efficient if you need complex pattern matching (because otherwise you'd have to use str.replace several times).
I'd recommend you use a combination of both. If you have several patterns that all get replaced by one thing, use re.sub. If you just have some cases where you just need to replace one specific tag with another, use str.replace.
You can also improve efficiency by using larger strings (call re.sub once instead of once for each line). Increases memory use, but shouldn't be a problem unless the file is HUGE, but also improves execution time.

If you don't actually need the regex and are just doing literal replacing, string.replace() will almost certainly be faster. But even so, your bottleneck here will be file input/output, not string manipulation.
The best solution though would probably be to use cStringIO

Depending on the ratio of relevant-to-not-relevant portions of the text you're operating on (and whether or not the parts each substitution operates on overlap), it might be more efficient to try to break down the input into tokens and work on each token individually.
Since each replace() in your current implementation has to examine the entire input string, that can be slow. If you instead broke down that stream into something like...
[<normal text>, <tag>, <tag>, <normal text>, <tag>, <normal text>]
# from an original "<normal text><tag><tag><normal text><tag><normal text>"
...then you could simply look to see if a given token is a tag, and replace it in the list (and then ''.join() at the end).

You can pass a function object to re.sub instead of a substitution string, it takes the match object and returns the substitution, so for example
>>> r = re.compile(r'<(\w+)>|(-)')
>>> r.sub(lambda m: '(%s)' % (m.group(1) if m.group(1) else 'emdash'), '<atag>-<anothertag>')
'(atag)(emdash)(anothertag)'
Of course you can use a more complex function object, this lambda is just an example.
Using a single regex that does all the substitution should be slightly faster than iterating the string many times, but if a lot of substitutions are perfomed the overhead of calling the function object that computes the substitution may be significant.

Related

best way to extract data using re.compiler

I need to extract (a lot of) info from different text files.
I wonder if there is a shorter and more efficient way than the following:
First part: (N lines long)
N1 = re.compile(r'')
N2 = re.compile(r'')
.
Nn = re.compile(r'')
Second part: (2N lines long)
with open(filename) as f:
for line in f:
if N1.match(line):
var1 = N1.match(line).group(x).strip()
elif N2.match(line):
var2 = N1.match(line).group(x).strip()
elif Nn.match(line):
varn = Nn
Do you recommend having the re.compile vars (part 1) separate from the part 2. What do you people use in this cases? Perhaps a function pasing the regex as argument? and call it every time.
In my case N is 30, meaning I have 90 lines for feeding a dictionary with very little, or no logic at all.
I’m going to attempt to answer this without really knowing what you are actually doing there. So this answer might help you, or it might not.
First of all, what re.compile does is pre-compile a regular expression, so you can use it later and do not have to compile it every time you use it. This is primarily useful when you have a regular expression that is used multiple times throughout your program. But if the expression is only used a few times, then there is not really that much of a benefit to compiling it up front.
So you should ask yourself, how often the code runs that attempts to match all those expressions. Is it just once during the script execution? Then you can make your code simpler by inlining the expressions. Since you’re running the matches for each line in a file, pre-compiling likely makes sense here.
But just because you pre-compiled the expression, that does not mean that you should be sloppy and match the same expression too often. Look at this code:
if N1.match(line):
var1 = N1.match(line).group(x).strip()
Assuming there is a match, this will run N1.match() twice. That’s an overhead you should avoid since matching expressions can be relatively expensive (depending on the expression), even if the expression is already pre-compiled.
Instead, just match it once, and then reuse the result:
n1_match = N1.match(line)
if n1_match:
var1 = n1_match.group(x).strip()
Looking at your code, your regular expressions also appear to be mutally exclusive—or at least you only ever use the first match and skip the remaining ones. In that case, you should make sure that you order your checks
so that the most common checks are done first. That way, you avoid running too many expressions that won’t match anyway. Also, try to order them so that more complex expressions are ran less often.
Finally, you are collecting the match result in separate variables varN. At this point, I’m questioning what exactly you are doing there, since after all your if checks, you do not have a clear way of figuring out what the result was and which variable to use. At this point, it might make more sense to just collect it in a single variable, or to move specific logic within the condition bodies. But it’s difficult to tell with the amount of information you gave.
As mentionned in re module documentation, the regexes you pass through re methods are cached: depending on the number of expressions you have, caching them yourself might not be useful.
That being said, you should make a list of your regexes, so that a simple for loop would allow you to test all your patterns.
regexes = map(re.compile, ['', '', '', '', ...])
vars = ['']*len(regexes)
with open(filename) as f:
for line in f:
for i,regex in enumerate(regexes):
if regex.match(line):
var[i] = regex.match(line).group(x).strip()
break # break here if you only want the first match for any given line.

Pyparsing delimited list only returns first element

Here is my code :
l = "1.3E-2 2.5E+1"
parser = Word(alphanums + '+-.')
grammar = delimitedList(parser,delim='\t ')
print(grammar.parseString(l))
It returns :
['1.3E-2']
Obiously, I want all both values, not a single one, any idea what is going on ?
As #dawg explains, delimitedList is intended for cases where you have an expression with separating non-whitespace delimiters, typically commas. Pyparsing implicitly skips over whitespace, so in the pyparsing world, what you are really seeing is not a delimitedList, but OneOrMore(realnumber). Also, parseString internally calls str.expandtabs on the provided input string, unless you use the parseWithTabs=True argument. Expanding tabs to spaces helps preserve columnar alignment of data when it is in tabular form, and when I originally wrote pyparsing, this was a prevalent use case.
If you have control over this data, then you might want to use a different delimiter than <TAB>, perhaps commas or semicolons. If you are stuck with this format, but determined to use pyparsing, then use OneOrMore.
As you move forward, you will also want to be more precise about the expressions you define and the variable names that you use. The name "parser" is not very informative, and the pattern of Word(alphanums+'+-.') will match a lot of things besides valid real values in scientific notation. I understand if you are just trying to get anything working, this is a reasonable first cut, and you can come back and tune it once you get something going. If in fact you are going to be parsing real numbers, here is an expression that might be useful:
realnum = Regex(r'[+-]?\d+\.\d*([eE][+-]?\d+)?').setParseAction(lambda t: float(t[0]))
Then you can define your grammar as "OneOrMore(realnum)", which is also a lot more self-explanatory. And the parse action will convert your strings to floats at parse time, which will save you step later when actually working with the parsed values.
Good luck!
Works if you switch to raw strings:
l = r"1.3E-2\t2.5E+1"
parser = Word(alphanums + '+-.')
grammar = delimitedList(parser, delim=r'\t')
print(grammar.parseString(l))
Prints:
['1.3E-2', '2.5E+1']
In general, delimitedList works with something like PDPDP where P is the parse target and D is the delimter or delimiting sequence.
You have delim='\t '. That specifically is a delimiter of 1 tab followed by 1 space; it is not either tab or space.

Regular expression dictionary in python

Is it possible to implement a dictionary with keys as regular expressions and actions (with parameters) as values?
for e.g.
key = "actionname 1 2", value = "method(1, 2)"
key = "differentaction par1 par2", value = "appropriate_method(par1, par2)"
User types in the key, i need to execute the matching method with the parameters provided as part of user input.
It would be great if we can achieve the lookup in O(1) time, even if its not possible atleast i am looking for solutions to solve this problem.
I will be having few hundred regular expressions (say 300) and matching parameterized actions to execute.
I can write a loop to achieve this, but is there any elegant way to do this without using a for loop?
Related question: Hashtable/dictionary/map lookup with regular expressions
Yes, it's perfectly possible:
import re
dict = {}
dict[re.compile('actionname (\d+) (\d+)')] = method
dict[re.compile('differentaction (\w+) (\w+)')] = appropriate_method
def execute_method_for(str):
#Match each regex on the string
matches = (
(regex.match(str), f) for regex, f in dict.iteritems()
)
#Filter out empty matches, and extract groups
matches = (
(match.groups(), f) for match, f in matches if match is not None
)
#Apply all the functions
for args, f in matches:
f(*args)
Of course, the values of your dictionary can be python functions.
Your matching function can try to match your string to each key and execute appropriate function if there is a match. This will be linear in time in the best case, but I don't think you can get anything better if you want to use regular expressions.
But looking at your example data I think you should reconsider whether you need regular expressions at all. Perhaps you can just parse your input string into, e.g. <procedure-name> <parameter>+ and then lookup appropriate procedure by it's name (simple string), that can be O(1)
Unfortunately this is not possible. You will need to iterate over the regular expressions in order to find out if they match. The lookup in the dictionary will be O(1) though (but that doesn't solve your problem).
IMHO, you are asking the WRONG QUESTION.
You ask if there's an elegant way to do this. Answer: The most elegant way is the most OBVIOUS way. Code will be read 10x to 20x as often as it's modified. Therefore, if you write something 'elegant' that's hard to read and quickly understand, you've just sabotaged the guy after you who has to modify it somehow.
BETTER CODE:
Another answer here reads like this:
matches = ( (regex.match(str), f) for regex, f in dict.iteritems() )
This is functionally equivalent (IMPORTANTLY, the same in terms of Python generated bytecode) to:
# IMHO 'regex' var should probably be named 'pattern' since it's type is <sre.SRE_Pattern>
for pattern, func in dictname.items():
if pattern.match(str):
func()
But, the below sample is hugely easier to read and understand at a glance.
I apologize (a little) if you're one of those people who is offended by code that is even slightly more wordy than you think it could be. My criteria, and Guido's as mentioned in PEP-8, is that the clearest code is the best code.

Doing multiple, successive regex replacements in Python. Inefficient?

First off - my code works. It just runs slowly, and I'm wondering if i'm missing something that will make it more efficient. I'm parsing PDFs with python (and yes, I know that this should be avoided if at all possible).
My problem is that i have to do several rather complex regex substitutions - and when i say substitution, I really mean deleting. I have done the ones that strip out the most data first so that the next expressions don't need to analyze too much text, but that's all I can think of to speed things up.
I'm pretty new to python and regexes, so it's very conceivable this could be done better.
Thanks for reading.
regexPagePattern = r"(Wk)\d{1,2}.\d{2}(\d\.\d{1,2})"
regexCleanPattern = r"(\(continued\))?((II)\d\.\d{1,2}|\d\.\d{1,2}(II)|\d\.\d{1,2})"
regexStartPattern = r".*(II)(\s)?(INDEX OF CHARTS AFFECTED)"
regexEndPattern = r"(II.)\d{1,5}\((P|T)\).*"
contentRaw = re.sub(regexStartPattern,"",contentRaw)
contentRaw = re.sub(regexEndPattern,"",contentRaw)
contentRaw = re.sub(regexPagePattern,"",contentRaw)
contentRaw = re.sub(regexCleanPattern,"",contentRaw)
I'm not sure if you do this inside of a loop. If not the following does not apply.
If you use a pattern multiple times you should compile it using re.compile( ... ). This way the pattern is only compiled once. The speed increase should be huge. Minimal example:
>>> a="a b c d e f"
>>> re.sub(' ', '-', a)
'a-b-c-d-e-f'
>>> p=re.compile(' ')
>>> re.sub(p, '-', a)
'a-b-c-d-e-f'
Another idea: Use re.split( ... ) instead of re.sub and operate on the array with the resulting fragments of your data. I'm not entirely sure how it is implemented, but I think re.sub creates text fragments and merges them into one string in the end, which is expensive. After the last step you can join the array using " ".join(fragments). Obviously, This method will not work if your patterns overlap somewhere.
It would be interesting to get timing information for your program before and after your changes.
Regex are always the last choice when trying to decode strings. So if you see another possibility to solve your problem, use that.
That said, you could use re.compile to precompile your regex patterns:
regexPagePattern = re.compile(r"(Wk)\d{1,2}.\d{2}(\d\.\d{1,2})")
regexPagePattern.sub("",contentRaw)
That should speed things up a bit (a pretty nice bit ;) )

regular expression help with converting exp1^exp2 to pow(exp1, exp2)

I am converting some matlab code to C, currently I have some lines that have powers using the ^, which is rather easy to do with something along the lines \(?(\w*)\)?\^\(?(\w*)\)?
works fine for converting (glambda)^(galpha),using the sub routine in python pattern.sub(pow(\g<1>,\g<2>),'(glambda)^(galpha)')
My problem comes with nested parenthesis
So I have a string like:
glambdastar^(1-(1-gphi)*galpha)*(glambdaq)^(-(1-gphi)*galpha);
And I can not figure out how to convert that line to:
pow(glambdastar,(1-(1-gphi)*galpha))*pow(glambdaq,-(1-gphi)*galpha));
Unfortunately, regular expressions aren't the right tool for handling nested structures. There are some regular expressions engines (such as .NET) which have some support for recursion, but most — including the Python engine — do not, and can only handle as many levels of nesting as you build into the expression (which gets ugly fast).
What you really need for this is a simple parser. For example, iterate over the string counting parentheses and storing their locations in a list. When you find a ^ character, put the most recently closed parenthesis group into a "left" variable, then watch the group formed by the next opening parenthesis. When it closes, use it as the "right" value and print the pow(left, right) expression.
I think you can use recursion here.
Once you figure out the Left and Right parts, pass each of those to your function again.
The base case would be that no ^ operator is found, so you will not need to add the pow() function to your result string.
The function will return a string with all the correct pow()'s in place.
I'll come up with an example of this if you want.
Nested parenthesis cannot be described by a regexp and require a full parser (able to understand a grammar, which is something more powerful than a regexp). I do not think there is a solution.
See recent discussion function-parser-with-regex-in-python (one of many similar discussions). Then follow the suggestion to pyparsing.
An alternative would be to iterate until all ^ have been exhausted. no?.
Ruby code:
# assuming str contains the string of data with the expressions you wish to convert
while str.include?('^')
str!.gsub!(/(\w+)\^(\w+)/, 'pow(\1,\2)')
end

Categories