Python: Effective replacing of substring

Python: Effective replacing of substring - python

I have code like this:
def escape_query(query):
special_chars = ['\\','+','-','&&','||','!','(',')','{','}','[',']',
'^','"','~','*','?',':']
for character in special_chars:
query = query.replace(character, '\\%s' % character)
return query
This function should escape all occurrences of every substring (Notice && and ||) in special_characters with backslash.
I think, that my approach is pretty ugly and I couldn't stop wondering if there aren't any better ways to do this. Answers should be limited to standart library.

Using reduce:
def escape_query(query):
special_chars = ['\\','+','-','&&','||','!','(',')','{','}','[',']',
'^','"','~','*','?',':']
return reduce(lambda q, c: q.replace(c, '\\%s' % c), special_chars, query)

The following code has exactly the same principle than the steveha's one.
But I think it fulfills your requirement of clarity and maintainability since the special chars are still listed in the same list as yours.
special_chars = ['\\','+','-','&&','||','!','(',')','{','}','[',']',
'^','"','~','*','?',':']
escaped_special_chars = map(re.escape, special_chars)
special_chars_pattern = '|'.join(escaped_special_chars).join('()')
def escape_query(query, reg = re.compile(special_chars_pattern) ):
return reg.sub(r'\\\1',query)
With this code:
when the function definition is executed, an object is created with a value (the regex re.compile(special_chars_pattern) ) received as default argument, and the name reg is assigned to this object and defined as a parameter for the function.
This happens only one time, at the moment when the function definition is executed, which is performed only one time at compilation time.
That means that during the execution of the compiled code that takes place after the compilation, each time a call to the function will be done, this creation and assignement won't be done again: the regex object already exists and is permanantly registered and avalaible in the tuple func_defaults that is definitive attribute of the function.
That's interesting if several calls to the function are done during execution, because Python has not to search for the regex outside if it was defined outside or to reassign it to parameter reg if it was passed as simple argument.

If I understand your requirements correctly, some of the special "chars" are two-character strings (specifically: "&&" and "||"). The best way to do such an odd collection is with a regular expression. You can use a character class to match anything that is one character long, then use vertical bars to separate some alternative patterns, and these can be multi-character. The trickiest part is the backslash-escaping of chars; for example, to match "||" you need to put r'\|\|' because the vertical bar is special in a regular expression. In a character class, backslash is special and so are '-' and ']'. The code:
import re
_s_pat = r'([\\+\-!(){}[\]^"~*?:]|&&|\|\|)'
_pat = re.compile(_s_pat)
def escape_query(query):
return re.sub(_pat, r'\\\1', query)
I suspect the above is the fastest solution to your problem possible in Python, because it pushes the work down to the regular expression machinery, which is written in C.
If you don't like the regular expression, you can make it easier to look at by using the verbose format, and compile using the re.VERBOSE flag. Then you can sprawl the regular expression across multiple lines, and put comments after any parts you find confusing.
Or, you can build your list of special characters, just like you already did, and run it through this function which will automatically compile a regular expression pattern that matches any alternative in the list. I made sure it will match nothing if the list is empty.
import re
def make_pattern(lst_alternatives):
if lst_alternatives:
temp = '|'.join(re.escape(s) for s in lst_alternatives)
s_pat = '(' + temp + ')'
else:
s_pat = '$^' # a pattern that will never match anything
return re.compile(s_pat)
By the way, I recommend you put the string and the pre-compiled pattern outside the function, as I showed above. In your code, Python will run code on each function invocation to build the list and bind it to the name special_chars.
If you want to not put anything but the function into the namespace, here's a way to do it without any run-time overhead:
import re
def escape_query(query):
return re.sub(escape_query.pat, r'\\\1', query)
escape_query.pat = re.compile(r'([\\+\-!(){}[\]^"~*?:]|&&|\|\|)')
The above uses the function's name to look up the attribute, which won't work if you rebind the function's name later. There is a discussion of this and a good solution here: how can python function access its own attributes?
(Note: The above paragraph replaces some stuff including a question that was discussed in the discussion comments below.)
Actually, upon further thought, I think this is cleaner and more Pythonic:
import re
_pat = re.compile(r'([\\+\-!(){}[\]^"~*?:]|&&|\|\|)')
def escape_query(query, pat=_pat):
return re.sub(pat, r'\\\1', query)
del(_pat) # not required but you can do it
At the time escape_query() is compiled, the object bound to the name _pat will be bound to a name inside the function's name space (that name is pat). Then you can call del() to unbind the name _pat if you like. This nicely encapsulates the pattern inside the function, does not depend at all on the function's name, and allows you to pass in an alternate pattern if you wish.
P.S. If your special characters were always a single character long, I would use the code below:
_special = set(['[', ']', '\\', '+']) # add other characters as desired, but only single chars
def escape_query(query):
return ''.join('\\' + ch if (ch in _special) else ch for ch in query)

Not sure if this is any better but it works and probably faster.
def escape_query(query):
special_chars = ['\\','+','-','&&','||','!','(',')','{','}','[',']', '^','"','~','*','?',':']
query = "".join(map(lambda x: "\\%s" % x if x in special_chars else x, query))
for sc in filter(lambda x: len(x) > 1, special_chars):
query = query.replace(sc, "\%s" % sc)
return query

Related

Python- find substring and then replace all characters within it

Let's say I have this string :
<div>Object</div><img src=#/><p> In order to be successful...</p>
I want to substitute every letter between < and > with a #.
So, after some operation, I want my string to look like:
<###>Object<####><##########><#> In order to be successful...<##>
Notice that every character between the two symbols were replaced with # ( including whitespace).
This is the closest I could get:
r = re.sub('<.*?>', '<#>', string)
The problem with my code is that all characters between < and > are replaced by a single #, whereas I would like every individual character to be replaced by a #.
I tried a mixture of various back references, but to no avail. Could someone point me in the right direction?

What about...:
def hashes(mo):
replacing = mo.group(1)
return '<{}>'.format('#' * len(replacing))
and then
r = re.sub(r'<(.*?)>', hashes, string)
The ability to use a function as the second argument to re.sub gives you huge flexibility in building up your substitutions (and, as usual, a named def results in much more readable code than any cramped lambda -- you can use meaningful names, normal layouts, etc, etc).

The re.sub function can be called with a function as the replacement, rather than a new string. Each time the pattern is matched, the function will be called with a match object, just like you'd get using re.search or re.finditer.
So try this:
re.sub(r'<(.*?)>', lambda m: "<{}>".format("#" * len(m.group(1))), string)

Function call syntax oddity

I was experimenting with something on the Python console and I noticed that it doesn't matter how many spaces you have between the function name and the (), Python still manages to call the method?
>>> def foo():
... print 'hello'
...
>>> foo ()
hello
>>> foo ()
hello
How is that possible? Shouldn't that raise some sort of exception?

From the Lexical Analysis documentation on whitespace between tokens:
Except at the beginning of a logical line or in string literals, the whitespace characters space, tab and formfeed can be used interchangeably to separate tokens. Whitespace is needed between two tokens only if their concatenation could otherwise be interpreted as a different token (e.g., ab is one token, but a b is two tokens).
Inverting the last sentence, whitespace is allowed between any two tokens as long as they should not instead be interpreted as one token without the whitespace. There is no limit on how much whitespace is used.
Earlier sections define what comprises a logical line, the above only applies to within a logical line. The following is legal too:
result = (foo
())
because the logical line is extended across newlines by parenthesis.
The call expression is a separate series of tokens from what precedes; foo is just a name to look up in the global namespace, you could have looked up the object from a dictionary, it could have been returned from another call, etc. As such, the () part is two separate tokens and any amount of whitespace in and around these is allowed.

You should understand that
foo()
in Python is composed of two parts: foo and ().
The first one is a name that in your case is found in the globals() dictionary and the value associated to it is a function object.
An open parenthesis following an expression means that a call operation should be made. Consider for example:
def foo():
print "Hello"
def bar():
return foo
bar()() # Will print "Hello"
So they key point to understand is that () can be applied to whatever expression precedes it... for example mylist[i]() will get the i-th element of mylist and call it passing no arguments.
The syntax also allows optional spaces between an expression and the ( character and there's nothing strange about it. Note that also you can for example write p . x to mean p.x.

Python takes many cues from the C programming language. This is certainly one of them. Consider the fact that the following compiles:
int add(int a, int b)
{
return a + b;
}
int main(int argc, char **argv)
{
return add (10, 11);
}
There can be an arbitrary number of spaces between the function name and the arguments to it.
This also holds true in other languages. Consider Ruby, for example,
def add(a, b)
a + b
end
add (10, 11)
Sure you get a warning, but it still works and although I don't know how, I'm sure that warning could easily be suppressed.

Perfectly fine. Leading spaces or not, anything between a name of a symbol of sorts and rounded braces would probably change the name of the symbol (in this case a function definition) in question.
This is also common in various C-based languages, where padding spaces between the end of a function name and the parameter list can be done without issue.

Migrating from Python to Racket (regular expression libraries and the "Racket Way")

I'm attempting to learn Racket, and in the process am attempting to rewrite a Python filter. I have the following pair of functions in my code:
def dlv(text):
"""
Returns True if the given text corresponds to the output of DLV
and False otherwise.
"""
return text.startswith("DLV") or \
text.startswith("{") or \
text.startswith("Best model")
def answer_sets(text):
"""
Returns a list comprised of all of the answer sets in the given text.
"""
if dlv(text):
# In the case where we are processing the output of DLV, each
# answer set is a comma-delimited sequence of literals enclosed
# in {}
regex = re.compile(r'\{(.*?)\}', re.MULTILINE)
else:
# Otherwise we assume that the answer sets were generated by
# one of the Potassco solvers. In this case, each answer set
# is presented as a comma-delimited sequence of literals,
# terminated by a period, and prefixed by a string of the form
# "Answer: #" where "#" denotes the number of the answer set.
regex = re.compile(r'Answer: \d+\n(.*)', re.MULTILINE)
return regex.findall(text)
From what I can tell the implementation of the first function in Racket would be something along the following lines:
(define (dlv-input? text)
(regexp-match? #rx"^DLV|^{|^Best model" text))
Which appears to work correctly. Working on the implementation of the second function, I currently have come up with the following (to start with):
(define (answer-sets text)
(cond
[(dlv-input? text) (regexp-match* #rx"{(.*?)}" text)]))
This is not correct, as regexp-match* gives a list of the strings which match the regular expression, including the curly braces. Does anyone know of how to get the same behavior as in the Python implementation? Also, any suggestions on how to make the regular expressions "better" would be much appreciated.

You are very close. You simply need to add #:match-select cadr to your regexp-match call:
(regexp-match* #rx"{(.*?)}" text #:match-select cadr)
By default, #:match-select has value of car, which returns the whole matched string. cadr selects the first group, caddr selects the second group, etc. See the regexp-match* documentation for more details.

Python: splitting a function and arguments

Here are some simple function calls in python:
foo(arg1, arg2, arg3)
func1()
Assume it is a valid function call.
Suppose I read these lines while parsing a file.
What is the cleanest way to separate the function name and the args into a list with two elements, the first a string for the function name, and the second a string for the arguments?
Desired results:
["foo", "arg, arg2, arg3"]
["func1", ""]
I'm currently using string searches to find the first instance of "(" from the left side and the first instance of ")" from the right side and just splicing the string with those given indices, but I don't like how I am approaching the problem.

I'm currently doing something similar using regular expressions. Adapting my code to your case, the following works with the examples you provide.
import re
def explode(s):
pattern = r'(\w[\w\d_]*)\((.*)\)$'
match = re.match(pattern, s)
if match:
return list(match.groups())
else:
return []

If you're parsing a Python file in Python, consider using Python's parser: ast (specifically the ast.parse() call).
That said, your current approach isn't terrible (though it will break on function calls that spam multiple lines). There are few completely correct approaches short of the aforementioned full parser - for instance, you could count matching parens, so that a((b,c)) would return the correct value even if there was a line break in the middle - but then that code would probably do the wrong thing when faced with a((b, "c)")), and so on.

Regular expression further checking

I am working with a regular expression that would check a string,Is it a function or not.
My regular expression for checking that as follows:
regex=r' \w+[\ ]*\(.*?\)*'
It succefully checks whether the string contains a function or not.
But it grabs normal string which contains firs barcket value,such as "test (meaning of test)".
So I have to check further that if there is a space between function name and that brackets that will not be caught as match.So I did another checking as follows:
regex2=r'\s'
It work successfully and can differentiate between "test()" and "test ()".
But now I have to maintain another condition that,if there is no space after the brackets(eg. test()abcd),it will not catch it as a function.The regular expression should only treat as match when it will be like "test() abcd".
But I tried using different regular expression ,unfortunately those are not working.
Her one thing to mention the checking string is inserted in to a list at when it finds a match and in second step it only check the portion of the string.Example:
String : This is a python function test()abcd
At first it will check the string for function and when find matches with function test()
then send only "test()" for whether there is a gap between "test" and "()".
In this last step I have to find is there any gap between "test()" and "abcd".If there is gap it will not show match as function otherwise as a normal portion of string.
How should I write the regular expression for such case?
The regular expression will have to show in following cases:
1.test() abc
2.test(as) abc
3.test()
will not treat as a function if:
1.test (a)abc
2.test ()abc

(\w+\([^)]*\))(\s+|$)
Bascially you make sure it ends with either spaces or end of line.
BTW the kiki tool is very useful for debugging Python re: http://code.google.com/p/kiki-re/

regex=r'\w+\([\w,]+\)(?:\s+|$)'

I have solved the problem at first I just chexked for the string that have "()"using the regular expression:
regex = r' \w+[\ ]*\(.*?\)\w*'
Then for checking the both the space between function name and brackets,also the gap after the brackets,I used following function with regular expression:
def testFunction(self, func):
a=" "
func=str(func).strip()
if a in func:
return False
else:
a = re.findall(r'\w+[\ ]*', func)
j = len(a)
if j<=1:
return True
else:
return False
So it can now differentiate between "test() abc" and "test()abc".
Thanks

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Effective replacing of substring - python

Using reduce: def escape_query(query): special_chars = ['\\','+','-','&&','||','!','(',')','{','}','[',']', '^','"','~','*','?',':'] return reduce(lambda q, c: q.replace(c, '\\%s' % c), special_chars, query)

Related

Python- find substring and then replace all characters within it

Function call syntax oddity

Migrating from Python to Racket (regular expression libraries and the "Racket Way")

Python: splitting a function and arguments

Regular expression further checking

Categories

Resources