Regular expression further checking - python

I am working with a regular expression that would check a string,Is it a function or not.
My regular expression for checking that as follows:
regex=r' \w+[\ ]*\(.*?\)*'
It succefully checks whether the string contains a function or not.
But it grabs normal string which contains firs barcket value,such as "test (meaning of test)".
So I have to check further that if there is a space between function name and that brackets that will not be caught as match.So I did another checking as follows:
regex2=r'\s'
It work successfully and can differentiate between "test()" and "test ()".
But now I have to maintain another condition that,if there is no space after the brackets(eg. test()abcd),it will not catch it as a function.The regular expression should only treat as match when it will be like "test() abcd".
But I tried using different regular expression ,unfortunately those are not working.
Her one thing to mention the checking string is inserted in to a list at when it finds a match and in second step it only check the portion of the string.Example:
String : This is a python function test()abcd
At first it will check the string for function and when find matches with function test()
then send only "test()" for whether there is a gap between "test" and "()".
In this last step I have to find is there any gap between "test()" and "abcd".If there is gap it will not show match as function otherwise as a normal portion of string.
How should I write the regular expression for such case?
The regular expression will have to show in following cases:
1.test() abc
2.test(as) abc
3.test()
will not treat as a function if:
1.test (a)abc
2.test ()abc

(\w+\([^)]*\))(\s+|$)
Bascially you make sure it ends with either spaces or end of line.
BTW the kiki tool is very useful for debugging Python re: http://code.google.com/p/kiki-re/

regex=r'\w+\([\w,]+\)(?:\s+|$)'

I have solved the problem at first I just chexked for the string that have "()"using the regular expression:
regex = r' \w+[\ ]*\(.*?\)\w*'
Then for checking the both the space between function name and brackets,also the gap after the brackets,I used following function with regular expression:
def testFunction(self, func):
a=" "
func=str(func).strip()
if a in func:
return False
else:
a = re.findall(r'\w+[\ ]*', func)
j = len(a)
if j<=1:
return True
else:
return False
So it can now differentiate between "test() abc" and "test()abc".
Thanks

Related

Python- find substring and then replace all characters within it

Let's say I have this string :
<div>Object</div><img src=#/><p> In order to be successful...</p>
I want to substitute every letter between < and > with a #.
So, after some operation, I want my string to look like:
<###>Object<####><##########><#> In order to be successful...<##>
Notice that every character between the two symbols were replaced with # ( including whitespace).
This is the closest I could get:
r = re.sub('<.*?>', '<#>', string)
The problem with my code is that all characters between < and > are replaced by a single #, whereas I would like every individual character to be replaced by a #.
I tried a mixture of various back references, but to no avail. Could someone point me in the right direction?
What about...:
def hashes(mo):
replacing = mo.group(1)
return '<{}>'.format('#' * len(replacing))
and then
r = re.sub(r'<(.*?)>', hashes, string)
The ability to use a function as the second argument to re.sub gives you huge flexibility in building up your substitutions (and, as usual, a named def results in much more readable code than any cramped lambda -- you can use meaningful names, normal layouts, etc, etc).
The re.sub function can be called with a function as the replacement, rather than a new string. Each time the pattern is matched, the function will be called with a match object, just like you'd get using re.search or re.finditer.
So try this:
re.sub(r'<(.*?)>', lambda m: "<{}>".format("#" * len(m.group(1))), string)

Django: Bad group name

I faced an error on "bad group name".
Here is the code:
for qitem in q['display']:
if qitem['type'] == 1:
for keyword in keywordTags.split('|'):
p = re.compile('^' + keyword + '$')
newstring=''
for word in qitem['value'].split():
if word[-1:] == ',':
word = word[0:len(word)-1]
newstring += (p.sub('<b>'+word+'</b>', word) + ', ')
else:
newstring += (p.sub('<b>'+word+'</b>', word) + ' ')
qitem['value']=newstring
And here's the error:
error at /result/1/
bad group name
Request Method: GET
Django Version: 1.4.1
Exception Type: error
Exception Value: bad group name
Exception Location: C:\Python27\lib\re.py in _compile_repl, line 257
Python Executable: C:\Python27\python.exe
Python Version: 2.7.3 Python
Path: ['D:\ExamPapers', 'C:\Windows\SYSTEM32\python27.zip',
'C:\Python27\DLLs', 'C:\Python27\lib',
'C:\Python27\lib\plat-win', 'C:\Python27\lib\lib-tk',
'C:\Python27', 'C:\Python27\lib\site-packages']
Server time: Sun,3 Mar 2013 15:31:05 +0800
Traceback Switch to copy-and-paste view
C:\Python27\lib\site-packages\django\core\handlers\base.py in get_response
response = callback(request, *callback_args, **callback_kwargs) ... ▶ Local vars ?
D:\ExamPapers\views.py in result
newstring += (p.sub(''+word+'', word) + ' ') ... ▶ Local vars
In summary, the error is at:
newstring += (p.sub('<b>'+word+'</b>', word) + ' ')
So you're trying to highlight in bold an occurrence of a set of keywords. Right now this code is broken in quite a lot of ways. You're using the re module right now to match the keywords but you're also breaking the keywords and the strings down into individual words, you don't need to do both and the interaction between these two different approaches to the solving the problem are what is causing you issues.
You can use regular expressions to match multiple possible strings at the same time, that's what they're good for! So instead of "^keyword$" to match just "keyword" you could use "^keyword|hello$" to match either "keyword" or "hello". You also use the ^ and $ characters which only match the beginning or end of the entire string, but what you probably wanted originally was to match the beginning or end of words, for this you can use \b like this r"\b(keyword|hello)\b". Note that in the last example I added a r character before the string, this stands for "raw" and turns off pythons usual handling of back slash characters which conflicts with regular expressions, it's good practice to always use the r before the string when the string contains a regular expression. I also used brackets to group together the words.
The regular expression sub method allows you to substitute things matched by a regular expression with another string. It also allow you to make "back references" in the replacing string that include parts of original string that matched. The parts that it includes are called "groups" and are indicated with brackets in the original regular expression, in the example above there is only one set of brackets and these are the first so they're indicated by the back reference \1. The cause of the actual error message you asked about is that your replacement string contained what looked like a backref but there weren't any groups in your regular expression.
Using that you do something like this:
keywordMatcher = re.compile(r"\b(keyword|hello)\b")
value = keywordMatcher.sub(r"<b>\1</b>", value)
Another thing that isn't directly related to what you're asking but is incredibly important is that you are taking source plain text strings (I assume) and making them into HTML, this gives a lot of chance for script injection vulnerabilities which if you don't take the time to understand and avoid will allow bad guys to hack the applications you build (they can do this in an automated way, so even if you think your app will be too small for anyone to notice it can still get hacked and used for all sorts of bad things, don't let this happen!). The basic rule is that it's ok to convert text to HTML but you need to "escape" it first, this is very simple:
from django.utils import html
html_safe = html.escape(my_text)
All this does is convert characters like < to < which the browser will show as < but won't interpret as the beginning of a tag. So if a bad guy types <script> into one of your forms and it gets processed by your code it will display it as <script> and not execute it as a script.
Likewise, if you use an text in a regular expression that you don't intend to have special regular expression characters then you must escape that too! You can do this using re.escape:
import re
my_regexp = re.compile(r"\b%s\b" % (re.escape(my_word),))
Ok, so now we've got that out of the way here is a method you could use to do what you wanted:
value = "this is my super duper testing thingy"
keywords = "super|my|test"
from django.utils import html
import re
# first we must split up the keywords
keywords = keywords.split("|")
# Next we must make each keyword safe for use in a regular expression,
# this is similar to the HTML escaping we discussed above but not to
# be confused with it.
keywords = [re.escape(k) for k in keywords]
# Now we reform the keywordTags string, but this time we know each keyword is regexp-safe
keywords = "|".join(keywords)
# Finally we create a regular expression that matches *any* of the keywords
keywordMatcher = re.compile(r'\b(%s)\b' % (keywords,))
# We are going to make the value into HTML (by adding <b> tags) so must first escape it
value = html.escape(value)
# We can then apply the regular expression to the value. We use a "back reference" `\0` to say
# that each keyword found should be replace with itself wrapped in a <b> tag
value = keywordMatcher.sub(r"<b>\1</b>", value)
print value
I urge you to take the time to understand what this does, otherwise you're just going to get yourself into a mess! It's always easier to just cut and paste and move on but this leads to crappy broken code and worse of all means you yourself don't improve and don't learn. All great coders started of as beginner coders who took the time to understand things :)

Migrating from Python to Racket (regular expression libraries and the "Racket Way")

I'm attempting to learn Racket, and in the process am attempting to rewrite a Python filter. I have the following pair of functions in my code:
def dlv(text):
"""
Returns True if the given text corresponds to the output of DLV
and False otherwise.
"""
return text.startswith("DLV") or \
text.startswith("{") or \
text.startswith("Best model")
def answer_sets(text):
"""
Returns a list comprised of all of the answer sets in the given text.
"""
if dlv(text):
# In the case where we are processing the output of DLV, each
# answer set is a comma-delimited sequence of literals enclosed
# in {}
regex = re.compile(r'\{(.*?)\}', re.MULTILINE)
else:
# Otherwise we assume that the answer sets were generated by
# one of the Potassco solvers. In this case, each answer set
# is presented as a comma-delimited sequence of literals,
# terminated by a period, and prefixed by a string of the form
# "Answer: #" where "#" denotes the number of the answer set.
regex = re.compile(r'Answer: \d+\n(.*)', re.MULTILINE)
return regex.findall(text)
From what I can tell the implementation of the first function in Racket would be something along the following lines:
(define (dlv-input? text)
(regexp-match? #rx"^DLV|^{|^Best model" text))
Which appears to work correctly. Working on the implementation of the second function, I currently have come up with the following (to start with):
(define (answer-sets text)
(cond
[(dlv-input? text) (regexp-match* #rx"{(.*?)}" text)]))
This is not correct, as regexp-match* gives a list of the strings which match the regular expression, including the curly braces. Does anyone know of how to get the same behavior as in the Python implementation? Also, any suggestions on how to make the regular expressions "better" would be much appreciated.
You are very close. You simply need to add #:match-select cadr to your regexp-match call:
(regexp-match* #rx"{(.*?)}" text #:match-select cadr)
By default, #:match-select has value of car, which returns the whole matched string. cadr selects the first group, caddr selects the second group, etc. See the regexp-match* documentation for more details.

Python: splitting a function and arguments

Here are some simple function calls in python:
foo(arg1, arg2, arg3)
func1()
Assume it is a valid function call.
Suppose I read these lines while parsing a file.
What is the cleanest way to separate the function name and the args into a list with two elements, the first a string for the function name, and the second a string for the arguments?
Desired results:
["foo", "arg, arg2, arg3"]
["func1", ""]
I'm currently using string searches to find the first instance of "(" from the left side and the first instance of ")" from the right side and just splicing the string with those given indices, but I don't like how I am approaching the problem.
I'm currently doing something similar using regular expressions. Adapting my code to your case, the following works with the examples you provide.
import re
def explode(s):
pattern = r'(\w[\w\d_]*)\((.*)\)$'
match = re.match(pattern, s)
if match:
return list(match.groups())
else:
return []
If you're parsing a Python file in Python, consider using Python's parser: ast (specifically the ast.parse() call).
That said, your current approach isn't terrible (though it will break on function calls that spam multiple lines). There are few completely correct approaches short of the aforementioned full parser - for instance, you could count matching parens, so that a((b,c)) would return the correct value even if there was a line break in the middle - but then that code would probably do the wrong thing when faced with a((b, "c)")), and so on.

Python: Effective replacing of substring

I have code like this:
def escape_query(query):
special_chars = ['\\','+','-','&&','||','!','(',')','{','}','[',']',
'^','"','~','*','?',':']
for character in special_chars:
query = query.replace(character, '\\%s' % character)
return query
This function should escape all occurrences of every substring (Notice && and ||) in special_characters with backslash.
I think, that my approach is pretty ugly and I couldn't stop wondering if there aren't any better ways to do this. Answers should be limited to standart library.
Using reduce:
def escape_query(query):
special_chars = ['\\','+','-','&&','||','!','(',')','{','}','[',']',
'^','"','~','*','?',':']
return reduce(lambda q, c: q.replace(c, '\\%s' % c), special_chars, query)
The following code has exactly the same principle than the steveha's one.
But I think it fulfills your requirement of clarity and maintainability since the special chars are still listed in the same list as yours.
special_chars = ['\\','+','-','&&','||','!','(',')','{','}','[',']',
'^','"','~','*','?',':']
escaped_special_chars = map(re.escape, special_chars)
special_chars_pattern = '|'.join(escaped_special_chars).join('()')
def escape_query(query, reg = re.compile(special_chars_pattern) ):
return reg.sub(r'\\\1',query)
With this code:
when the function definition is executed, an object is created with a value (the regex re.compile(special_chars_pattern) ) received as default argument, and the name reg is assigned to this object and defined as a parameter for the function.
This happens only one time, at the moment when the function definition is executed, which is performed only one time at compilation time.
That means that during the execution of the compiled code that takes place after the compilation, each time a call to the function will be done, this creation and assignement won't be done again: the regex object already exists and is permanantly registered and avalaible in the tuple func_defaults that is definitive attribute of the function.
That's interesting if several calls to the function are done during execution, because Python has not to search for the regex outside if it was defined outside or to reassign it to parameter reg if it was passed as simple argument.
If I understand your requirements correctly, some of the special "chars" are two-character strings (specifically: "&&" and "||"). The best way to do such an odd collection is with a regular expression. You can use a character class to match anything that is one character long, then use vertical bars to separate some alternative patterns, and these can be multi-character. The trickiest part is the backslash-escaping of chars; for example, to match "||" you need to put r'\|\|' because the vertical bar is special in a regular expression. In a character class, backslash is special and so are '-' and ']'. The code:
import re
_s_pat = r'([\\+\-!(){}[\]^"~*?:]|&&|\|\|)'
_pat = re.compile(_s_pat)
def escape_query(query):
return re.sub(_pat, r'\\\1', query)
I suspect the above is the fastest solution to your problem possible in Python, because it pushes the work down to the regular expression machinery, which is written in C.
If you don't like the regular expression, you can make it easier to look at by using the verbose format, and compile using the re.VERBOSE flag. Then you can sprawl the regular expression across multiple lines, and put comments after any parts you find confusing.
Or, you can build your list of special characters, just like you already did, and run it through this function which will automatically compile a regular expression pattern that matches any alternative in the list. I made sure it will match nothing if the list is empty.
import re
def make_pattern(lst_alternatives):
if lst_alternatives:
temp = '|'.join(re.escape(s) for s in lst_alternatives)
s_pat = '(' + temp + ')'
else:
s_pat = '$^' # a pattern that will never match anything
return re.compile(s_pat)
By the way, I recommend you put the string and the pre-compiled pattern outside the function, as I showed above. In your code, Python will run code on each function invocation to build the list and bind it to the name special_chars.
If you want to not put anything but the function into the namespace, here's a way to do it without any run-time overhead:
import re
def escape_query(query):
return re.sub(escape_query.pat, r'\\\1', query)
escape_query.pat = re.compile(r'([\\+\-!(){}[\]^"~*?:]|&&|\|\|)')
The above uses the function's name to look up the attribute, which won't work if you rebind the function's name later. There is a discussion of this and a good solution here: how can python function access its own attributes?
(Note: The above paragraph replaces some stuff including a question that was discussed in the discussion comments below.)
Actually, upon further thought, I think this is cleaner and more Pythonic:
import re
_pat = re.compile(r'([\\+\-!(){}[\]^"~*?:]|&&|\|\|)')
def escape_query(query, pat=_pat):
return re.sub(pat, r'\\\1', query)
del(_pat) # not required but you can do it
At the time escape_query() is compiled, the object bound to the name _pat will be bound to a name inside the function's name space (that name is pat). Then you can call del() to unbind the name _pat if you like. This nicely encapsulates the pattern inside the function, does not depend at all on the function's name, and allows you to pass in an alternate pattern if you wish.
P.S. If your special characters were always a single character long, I would use the code below:
_special = set(['[', ']', '\\', '+']) # add other characters as desired, but only single chars
def escape_query(query):
return ''.join('\\' + ch if (ch in _special) else ch for ch in query)
Not sure if this is any better but it works and probably faster.
def escape_query(query):
special_chars = ['\\','+','-','&&','||','!','(',')','{','}','[',']', '^','"','~','*','?',':']
query = "".join(map(lambda x: "\\%s" % x if x in special_chars else x, query))
for sc in filter(lambda x: len(x) > 1, special_chars):
query = query.replace(sc, "\%s" % sc)
return query

Categories