I'm attempting to learn Racket, and in the process am attempting to rewrite a Python filter. I have the following pair of functions in my code:
def dlv(text):
"""
Returns True if the given text corresponds to the output of DLV
and False otherwise.
"""
return text.startswith("DLV") or \
text.startswith("{") or \
text.startswith("Best model")
def answer_sets(text):
"""
Returns a list comprised of all of the answer sets in the given text.
"""
if dlv(text):
# In the case where we are processing the output of DLV, each
# answer set is a comma-delimited sequence of literals enclosed
# in {}
regex = re.compile(r'\{(.*?)\}', re.MULTILINE)
else:
# Otherwise we assume that the answer sets were generated by
# one of the Potassco solvers. In this case, each answer set
# is presented as a comma-delimited sequence of literals,
# terminated by a period, and prefixed by a string of the form
# "Answer: #" where "#" denotes the number of the answer set.
regex = re.compile(r'Answer: \d+\n(.*)', re.MULTILINE)
return regex.findall(text)
From what I can tell the implementation of the first function in Racket would be something along the following lines:
(define (dlv-input? text)
(regexp-match? #rx"^DLV|^{|^Best model" text))
Which appears to work correctly. Working on the implementation of the second function, I currently have come up with the following (to start with):
(define (answer-sets text)
(cond
[(dlv-input? text) (regexp-match* #rx"{(.*?)}" text)]))
This is not correct, as regexp-match* gives a list of the strings which match the regular expression, including the curly braces. Does anyone know of how to get the same behavior as in the Python implementation? Also, any suggestions on how to make the regular expressions "better" would be much appreciated.
You are very close. You simply need to add #:match-select cadr to your regexp-match call:
(regexp-match* #rx"{(.*?)}" text #:match-select cadr)
By default, #:match-select has value of car, which returns the whole matched string. cadr selects the first group, caddr selects the second group, etc. See the regexp-match* documentation for more details.
Related
Newb programmer here working on my first project. I've searched this site and the python documentation, and either I'm not seeing the answer, or I'm not using the right terminology. I've read the regex and if sections, specifically, and followed links around to other parts that seemed relevant.
import re
keyphrase = '##' + '' + '##'
print(keyphrase) #output is ####
j = input('> ') ###whatever##
if keyphrase in j:
print('yay')
else:
print('you still haven\'t figured it out...')
k = j.replace('#', '')
print(k) #whatever
This is for a little reddit bot project. I want the bot to be called like ##whatever## and then be able to do things with the word(s) in between the ##'s. I've set up the above code to test if Python was reading it but I keep getting my "you still haven't figured it out..." quip.
I tried adding the REGEX \W in the middle of keyphrase, to no avail. Also weird combinations of \$\$ and quotes
So, my question, is how do I put a placeholder in keyphrase for user input?
For instance, if a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.
You could use the following regex r'##(.*?)##' to capture everything inside of the key phrase you've chosen.
Sample Output:
>>> import re
>>> f = lambda s: re.match(r'##(.*?)##', s).group(1)
>>> f("##whatever##")
whatever
>>> f = lambda s: re.findall(r'##(.*?)##', s)
>>> f("a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.")
['comment', 'this', 'I can grab']
How does it work? (1) We state the string constant head and tail for the capture group 1 between the brackets (). Great, almost there! (2) We then match any character .*? with greedy search enforced so that we capture the whole string.
Suggested Readings:
Introduction to Regex in Python - Jee Gikera
Something like this should work:
import re
keyphrase_regex = re.compile(r'##(.*)##')
user_input = input('> ')
keyphrase_match = keyphrase_regex.search(user_input)
# `search` returns `None` if regex didn't match anywhere in the string
keyphrase_content = keyphrase_match.group(1) if keyphrase_match else None
if keyphrase_content:
keyphrase_content = keyphrase_match.group(1)
print('yay! You submitted "', keyphrase_content, '" to the bot!')
else:
# Bonus tip: Use double quotes to make a string containing apostrophe
# without using a backslash escape
print("you still haven't figured it out...")
# Use `keyphrase_content` for whatever down here
Regular expressions are kind of hard to wrap your head around, because they work differently than most programming constructs. It's a language to describe patterns.
Regex One is a fantastic beginners guide.
Regex101 is an online sandbox that allows you to type a regular expression and some sample strings, then see what matches (and why) in real time
The regex ##(.*)## basically means "search through the string until you find two '#' signs. Right after those, start capturing zero-or-more of any character. If you find another '#', stop capturing characters. If that '#' is followed by another one, stop looking at the string, return successfully, and hold onto the entire match (from first '#' to last '#'). Also, hold onto the captured characters in case the programmer asks you for just them.
EDIT: Props to #ospahiu for bringing up the ? lazy quantifier. A final solution, combining our approaches, would look like this:
# whatever_bot.py
import re
# Technically, Python >2.5 will compile and cache regexes automatically.
# For tiny projects, it shouldn't make a difference. I think it's better style, though.
# "Explicit is better than implicit"
keyphrase_regex = re.compile(r'##(.*?)##')
def parse_keyphrases(input):
return keyphrase_regex.find_all(input)
Lambdas are cool. I prefer them for one-off things, but the code above is something I'd rather put in a module. Personal preference.
You could even make the regex substitutable, using the '##' one by default
# whatever_bot.py
import re
keyphrase_double_at_sign = re.compile(r'##(.*?)##')
def parse_keyphrases(input, keyphrase_regex=keyphrase_double_at_sign):
return keyphrase_regex.find_all(input)
You could even go bonkers and write a function that generates a keyphrase regex from an arbitrary "tag" pattern! I'll leave that as an exercise for the reader ;) Just remember: Several characters have special regex meanings, like '*' and '?', so if you want to match that literal character, you'd need to escape them (e.g. '\?').
If you want to grab the content between the "#", then try this:
j = input("> ")
"".join(j.split("#"))
You're not getting any of the info between the #'s in your example because you're effectively looking for '####' in whatever input you give it. Unless you happen to put 4 #'s in a row, that RE will never match.
What you want to do instead is something like
re.match('##\W+##', j)
which will look for 2 leading ##s, then any number greater than 1 alphanumeric characters (\W+), then 2 trailing ##s. From there, your strip code looks fine and you should be able to grab it.
I need to get (parse) from a device its whole output.
My solution was: 1) Determine how the last line of its output looks like
2) Use the code below to read the output until the last line (which is a way around of saying - read the whole output)
last_line = "text of the last line"
read_until(last_line)
3) Technical detail: make it to a return value of the get_output() as means of passing it further to a parse_result() function.
The problem is: The last line might take various forms and only its rough format is known. For example it might say: {"diag":"hdd_id", "status":"0"}. However, both "diag" and "status" might take other values than "hdd_id" and "0".
What can I do to make the "text of the last line" more universal so that the read_until() stops for every value of "diag" and "status"? (given that the output always includes words "diag" and "status")
What I tried: Using regular expressions. Defining last_line = re('"status":"."}') making use of the fact that . in regular expression means any value. What I get though is TypeError: 'module' object is not callable.
It also wouldn't make much sense to convert that regular expression to a string by str(re('"status":"."}')) since, as far as I understand regular expressions, it wouldn't mean any particular string (due to .).
You should read (again) the re chapter from the Python standard library manual.
The correct usage is:
import re
...
eof = re.compile(r'\s*\{\s*"diag":.*,\s*"status":.*\}') # compile the regex
...
The above expression uses \s* to allow for optional white spaces in the line. You can remove them if you know that they cannot occur.
You can then use it with the telnetlib Python module, but with expect instead of read_until, because the latter searches for a string and not a regex:
index, match, text = tn.expect([eof])
Here, index will the the index of the matched regex (here 0), match the match object, and text the full text including the last line
--SOLVED--
I solved my issue by enabling multiline mode, and now the characters ^ and $ work perfectly for identifying the beginning and end of each string
--EDIT--
My code:
import re
import test_regex
def regex_content(text_content, regex_dictionary):
#text_content = text_content.lower()
regex_matches = []
# Search sanitized text (markup removed) for DLP theme keywords
for key,value in regex_dictionary.items():
# Get confiiguration settings
min_matches = value.get('min_matches',1)
risk = value.get('risk',1)
enabled = value.get('enabled',False)
regex_str = value.get('regex','')
# Fast compute True/False hit for each DLP theme word
if enabled:
print "Searching for key : %s" % (key)
my_regex = re.compile(value.get('regex'))
hits = my_regex.findall(text_content)
if len(hits) > 0:
regex_matches.append((key, risk, len(hits), hits))
# Return array of results (key, risk, number of hits, regex matches)
return regex_matches
def main():
#print defaults.test_regex.dlp_regex
text_content = ""
for line in open('testData.txt'):
text_content+=line
for match in regex_content(text_content, test_regex.dlp_regex):
print "\nFound %s : %s" % (match[0], match[3])
print "\n"
if __name__ == '__main__':
main()
and it is using the regex found here:
'Large number of US Zip Codes' : { 'regex' : "\b\d{5}(?:-\d{1,4})?\b"},
When I precede my regex with the 'r' flag, I can find the zip codes I'm looking for, but as well as every other 5 digit number in my document I am searching through. From my understanding this is because it ignored the \b characters. Without the r flag though, it cannot find any zip codes. It works perfectly fine in regexr, but not in my code. I haven't had any luck making \b characters work, nor ^ and $ for identifying the beginnings and ends of the strings I'm searching for. What is it that I am misunderstanding about these special characters?
--Original post--
I am writing a regex for identifying zip codes (and only zip codes), so to avoid false positives I am trying to include a boundary on my regex, using both of the following:
\b\d{5}\b|\b\d{5}-\b\d{1,4}\b
using the online regex debugger Regexr, my code should correctly catch 5 digit zip codes, such as 34332. However, I have two problems:
1. This regex is not working in my actual code for finding any zip codes, but it does work when I don't have the boundary (\b) characters. The exact code I'm trying to extract with my regex is:
Zip:
----
98839-0111
34332
2. I don't see why my regex can't correctly identify 98839-0111 in Regexr. I tried doing the super-primitive approach of
\b\d{5}\b|98839-0111
and even that couldn't identify 98839-0111. Does anyone know what could be going on?
Note: I have also tried using ^ and $ for the boundaries of my regex, but this also doesn't find the regex's, not even in Regexr.
EDIT: After removing the first part of my regex, leaving only
98839-0111
It can now correctly identify it. I guess this means that once a string is pulled out by one of my regex's, it can no longer be found by any subsequent regexs? Why is this?
It is because of the alternative list: the first part was matched, and the engine stopped checking.
Try this regex
98839-0111|\b\d{5}\b
And you'll get a match.
Or, to be more generic in your case:
\b(?:\d{5}-\d{4}|\d{5})\b
will match both, and more (actually, functionally the same as \b\d{5}(?:-\d{4})?\b). See demo.
Your pattern is evaluated for each position in the string from the left to the right, so if the left branch of your pattern succeeds, the second branch isn't tested at all.
I suggest you to use this pattern that solves the problem:
\b\d{5}(?:-\d{1,4})?\b
You can use this regex:
\b(\d{5}-\d{1,4}|\d{5})\b
Working demo
Let's say I have this string :
<div>Object</div><img src=#/><p> In order to be successful...</p>
I want to substitute every letter between < and > with a #.
So, after some operation, I want my string to look like:
<###>Object<####><##########><#> In order to be successful...<##>
Notice that every character between the two symbols were replaced with # ( including whitespace).
This is the closest I could get:
r = re.sub('<.*?>', '<#>', string)
The problem with my code is that all characters between < and > are replaced by a single #, whereas I would like every individual character to be replaced by a #.
I tried a mixture of various back references, but to no avail. Could someone point me in the right direction?
What about...:
def hashes(mo):
replacing = mo.group(1)
return '<{}>'.format('#' * len(replacing))
and then
r = re.sub(r'<(.*?)>', hashes, string)
The ability to use a function as the second argument to re.sub gives you huge flexibility in building up your substitutions (and, as usual, a named def results in much more readable code than any cramped lambda -- you can use meaningful names, normal layouts, etc, etc).
The re.sub function can be called with a function as the replacement, rather than a new string. Each time the pattern is matched, the function will be called with a match object, just like you'd get using re.search or re.finditer.
So try this:
re.sub(r'<(.*?)>', lambda m: "<{}>".format("#" * len(m.group(1))), string)
I am working with a regular expression that would check a string,Is it a function or not.
My regular expression for checking that as follows:
regex=r' \w+[\ ]*\(.*?\)*'
It succefully checks whether the string contains a function or not.
But it grabs normal string which contains firs barcket value,such as "test (meaning of test)".
So I have to check further that if there is a space between function name and that brackets that will not be caught as match.So I did another checking as follows:
regex2=r'\s'
It work successfully and can differentiate between "test()" and "test ()".
But now I have to maintain another condition that,if there is no space after the brackets(eg. test()abcd),it will not catch it as a function.The regular expression should only treat as match when it will be like "test() abcd".
But I tried using different regular expression ,unfortunately those are not working.
Her one thing to mention the checking string is inserted in to a list at when it finds a match and in second step it only check the portion of the string.Example:
String : This is a python function test()abcd
At first it will check the string for function and when find matches with function test()
then send only "test()" for whether there is a gap between "test" and "()".
In this last step I have to find is there any gap between "test()" and "abcd".If there is gap it will not show match as function otherwise as a normal portion of string.
How should I write the regular expression for such case?
The regular expression will have to show in following cases:
1.test() abc
2.test(as) abc
3.test()
will not treat as a function if:
1.test (a)abc
2.test ()abc
(\w+\([^)]*\))(\s+|$)
Bascially you make sure it ends with either spaces or end of line.
BTW the kiki tool is very useful for debugging Python re: http://code.google.com/p/kiki-re/
regex=r'\w+\([\w,]+\)(?:\s+|$)'
I have solved the problem at first I just chexked for the string that have "()"using the regular expression:
regex = r' \w+[\ ]*\(.*?\)\w*'
Then for checking the both the space between function name and brackets,also the gap after the brackets,I used following function with regular expression:
def testFunction(self, func):
a=" "
func=str(func).strip()
if a in func:
return False
else:
a = re.findall(r'\w+[\ ]*', func)
j = len(a)
if j<=1:
return True
else:
return False
So it can now differentiate between "test() abc" and "test()abc".
Thanks