Using user input as regex search expression - python

I am working on a personal project that is designed to open a file specified by the user, then to take in user input and use that input as a regular expression to search the file with. The purpose of this is to gain a deeper understanding of how regular expressions work, and how to incorporate them into programs.
My problem lies in that all input the user gives me is formatted as a string. So (correct me if I'm wrong), an input of [a-z]+ will result in the search expression "[a-z]+". This is a problem if I want r"[a-z]+" as my search expression, as putting that in as user input will give me "r"[a-z]+"" (again, correct me if I'm wrong). This will obviously not work with regex. How do I format the input so that an input of r"[a-z]+" remains r"[a-z]+"?
This is the code section in question. The textFile in the function arguments is imported from another section of the program, and is used in the regex search:
def new_search_regex(textFile):
"""Query for input, then performs RegEx() with user's input"""
global totalSearches
global allSearchResults
# ask user for regular expression to be searched
expression = raw_input("Please enter the Regular Expression to be searched: ")
# perform initial regex search
foundRegex = re.search(expression, textFile)
# if Regex search successful
if foundRegex != None:
# Do complete regex search
foundRegex = re.findall(expression, textFile)
# Print result
print "Result: " + str(foundRegex)
# Increment global total
totalSearches += 1
# create object for result, store in global array
reg_object = Reg_Search(totalSearches, expression, foundRegex)
allSearchResults.append(reg_object)
print "You're search number for this search is " + str(totalSearches) # Inform user of storage location
# if Regex search unsuccessful
else:
print "Search did not have any results."
return
Note: At the end I create an object for the result, and store it in a global array.
This is also assuming for now that the user is competently entering non-system destroying regex's. I will soon start adding in error checking though, such as using .escape on the user input. How will this affect my situation? Will it wreak havoc with the user including " in the input?

The r"..." syntax is only useful to prevent the python compiler to interpret escape sequences (\n being converted to newline character for example). Once parsed by the compiler it will just be a regular string.
We you read input from the user with `raw_input the compiler does not perform any escape sequence interpretations. You don't have to do anything, the string is already correctly interpreted.
You can test this yourself like that:
>>> x = r"[a-z]+\n"
>>> y = raw_input("")
[a-z]+\n
>>> x == y
True

Directly coming from the Python http://docs.python.org/2/library/re.html:
import re
m = re.search(regexp_as_string, payload)
m.group(0) #first occurence of the pattern

Related

How can I validate user input with regex with option to enter multiple entries using comma's?

I'm trying to create a python script that prompts the user for the slot location of a hard drive in a server. I would like it to match the pattern n:n:n where 'n' is a single digit number. They should also have the option of entering multiple slots using commas.
So far I have the following but it only works with a single entry? I've commented out the '.split()' because I would get an error:
Traceback (most recent call last):
File "v2hdorders.py", line 22, in <module>
if not re.match("\d:\d:\d", c):
File "/usr/local/Cellar/python#2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 141, in match
return _compile(pattern, flags).match(string)
TypeError: expected string or buffer
The code:
import re
while True:
c = raw_input("What is the slot position of the hard drive(s)? (e.g. 0.0.0 - Use commas if more than one drive): ")#.split()
if not re.match("\d:\d:\d:", c):
print ("Please enter in n:n:n format")
else:
d = raw_input("What is the disk size (Specify in GB or TB)? ")
break
I would like the user to be able to enter one or more slot entries while having the program check that the format is n:n:n
There is some confusion between your text and regex. You should know that the 'dot' character is special in regex syntax. So if you want a.b.c, you have to use a backslash in your regex for the '.' character: r'a\.b\.c' or whatever. (Or go with colon, ':', which seems fine.)
You can get the result you want by matching a single match, followed by zero-or-more occurrences of comma+match. You should make a habit of using raw-strings (r'' or r"" or ...) to write regular expressions, because they let you avoid extra backslashes:
re.match(r'match(,match)*')
The pattern above will match "match" or "match,match" or match,match,match", etc.
So, since your desired match appears to be either \d\.\d\.\d or perhaps \d:\d:\d, we can insert that instead:
re.match(r'\d:\d:\d(,\d:\d:\d)*')
(Note: I have not allowed for spaces, which you should do.)
Once you check for a match, I'll suggest using re.findall to iterate through the possibilities. It solves the whole problem of "how do I know if there's one or more of these matches?" for you!
for slot_pos in re.findall(r'\d:\d:\d', c):
print("Slot pos:", slot_pos)
You could use re.search to find the pattern then use re.match to return a boolean informing you if it matched your pattern.
import re
while True:
c = input("What is the slot position of the hard drive(s)? (e.g. 0.0.0 - Use commas if more than one drive): ")#.split()
match = re.search(r"\d:\d:\d", c)
print(match)
if not match :
print ("Please enter in n:n:n format")
else:
d = input("What is the disk size (Specify in GB or TB)? ")
break
I won't go into the split with commas functionality, because I think you may want to reconsider how you're going about that. If you delimit with commas here, you'll need to make sure the storage capacities also maintain their relationship with the drive number.
Worth mentioning, you had an error in your regex as well. You were asking for \d:\d:\d: which would be something like 0:1:3: where you're working with Dell servers and you're not going to have that trailing colon, therefore I changed it to \d:\d:\d.

Python 3.6 Identifying a string and if X in Y

Newb programmer here working on my first project. I've searched this site and the python documentation, and either I'm not seeing the answer, or I'm not using the right terminology. I've read the regex and if sections, specifically, and followed links around to other parts that seemed relevant.
import re
keyphrase = '##' + '' + '##'
print(keyphrase) #output is ####
j = input('> ') ###whatever##
if keyphrase in j:
print('yay')
else:
print('you still haven\'t figured it out...')
k = j.replace('#', '')
print(k) #whatever
This is for a little reddit bot project. I want the bot to be called like ##whatever## and then be able to do things with the word(s) in between the ##'s. I've set up the above code to test if Python was reading it but I keep getting my "you still haven't figured it out..." quip.
I tried adding the REGEX \W in the middle of keyphrase, to no avail. Also weird combinations of \$\$ and quotes
So, my question, is how do I put a placeholder in keyphrase for user input?
For instance, if a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.
You could use the following regex r'##(.*?)##' to capture everything inside of the key phrase you've chosen.
Sample Output:
>>> import re
>>> f = lambda s: re.match(r'##(.*?)##', s).group(1)
>>> f("##whatever##")
whatever
>>> f = lambda s: re.findall(r'##(.*?)##', s)
>>> f("a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.")
['comment', 'this', 'I can grab']
How does it work? (1) We state the string constant head and tail for the capture group 1 between the brackets (). Great, almost there! (2) We then match any character .*? with greedy search enforced so that we capture the whole string.
Suggested Readings:
Introduction to Regex in Python - Jee Gikera
Something like this should work:
import re
keyphrase_regex = re.compile(r'##(.*)##')
user_input = input('> ')
keyphrase_match = keyphrase_regex.search(user_input)
# `search` returns `None` if regex didn't match anywhere in the string
keyphrase_content = keyphrase_match.group(1) if keyphrase_match else None
if keyphrase_content:
keyphrase_content = keyphrase_match.group(1)
print('yay! You submitted "', keyphrase_content, '" to the bot!')
else:
# Bonus tip: Use double quotes to make a string containing apostrophe
# without using a backslash escape
print("you still haven't figured it out...")
# Use `keyphrase_content` for whatever down here
Regular expressions are kind of hard to wrap your head around, because they work differently than most programming constructs. It's a language to describe patterns.
Regex One is a fantastic beginners guide.
Regex101 is an online sandbox that allows you to type a regular expression and some sample strings, then see what matches (and why) in real time
The regex ##(.*)## basically means "search through the string until you find two '#' signs. Right after those, start capturing zero-or-more of any character. If you find another '#', stop capturing characters. If that '#' is followed by another one, stop looking at the string, return successfully, and hold onto the entire match (from first '#' to last '#'). Also, hold onto the captured characters in case the programmer asks you for just them.
EDIT: Props to #ospahiu for bringing up the ? lazy quantifier. A final solution, combining our approaches, would look like this:
# whatever_bot.py
import re
# Technically, Python >2.5 will compile and cache regexes automatically.
# For tiny projects, it shouldn't make a difference. I think it's better style, though.
# "Explicit is better than implicit"
keyphrase_regex = re.compile(r'##(.*?)##')
def parse_keyphrases(input):
return keyphrase_regex.find_all(input)
Lambdas are cool. I prefer them for one-off things, but the code above is something I'd rather put in a module. Personal preference.
You could even make the regex substitutable, using the '##' one by default
# whatever_bot.py
import re
keyphrase_double_at_sign = re.compile(r'##(.*?)##')
def parse_keyphrases(input, keyphrase_regex=keyphrase_double_at_sign):
return keyphrase_regex.find_all(input)
You could even go bonkers and write a function that generates a keyphrase regex from an arbitrary "tag" pattern! I'll leave that as an exercise for the reader ;) Just remember: Several characters have special regex meanings, like '*' and '?', so if you want to match that literal character, you'd need to escape them (e.g. '\?').
If you want to grab the content between the "#", then try this:
j = input("> ")
"".join(j.split("#"))
You're not getting any of the info between the #'s in your example because you're effectively looking for '####' in whatever input you give it. Unless you happen to put 4 #'s in a row, that RE will never match.
What you want to do instead is something like
re.match('##\W+##', j)
which will look for 2 leading ##s, then any number greater than 1 alphanumeric characters (\W+), then 2 trailing ##s. From there, your strip code looks fine and you should be able to grab it.

How can I check if the users input is a number?

I'm trying to create a function to check if the user inputs a number. If the user inputs a number my program should output an error message, if the users enters a string of letters, my program should proceed with program. How can I do this?
I've come up with this so far:
#Checks user input
def CheckInput():
while True:
try:
city=input("Enter name of city: ")
return city
except ValueError:
print ("letters only no numbers")
This function doesn't seem to work. Please help.
You are looking to filter out any responses that include digits in the string. The answers given will do that using a regular expression.
If that's all you want, job done. But you will also accept city names like Ad€×¢® or john#example.com.
Depending on how choosy you want to be, and whether you're just looking to fix this code snippet or to learn the technique that the answers gave you so that you can solve the next problem where you want to reject anything that is not a dollar amount, say),you could try writing a regular expression. This lets you define the characters that you want to match against. You could write a simple one to test if the input string contains a character that is not a letter [^a-zA-Z] (the ^ inside [ ] means any character that is not in the class listed). If that RE matches, you can then reject the string.
Then consider whether the strict rule of "letters only" is good enough? Have you replaced one flawed rule (no digits allowed) with another? What about 'L.A.' as a city name? Or 'Los Angeles'? Maybe you need to allow for spaces and periods. What about hyphens? Try [^a-zA-Z .-] which now includes a space, period and hyphen. The backslash tells the RE engine to treat that hyphen literally unlike the one in "a-z".
Details about writing a regex here:http://docs.python.org/3/howto/regex.html#regex-howto
Details about using the Re module in Python here: http://docs.python.org/3/library/re.html#module-re
import re
def CheckInput():
city = input('Enter name of city: ')
if re.search(r'\d', city):
raise Exception('Invalid input')
You wouldn't be type checking because in Python 3 all text inputs are strings. This checks for a decimal value in the input using regular expressions and raises an exception if one is found.
val = input("Enter name of city:")
try:
int( val )
except ValueError:
return val
else:
print("No numbers please")
Edit: I saw mention that no number should be present in the input at all. This version checks for numbers at any place in the input:
import re
val = input("Enter name of city:")
if re.search( r'\d', val ) is not None:
print("No numbers please")
else:
return val
You can use the type(variable_name) function to retrieve the type.

Django: Bad group name

I faced an error on "bad group name".
Here is the code:
for qitem in q['display']:
if qitem['type'] == 1:
for keyword in keywordTags.split('|'):
p = re.compile('^' + keyword + '$')
newstring=''
for word in qitem['value'].split():
if word[-1:] == ',':
word = word[0:len(word)-1]
newstring += (p.sub('<b>'+word+'</b>', word) + ', ')
else:
newstring += (p.sub('<b>'+word+'</b>', word) + ' ')
qitem['value']=newstring
And here's the error:
error at /result/1/
bad group name
Request Method: GET
Django Version: 1.4.1
Exception Type: error
Exception Value: bad group name
Exception Location: C:\Python27\lib\re.py in _compile_repl, line 257
Python Executable: C:\Python27\python.exe
Python Version: 2.7.3 Python
Path: ['D:\ExamPapers', 'C:\Windows\SYSTEM32\python27.zip',
'C:\Python27\DLLs', 'C:\Python27\lib',
'C:\Python27\lib\plat-win', 'C:\Python27\lib\lib-tk',
'C:\Python27', 'C:\Python27\lib\site-packages']
Server time: Sun,3 Mar 2013 15:31:05 +0800
Traceback Switch to copy-and-paste view
C:\Python27\lib\site-packages\django\core\handlers\base.py in get_response
response = callback(request, *callback_args, **callback_kwargs) ... ▶ Local vars ?
D:\ExamPapers\views.py in result
newstring += (p.sub(''+word+'', word) + ' ') ... ▶ Local vars
In summary, the error is at:
newstring += (p.sub('<b>'+word+'</b>', word) + ' ')
So you're trying to highlight in bold an occurrence of a set of keywords. Right now this code is broken in quite a lot of ways. You're using the re module right now to match the keywords but you're also breaking the keywords and the strings down into individual words, you don't need to do both and the interaction between these two different approaches to the solving the problem are what is causing you issues.
You can use regular expressions to match multiple possible strings at the same time, that's what they're good for! So instead of "^keyword$" to match just "keyword" you could use "^keyword|hello$" to match either "keyword" or "hello". You also use the ^ and $ characters which only match the beginning or end of the entire string, but what you probably wanted originally was to match the beginning or end of words, for this you can use \b like this r"\b(keyword|hello)\b". Note that in the last example I added a r character before the string, this stands for "raw" and turns off pythons usual handling of back slash characters which conflicts with regular expressions, it's good practice to always use the r before the string when the string contains a regular expression. I also used brackets to group together the words.
The regular expression sub method allows you to substitute things matched by a regular expression with another string. It also allow you to make "back references" in the replacing string that include parts of original string that matched. The parts that it includes are called "groups" and are indicated with brackets in the original regular expression, in the example above there is only one set of brackets and these are the first so they're indicated by the back reference \1. The cause of the actual error message you asked about is that your replacement string contained what looked like a backref but there weren't any groups in your regular expression.
Using that you do something like this:
keywordMatcher = re.compile(r"\b(keyword|hello)\b")
value = keywordMatcher.sub(r"<b>\1</b>", value)
Another thing that isn't directly related to what you're asking but is incredibly important is that you are taking source plain text strings (I assume) and making them into HTML, this gives a lot of chance for script injection vulnerabilities which if you don't take the time to understand and avoid will allow bad guys to hack the applications you build (they can do this in an automated way, so even if you think your app will be too small for anyone to notice it can still get hacked and used for all sorts of bad things, don't let this happen!). The basic rule is that it's ok to convert text to HTML but you need to "escape" it first, this is very simple:
from django.utils import html
html_safe = html.escape(my_text)
All this does is convert characters like < to < which the browser will show as < but won't interpret as the beginning of a tag. So if a bad guy types <script> into one of your forms and it gets processed by your code it will display it as <script> and not execute it as a script.
Likewise, if you use an text in a regular expression that you don't intend to have special regular expression characters then you must escape that too! You can do this using re.escape:
import re
my_regexp = re.compile(r"\b%s\b" % (re.escape(my_word),))
Ok, so now we've got that out of the way here is a method you could use to do what you wanted:
value = "this is my super duper testing thingy"
keywords = "super|my|test"
from django.utils import html
import re
# first we must split up the keywords
keywords = keywords.split("|")
# Next we must make each keyword safe for use in a regular expression,
# this is similar to the HTML escaping we discussed above but not to
# be confused with it.
keywords = [re.escape(k) for k in keywords]
# Now we reform the keywordTags string, but this time we know each keyword is regexp-safe
keywords = "|".join(keywords)
# Finally we create a regular expression that matches *any* of the keywords
keywordMatcher = re.compile(r'\b(%s)\b' % (keywords,))
# We are going to make the value into HTML (by adding <b> tags) so must first escape it
value = html.escape(value)
# We can then apply the regular expression to the value. We use a "back reference" `\0` to say
# that each keyword found should be replace with itself wrapped in a <b> tag
value = keywordMatcher.sub(r"<b>\1</b>", value)
print value
I urge you to take the time to understand what this does, otherwise you're just going to get yourself into a mess! It's always easier to just cut and paste and move on but this leads to crappy broken code and worse of all means you yourself don't improve and don't learn. All great coders started of as beginner coders who took the time to understand things :)

Migrating from Python to Racket (regular expression libraries and the "Racket Way")

I'm attempting to learn Racket, and in the process am attempting to rewrite a Python filter. I have the following pair of functions in my code:
def dlv(text):
"""
Returns True if the given text corresponds to the output of DLV
and False otherwise.
"""
return text.startswith("DLV") or \
text.startswith("{") or \
text.startswith("Best model")
def answer_sets(text):
"""
Returns a list comprised of all of the answer sets in the given text.
"""
if dlv(text):
# In the case where we are processing the output of DLV, each
# answer set is a comma-delimited sequence of literals enclosed
# in {}
regex = re.compile(r'\{(.*?)\}', re.MULTILINE)
else:
# Otherwise we assume that the answer sets were generated by
# one of the Potassco solvers. In this case, each answer set
# is presented as a comma-delimited sequence of literals,
# terminated by a period, and prefixed by a string of the form
# "Answer: #" where "#" denotes the number of the answer set.
regex = re.compile(r'Answer: \d+\n(.*)', re.MULTILINE)
return regex.findall(text)
From what I can tell the implementation of the first function in Racket would be something along the following lines:
(define (dlv-input? text)
(regexp-match? #rx"^DLV|^{|^Best model" text))
Which appears to work correctly. Working on the implementation of the second function, I currently have come up with the following (to start with):
(define (answer-sets text)
(cond
[(dlv-input? text) (regexp-match* #rx"{(.*?)}" text)]))
This is not correct, as regexp-match* gives a list of the strings which match the regular expression, including the curly braces. Does anyone know of how to get the same behavior as in the Python implementation? Also, any suggestions on how to make the regular expressions "better" would be much appreciated.
You are very close. You simply need to add #:match-select cadr to your regexp-match call:
(regexp-match* #rx"{(.*?)}" text #:match-select cadr)
By default, #:match-select has value of car, which returns the whole matched string. cadr selects the first group, caddr selects the second group, etc. See the regexp-match* documentation for more details.

Categories