I am doing a project in Python where I require a user to input text. If the text matches a format supported by the program, it will output a response that includes a user's key word (it is a simple chat bot). The format is stored in a text file as a user input format and an answer format.
For example, the text file looks like this, with user input on the left and output on the right:
my name is <-name> | Hi there, <-name>
So if the user writes my name is johnny, I want the program to know that johnny is the <-name> variable, and then to print the response Hi there, johnny.
Some prodding me in the right direction would be great! I have never used regular expressions before and I read an article on how to use them, but unfortunately it didn't really help me since it mainly went over how to match specific words.
Here's an example:
import re
io = [
('my name is (?P<name>\w+)', 'Hi there, {name}'),
]
string = input('> ')
for regex, output in io:
match = re.match(regex, string)
if match:
print(output.format(**match.groupdict()))
break
I'll take you through it:
'my name is (?P<name>\w+)'
(?P<name>...) stores the following part (\w+) under the name name in the match object which we're going to use later on.
match = re.match(regex, string)
This looks for the regex in the input given. Note that re.match only matches at the beginning of the input, if you don't want that restriction use re.search here instead.
If it matches:
output.format(**match.groupdict())
match.groupdict returns a dictionary of keys defined by (?P<name>...) and their associated matched values. ** passes those key/values to .format, in this case Python will translate it to output.format(name='matchedname').
To construct the io dictionary from a file do something like this:
io = []
with open('input.txt') as file_:
for line in file:
key, value = line.rsplit(' | ', 1)
io.append(tuple(key, value))
You are going to want to do a group match and then pull out the search groups.
First you would want to import re - re is the python regex module.
Lets say that user_input is the var holding the input string.
You then want to use the re.sub method to match your string and return a substitute it for something.
output = re.sub(input_regex, output_regex, user_input)
So the regex, first you can put the absolute stuff you want:
input_regex = 'my name is '
If you want it to match explicitly from the start of the line, you should proceed it with the caret:
input_regex = '^my name is '
You then want a group to match any string .+ (. is anything, + is 1 or more of the preceding item) until the end of the line '$'.
input_regex = '^my name is .+$'
Now you'll want to put that into a named group. Named groups take the form "(?Pregex)" - note that those angle brackets are literal.
input_regex = '^my name is (?P<name>.+)$'
You now have a regex that will match and give a match group named "name" with the users name in it. The output string will need to reference the match group with "\g"
output_regex = 'Hi there, \g<name>'
Putting it all together you can do it in a one liner (and the import):
import re
output = re.sub('^my name is (?P<name>.+)$', 'Hi there, \g<name>', user_input)
Asking for REGEXP inevitably leads to answers like the ones you're getting right now: demonstrations of basic REGEXP operations: how to split sentences, search for some term combination like 'my' + 'name' + 'is' within, etc.
In fact, you could learn all this from reading existing documentation and open source programs. REGEXP is not exactly easy. Still you'll need to understand a bit on your own to be able to really know what's going on, if you want to change and extend your program. Don't just copy from the receipts here.
But you may even want to have something more comprehensive. Because you mentioned building a "chat bot", you may want see, how others are approaching that task - way beyond REGEXP. See:
So if the user writes 'my name is johnny', I want the program to know that 'johnny' is the '<-name>' variable, ...
From you question it's unclear, how complex this program should become. What, if he types
'Johnny is my name.'
or
'Hey, my name is John X., but call me johnny.'
?
Take a look at re module and pay attention for capturing groups.
For example, you can assume that name will be a word, so it matches \w+. Then you have to construct a regular expression with \w+ capturing group where the name should be (capturing groups are delimited by parentheses):
r'my name is (\w+)'
and then match it against the input (hint: look for match in the re module docs).
Once you get the match, you have to get the contents of capturing group (in this case at index 1, index 0 is reserved for the whole match) and use it to construct your response.
Related
I'm working on a simple personal project that's required I learn to use regular expressions. I have successfully used findall() once before in my program:
def getStats():
playername = input("Enter your OSRS name: ")
try:
with urllib.request.urlopen("https://secure.runescape.com/m=hiscore_oldschool/index_lite.ws?player=" + playername) as response:
page = str(response.read())
player.levels = re.findall(r',(\d\d),', page)
This worked fine and populated the list exactly as I wanted. I'm now trying to do something similar with a text file.
The text file contains a string, followed by a lot of digits, and then another string followed by a lot of digits, etc. I just want to populate a list with the text and ignore the digits, but I get no matches (the list is empty):
def getQuests():
try:
with open("quests.txt") as file:
q = file.read()
questList = re.findall(r',(\D\D),', q)
print(questList)
Pythex link: https://pythex.org/?regex=%5CD%5CD&test_string=Desert%20Treasure%2C0%2C0%2C0%2C12%0AContact!%2C0%2C0%2C11%2C0%2C0%2C0%2C5%0ACook%27s%20Assistant%2C0%2C0%2C0%2C0%0AHorror%20from%20the%20Deep%2C0%2C0%2C13&ignorecase=0&multiline=0&dotall=0&verbose=0
I've gotten some help with the pattern and edited accordingly, but the list is still printing empty
def getQuests():
try:
with open("quests.txt") as file:
q = file.read()
questList = re.findall(r'^(\D+),', q)
Your pattern is incorrect. Firstly, in the demo you linked, the website is not very well designed and shows adjacent matches as one single match. \D\D matches exactly 2 non-digit characters. Also, you didn't include the commas you have in your pattern in the code. Anyway, here is a correct pattern:
^(\D+),
It matches the start of the line, then at least one non-digit character, then a comma. The first group contains the string you want to match.
Demo: https://regex101.com/r/pViF0h/2
In code:
import re
text = '''Desert Treasure,0,0,0,12
Contact!,0,0,11,0,0,0,5
Cook's Assistant,0,0,0,0
Horror from the Deep,0,0,13'''
print(re.findall(r'^(\D+),', text, re.M))
# ['Desert Treasure', 'Contact!', "Cook's Assistant", 'Horror from the Deep']
If the first entry is what you want no matter what, you can also use:
^(.+?),
Also, for these files, it is usually a much better idea to read it as a CSV and extract what you need that way.
Your TypeError solution is correct.
Without knowing what that webpage looks like, I do see one problem. In your working example, you use ',(\d\d),', but in the problem one you use ,(\D\D),. \d Matches any digit characters, but \D matches any non-digits.
In code i only want to fetch variable name from a c file which is used in if condition.
Following is code snippet of regex:
fieldMatch = re.findall(itemFieldList[i]+"=", codeline, re.IGNORECASE);
here i can find variable itemFieldList[i] from file.
But when i try to add if as shown below nothing is extracted as output even though variable exist in c code in if condition .
fieldMatch = re.findall(("^(\w+)if+[(](\w+)("+itemFieldList[i]+")="), codeline, re.IGNORECASE|re.MULTILINE);
Can anyone suggest how can we create regex to fetch mentioned scenario.
Sample Input :
IF(WORK.env_flow_ind=="R")
OR
IF( WORK.qa_flow_ind=="Q" OR WORK.env_flow_ind=="R")
here itemFieldList[i] = WORK.env_flow_ind
I don't have enough reputation to make this a comment, which it should be and I can't say that I fully understand the question. But to point out a few things:
it's about adding variables to your regex then you should be using string templates to make it more understandable for us and your future self.
"^{}".format(variable)
Doing that will allow you to create a dynamic regex that searches for what you want.
Secondly, I don't think that is your problem. I think that your regex is malformed. I don't know what exactly you are trying to search for but I recommend reading the python regex documentation and testing your regex on a resource like regex101 to make sure that you're capturing what you intend to. From what I can see you are a bit confused about groups. When you put parenthesis around a pattern you are identifying it as a group. You were on the right track trying to exclude the parenthesis in your search by surrounding it with square brackets but it's simpler and cleaner to escape them.
if you are trying to capture this statement:
if(someCondition == fries)
and you want to extract the keyword fries the valid syntax for that pattern is:
(?=if\((?:[\w=\s])+(fries)\))
Since you want this to be dynamic you would replace the string fries with your string template, and you'll get code that ends up something like this:
p = re.compile("(?=if\((?:[\w=\s])+({})\))".format(search), re.IGNORECASE)
p.findall(string)
Regex101 does a better job of breaking down my regex than I ever will:
Link cuz i have no rep
You can build the regex pattern as:
pattern = r"\bif\b\s*\(.*?\b" + re.escape(variablename) + r"\b"
This will look for the word “if” in lowercase, then optionally any spaces, then an opening parenthesis, then optionally any characters, and then your search term, its beginning and its end at word boundaries.
So if variablename is "WORK.env_flow_ind", then re.findall(pattern, textfile) will match the following lines:
if(blabla & WORK.env_flow_ind == "a")
if (WORK.env_flow_id == "b")
if(WORK.env_flow_id == "b")
if( WORK.env_flow_id == "b")
and these won't match:
if (WORK.env_bla == "c")
if (WORK.env_flow_id2 == "d")
I'd like to match the urls like this:
input:
x = "https://play.google.com/store/apps/details?id=com.alibaba.aliexpresshd&hl=en"
get_id(x)
output:
com.alibaba.aliexpresshd
What is the best way to do it with re in python?
def get_id(toParse):
return re.search('id=(WHAT TO WRITE HERE?)', toParse).groups()[0]
I found only the case with exactly one dot.
You could try:
r'\?id=([a-zA-Z\.]+)'
For your regex, like so:
def get_id(toParse)
regex = r'\?id=([a-zA-Z\.]+)'
x = re.findall(regex, toParse)[0]
return x
Regex -
By adding r before the actual regex code, we specify that it is a raw string, so we don't have to add multiple backslashes before every command, which is better explained here.
? holds special meaning for the regex system, so to match a question mark, we precede it by a backslash like \?
id= matches the id= part of the extraction
([a-zA-Z\.]+) is the group(0) of the regex, which matches the id of the URL. Hence, by saying [0], we are able to return the desired text.
Note - I have used re.findall for this, because it returns an array [] whose element at index 0 is the extracted text.
I recommend you take a look at rexegg.com for a full list of regex syntax.
Actually, you do not need to put anything "special" there.
Since you know that the bundle id is between id= and &, you can just capture whatever is inside and have your result in capture group like this:
id=(.+)&
So the code would look like this:
def get_id(toParse):
return re.search('id=(.+)&', toParse).groups()[0]
Note: you might need to change the group index to "1", not "0", as most regex engines reserve this for full match. I'm not familiar how Python actually handles this.
See demo here
This regex should easily get what you want, it gets everything between id= and either the following parameter (.*? being ungreedy), or the end of the string.
id=(.*?)(&|$)
If you only need the id itself, it will be in the first group.
I am learning regular expressions in Python but couldn't find out what is numeration in .group() based on.
Here is my code:
import re
string = 'suzi sabin joe brandon josh'
print(re.search(r'^.*\b(suzi|sabin|joe|brandon|josh)\b.*$', string).group(0))
# output : suzi sabin joe brandon josh
print(re.search(r'^.*\b(suzi|sabin|joe|brandon|josh)\b.*$', string).group(1))
# output : josh
I am wondering
Why is there only group(1) and not group(1-5)?
Why was josh classified into group(1)?
I am thankful for any advice.
When you call group(0), you get the whole matched text, which is the whole string, since your pattern matches from the beginning of the string to the end.
While the regex matches everything, it only captures one name (in group 1 because regex counts from 1 for historical reasons). Because the first .* is greedy (it tries to match as much text as possible), it gobbles up the earlier names, and the captured name is the last one, "josh" (and the last .* matches an empty string). The captured name is what you get when you call group(1).
If you want to separately capture each name, you'll need to do things differently. Probably something like this would work:
print(re.findall(r'\b(suzi|sabin|joe|brandon|josh)\b', string))
This will print the list ['suzi', 'sabin', 'joe', 'brandon', 'josh']. Each name will appear in the output in the same order it appears in the input string, which need not be the same order they were in the pattern. This might not do exactly what you want though, since it will skip over any text that isn't one of the names you're looking for (rather than failing to match anything).
In the following regex r"\g<NAME>\w+", I would like to know that a group named NAME must be used for replacements corresponding to a match.
Which regex matches the wrong use of \g<...> ?
For example, the following code finds any not escaped groups.
p = re.compile(r"(?:[^\\])\\g<(\w+)>")
for m in p.finditer(r"\g<NAME>\w+\\g<ESCAPED>"):
print(m.group(1))
But there is a last problem to solve. How can I manage cases of \g<WRONGUSE\> and\g\<WRONGUSE> ?
As far as I am aware, the only restriction on named capture groups is that you can't put metacharacters in there, such as . \, etc...
Have you come across some kind of problem with named capture groups?
The regex you used, r"illegal|(\g<NAME>\w+)" is only illegal because you referred to a backreference without it being declared earlier in the regex string. If you want to make a named capture group, it is (?P<NAME>regex)
Like this:
>>> import re
>>> string = "sup bro"
>>> re.sub(r"(?P<greeting>sup) bro", r"\g<greeting> mate", string)
'sup mate'
If you wanted to do some kind of analysis on the actual regex string in use, I don't think there is anything inside the re module which can do this natively.
You would need to run another match on the string itself, so, you would put the regex into a string variable and then match something like \(\?P<(.*?)>\) which would give you the named capture group's name.
I hope that is what you are asking for... Let me know.
So, what you want is to get the string of the group name, right?
Maybe you can get it by doing this:
>>> regex = re.compile(r"illegal|(?P<group_name>\w+)")
>>> regex.groupindex
{'group_name': 1}
As you see, groupindex returns a dictionary mapping the group names and their position in the regex. Having that, it is easy to retrieve the string:
>>> # A list of the group names in your regex:
... regex.groupindex.keys()
['group_name']
>>> # The string of your group name:
... regex.groupindex.keys()[0]
'group_name'
Don't know if that is what you were looking for...
Use a negative lookahead?
\\g(?!<\w+>)
This search for any g not followed by <…>, thus a "wrong use".
Thanks to all the comments, I have this solution.
# Good uses.
p = re.compile(r"(?:[^\\])\\g<(\w+)>")
for m in p.finditer(r"</\g\<at__tribut1>\\g<notattribut>>"):
print(m.group(1))
# Bad uses.
p = re.compile(r"(?:[^\\])\\g(?!<\w+>)")
if p.search(r"</\g\<at__tribut1>\\g<notattribut>>"):
print("Wrong use !")