Newb programmer here working on my first project. I've searched this site and the python documentation, and either I'm not seeing the answer, or I'm not using the right terminology. I've read the regex and if sections, specifically, and followed links around to other parts that seemed relevant.
import re
keyphrase = '##' + '' + '##'
print(keyphrase) #output is ####
j = input('> ') ###whatever##
if keyphrase in j:
print('yay')
else:
print('you still haven\'t figured it out...')
k = j.replace('#', '')
print(k) #whatever
This is for a little reddit bot project. I want the bot to be called like ##whatever## and then be able to do things with the word(s) in between the ##'s. I've set up the above code to test if Python was reading it but I keep getting my "you still haven't figured it out..." quip.
I tried adding the REGEX \W in the middle of keyphrase, to no avail. Also weird combinations of \$\$ and quotes
So, my question, is how do I put a placeholder in keyphrase for user input?
For instance, if a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.
You could use the following regex r'##(.*?)##' to capture everything inside of the key phrase you've chosen.
Sample Output:
>>> import re
>>> f = lambda s: re.match(r'##(.*?)##', s).group(1)
>>> f("##whatever##")
whatever
>>> f = lambda s: re.findall(r'##(.*?)##', s)
>>> f("a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.")
['comment', 'this', 'I can grab']
How does it work? (1) We state the string constant head and tail for the capture group 1 between the brackets (). Great, almost there! (2) We then match any character .*? with greedy search enforced so that we capture the whole string.
Suggested Readings:
Introduction to Regex in Python - Jee Gikera
Something like this should work:
import re
keyphrase_regex = re.compile(r'##(.*)##')
user_input = input('> ')
keyphrase_match = keyphrase_regex.search(user_input)
# `search` returns `None` if regex didn't match anywhere in the string
keyphrase_content = keyphrase_match.group(1) if keyphrase_match else None
if keyphrase_content:
keyphrase_content = keyphrase_match.group(1)
print('yay! You submitted "', keyphrase_content, '" to the bot!')
else:
# Bonus tip: Use double quotes to make a string containing apostrophe
# without using a backslash escape
print("you still haven't figured it out...")
# Use `keyphrase_content` for whatever down here
Regular expressions are kind of hard to wrap your head around, because they work differently than most programming constructs. It's a language to describe patterns.
Regex One is a fantastic beginners guide.
Regex101 is an online sandbox that allows you to type a regular expression and some sample strings, then see what matches (and why) in real time
The regex ##(.*)## basically means "search through the string until you find two '#' signs. Right after those, start capturing zero-or-more of any character. If you find another '#', stop capturing characters. If that '#' is followed by another one, stop looking at the string, return successfully, and hold onto the entire match (from first '#' to last '#'). Also, hold onto the captured characters in case the programmer asks you for just them.
EDIT: Props to #ospahiu for bringing up the ? lazy quantifier. A final solution, combining our approaches, would look like this:
# whatever_bot.py
import re
# Technically, Python >2.5 will compile and cache regexes automatically.
# For tiny projects, it shouldn't make a difference. I think it's better style, though.
# "Explicit is better than implicit"
keyphrase_regex = re.compile(r'##(.*?)##')
def parse_keyphrases(input):
return keyphrase_regex.find_all(input)
Lambdas are cool. I prefer them for one-off things, but the code above is something I'd rather put in a module. Personal preference.
You could even make the regex substitutable, using the '##' one by default
# whatever_bot.py
import re
keyphrase_double_at_sign = re.compile(r'##(.*?)##')
def parse_keyphrases(input, keyphrase_regex=keyphrase_double_at_sign):
return keyphrase_regex.find_all(input)
You could even go bonkers and write a function that generates a keyphrase regex from an arbitrary "tag" pattern! I'll leave that as an exercise for the reader ;) Just remember: Several characters have special regex meanings, like '*' and '?', so if you want to match that literal character, you'd need to escape them (e.g. '\?').
If you want to grab the content between the "#", then try this:
j = input("> ")
"".join(j.split("#"))
You're not getting any of the info between the #'s in your example because you're effectively looking for '####' in whatever input you give it. Unless you happen to put 4 #'s in a row, that RE will never match.
What you want to do instead is something like
re.match('##\W+##', j)
which will look for 2 leading ##s, then any number greater than 1 alphanumeric characters (\W+), then 2 trailing ##s. From there, your strip code looks fine and you should be able to grab it.
Related
So I was having a bit of trouble wording my question, but essentially I am working on a client application making commands to a IRC chat server that provide certain functionality. It was suggested we use regex to do the parsing of such commands. The first command that needs to be completed when a client is accepted by the server is the USER command which in general will look something like this:
"USER guest 0 * :Ronnie Reagan"
The parts are the USER, followed by the username which is 1 word and I believe can contain numbers, the mode which is a numeric value from 0-9 that indicates your current mode in the chat, the star is just unused extra stuff but it has to be there, and the last part is a colon with no space before the real name. Just as a note the manual doesn't say the real name has to be two separate names, just that it can contain spaces, so it can be any combination of letters and spaces even though its kind of weird.
This is what I came up with based on what I read about regex but have had some issues testing it.
"USER\s[a-zA-Z0-9]\s\d\s*\s:[a-zA-z\s]"
Here is the simple program I was using to test it based on some light tutorials I looked through
import re
userPattern = re.compile("USER\s[a-zA-Z0-9]\s\d\s*\s:[a-zA-z\s]")
while True:
regexTest = input()
isMatch = userPattern.match(regexTest)
if bool(isMatch) == True:
print("valid request")
else:
print("invalid request")
No matter the case I always get an invalid request and I've tried it in a few other ways too. I can't tell if its because something is wrong with my regex or my method of testing it.
There are some issues in your regex:
[a-zA-Z0-9] represents a single character, you want a plus sign at the end of it so that it matches 1 or more characters: [a-zA-Z0-9]+. Same thing about [a-zA-z\s].
* in regex is a special symbol, you need to escape it if you want to match an asterisk: \*
So this is the fixed version of your regex that should work:
USER\s[a-zA-Z0-9]+\s\d\s\*\s:[a-zA-z\s]+
But I think it could be simplified:
If you don't care about what goes after a colon then you can just use .+ there
Instead of [a-zA-Z0-9] you can just use \w (word matcher)
So I think this would work as well:
USER\s\w+\s\d\s\*\s:.+
Python beginner here. I'm stumped on part of this code for a bot I'm writing.
I am making a reddit bot using Praw to comb through posts and removed a specific set of characters (steam CD keys).
I made a test post here: https://www.reddit.com/r/pythonforengineers/comments/91m4l0/testing_my_reddit_scraping_bot/
This should have all the formats of keys.
Currently, my bot is able to find the post using a regex expression. I have these variables:
steamKey15 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w')
steamKey25 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.')
steamKey17 = (r'\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\s\w\w')
I am finding the text using this:
subreddit = reddit.subreddit('pythonforengineers')
for submission in subreddit.new(limit=20):
if submission.id not in steamKeyPostID:
if re.search(steamKey15, submission.selftext, re.IGNORECASE):
searchLogic()
saveSteamKey()
So this is just to show that the things I should be using in a filter function is a combination of steamKey15/25/17, and submission.selftext.
So here is the part where I am confused. I cant find a function that works, or is doing what I want. My goal is to remove all the text from submission.selftext(the body of the post) BUT the keys, which will eventually be saved in a .txt file.
Any advice on a good way to go around this? I've looked into re.sub and .translate but I don't understand how the parts fit together.
I am using Python 3.7 if it helps.
can't you just get the regexp results?
m = re.search(steamKey15, submission.selftext, re.IGNORECASE)
if m:
print(m.group(0))
Also note that a dot . means any char in a regexp. If you want to match only dots, you should use \.. You can probably write your regexp like this instead:
r'\w{5}[-.]\w{5}[-.]\w{5}'
This will match the key when separated by . or by -.
Note that this will also match anything that begin or end with a key, or has a key in the middle - that can cause you problems as your 15-char key regexp is contained in the 25-key one! To fix that use negative lookahead/negative lookbehind:
r'(?<![\w.-])\w{5}[-.]\w{5}[-.]\w{5}(?![\w.-])'
that will only find the keys if there are no extraneous characters before and after them
Another hint is to use re.findall instead of re.search - some posts contain more than one steam key in the same post! findall will return all matches while search only returns the first one.
So a couple things first . means any character in regex. I think you know that, but just to be sure. Also \w\w\w\w\w can be replaced with \w{5} where this specifies 5 alphanumerics. I would use re.findall.
import re
steamKey15 = (r'(?:\w{5}.){2}\w{5}')
steamKey25 = (r'(?:\w{5}.){5}')
steamKey17 = (r'\w{15}\s\w\w')
subreddit = reddit.subreddit('pythonforengineers')
for submission in subreddit.new(limit=20):
if submission.id not in steamKeyPostID:
finds_15 = re.findall(steamKey15, submission.selftext)
finds_25 = re.findall(steamKey25, submission.selftext)
finds_17 = re.findall(steamKey17, submission.selftext)
I have a long string like this:
'[("He tended to be helpful, enthusiastic, and encouraging, even to studentsthat didn\'t have very much innate talent.\\n",), (\'Great instructor\\n\',), (\'He could always say something nice and was always helpful.\\n\',), (\'He knew what he was doing.\\n\',), (\'Likes art\\n\',), (\'He enjoys the classwork.\\n\',), (\'Good discussion of ideas\\n\',), (\'Open-minded\\n\',), (\'We learned stuff without having to take notes, we just applied it to what we were doing; made it an interesting and fun class.\\n\',), (\'Very kind, gave good insight on assignments\\n\',), (\' Really pushed me in what I can do; expanded how I thought about art, the materials used, and how it was visually.\\n\',)
and I want to remove all [, (, ", \, \n from this string at once. Somehow I can do it one by one, but always failed with '\n'. Is there any efficient way I can remove or translate all these characters or blank lines symbols?
Since my senectiecs are not long so I do not want to use dictionary methods like earlier questions.
Maybe you could use regex to find all the characters that you want to replace
s = s.strip()
r = re.compile("\[|\(|\)|\]|\\|\"|'|,")
s = re.sub(r, '', s)
print s.replace("\\n", "")
I have some problems with the "\n" but replacing after the regex is easy to remove too.
If string is correct python expression then you can use literal_eval from ast module to transform string to tuples and after that you can process every tuple.
from ast import literal_eval
' '.join(el[0].strip() for el in literal_eval(your_string))
If not then you can use this:
def get_part_string(your_string):
for part in re.findall(r'\((.+?)\)', your_string):
yield re.sub(r'[\"\'\\\\n]', '', part).strip(', ')
''.join(get_part_string(your_string))
--SOLVED--
I solved my issue by enabling multiline mode, and now the characters ^ and $ work perfectly for identifying the beginning and end of each string
--EDIT--
My code:
import re
import test_regex
def regex_content(text_content, regex_dictionary):
#text_content = text_content.lower()
regex_matches = []
# Search sanitized text (markup removed) for DLP theme keywords
for key,value in regex_dictionary.items():
# Get confiiguration settings
min_matches = value.get('min_matches',1)
risk = value.get('risk',1)
enabled = value.get('enabled',False)
regex_str = value.get('regex','')
# Fast compute True/False hit for each DLP theme word
if enabled:
print "Searching for key : %s" % (key)
my_regex = re.compile(value.get('regex'))
hits = my_regex.findall(text_content)
if len(hits) > 0:
regex_matches.append((key, risk, len(hits), hits))
# Return array of results (key, risk, number of hits, regex matches)
return regex_matches
def main():
#print defaults.test_regex.dlp_regex
text_content = ""
for line in open('testData.txt'):
text_content+=line
for match in regex_content(text_content, test_regex.dlp_regex):
print "\nFound %s : %s" % (match[0], match[3])
print "\n"
if __name__ == '__main__':
main()
and it is using the regex found here:
'Large number of US Zip Codes' : { 'regex' : "\b\d{5}(?:-\d{1,4})?\b"},
When I precede my regex with the 'r' flag, I can find the zip codes I'm looking for, but as well as every other 5 digit number in my document I am searching through. From my understanding this is because it ignored the \b characters. Without the r flag though, it cannot find any zip codes. It works perfectly fine in regexr, but not in my code. I haven't had any luck making \b characters work, nor ^ and $ for identifying the beginnings and ends of the strings I'm searching for. What is it that I am misunderstanding about these special characters?
--Original post--
I am writing a regex for identifying zip codes (and only zip codes), so to avoid false positives I am trying to include a boundary on my regex, using both of the following:
\b\d{5}\b|\b\d{5}-\b\d{1,4}\b
using the online regex debugger Regexr, my code should correctly catch 5 digit zip codes, such as 34332. However, I have two problems:
1. This regex is not working in my actual code for finding any zip codes, but it does work when I don't have the boundary (\b) characters. The exact code I'm trying to extract with my regex is:
Zip:
----
98839-0111
34332
2. I don't see why my regex can't correctly identify 98839-0111 in Regexr. I tried doing the super-primitive approach of
\b\d{5}\b|98839-0111
and even that couldn't identify 98839-0111. Does anyone know what could be going on?
Note: I have also tried using ^ and $ for the boundaries of my regex, but this also doesn't find the regex's, not even in Regexr.
EDIT: After removing the first part of my regex, leaving only
98839-0111
It can now correctly identify it. I guess this means that once a string is pulled out by one of my regex's, it can no longer be found by any subsequent regexs? Why is this?
It is because of the alternative list: the first part was matched, and the engine stopped checking.
Try this regex
98839-0111|\b\d{5}\b
And you'll get a match.
Or, to be more generic in your case:
\b(?:\d{5}-\d{4}|\d{5})\b
will match both, and more (actually, functionally the same as \b\d{5}(?:-\d{4})?\b). See demo.
Your pattern is evaluated for each position in the string from the left to the right, so if the left branch of your pattern succeeds, the second branch isn't tested at all.
I suggest you to use this pattern that solves the problem:
\b\d{5}(?:-\d{1,4})?\b
You can use this regex:
\b(\d{5}-\d{1,4}|\d{5})\b
Working demo
I faced an error on "bad group name".
Here is the code:
for qitem in q['display']:
if qitem['type'] == 1:
for keyword in keywordTags.split('|'):
p = re.compile('^' + keyword + '$')
newstring=''
for word in qitem['value'].split():
if word[-1:] == ',':
word = word[0:len(word)-1]
newstring += (p.sub('<b>'+word+'</b>', word) + ', ')
else:
newstring += (p.sub('<b>'+word+'</b>', word) + ' ')
qitem['value']=newstring
And here's the error:
error at /result/1/
bad group name
Request Method: GET
Django Version: 1.4.1
Exception Type: error
Exception Value: bad group name
Exception Location: C:\Python27\lib\re.py in _compile_repl, line 257
Python Executable: C:\Python27\python.exe
Python Version: 2.7.3 Python
Path: ['D:\ExamPapers', 'C:\Windows\SYSTEM32\python27.zip',
'C:\Python27\DLLs', 'C:\Python27\lib',
'C:\Python27\lib\plat-win', 'C:\Python27\lib\lib-tk',
'C:\Python27', 'C:\Python27\lib\site-packages']
Server time: Sun,3 Mar 2013 15:31:05 +0800
Traceback Switch to copy-and-paste view
C:\Python27\lib\site-packages\django\core\handlers\base.py in get_response
response = callback(request, *callback_args, **callback_kwargs) ... ▶ Local vars ?
D:\ExamPapers\views.py in result
newstring += (p.sub(''+word+'', word) + ' ') ... ▶ Local vars
In summary, the error is at:
newstring += (p.sub('<b>'+word+'</b>', word) + ' ')
So you're trying to highlight in bold an occurrence of a set of keywords. Right now this code is broken in quite a lot of ways. You're using the re module right now to match the keywords but you're also breaking the keywords and the strings down into individual words, you don't need to do both and the interaction between these two different approaches to the solving the problem are what is causing you issues.
You can use regular expressions to match multiple possible strings at the same time, that's what they're good for! So instead of "^keyword$" to match just "keyword" you could use "^keyword|hello$" to match either "keyword" or "hello". You also use the ^ and $ characters which only match the beginning or end of the entire string, but what you probably wanted originally was to match the beginning or end of words, for this you can use \b like this r"\b(keyword|hello)\b". Note that in the last example I added a r character before the string, this stands for "raw" and turns off pythons usual handling of back slash characters which conflicts with regular expressions, it's good practice to always use the r before the string when the string contains a regular expression. I also used brackets to group together the words.
The regular expression sub method allows you to substitute things matched by a regular expression with another string. It also allow you to make "back references" in the replacing string that include parts of original string that matched. The parts that it includes are called "groups" and are indicated with brackets in the original regular expression, in the example above there is only one set of brackets and these are the first so they're indicated by the back reference \1. The cause of the actual error message you asked about is that your replacement string contained what looked like a backref but there weren't any groups in your regular expression.
Using that you do something like this:
keywordMatcher = re.compile(r"\b(keyword|hello)\b")
value = keywordMatcher.sub(r"<b>\1</b>", value)
Another thing that isn't directly related to what you're asking but is incredibly important is that you are taking source plain text strings (I assume) and making them into HTML, this gives a lot of chance for script injection vulnerabilities which if you don't take the time to understand and avoid will allow bad guys to hack the applications you build (they can do this in an automated way, so even if you think your app will be too small for anyone to notice it can still get hacked and used for all sorts of bad things, don't let this happen!). The basic rule is that it's ok to convert text to HTML but you need to "escape" it first, this is very simple:
from django.utils import html
html_safe = html.escape(my_text)
All this does is convert characters like < to < which the browser will show as < but won't interpret as the beginning of a tag. So if a bad guy types <script> into one of your forms and it gets processed by your code it will display it as <script> and not execute it as a script.
Likewise, if you use an text in a regular expression that you don't intend to have special regular expression characters then you must escape that too! You can do this using re.escape:
import re
my_regexp = re.compile(r"\b%s\b" % (re.escape(my_word),))
Ok, so now we've got that out of the way here is a method you could use to do what you wanted:
value = "this is my super duper testing thingy"
keywords = "super|my|test"
from django.utils import html
import re
# first we must split up the keywords
keywords = keywords.split("|")
# Next we must make each keyword safe for use in a regular expression,
# this is similar to the HTML escaping we discussed above but not to
# be confused with it.
keywords = [re.escape(k) for k in keywords]
# Now we reform the keywordTags string, but this time we know each keyword is regexp-safe
keywords = "|".join(keywords)
# Finally we create a regular expression that matches *any* of the keywords
keywordMatcher = re.compile(r'\b(%s)\b' % (keywords,))
# We are going to make the value into HTML (by adding <b> tags) so must first escape it
value = html.escape(value)
# We can then apply the regular expression to the value. We use a "back reference" `\0` to say
# that each keyword found should be replace with itself wrapped in a <b> tag
value = keywordMatcher.sub(r"<b>\1</b>", value)
print value
I urge you to take the time to understand what this does, otherwise you're just going to get yourself into a mess! It's always easier to just cut and paste and move on but this leads to crappy broken code and worse of all means you yourself don't improve and don't learn. All great coders started of as beginner coders who took the time to understand things :)