using \b in regex

using \b in regex - python

--SOLVED--
I solved my issue by enabling multiline mode, and now the characters ^ and $ work perfectly for identifying the beginning and end of each string
--EDIT--
My code:
import re
import test_regex
def regex_content(text_content, regex_dictionary):
#text_content = text_content.lower()
regex_matches = []
# Search sanitized text (markup removed) for DLP theme keywords
for key,value in regex_dictionary.items():
# Get confiiguration settings
min_matches = value.get('min_matches',1)
risk = value.get('risk',1)
enabled = value.get('enabled',False)
regex_str = value.get('regex','')
# Fast compute True/False hit for each DLP theme word
if enabled:
print "Searching for key : %s" % (key)
my_regex = re.compile(value.get('regex'))
hits = my_regex.findall(text_content)
if len(hits) > 0:
regex_matches.append((key, risk, len(hits), hits))
# Return array of results (key, risk, number of hits, regex matches)
return regex_matches
def main():
#print defaults.test_regex.dlp_regex
text_content = ""
for line in open('testData.txt'):
text_content+=line
for match in regex_content(text_content, test_regex.dlp_regex):
print "\nFound %s : %s" % (match[0], match[3])
print "\n"
if __name__ == '__main__':
main()
and it is using the regex found here:
'Large number of US Zip Codes' : { 'regex' : "\b\d{5}(?:-\d{1,4})?\b"},
When I precede my regex with the 'r' flag, I can find the zip codes I'm looking for, but as well as every other 5 digit number in my document I am searching through. From my understanding this is because it ignored the \b characters. Without the r flag though, it cannot find any zip codes. It works perfectly fine in regexr, but not in my code. I haven't had any luck making \b characters work, nor ^ and $ for identifying the beginnings and ends of the strings I'm searching for. What is it that I am misunderstanding about these special characters?
--Original post--
I am writing a regex for identifying zip codes (and only zip codes), so to avoid false positives I am trying to include a boundary on my regex, using both of the following:
\b\d{5}\b|\b\d{5}-\b\d{1,4}\b
using the online regex debugger Regexr, my code should correctly catch 5 digit zip codes, such as 34332. However, I have two problems:
1. This regex is not working in my actual code for finding any zip codes, but it does work when I don't have the boundary (\b) characters. The exact code I'm trying to extract with my regex is:
Zip:
----
98839-0111
34332
2. I don't see why my regex can't correctly identify 98839-0111 in Regexr. I tried doing the super-primitive approach of
\b\d{5}\b|98839-0111
and even that couldn't identify 98839-0111. Does anyone know what could be going on?
Note: I have also tried using ^ and $ for the boundaries of my regex, but this also doesn't find the regex's, not even in Regexr.
EDIT: After removing the first part of my regex, leaving only
98839-0111
It can now correctly identify it. I guess this means that once a string is pulled out by one of my regex's, it can no longer be found by any subsequent regexs? Why is this?

It is because of the alternative list: the first part was matched, and the engine stopped checking.
Try this regex
98839-0111|\b\d{5}\b
And you'll get a match.
Or, to be more generic in your case:
\b(?:\d{5}-\d{4}|\d{5})\b
will match both, and more (actually, functionally the same as \b\d{5}(?:-\d{4})?\b). See demo.

Your pattern is evaluated for each position in the string from the left to the right, so if the left branch of your pattern succeeds, the second branch isn't tested at all.
I suggest you to use this pattern that solves the problem:
\b\d{5}(?:-\d{1,4})?\b

You can use this regex:
\b(\d{5}-\d{1,4}|\d{5})\b
Working demo

Related

Extract values in name=value lines with regex

I'm really sorry for asking because there are some questions like this around. But can't get the answer fixed to make problem.
This are the input lines (e.g. from a config file)
profile2.name=share2
profile8.name=share8
profile4.name=shareSSH
profile9.name=share9
I just want to extract the values behind the = sign with Python 3.9. regex.
I tried this on regex101.
^profile[0-9]\.name=(.*?)
But this gives me the variable name including the = sign as result; e.g. profile2.name=. But I want exactly the inverted opposite.
The expected results (what Pythons re.find_all() return) are
['share2', 'share8', 'shareSSH', 'share9']

Try pattern profile\d+\.name=(.*), look at Regex 101 example
import re
re.findall('profile\d+\.name=(.*)', txt)
# output
['share2', 'share8', 'shareSSH', 'share9']
But this problem doesn't necessarily need regex, split should work absolutely fine:

Try removing the ? quantifier. It will make your capture group match an empty st
regex101

Why is re.findall behaving like this? (python regex)

I made a small program in pyhton that searches through a music website and collects music data. The music has a format of [artist] - [music name] [music file format]. At first I used re.search to find a certain artist (I used regex because there are some other characters and irregularities in the music info above, and the only indicator for finding the artist was the - following the artist).
Somehow it didn't work so I changed it to re.findall just in case but it still didn't work. since I'm a beginner at python I thought I sis something wrong so I wrote some test code to study what was wrong. And this is what I got.
when I changed the x string (which would be the music info) and ran re.findall again it gave me a different result(none). I 100% thought the result would be the same. why is this behaving like this? And could this be the reason why my original code's re.serach, re.findall wasn't working?
I've included the code just in case. (used selenium)
idx = 1
while True:
try:
hxp1 = "(//h3[#class='entry-title td-module-title']/a)[" + str(idx) + "]"
text = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, hxp1)))
# info = eg) 'Michael Jackson - Beat it [FLAC, MP3, WAV]'
info = text.get_attribute('title') # get 'info' as string
# ARTIST = eg) 'Michael Jackson'
regex = ARTIST + ' - '
match = re.findall(regex, info) # or use re.search
# do something with 'match'...
idx += 1
except:
# do something...
break

It seems you need to make sure you match
any Unicode whitespaces (i.e. \s in Python 3.x, or (?u)\s in Python 2.x, see re documentation: "Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages).")
any Unicode hyphens (see Searching for all Unicode variation of hyphens in Python).
Combining all that into your regex:
Minami\s[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]\s
In your case, if you just need to support en-dash/em-dash/hyhen chars and any Unicode whitespace chars, you can use
Minami\s[-—–]\s

How to filter out specific strings from a string

Python beginner here. I'm stumped on part of this code for a bot I'm writing.
I am making a reddit bot using Praw to comb through posts and removed a specific set of characters (steam CD keys).
I made a test post here: https://www.reddit.com/r/pythonforengineers/comments/91m4l0/testing_my_reddit_scraping_bot/
This should have all the formats of keys.
Currently, my bot is able to find the post using a regex expression. I have these variables:
steamKey15 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w')
steamKey25 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.')
steamKey17 = (r'\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\s\w\w')
I am finding the text using this:
subreddit = reddit.subreddit('pythonforengineers')
for submission in subreddit.new(limit=20):
if submission.id not in steamKeyPostID:
if re.search(steamKey15, submission.selftext, re.IGNORECASE):
searchLogic()
saveSteamKey()
So this is just to show that the things I should be using in a filter function is a combination of steamKey15/25/17, and submission.selftext.
So here is the part where I am confused. I cant find a function that works, or is doing what I want. My goal is to remove all the text from submission.selftext(the body of the post) BUT the keys, which will eventually be saved in a .txt file.
Any advice on a good way to go around this? I've looked into re.sub and .translate but I don't understand how the parts fit together.
I am using Python 3.7 if it helps.

can't you just get the regexp results?
m = re.search(steamKey15, submission.selftext, re.IGNORECASE)
if m:
print(m.group(0))
Also note that a dot . means any char in a regexp. If you want to match only dots, you should use \.. You can probably write your regexp like this instead:
r'\w{5}[-.]\w{5}[-.]\w{5}'
This will match the key when separated by . or by -.
Note that this will also match anything that begin or end with a key, or has a key in the middle - that can cause you problems as your 15-char key regexp is contained in the 25-key one! To fix that use negative lookahead/negative lookbehind:
r'(?<![\w.-])\w{5}[-.]\w{5}[-.]\w{5}(?![\w.-])'
that will only find the keys if there are no extraneous characters before and after them
Another hint is to use re.findall instead of re.search - some posts contain more than one steam key in the same post! findall will return all matches while search only returns the first one.

So a couple things first . means any character in regex. I think you know that, but just to be sure. Also \w\w\w\w\w can be replaced with \w{5} where this specifies 5 alphanumerics. I would use re.findall.
import re
steamKey15 = (r'(?:\w{5}.){2}\w{5}')
steamKey25 = (r'(?:\w{5}.){5}')
steamKey17 = (r'\w{15}\s\w\w')
subreddit = reddit.subreddit('pythonforengineers')
for submission in subreddit.new(limit=20):
if submission.id not in steamKeyPostID:
finds_15 = re.findall(steamKey15, submission.selftext)
finds_25 = re.findall(steamKey25, submission.selftext)
finds_17 = re.findall(steamKey17, submission.selftext)

Python 3.6 Identifying a string and if X in Y

Newb programmer here working on my first project. I've searched this site and the python documentation, and either I'm not seeing the answer, or I'm not using the right terminology. I've read the regex and if sections, specifically, and followed links around to other parts that seemed relevant.
import re
keyphrase = '##' + '' + '##'
print(keyphrase) #output is ####
j = input('> ') ###whatever##
if keyphrase in j:
print('yay')
else:
print('you still haven\'t figured it out...')
k = j.replace('#', '')
print(k) #whatever
This is for a little reddit bot project. I want the bot to be called like ##whatever## and then be able to do things with the word(s) in between the ##'s. I've set up the above code to test if Python was reading it but I keep getting my "you still haven't figured it out..." quip.
I tried adding the REGEX \W in the middle of keyphrase, to no avail. Also weird combinations of \$\$ and quotes
So, my question, is how do I put a placeholder in keyphrase for user input?
For instance, if a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.

You could use the following regex r'##(.*?)##' to capture everything inside of the key phrase you've chosen.
Sample Output:
>>> import re
>>> f = lambda s: re.match(r'##(.*?)##', s).group(1)
>>> f("##whatever##")
whatever
>>> f = lambda s: re.findall(r'##(.*?)##', s)
>>> f("a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.")
['comment', 'this', 'I can grab']
How does it work? (1) We state the string constant head and tail for the capture group 1 between the brackets (). Great, almost there! (2) We then match any character .*? with greedy search enforced so that we capture the whole string.
Suggested Readings:
Introduction to Regex in Python - Jee Gikera

Something like this should work:
import re
keyphrase_regex = re.compile(r'##(.*)##')
user_input = input('> ')
keyphrase_match = keyphrase_regex.search(user_input)
# `search` returns `None` if regex didn't match anywhere in the string
keyphrase_content = keyphrase_match.group(1) if keyphrase_match else None
if keyphrase_content:
keyphrase_content = keyphrase_match.group(1)
print('yay! You submitted "', keyphrase_content, '" to the bot!')
else:
# Bonus tip: Use double quotes to make a string containing apostrophe
# without using a backslash escape
print("you still haven't figured it out...")
# Use `keyphrase_content` for whatever down here
Regular expressions are kind of hard to wrap your head around, because they work differently than most programming constructs. It's a language to describe patterns.
Regex One is a fantastic beginners guide.
Regex101 is an online sandbox that allows you to type a regular expression and some sample strings, then see what matches (and why) in real time
The regex ##(.*)## basically means "search through the string until you find two '#' signs. Right after those, start capturing zero-or-more of any character. If you find another '#', stop capturing characters. If that '#' is followed by another one, stop looking at the string, return successfully, and hold onto the entire match (from first '#' to last '#'). Also, hold onto the captured characters in case the programmer asks you for just them.
EDIT: Props to #ospahiu for bringing up the ? lazy quantifier. A final solution, combining our approaches, would look like this:
# whatever_bot.py
import re
# Technically, Python >2.5 will compile and cache regexes automatically.
# For tiny projects, it shouldn't make a difference. I think it's better style, though.
# "Explicit is better than implicit"
keyphrase_regex = re.compile(r'##(.*?)##')
def parse_keyphrases(input):
return keyphrase_regex.find_all(input)
Lambdas are cool. I prefer them for one-off things, but the code above is something I'd rather put in a module. Personal preference.
You could even make the regex substitutable, using the '##' one by default
# whatever_bot.py
import re
keyphrase_double_at_sign = re.compile(r'##(.*?)##')
def parse_keyphrases(input, keyphrase_regex=keyphrase_double_at_sign):
return keyphrase_regex.find_all(input)
You could even go bonkers and write a function that generates a keyphrase regex from an arbitrary "tag" pattern! I'll leave that as an exercise for the reader ;) Just remember: Several characters have special regex meanings, like '*' and '?', so if you want to match that literal character, you'd need to escape them (e.g. '\?').

If you want to grab the content between the "#", then try this:
j = input("> ")
"".join(j.split("#"))

You're not getting any of the info between the #'s in your example because you're effectively looking for '####' in whatever input you give it. Unless you happen to put 4 #'s in a row, that RE will never match.
What you want to do instead is something like
re.match('##\W+##', j)
which will look for 2 leading ##s, then any number greater than 1 alphanumeric characters (\W+), then 2 trailing ##s. From there, your strip code looks fine and you should be able to grab it.

Python - Injecting html tags into strings based on regex match

I wrote a script in Python for custom HTML page that finds a word within a string/line and highlights just that word with use of following tags where instance is the word that is searched for.
<b><font color=\"red\">"+instance+"</font></b>
With the following result:
I need to find a word (case insensitive) let's say "port" within a string that can be port, Port, SUPPORT, Support, support etc, which is easy enough.
pattern = re.compile(word, re.IGNORECASE)
find_all_instances = pattern.findall(string_to_search)
However my strings often contain 2 or more instances in single line, and I need to append
<b><font color=\"red\">"+instance+"</font></b> to each of those instances, without changing cases.
Problem with my approach, is that I am attempting to itterate over each of instances found with findall (exact match),
while multiple same matches can also be found within the string.
for instance in find_all_instances:
second_pattern = re.compile(instance)
string_to_search = second_pattern.sub("<b><font color=\"red\">"+instance+"</font></b>", string_to_search)
This results in following:
<b><font color="red"><b><font color="red"><b><font color="red">Http</font></b></font></b></font></b></font>
when I need
<b><font color="red">Http</font></b>
I was thinking, I would be able to avoid this if I was able to find out exact part of the string that the pattern.sub substitutes at the moment of doing it,
however I was not able to find any examples of that kind of usage, which leads me to believe that I am doing something very wrong.
If anyone have a way I could use to insert <b><font color="red">instance</font></b> without replacing instance for all matches(case insensitive), then I would be grateful.

Maybe I'm misinterpretting your question, but wouldn't re.sub be the best option?
Example: https://repl.it/DExs

Okay so two ways I did quickly! The second loop is definitely the way to go. It uses re.sub (as someone else commented too). It replaces with the lowercase search term bear in mind.
import re
FILE = open("testing.txt","r")
word="port"
#THIS LOOP IS CASE SENSITIVE
for line in FILE:
newline=line.replace(word,"<b><font color=\"red\">"+word+"</font></b>")
print newline
#THIS LOOP IS INCASESENSITIVE
for line in FILE:
pattern=re.compile(word,re.IGNORECASE)
newline = pattern.sub("<b><font color=\"red\">"+word+"</font></b>",line)
print newline

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

using \b in regex - python

Your pattern is evaluated for each position in the string from the left to the right, so if the left branch of your pattern succeeds, the second branch isn't tested at all. I suggest you to use this pattern that solves the problem: \b\d{5}(?:-\d{1,4})?\b

You can use this regex: \b(\d{5}-\d{1,4}|\d{5})\b Working demo

Related

Extract values in name=value lines with regex

Why is re.findall behaving like this? (python regex)

How to filter out specific strings from a string

Python 3.6 Identifying a string and if X in Y

Python - Injecting html tags into strings based on regex match

Categories

Resources