Exact search of a string that has parenthesis using regex - python

I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!

Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet

If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.

Related

Regex search fail when input has line breaks [duplicate]

This question already has an answer here:
Why is Python Regex Wildcard only matching newLine
(1 answer)
Closed 1 year ago.
The following regular expression is not returning any match:
import re
regex = '.*match.*fail.*'
pattern = re.compile(regex)
text = '\ntestmatch\ntestfail'
match = pattern.search(text)
I managed to solve the problem by changing text to repr(text) or setting text as a raw string with r'\ntestmatch\ntestfail', but I'm not sure if these are the best approaches. What is the best way to solve this problem?
Using repr or raw string on a target string is a bad idea!
By doing that newline characters are treated as literal '\n'.
This is likely to cause unexpected behavior on other test cases.
The real problem is that . matches any character EXCEPT newline.
If you want to match everything, replace . with [\s\S].
This means "whitespace or not whitespace" = "anything".
Using other character groups like [\w\W] also works,
and it is more efficient for adding exception just for newline.
One more thing, it is a good practice to use raw string in pattern string(not match target).
This will eliminate the need to escape every characters that has special meaning in normal python strings.
You could add it as an or, but make sure you \ in the regex string, so regex actually gets the \n and not a actual newline.
Something like this:
regex = '.*match(.|\\n)*fail.*'
This would match anything from the last \n to match, then any mix or number of \n until testfail. You can change this how you want, but the idea is the same. Put what you want into a grouping, and then use | as an or.
On the left is what this regex pattern matched from your example.

Python regex match all sentences include either wordA or wordB [duplicate]

I'm creating a javascript regex to match queries in a search engine string. I am having a problem with alternation. I have the following regex:
.*baidu.com.*[/?].*wd{1}=
I want to be able to match strings that have the string 'word' or 'qw' in addition to 'wd', but everything I try is unsuccessful. I thought I would be able to do something like the following:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
but it does not seem to work.
replace [wd|word|qw] with (wd|word|qw) or (?:wd|word|qw).
[] denotes character sets, () denotes logical groupings.
Your expression:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
does need a few changes, including [wd|word|qw] to (wd|word|qw) and getting rid of the redundant {1}, like so:
.*baidu.com.*[/?].*(wd|word|qw)=
But you also need to understand that the first part of your expression (.*baidu.com.*[/?].*) will match baidu.com hello what spelling/handle????????? or hbaidu-com/ or even something like lkas----jhdf lkja$##!3hdsfbaidugcomlaksjhdf.[($?lakshf, because the dot (.) matches any character except newlines... to match a literal dot, you have to escape it with a backslash (like \.)
There are several approaches you could take to match things in a URL, but we could help you more if you tell us what you are trying to do or accomplish - perhaps regex is not the best solution or (EDIT) only part of the best solution?

Need a specific explanation of part of a regex code

I'm developing a calculator program in Python, and need to remove leading zeros from numbers so that calculations work as expected. For example, if the user enters "02+03" into the calculator, the result should return 5. In order to remove these leading zeroes in-front of digits, I asked a question on here and got the following answer.
self.answer = eval(re.sub(r"((?<=^)|(?<=[^\.\d]))0+(\d+)", r"\1\2", self.equation.get()))
I fully understand how the positive lookbehind to the beginning of the string and lookbehind to the non digit, non period character works. What I'm confused about is where in this regex code can I find the replacement for the matched patterns?
I found this online when researching regex expressions.
result = re.sub(pattern, repl, string, count=0, flags=0)
Where is the "repl" in the regex code above? If possible, could somebody please help to explain what the r"\1\2" is used for in this regex also?
Thanks for your help! :)
The "repl" part of the regex is this component:
r"\1\2"
In the "find" part of the regex, group capturing is taking place (ordinarily indicated by "()" characters around content, although this can be overridden by specific arguments).
In python regex, the syntax used to indicate a reference to a positional captured group (sometimes called a "backreference") is "\n" (where "n" is a digit refering to the position of the group in the "find" part of the regex).
So, this regex is returning a string in which the overall content is being replaced specifically by parts of the input string matched by numbered groups.
Note: I don't believe the "\1" part of the "repl" is actually required. I think:
r"\2"
...would work just as well.
Further reading: https://www.regular-expressions.info/brackets.html
Firstly, repl includes what you are about to replace.
To understand \1\2 you need to know what capture grouping is.
Check this video out for basics of Group capturing.
Here , since your regex splits every match it finds into groups which are 1,2... so on. This is so because of the parenthesis () you have placed in the regex.
$1 , $2 or \1,\2 can be used to refer to them.
In this case: The regex is replacing all numbers after the leading 0 (which is caught by group 2) with itself.
Note: \1 is not necessary. works fine without it.
See example:
>>> import re
>>> s='awd232frr2cr23'
>>> re.sub('\d',' ',s)
'awd frr cr '
>>>
Explanation:
As it is, '\d' is for integer so removes them and replaces with repl (in this case ' ').

Regex End of Line and Specific Chracters

So I'm writing a Python program that reads lines of serial data, and compares them to a dictionary of line codes to figure out which specific lines are being transmitted. I am attempting to use a Regular Expression in order to filter out the extra garbage line serial read string has on it, but I'm having a bit of an issue.
Every single code in my dictionary looks like this: T12F8B0A22**F8. The asterisks are the two alpha numeric pieces that differentiate each string code.
This is what I have so far as my regex: '/^T12F8B0A22[A-Z0-9]{2}F8$/'
I am getting a few errors with this however. My first error, is that there are some characters are the end of the string I still need to get rid of, which is odd because I thought $/ denoted the end of the line in regex. However when I run my code through the debugger I notice that after running through the following code:
#regexString contains the serial read line data
regexString = re.sub('/^T12F8B0A22[A-Z0-9]{2}F8$/', '', regexString)
My string looks something like this: 'T12F8B0A2200F8\\r'
I need to get rid of the \\r.
If for some reason I can't get rid of this with regex, how in python do you send specific string character through an argument? In this case I suppose it would be length - 3?
Your problem is threefold:
1) your string contains extra \r (Carriage Return character) before \n (New Line character); this is common in Windows and in network communication protocols; it is probably best to remove any trailing whitespace from your string:
regexString = regexString.rstrip()
2) as mentioned by Wiktor Stribiżew, your regexp is unnecessarily surrounded with / characters - some languages, like Perl, define regexp as a string delimited by / characters, but Python is not one of them;
3) your instruction using re.sub is actually replacing the matching part of regexString with an empty string - I believe this is the exact opposite of what you want (you want to keep the match and remove everything else, right?); that's why fixing the regexp makes things "even worse".
To summarize, I think you should use this instead of your current code:
m = re.match('T12F8B0A22[A-Z0-9]{2}F8', regexString)
regexString = m.group(0)
There are several ways to get rid of the "\r", but first a little analysis of your code :
1. the special charakter for the end is just '$' not '$\' in python.
2. re.sub will substitute the matched pattern with a string ( '' in your case) wich would substitute the string you want to get with an empty string and you are left with the //r
possible solutions:
use simple replace:
regexString.replace('\\r','')
if you want to stick to regex the approach is the same
pattern = '\\\\r'
match = re.sub(pattern, '',regexString)
2.2 if you want the acces the different groubs use re.search
match = re.search('(^T12F8B0A22[A-Z0-9]{2}F8)(.*)',regexString)
match.group(1) # will give you the T12...
match.groupe(2) # gives you the \\r
Just match what you want to find. Couple of examples:
import re
data = '''lots of
otherT12F8B0A2212F8garbage
T12F8B0A2234F8around
T12F8B0A22ABF8the
stringsT12F8B0A22CDF8
'''
print(re.findall('T12F8B0A22..F8',data))
['T12F8B0A2212F8', 'T12F8B0A2234F8', 'T12F8B0A22ABF8', 'T12F8B0A22CDF8']
m = re.search('T12F8B0A22..F8',data)
if m:
print(m.group(0))
T12F8B0A2212F8

Match everything expect a specific string

I am using Python 2.7 and have a question with regards to regular expressions. My string would be something like this...
"SecurityGroup:Pub HDP SG"
"SecurityGroup:Group-Name"
"SecurityGroup:TestName"
My regular expression looks something like below
[^S^e^c^r^i^t^y^G^r^o^u^p^:].*
The above seems to work but I have the feeling it is not very efficient and also if the string has the word "group" in it, that will fail as well...
What I am looking for is the output should find anything after the colon (:). I also thought I can do something like using group 2 as my match... but the problem with that is, if there are spaces in the name then I won't be able to get the correct name.
(SecurityGroup):(\w{1,})
Why not just do
security_string.split(':')[1]
To grab the second part of the String after the colon?
You could use lookbehind:
pattern = re.compile(r"(?<=SecurityGroup:)(.*)")
matches = re.findall(pattern, your_string)
Breaking it down:
(?<= # positive lookbehind. Matches things preceded by the following group
SecurityGroup: # pattern you want your matches preceded by
) # end positive lookbehind
( # start matching group
.* # any number of characters
) # end matching group
When tested on the string "something something SecurityGroup:stuff and stuff" it returns matches = ['stuff and stuff'].
Edit:
As mentioned in a comment, pattern = re.compile(r"SecurityGroup:(.*)") accomplishes the same thing. In this case you are matching the string "SecurityGroup:" followed by anything, but only returning the stuff that follows. This is probably more clear than my original example using lookbehind.
Maybe this:
([^:"]+[^\s](?="))
Regex live here.

Categories