Python regular expression issue

Python regular expression issue - python

I'm trying to use the re module in a way that it will return bunch of characters until a particular string follows an individual character. The re documentation seems to indicate that I can use (?!...) to accomplish this. The example that I'm currently wrestling with:
str_to_search = 'abababsonab, etc'
first = re.search(r'(ab)+(?!son)', str_to_search)
second = re.search(r'.+(?!son)', str_to_search)
first.group() is 'abab', which is what I'm aiming for. However, second.group() returns the entire str_to_search string, despite the fact that I'm trying to make it stop at 'ababa', as the subsequent 'b' is immediately followed by 'son'. Where am I going wrong?

It's not the simplest thing, but you can capture a repeating sequence of "a character not followed by 'son'". This repeated expression should be in a non-capturing group, (?: ... ), so it doesn't mess with your match results. (You'd end up with an extra match group)
Try this:
import re
str_to_search = 'abababsonab, etc'
second = re.search(r'(?:.(?!son))+', str_to_search)
print(second.group())
Output:
ababa
See it here: http://ideone.com/6DhLgN

This should work:
second = re.search(r'(.(?!son))+', str_to_search)
#output: 'ababa'

not sure what you are trying to do
check out string.partition
'.+?' is the minimal matcher, otherwise it is greedy and gets it all
read the docs for group(...) and groups(..) especially when passing group number

Related

Exact search of a string that has parenthesis using regex

I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!

Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet

If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.

How to match digits only after a particular string, stop matching if non-digit is found - Python 27

I have huge string like this dsdasdludocid=15878284988193842600#lrd=0x3be04dcc5b5ac513:0xdc5b0011ebb625a8,2
I want to get the number after ludocid, only consecutive numbers.
I have tried this regex (ludocid).*(?=\d+\d+) and many more but no luck.

You can try ludocid=(\d+):
s = "dsdasdludocid=15878284988193842600#lrd=0x3be04dcc5b5ac513:0xdc5b0011ebb625a8,2"
import re
re.findall(r"ludocid=(\d+)", s)
# ['15878284988193842600']

You can use this regex:
ludocid\D*(\d+)
RegEx Demo
This will match literal ludocid followed by 0 or more non-digits and then it will match 1 or more digits in captured group #1
Code:
>>> s = 'dsdasdludocid=15878284988193842600#lrd=0x3be04dcc5b5ac513:0xdc5b0011ebb625a8,2'
>>> print re.search(r'ludocid\D*(\d+)', s).group(1)
15878284988193842600

It looks like you just threw a bunch of regex bits together... Let's work through that.
First, this is the correct regex: ludocid.(\d+)
(You would want to use it with re.search instead of re.match, by the way. Match requires the regex to match the entire string.)
But let's look at yours and see what went wrong and how we can get to the correct regex.
(ludocid).*(?=\d+\d+)
Imagine a regex as a function. You pass it the right things, and it gives you the appropriate result. When you wrap things in parentheses, you're saying "Find this and give it back to me." You don't need the ludocid given back to you, I'm guessing... so remove those paren.
ludocid.*(?=\d+\d+)
Now you've got a .*. This is dangerous in regular expressions because it literally says "Grab as many of anything as you possibly can!" Often I use the non-greedy version (.*?), but in this case it looks like we're just expecting a single extra character there. If you know the literal character you can use that, but to be safe I'll leave it as ., which says "Grab any one character."
ludocid.(?=\d+\d+)
Now let's go inside the parentheses. You've got \d+\d+, which says "Find a sequence of one or more digits, and then find another sequence of one or more digits." This equates to "Find a sequence of two or more digits." I don't think this is what you wanted (it's not how you described the problem, anyway), so let's reduce that:
ludocid.(?=\d+)
Okay, great. Now... what is (?=...) for? It's called a lookahead assertion. It says "If you find this string, match things in front of it." The example given in the Python 2.7 documentation is:
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
Essentially this means that your regex will never return the digits. Instead, it looks to see if digits exist, and then it returns things from the rest of the regex. Remove the lookahead assertion and we're there:
ludocid.(\d+)
When you use this with re.search, you'll get the group you want:
>>> s = "dsdasdludocid=15878284988193842600#lrd=0x3be04dcc5b5ac513:0xdc5b0011ebb625a8,2"
>>> import re
>>> re.search(r"ludocid.(\d+)", s).group(1)
'15878284988193842600'

To match only the digits that follow, stopping at the first non-numeric char, try a positive look behind:
(?<=ludocid=)(\d+)
So:
re.findall(r"(?<=ludocid=)(\d+)", s)
The positive look behind will look for what you want, and only match if it is preceded by the 'flag' string.
**Note: **You may need to escape that second = sign like this: (?<=ludocid\=)(\d+)

Match everything expect a specific string

I am using Python 2.7 and have a question with regards to regular expressions. My string would be something like this...
"SecurityGroup:Pub HDP SG"
"SecurityGroup:Group-Name"
"SecurityGroup:TestName"
My regular expression looks something like below
[^S^e^c^r^i^t^y^G^r^o^u^p^:].*
The above seems to work but I have the feeling it is not very efficient and also if the string has the word "group" in it, that will fail as well...
What I am looking for is the output should find anything after the colon (:). I also thought I can do something like using group 2 as my match... but the problem with that is, if there are spaces in the name then I won't be able to get the correct name.
(SecurityGroup):(\w{1,})

Why not just do
security_string.split(':')[1]
To grab the second part of the String after the colon?

You could use lookbehind:
pattern = re.compile(r"(?<=SecurityGroup:)(.*)")
matches = re.findall(pattern, your_string)
Breaking it down:
(?<= # positive lookbehind. Matches things preceded by the following group
SecurityGroup: # pattern you want your matches preceded by
) # end positive lookbehind
( # start matching group
.* # any number of characters
) # end matching group
When tested on the string "something something SecurityGroup:stuff and stuff" it returns matches = ['stuff and stuff'].
Edit:
As mentioned in a comment, pattern = re.compile(r"SecurityGroup:(.*)") accomplishes the same thing. In this case you are matching the string "SecurityGroup:" followed by anything, but only returning the stuff that follows. This is probably more clear than my original example using lookbehind.

Maybe this:
([^:"]+[^\s](?="))
Regex live here.

Nongreedy Regex with Repetition

I am using the following regex:
((FFD8FF).+?((FFD9)(?:(?!FFD8).)*))
I need to do the following with regex:
Find FFD8FF
Find the last FFD9that comes before the next FFD8FF
Stop at the last FFD9 and not include any content after
What I've got does what I need except it finds and keeps any junk after the last FFD9. How can I get it to jump back to the last FFD9?
Here's the string that I'm searching with this expression:
asdfasdfasasdaFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9asdfasdfFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9
Thanks a lot for your help.
More info:
I have a list of start and end values I need to search for (FFD8FF and FFD9 are just one pair). They are in a list. Because of this, I'm using r.compile to dynamically create the expression in a for loop that goes through the different values. I have the following code, but it is returning 0 matches:
regExp = re.compile("FD8FF(?:[^F]|F(?!FD8FF))*FFD9")
matchObj = re.findall(regExp, contents)
In the above code, I'm just trying to use the plain regex without even getting the values from the list (that would look like this):
regExp = re.compile(typeItem[0] + "(?:[^" + typeItem[0][0] + "]|" + typeItem[0][0] + "(?!" + typeItem[0] + "))*" + typeItem[1])
Any other ideas why there aren't any matches?
EDIT:
I figured out that I forgot to include flags. Flags are now included to ignore case and multiline. I now have
regExp = re.compile(typeItem[0] + "(?:[^" + typeItem[0][0] + "]|" + typeItem[0][0] + "(?!" + typeItem[0] + "))*" + typeItem[1],re.M|re.I)
Although now I'm getting a memory error. Is there any way to make this more efficient? I am using the expression to search hundreds of thousands of lines (using the findall expression above)

an easy way is to use this:
FFD8FF(?:[^F]|F(?!FD8FF))*FFD9
explanation:
FFD8FF
(?: # this group describe the allowed content between the "anchors"
[^F] # all that is not a "F"
| # OR
F(?!FD8FF) # a "F" not followed by "FD8FF"
)* # repeat (greedy)
FFD9 # until the last FFD9 before FFD8FF
Even if a greedy quantifier is used for the group, the regex engine will backtrack to find the last "FFD9" substring.
If you want to ensure that FFD8FF is present, you can add a lookahead at the end of the pattern:
FFD8FF(?:[^F]|F(?!FD8FF))*FFD9(?=.*?FFD8FF)
You can optimize this pattern by emulating an atomic group that will limit the backtracking and allows to use quantifier inside the group:
FFD8FF(?:(?=([^F]+|F(?!FD8FF)))\1)*FFD9
This trick uses the fact that the content of a lookahead is naturally atomic once the closing parenthesis reached. So if you enclose a group inside a lookahead with a capture group inside, you only have to put the backreference after to obtain an "atom" (an indivisable substring).
When the regex engine need to backtrack, it will backtrack atom by atom instead of character by character that is much faster.
If you need a capture group before this trick, don't forget to update the number of the backreference, examples:
(FFD8FF(?:(?=([^F]+|F(?!FD8FF)))\2)*FFD9)
(FFD8FF((?:(?=([^F]+|F(?!FD8FF)))\3)*)FFD9)
working example:
>>> import re
>>> yourstr = 'asdfasdfasasdaFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9asdfasdfFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9'
>>> p = re.compile(r'(FFD8FF((?:(?=([^F]+|F(?!FD8FF)))\3)*)FFD9)(?=.*?FFD8FF)')
>>> re.findall(p, yourstr)
[('FFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9', 'asdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdf', 'D9asdflasdflasdf')]
variant:
(FFD8FF((?:(?=(F(?!FD8FF)[^F]*|[^F]+))\3)*)FFD9)(?=.*?FFD8FF)

Since you are not restricted to one regexp by your application's architecture, break it down into steps:
You want to break up the text in units that begin at each FFD8FF. Just use non-greedy search that ends just before the next FFD8FF: re.findall(r"FFD8FF.*?(?=FFD8FF)", contents). (This uses look-ahead, which is in my opinion overused; but it lets you save the final FFD8FF for the next string.)
You then want to trim each such string so that it ends at the last FFD9. Easiest way to do this is with greedy search: re.search(r"^.*FFD9", part). Like this:
for part in re.findall(r"FFD8FF.*?(?=FFD8FF)", contents):
print(re.search(r"^.*FFD9", part).group(0))
Simple, maintainable and efficient.

This is how I would do it:
>>> re.search(r'((FFD8FF).+?(FFD9))(?:((?!FFD9).)+FFD8FF)', s).groups()
('FFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9',
'FFD8FF',
'FFD9',
'f')
The second part just searches for a string not containing FFD9 that ends with FFD8FF.
It includes your search components, so you can still substitute them in your regex. However for something rather complicated like this I would avoid regex.
btw, thanks for posting a regex question that is high-quality and not the usual spam.

Unexpected end of Pattern : Python Regex

When I use the following python regex to perform the functionality described below, I get the error Unexpected end of Pattern.
Regex:
modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
(CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
Purpose of this regex:
INPUT:
CODE876
CODE223
matchjustCODE657
CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665
Should match:
CODE876
CODE223
CODE657
CODE697
and replace occurrences with
http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743
Should Not match:
code876
testing1CODE888
testing2CODE776
example3CODE654
example2CODE098
http://replaced/CODE665
FINAL OUTPUT
http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665
EDIT and UPDATE 1
modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
The error is no more happening. But this does not match any of the patterns as needed. Is there a problem with matching groups or the matching itself. Because when I compile this regex as such, I get no match to my input.
EDIT AND UPDATE 2
f=open("/Users/mymac/Desktop/regex.txt")
s=f.read()
s1 = re.sub(r'((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)',
r'\g<1>\g<2>', s)
print s1
INPUT
CODE123 CODE765 testing1CODE123 example1CODE345 http://www.coding.com/CODE333 CODE345
CODE234
CODE333
OUTPUT
CODE123 CODE765 testing1CODE123 example1CODE345 http://www.coding.com/CODE333 CODE345
CODE234
CODE333
Regex works for Raw input, but not for string input from a text file.
See Input 4 and 5 for more results http://ideone.com/3w1E3

Your main problem is the (?-i) thingy which is wishful thinking as far as Python 2.7 and 3.2 are concerned. For more details, see below.
import re
# modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
# (CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
# observation 1: as presented, pattern has a line break in the middle, just after (?-i)
# ob 2: rather hard to read, should use re.VERBOSE
# ob 3: not obvious whether it's a complile-time or run-time problem
# ob 4: (?i) should be at the very start of the pattern (see docs)
# ob 5: what on earth is (?-i) ... not in 2.7 docs, not in 3.2 docs
pattern = r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)'
#### rx = re.compile(pattern)
# above line failed with "sre_constants.error: unexpected end of pattern"
# try without the (?-i)
pattern2 = r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)'
rx = re.compile(pattern2)
# This works, now you need to work on observations 1 to 4,
# and rethink your CODE/code strategy
Looks like suggestions fall on deaf ears ... Here's the pattern in re.VERBOSE format:
pattern4 = r'''
^
(?i)
(
(?:
(?!http://)
(?!testing[0-9])
(?!example[0-9])
. #### what is this for?
)*?
) ##### end of capturing group 1
(CODE[0-9]{3}) #### not in capturing group 1
(?!</a>)
'''

Okay, it looks like the problem is the (?-i), which is surprising. The purpose of the inline-modifier syntax is to let you apply modifiers to selected portions of the regex. At least, that's how they work in most flavors. In Python it seems they always modify the whole regex, same as the external flags (re.I, re.M, etc.). The alternative (?i:xyz) syntax doesn't work either.
On a side note, I don't see any reason to use three separate lookaheads, as you did here:
(?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?
Just OR them together:
(?:(?!http://|testing[0-9]|example[0-9]).)*?
EDIT: We seem to have moved from the question of why the regex throws exceptions, to the question of why it doesn't work. I'm not sure I understand your requirements, but the regex and replacement string below return the results you want.
s1 = re.sub(r'^((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)',
r'\g<1>\g<2>', s)
see it in action one ideone.com
Is that what you're after?
EDIT: We now know that the replacements are being done within a larger text, not on standalone strings. That's makes the problem much more difficult, but we also know the full URLs (the ones that start with http://) only occur in already-existing anchor elements. That means we can split the regex into two alternatives: one to match complete <a>...</a> elements, and one to match our the target strings.
(?s)(?:(<a\s+[^>]*>.*?</a>)|\b((?:(?!testing[0-9]|example[0-9])\w)*?)(CODE[0-9]{3}))
The trick is to use a function instead of a static string for the replacement. Whenever the regex matches an anchor element, the function will find it in group(1) and return it unchanged. Otherwise, it uses group(2) and group(3) to build a new one.
here's another demo (I know that's horrible code, but I'm too tired right now to learn a more pythonic way.)

The only problem I see is that you replace using the wrong capturing group.
modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
^ ^ ^
first capturing group second one using the first group
Here I made the first one also a non capturing group
^(?i)(?:(?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)
See it here on Regexr

For complex regexes, use the re.X flag to document what you're doing and to make sure the brackets match up correctly (i.e. by using indentation to indicate the current level of nesting).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regular expression issue - python

This should work: second = re.search(r'(.(?!son))+', str_to_search) #output: 'ababa'

not sure what you are trying to do check out string.partition '.+?' is the minimal matcher, otherwise it is greedy and gets it all read the docs for group(...) and groups(..) especially when passing group number

Related

Exact search of a string that has parenthesis using regex

How to match digits only after a particular string, stop matching if non-digit is found - Python 27

Match everything expect a specific string

Nongreedy Regex with Repetition

Unexpected end of Pattern : Python Regex

Categories

Resources