Python re.findall non-greedy result - python

I'm trying to get only "Text3" part with the following code:
import re
stringtotest = "begin:Text1<wrong>Text2<wrong>Text3<right>Text4<wrong>"
right = re.findall("<wrong>(.+?)<right>",stringtotest)
>>> right
['Text2<wrong>Text3']
Why Python gives me Text2 as well? How to tell him I want only the part after the nearest "wrong"? Thank you.

The dot . matches anything. You can use a negated character class to restrict the match:
<wrong>([^<]+?)<right>
If you want to get the middle section without the outer tags, use lookaheads and lookbehinds to assert the position of the tags:
(?<=<wrong>)([^<]+?)(?=<right>)

<wrong>((?:(?!<wrong>).)*)<right>
You can use a negated lookahead based quantifier.See demo.
https://regex101.com/r/8yUhDL/1

Related

Extracting a word between two path separators that comes after a specific word

I have the following path stored as a python string 'C:\ABC\DEF\GHI\App\Module\feature\src' and I would like to extract the word Module that is located between words \App\ and \feature\ in the path name. Note that there are file separators '\' in between which ought not to be extracted, but only the string Module has to be extracted.
I had the few ideas on how to do it:
Write a RegEx that matches a string between \App\ and \feature\
Write a RegEx that matches a string after \App\ --> App\\[A-Za-z0-9]*\\, and then split that matched string in order to find the Module.
I think the 1st solution is better, but that unfortunately it goes over my RegEx knowledge and I am not sure how to do it.
I would much appreciate any help.
Thank you in advance!
The regex you want is:
(?<=\\App\\).*?(?=\\feature\\)
Explanation of the regex:
(?<=behind)rest matches all instances of rest if there is behind immediately before it. It's called a positive lookbehind
rest(?=ahead) matches all instances of rest where there is ahead immediately after it. This is a positive lookahead.
\ is a reserved character in regex patterns, so to use them as part of the pattern itself, we have to escape it; hence, \\
.* matches any character, zero or more times.
? specifies that the match is not greedy (so we are implicitly assuming here that \feature\ only shows up once after \App\).
The pattern in general also assumes that there are no \ characters between \App\ and \feature\.
The full code would be something like:
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
start = '\\App\\'
end = '\\feature\\'
pattern = rf"(?<=\{start}\).*?(?=\{end}\)"
print(pattern) # (?<=\\App\\).*?(?=\\feature\\)
print(re.search(pattern, str)[0]) # Module
A link on regex lookarounds that may be helpful: https://www.regular-expressions.info/lookaround.html
We can do that by str.find somethings like
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
import re
start = '\\App\\'
end = '\\feature\\'
print( (str[str.find(start)+len(start):str.rfind(end)]))
print("\n")
output
Module
Your are looking for groups. With some small modificatians you can extract only the part between App and Feature.
(?:App\\\\)([A-Za-z0-9]*)(?:\\\\feature)
The brackets ( ) define a Match group which you can get by match.group(1). Using (?:foo) defines a non-matching group, e.g. one that is not included in your result. Try the expression here: https://regex101.com/r/24mkLO/1

Match everything expect a specific string

I am using Python 2.7 and have a question with regards to regular expressions. My string would be something like this...
"SecurityGroup:Pub HDP SG"
"SecurityGroup:Group-Name"
"SecurityGroup:TestName"
My regular expression looks something like below
[^S^e^c^r^i^t^y^G^r^o^u^p^:].*
The above seems to work but I have the feeling it is not very efficient and also if the string has the word "group" in it, that will fail as well...
What I am looking for is the output should find anything after the colon (:). I also thought I can do something like using group 2 as my match... but the problem with that is, if there are spaces in the name then I won't be able to get the correct name.
(SecurityGroup):(\w{1,})
Why not just do
security_string.split(':')[1]
To grab the second part of the String after the colon?
You could use lookbehind:
pattern = re.compile(r"(?<=SecurityGroup:)(.*)")
matches = re.findall(pattern, your_string)
Breaking it down:
(?<= # positive lookbehind. Matches things preceded by the following group
SecurityGroup: # pattern you want your matches preceded by
) # end positive lookbehind
( # start matching group
.* # any number of characters
) # end matching group
When tested on the string "something something SecurityGroup:stuff and stuff" it returns matches = ['stuff and stuff'].
Edit:
As mentioned in a comment, pattern = re.compile(r"SecurityGroup:(.*)") accomplishes the same thing. In this case you are matching the string "SecurityGroup:" followed by anything, but only returning the stuff that follows. This is probably more clear than my original example using lookbehind.
Maybe this:
([^:"]+[^\s](?="))
Regex live here.

How to get the rightest match by regular expression?

I think this is a common problem. But I didn't find a satisfactory answer elsewhere.
Suppose I extract some links from a website. The links are like the following:
http://example.com/goto/http://example1.com/123.html
http://example1.com/456.html
http://example.com/yyy/goto/http://example2.com/789.html
http://example3.com/xxx.html
I want to use regular expression to convert them to their real links:
http://example1.com/123.html
http://example1.com/456.html
http://example2.com/789.html
http://example3.com/xxx.html
However, I can't do that because of the greedy feature of RE.
'http://.*$' will only match the whole sentence. Then I tried 'http://.*?$' but it didn't work either. Nor did re.findall. So is there any other way to do this?
Yes. I can do it by str.split or str.index. But I'm still curious about whether there is a RE solution for this.
You don't need to use regex you can use str.split() to split your links with // then pickup the last part and concatenate that with http//:
>>> s="""http://example.com/goto/http://example1.com/123.html
... http://example1.com/456.html
... http://example.com/yyy/goto/http://example2.com/789.html
... http://example3.com/xxx.html"""
>>> ['http://'+s.split('//')[-1] for link in s.split('\n')]
['http://example3.com/xxx.html', 'http://example3.com/xxx.html', 'http://example3.com/xxx.html', 'http://example3.com/xxx.html']
And with regex you just need to replace all characters between 2 // with empty string but as you need one of // for the first use a positive look-behind :
>>> [re.sub(r'(?<=//)(.*)//','',link) for link in s.split('\n')]
['http://example1.com/123.html', 'http://example1.com/456.html', 'http://example2.com/789.html', 'http://example3.com/xxx.html']
>>>
use this pattern
^(.*?[^/])(?=\/[^/]).*?([^/]+)$
and replace with $1/$2
Demo
after reading comment below, use this pattern to capture what you want
(http://(?:[^h]|h(?!ttp:))*)$
Demo
or this pattern
(http://(?:(?!http:).)*)$
Demo
or this pattern
http://.*?(?=http://)
and replace with nothing
Demo

python regex suffix matching

for a typical set of word suffixes (ize,fy,ly,able...etc), I want to know if a given words ends with any of them, and subsequently remove them. I know this can be done iteratively with word.endswith('ize') for example, but I believe there is a neater regex way of doing it.. tried positive lookahead with an ending marker $ but for some reason didn't work:
pat='(?=ate|ize|ify|able)$'
word='terrorize'
re.findall(pat,word)
Little-known fact: endswith accepts a tuple of possibilities:
if word.endswith(('ate','ize','ify','able')):
#...
Unfortunately, it doesn't indicate which string was found, so it doesn't help with removing the suffix.
What you are looking for is actually (?:)
Check this out:
re.sub(r"(?:ate|ize|ify|able)$", "", "terrorize")
Have a look at this site Regex.
There are tones of useful regex skills. Hope you enjoy it.
BTW, the python library itself is a neat & wonderful tutorial.
I do help() a lot :)
A lookahead is an anchor pattern, just like ^ and $ anchor matches to a specific location but are not themselves a match.
You want to match these suffixes, but at the end of a word, so use the word-edge anchor \b instead:
r'(ate|ize|ify|able)\b'
then use re.sub() to replace those:
re.sub(r'(ate|ize|ify|able)\b', '', word)
which works just fine:
>>> word='terrorize'
>>> re.sub(r'(ate|ize|ify|able)\b', '', word)
'terror'
You need adjust parenthese, just change pat from:
(?=ate|ize|ify|able)$
to:
(?=(ate|ize|ify|able)$)
If you need remove the suffixes later, you could use the pattern:
^(.*)(?=(ate|ize|ify|able)$)
Test in REPL:
>>> pat = '^(.*)(?=(ate|ize|ify|able)$)'
>>> word = 'terrorize'
>>> re.findall(pat, word)
[('terror', 'ize')]
If it's word-by-word matching then simply remove the look-ahead check, the $ caret is sufficient.

Python: RegEx assistance

I have a filename 10.10.10.17_super-micro-100-13.txt from which I need to extract everything between _ and .. E.g., in this case it would return super-micro-100-13
I will need a Python regex to accomplish the task. If I do
re.compile('\_(.*)\.), I get _super-micro-100-13. which is not what I want. Can anyone throw some light on what would be the correct regex in this case?
Thanks,
Neel
If you decide you don't need to use regex, throwing together a few string methods is more readable.
file_name = "10.10.10.17_super-micro-100-13.txt"
print file_name.split("_")[1].split(".")[0]
You can use a lookbehind and lookahead so that you are only actually matching the part that you want. Also note that you need to escape the . at the end to match a literal dot.
Here is the regex you could use:
regex = re.compile(r'(?<=_).*(?=\.)')
Alternatively, you can use your current regex and pull out the first capture group from your match:
regex = re.compile(r'_(.*)\.')
print regex.search('10.10.10.17_super-micro-100-13.txt').group(1)
# super-micro-100-13
Try this:
import re
name = '10.10.10.17_super-micro-100-13.txt'
regex = re.compile(r'.+_(.+)\.txt')
regex.match(name).group(1)
> 'super-micro-100-13'
I do think that regex is a bit overkill. You can use the "find" function as follow:
def extract_info(s):
underscore = s.find('_')
dot = s.find('_', underscore) //you only want a dot after the underscore
return s[underscore:dot]

Categories