python regex suffix matching

python regex suffix matching - python

for a typical set of word suffixes (ize,fy,ly,able...etc), I want to know if a given words ends with any of them, and subsequently remove them. I know this can be done iteratively with word.endswith('ize') for example, but I believe there is a neater regex way of doing it.. tried positive lookahead with an ending marker $ but for some reason didn't work:
pat='(?=ate|ize|ify|able)$'
word='terrorize'
re.findall(pat,word)

Little-known fact: endswith accepts a tuple of possibilities:
if word.endswith(('ate','ize','ify','able')):
#...
Unfortunately, it doesn't indicate which string was found, so it doesn't help with removing the suffix.

What you are looking for is actually (?:)
Check this out:
re.sub(r"(?:ate|ize|ify|able)$", "", "terrorize")
Have a look at this site Regex.
There are tones of useful regex skills. Hope you enjoy it.
BTW, the python library itself is a neat & wonderful tutorial.
I do help() a lot :)

A lookahead is an anchor pattern, just like ^ and $ anchor matches to a specific location but are not themselves a match.
You want to match these suffixes, but at the end of a word, so use the word-edge anchor \b instead:
r'(ate|ize|ify|able)\b'
then use re.sub() to replace those:
re.sub(r'(ate|ize|ify|able)\b', '', word)
which works just fine:
>>> word='terrorize'
>>> re.sub(r'(ate|ize|ify|able)\b', '', word)
'terror'

You need adjust parenthese, just change pat from:
(?=ate|ize|ify|able)$
to:
(?=(ate|ize|ify|able)$)
If you need remove the suffixes later, you could use the pattern:
^(.*)(?=(ate|ize|ify|able)$)
Test in REPL:
>>> pat = '^(.*)(?=(ate|ize|ify|able)$)'
>>> word = 'terrorize'
>>> re.findall(pat, word)
[('terror', 'ize')]

If it's word-by-word matching then simply remove the look-ahead check, the $ caret is sufficient.

Related

Looking at what comes after potential selection regex

When you want to select some text from HTML using regex and it is of importance what comes after the potential selection I would imagine that you'd have to do something like this:
selected = re.findall(r'<a (.*?) >About', text)
Obviously this does not work but what is the right way to do this?

You want to use a lookahead assertion. From the python docs:
(?=...)Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
In your case:
re.findall(r'<a (.*?)(?= *>About)', text)

You should use the re module if you want to make use of regex in python. re.findall is a good method to use for putting all the matching texts into lists.
import re
print(re.findall(r'<(\w+)\s+(.*?)\s*>(.*?)</\1>', 'About'))
This outputs:
[('a', 'href="#about"', 'About')]

Seems like re takes proper care of the prefix and the suffix:
a = 'About'
re.findall(r'<a (.*?) >About', a)
[]
re.findall(r'<a (.*?)>About', a)
['href="#about"']
re.findall(r'<a (.*?)>Abo ut', a)
[]

how to replace multiple consecutive repeating characters into 1 character in python?

I have a string in python and I want to replace multiple consecutive repeating character into 1.
For example:
st = "UUUURRGGGEENNTTT"
print(st.replace(r'(\w){2,}',r'\1'))
But this command doesn't seems to be working, please can anybody help in finding what's wrong with this command?
There is one more way to solve this but wanted to understand why the above command fails and is there any way to correct it:
print(re.sub(r"([a-z])\1+",r"\1",st)) -- print URGENT

you need to use regex.
so you can do this:
import re
re.sub(r'[^\w\s]|(.)(?=\1)', '', 'UUURRRUU')
the result is UR.
this is a snapshot of what I have got:
for this regex: (.)(?=.*\1)
(.) means: match any char except new lines (line breaks)
?=. means: lookahead every char except new line (.)
* means: match a preceding token
\1 means: to mach the result of captured group, which is the U or R ...
then replace all matches with ''
also you can check this:
lookahead
also check this tool I solve my regex using it,
it describe everything and you can learn a lot from it:
regexer

The reason for why your code does not work is because str.replace does not support regex, you can only replace a substring with another string. You will need to use the re module if you want to replace by matching a regex pattern.
Secondly, your regex pattern is also incorrect, (\w){2,} will match any characters that occurs 2 or more times (doesn’t have to be the same character though), so it will not work. You will need to do something like this:
import re
st = "UUUURRGGGEENNTTT"
print(re.sub(r'(\w)\1+',r'\1', st)))
# URGENT
Now this will only match the same character 2 or more times.
An alternative, “unique” solution to this is that you can use the unique_justseen recipe that itertools provides:
from itertools import groupby
from operator import itemgetter
st = "UUUURRGGGEENNTTT"
new ="".join(map(next, map(itemgetter(1), groupby(st))))
print(new)
# URGENT

string.replace(s, old, new[, maxreplace]) only does substring replacement:
>>> '(\w){2,}'.replace(r'(\w){2,}',r'\1')
'\\1'
That's why it fails and it can't work with regex expression so no way to correct the first command.

Match everything expect a specific string

I am using Python 2.7 and have a question with regards to regular expressions. My string would be something like this...
"SecurityGroup:Pub HDP SG"
"SecurityGroup:Group-Name"
"SecurityGroup:TestName"
My regular expression looks something like below
[^S^e^c^r^i^t^y^G^r^o^u^p^:].*
The above seems to work but I have the feeling it is not very efficient and also if the string has the word "group" in it, that will fail as well...
What I am looking for is the output should find anything after the colon (:). I also thought I can do something like using group 2 as my match... but the problem with that is, if there are spaces in the name then I won't be able to get the correct name.
(SecurityGroup):(\w{1,})

Why not just do
security_string.split(':')[1]
To grab the second part of the String after the colon?

You could use lookbehind:
pattern = re.compile(r"(?<=SecurityGroup:)(.*)")
matches = re.findall(pattern, your_string)
Breaking it down:
(?<= # positive lookbehind. Matches things preceded by the following group
SecurityGroup: # pattern you want your matches preceded by
) # end positive lookbehind
( # start matching group
.* # any number of characters
) # end matching group
When tested on the string "something something SecurityGroup:stuff and stuff" it returns matches = ['stuff and stuff'].
Edit:
As mentioned in a comment, pattern = re.compile(r"SecurityGroup:(.*)") accomplishes the same thing. In this case you are matching the string "SecurityGroup:" followed by anything, but only returning the stuff that follows. This is probably more clear than my original example using lookbehind.

Maybe this:
([^:"]+[^\s](?="))
Regex live here.

How to get the rightest match by regular expression?

I think this is a common problem. But I didn't find a satisfactory answer elsewhere.
Suppose I extract some links from a website. The links are like the following:
http://example.com/goto/http://example1.com/123.html
http://example1.com/456.html
http://example.com/yyy/goto/http://example2.com/789.html
http://example3.com/xxx.html
I want to use regular expression to convert them to their real links:
http://example1.com/123.html
http://example1.com/456.html
http://example2.com/789.html
http://example3.com/xxx.html
However, I can't do that because of the greedy feature of RE.
'http://.*$' will only match the whole sentence. Then I tried 'http://.*?$' but it didn't work either. Nor did re.findall. So is there any other way to do this?
Yes. I can do it by str.split or str.index. But I'm still curious about whether there is a RE solution for this.

You don't need to use regex you can use str.split() to split your links with // then pickup the last part and concatenate that with http//:
>>> s="""http://example.com/goto/http://example1.com/123.html
... http://example1.com/456.html
... http://example.com/yyy/goto/http://example2.com/789.html
... http://example3.com/xxx.html"""
>>> ['http://'+s.split('//')[-1] for link in s.split('\n')]
['http://example3.com/xxx.html', 'http://example3.com/xxx.html', 'http://example3.com/xxx.html', 'http://example3.com/xxx.html']
And with regex you just need to replace all characters between 2 // with empty string but as you need one of // for the first use a positive look-behind :
>>> [re.sub(r'(?<=//)(.*)//','',link) for link in s.split('\n')]
['http://example1.com/123.html', 'http://example1.com/456.html', 'http://example2.com/789.html', 'http://example3.com/xxx.html']
>>>

use this pattern
^(.*?[^/])(?=\/[^/]).*?([^/]+)$
and replace with $1/$2
Demo
after reading comment below, use this pattern to capture what you want
(http://(?:[^h]|h(?!ttp:))*)$
Demo
or this pattern
(http://(?:(?!http:).)*)$
Demo
or this pattern
http://.*?(?=http://)
and replace with nothing
Demo

Repeat a substring in a nonconsecutive position

How shall we write a RegEx that captures repeating a substring in a nonconsecutive position?
For example, in aaabcaaa, aaa repeats with bc in between.
\1 can only be used in replacement not in the match pattern, right? Can we write (.*)bc\1?

The Regex can be (.+)bc\1
>>> s = "aaabcaaa"
>>> re.search(r'(.+)bc\1',s).group(1)
'aaa'
Debuggex Demo
To solve your doubt let me quote from the Regex HOWto
Backreferences in a pattern allow you to specify that the contents of an earlier capturing group must also be found at the current location in the string. For example, \1 will succeed if the exact contents of group 1 can be found at the current position, and fails otherwise.
The official docs also include a program to solve your problem (slightly changed)
>>> p = re.compile(r'(\b\w+)bc\1')
>>> p.search(s).group(1)
'aaa'

Yes, you can use \1 in the match. I guess you haven't tried before asking?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python regex suffix matching - python

Little-known fact: endswith accepts a tuple of possibilities: if word.endswith(('ate','ize','ify','able')): #... Unfortunately, it doesn't indicate which string was found, so it doesn't help with removing the suffix.

What you are looking for is actually (?:) Check this out: re.sub(r"(?:ate|ize|ify|able)$", "", "terrorize") Have a look at this site Regex. There are tones of useful regex skills. Hope you enjoy it. BTW, the python library itself is a neat & wonderful tutorial. I do help() a lot :)

If it's word-by-word matching then simply remove the look-ahead check, the $ caret is sufficient.

Related

Looking at what comes after potential selection regex

how to replace multiple consecutive repeating characters into 1 character in python?

Match everything expect a specific string

How to get the rightest match by regular expression?

Repeat a substring in a nonconsecutive position

Categories

Resources