Looking at what comes after potential selection regex - python

When you want to select some text from HTML using regex and it is of importance what comes after the potential selection I would imagine that you'd have to do something like this:
selected = re.findall(r'<a (.*?) >About', text)
Obviously this does not work but what is the right way to do this?

You want to use a lookahead assertion. From the python docs:
(?=...)Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
In your case:
re.findall(r'<a (.*?)(?= *>About)', text)

You should use the re module if you want to make use of regex in python. re.findall is a good method to use for putting all the matching texts into lists.
import re
print(re.findall(r'<(\w+)\s+(.*?)\s*>(.*?)</\1>', 'About'))
This outputs:
[('a', 'href="#about"', 'About')]

Seems like re takes proper care of the prefix and the suffix:
a = 'About'
re.findall(r'<a (.*?) >About', a)
[]
re.findall(r'<a (.*?)>About', a)
['href="#about"']
re.findall(r'<a (.*?)>Abo ut', a)
[]

Related

How to search for and replace a term within another search term

I have a url I get from parsing a swagger's api.json file in Python.
The URL looks something like this and I want to replace the dashes with underscores, but only inside the curly brackets.
10.147.48.10:8285/pet-store-account/{pet-owner}/version/{pet-type-id}/pet-details-and-name
So, {pet-owner} will become {pet_owner}, but pet-store-account will remain the same.
I am looking for a regular expression that will allow me to perform a non-greedy search and then do a search-replace on each of the first search's findings.
a Python re approach is what I am looking for, but I will also appreciate if you can suggest a Vim one liner.
The expected final result is:
10.147.48.10:8285/pet-store-account/{pet_owner}/version/{pet_type_id}/pet-details-and-name
Provided that you expect all '{...}' blocks to be consistent, you may use a trailing context to determine whether a given dash is inside a block, actually just requiring it to be followed by '...}' where '.' is not a '{'
exp = re.compile(r'(?=[^{]*})-')
...
substituted_url = re.sub(exp,'_',url_string)
Using lookahead and lookbehind in Vim:
s/\({[^}]*\)\#<=-\([^{]*}\)\#=/_/g
The pattern has three parts:
\({[^}]*\)\#<= matches, but does not consume, an opening brace followed by anything except a closing brace, immediately behind the next part.
- matches a hyphen.
\([^{]*}\)\#= matches, but does not consume, anything except an opening brace, followed by a closing brace, immediately ahead of the previous part.
The same technique can't be exactly followed in Python regular expressions, because they only allow fixed-width lookbehinds.
Result:
Before
outside-braces{inside-braces}out-again{in-again}out-once-more{in-once-more}
After
outside-braces{inside_braces}out-again{in_again}out-once-more{in_once_more}
Because it checks for braces in the right place both before and after the hyphen, this solution (unlike others which use only lookahead assertions) behaves sensibly in the face of unmatched braces:
Before
b-c{d-e{f-g}h-i
b-c{d-e}f-g}h-i
b-c{d-e}f-g{h-i
b-c}d-e{f-g}h-i
After
b-c{d-e{f_g}h-i
b-c{d_e}f-g}h-i
b-c{d_e}f-g{h-i
b-c}d-e{f_g}h-i
Use a two-step approach:
import re
url = "10.147.48.10:8285/pet-store-account/{pet-owner}/version/{pet-type-id}/pet-details-and-name"
rx = re.compile(r'{[^{}]+}')
def replacer(match):
return match.group(0).replace('-', '_')
url = rx.sub(replacer, url)
print(url)
Which yields
10.147.48.10:8285/pet-store-account/{pet_owner}/version/{pet_type_id}/pet-details-and-name
This looks for pairs of { and } and replaces every - with _ inside it.
There may be solutions with just one line but this one is likely to be understood in a couple of months as well.
Edit: For one-line-gurus:
url = re.sub(r'{[^{}]+}',
lambda x: x.group(0).replace('-', '_'),
url)
Solution in Vim:
%s/\({.*\)\#<=-\(.*}\)\#=/_/g
Explanation of matched pattern:
\({.*\)\#<=-\(.*}\)\#=
\({.*\)\#<= Forces the match to have a {.* behind
- Specifies a dash (-) as the match
\(.*}\)\#= Forces the match to have a .*} ahead
Use python lookahead to ignore the string enclosed within curly brackets {}:
Description:
(?=...):
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
Solution
a = "10.147.48.10:8285/pet-store-account/**{pet-owner}**/version/**{pet-type-id}**/pet-details-and-name"
import re
re.sub(r"(?=[^{]*})-", "_", a)
Output:
'10.147.48.10:8285/pet-store-account/**{pet_owner}**/version/**{pet_type_id}**/pet-details-and-name'
Another way to do in Vim is to use a sub-replace-expression:
:%s/{\zs[^}]*\ze}/\=substitute(submatch(0),'-','_','g')/g
Using \zs and \ze we set the match between the { & } characters. Using \={expr} will evaluate {expr} as the replacement for each substitution. Using VimScripts substitution function, substitute({text}, {pat}, {replace}, {flag}), on the entire match, submatch(0), to convert - to _.
For more help see:
:h sub-replace-expression
:h /\zs
:h submatch()
:h substitute()

how to replace multiple consecutive repeating characters into 1 character in python?

I have a string in python and I want to replace multiple consecutive repeating character into 1.
For example:
st = "UUUURRGGGEENNTTT"
print(st.replace(r'(\w){2,}',r'\1'))
But this command doesn't seems to be working, please can anybody help in finding what's wrong with this command?
There is one more way to solve this but wanted to understand why the above command fails and is there any way to correct it:
print(re.sub(r"([a-z])\1+",r"\1",st)) -- print URGENT
you need to use regex.
so you can do this:
import re
re.sub(r'[^\w\s]|(.)(?=\1)', '', 'UUURRRUU')
the result is UR.
this is a snapshot of what I have got:
for this regex: (.)(?=.*\1)
(.) means: match any char except new lines (line breaks)
?=. means: lookahead every char except new line (.)
* means: match a preceding token
\1 means: to mach the result of captured group, which is the U or R ...
then replace all matches with ''
also you can check this:
lookahead
also check this tool I solve my regex using it,
it describe everything and you can learn a lot from it:
regexer
The reason for why your code does not work is because str.replace does not support regex, you can only replace a substring with another string. You will need to use the re module if you want to replace by matching a regex pattern.
Secondly, your regex pattern is also incorrect, (\w){2,} will match any characters that occurs 2 or more times (doesn’t have to be the same character though), so it will not work. You will need to do something like this:
import re
st = "UUUURRGGGEENNTTT"
print(re.sub(r'(\w)\1+',r'\1', st)))
# URGENT
Now this will only match the same character 2 or more times.
An alternative, “unique” solution to this is that you can use the unique_justseen recipe that itertools provides:
from itertools import groupby
from operator import itemgetter
st = "UUUURRGGGEENNTTT"
new ="".join(map(next, map(itemgetter(1), groupby(st))))
print(new)
# URGENT
string.replace(s, old, new[, maxreplace]) only does substring replacement:
>>> '(\w){2,}'.replace(r'(\w){2,}',r'\1')
'\\1'
That's why it fails and it can't work with regex expression so no way to correct the first command.

Regular Expression: Include Text After (...) Group

I am learning about regular expressions. I need to match things in a parenthesis group followed by some pattern that I define. When I try this with regular expressions (in Python), it only returns the part in parentheses that it matched, but not the pattern which follows it. An example should clarify:
import re
s = "texttoignore_ABCABC12345_moretexttoignore"
re.findall("(ABC)+\d+", s)
When I speak of the parenthesis group, in the example above this is the "(ABC)+" part. What I intend is for it to look for one or more repetitions of the pattern in parentheses (in this case "ABC"), then the pattern after.
The problem is this: it does not return the pattern after. (In this example, it would return 'ABC', but I would want 'ABCABC12345' or 'ABC12345' or better yet '12345')
How can you include the part after the parentheses in the return value? Is this something about regular expressions or is it specific to this Python method?
Thanks!
John
The "problem" here is that rather specific behavior of re.findall
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
There are a few options you have here. Either make your group non-capturing:
>>> re.findall("(?:ABC)+\d+", s)
['ABCABC12345']
or use re.finditer:
>>> [m.group(0) for m in re.finditer("(ABC)+\d+", s)]
['ABCABC12345']
If you only want to find the pattern once, then #Jkdc's approach from the comments works fine.
>>> re.search("(ABC)+\d+", s).group()
'ABCABC12345'

How to get the rightest match by regular expression?

I think this is a common problem. But I didn't find a satisfactory answer elsewhere.
Suppose I extract some links from a website. The links are like the following:
http://example.com/goto/http://example1.com/123.html
http://example1.com/456.html
http://example.com/yyy/goto/http://example2.com/789.html
http://example3.com/xxx.html
I want to use regular expression to convert them to their real links:
http://example1.com/123.html
http://example1.com/456.html
http://example2.com/789.html
http://example3.com/xxx.html
However, I can't do that because of the greedy feature of RE.
'http://.*$' will only match the whole sentence. Then I tried 'http://.*?$' but it didn't work either. Nor did re.findall. So is there any other way to do this?
Yes. I can do it by str.split or str.index. But I'm still curious about whether there is a RE solution for this.
You don't need to use regex you can use str.split() to split your links with // then pickup the last part and concatenate that with http//:
>>> s="""http://example.com/goto/http://example1.com/123.html
... http://example1.com/456.html
... http://example.com/yyy/goto/http://example2.com/789.html
... http://example3.com/xxx.html"""
>>> ['http://'+s.split('//')[-1] for link in s.split('\n')]
['http://example3.com/xxx.html', 'http://example3.com/xxx.html', 'http://example3.com/xxx.html', 'http://example3.com/xxx.html']
And with regex you just need to replace all characters between 2 // with empty string but as you need one of // for the first use a positive look-behind :
>>> [re.sub(r'(?<=//)(.*)//','',link) for link in s.split('\n')]
['http://example1.com/123.html', 'http://example1.com/456.html', 'http://example2.com/789.html', 'http://example3.com/xxx.html']
>>>
use this pattern
^(.*?[^/])(?=\/[^/]).*?([^/]+)$
and replace with $1/$2
Demo
after reading comment below, use this pattern to capture what you want
(http://(?:[^h]|h(?!ttp:))*)$
Demo
or this pattern
(http://(?:(?!http:).)*)$
Demo
or this pattern
http://.*?(?=http://)
and replace with nothing
Demo

python regex suffix matching

for a typical set of word suffixes (ize,fy,ly,able...etc), I want to know if a given words ends with any of them, and subsequently remove them. I know this can be done iteratively with word.endswith('ize') for example, but I believe there is a neater regex way of doing it.. tried positive lookahead with an ending marker $ but for some reason didn't work:
pat='(?=ate|ize|ify|able)$'
word='terrorize'
re.findall(pat,word)
Little-known fact: endswith accepts a tuple of possibilities:
if word.endswith(('ate','ize','ify','able')):
#...
Unfortunately, it doesn't indicate which string was found, so it doesn't help with removing the suffix.
What you are looking for is actually (?:)
Check this out:
re.sub(r"(?:ate|ize|ify|able)$", "", "terrorize")
Have a look at this site Regex.
There are tones of useful regex skills. Hope you enjoy it.
BTW, the python library itself is a neat & wonderful tutorial.
I do help() a lot :)
A lookahead is an anchor pattern, just like ^ and $ anchor matches to a specific location but are not themselves a match.
You want to match these suffixes, but at the end of a word, so use the word-edge anchor \b instead:
r'(ate|ize|ify|able)\b'
then use re.sub() to replace those:
re.sub(r'(ate|ize|ify|able)\b', '', word)
which works just fine:
>>> word='terrorize'
>>> re.sub(r'(ate|ize|ify|able)\b', '', word)
'terror'
You need adjust parenthese, just change pat from:
(?=ate|ize|ify|able)$
to:
(?=(ate|ize|ify|able)$)
If you need remove the suffixes later, you could use the pattern:
^(.*)(?=(ate|ize|ify|able)$)
Test in REPL:
>>> pat = '^(.*)(?=(ate|ize|ify|able)$)'
>>> word = 'terrorize'
>>> re.findall(pat, word)
[('terror', 'ize')]
If it's word-by-word matching then simply remove the look-ahead check, the $ caret is sufficient.

Categories