Return reoccuring regex matches with python - python

I have a string:
SomeTextSomeTextASomeThingBSomeTextSomeTextASomeThingElseBSomeText
I want to have the Strings SomeThing and SomeThingElse string returned because they are bracketed with A and B and assuming SomeText does not contain any A ... B occurences.
Any hint would be highly appreciated.
Here's what I tried, but it doesn't work:
import re
string = 'SomeTextSomeTextASomeThingBSomeTextSomeTextASomeThingElseBSomeText'
regex='(A.*B)'
I guess neither the regex is correct, nor do I know how to access the matches. Is it match of finditer or…?

Try using re.findall:
>>> print re.findall('A(.*?)B', s)
['SomeThing', 'SomeThingElse']
See it working online: ideone
Note the question mark. Without it the matching is done greedily - it will consume as many characters as possible.

Related

how to replace multiple consecutive repeating characters into 1 character in python?

I have a string in python and I want to replace multiple consecutive repeating character into 1.
For example:
st = "UUUURRGGGEENNTTT"
print(st.replace(r'(\w){2,}',r'\1'))
But this command doesn't seems to be working, please can anybody help in finding what's wrong with this command?
There is one more way to solve this but wanted to understand why the above command fails and is there any way to correct it:
print(re.sub(r"([a-z])\1+",r"\1",st)) -- print URGENT
you need to use regex.
so you can do this:
import re
re.sub(r'[^\w\s]|(.)(?=\1)', '', 'UUURRRUU')
the result is UR.
this is a snapshot of what I have got:
for this regex: (.)(?=.*\1)
(.) means: match any char except new lines (line breaks)
?=. means: lookahead every char except new line (.)
* means: match a preceding token
\1 means: to mach the result of captured group, which is the U or R ...
then replace all matches with ''
also you can check this:
lookahead
also check this tool I solve my regex using it,
it describe everything and you can learn a lot from it:
regexer
The reason for why your code does not work is because str.replace does not support regex, you can only replace a substring with another string. You will need to use the re module if you want to replace by matching a regex pattern.
Secondly, your regex pattern is also incorrect, (\w){2,} will match any characters that occurs 2 or more times (doesn’t have to be the same character though), so it will not work. You will need to do something like this:
import re
st = "UUUURRGGGEENNTTT"
print(re.sub(r'(\w)\1+',r'\1', st)))
# URGENT
Now this will only match the same character 2 or more times.
An alternative, “unique” solution to this is that you can use the unique_justseen recipe that itertools provides:
from itertools import groupby
from operator import itemgetter
st = "UUUURRGGGEENNTTT"
new ="".join(map(next, map(itemgetter(1), groupby(st))))
print(new)
# URGENT
string.replace(s, old, new[, maxreplace]) only does substring replacement:
>>> '(\w){2,}'.replace(r'(\w){2,}',r'\1')
'\\1'
That's why it fails and it can't work with regex expression so no way to correct the first command.

Regex check if backslash before every symbols using python

I met some problems when I'd like to check if the input regex if correct or not.
I'd like to check is there one backslash before every symbol, but I don't know how to implement using Python.
For example:
number: 123456789. (return False)
phone\:111111 (return True)
I try to use (?!) and (?=) in Python, but it doesn't work.
Update:
I'd like to match the following string:
\~, \!, \#, \$, \%, \^, \&, \*, \(, \), \{, \}, \[, \], \:, \;, \", \', \>, \<, \?
Thank you very much.
import re
if re.seach(r"\\\W", "phone\:111111") is not None:
print("OK")
Does it work?
Reading between the lines a bit, it sounds like you are trying to pass a string to a regex and you want to make sure it has no special characters in it that are unescaped.
Python's re module has an inbuilt re.escape() function for this.
Example:
>>> import re
>>> print(re.escape("phone:111111"))
"phone\\:111111"
Check that the entire string is composed of single characters or pairs of backslash+symbol:
import re
def has_backslash_before_every_symbol(s):
return re.match(r"^(\\[~!#$%^&*(){}\[\]:;"'><?]|[^~!#$%^&*(){}\[\]:;"'><?])*$", s) is not None
Python regex reference: https://docs.python.org/3/library/re.html

How to get the rightest match by regular expression?

I think this is a common problem. But I didn't find a satisfactory answer elsewhere.
Suppose I extract some links from a website. The links are like the following:
http://example.com/goto/http://example1.com/123.html
http://example1.com/456.html
http://example.com/yyy/goto/http://example2.com/789.html
http://example3.com/xxx.html
I want to use regular expression to convert them to their real links:
http://example1.com/123.html
http://example1.com/456.html
http://example2.com/789.html
http://example3.com/xxx.html
However, I can't do that because of the greedy feature of RE.
'http://.*$' will only match the whole sentence. Then I tried 'http://.*?$' but it didn't work either. Nor did re.findall. So is there any other way to do this?
Yes. I can do it by str.split or str.index. But I'm still curious about whether there is a RE solution for this.
You don't need to use regex you can use str.split() to split your links with // then pickup the last part and concatenate that with http//:
>>> s="""http://example.com/goto/http://example1.com/123.html
... http://example1.com/456.html
... http://example.com/yyy/goto/http://example2.com/789.html
... http://example3.com/xxx.html"""
>>> ['http://'+s.split('//')[-1] for link in s.split('\n')]
['http://example3.com/xxx.html', 'http://example3.com/xxx.html', 'http://example3.com/xxx.html', 'http://example3.com/xxx.html']
And with regex you just need to replace all characters between 2 // with empty string but as you need one of // for the first use a positive look-behind :
>>> [re.sub(r'(?<=//)(.*)//','',link) for link in s.split('\n')]
['http://example1.com/123.html', 'http://example1.com/456.html', 'http://example2.com/789.html', 'http://example3.com/xxx.html']
>>>
use this pattern
^(.*?[^/])(?=\/[^/]).*?([^/]+)$
and replace with $1/$2
Demo
after reading comment below, use this pattern to capture what you want
(http://(?:[^h]|h(?!ttp:))*)$
Demo
or this pattern
(http://(?:(?!http:).)*)$
Demo
or this pattern
http://.*?(?=http://)
and replace with nothing
Demo

How to find a specific character in a string and put it at the end of the string

I have this string:
'Is?"they'
I want to find the question mark (?) in the string, and put it at the end of the string. The output should look like this:
'Is"they?'
I am using the following regular expression in python 2.7. I don't know why my regex is not working.
import re
regs = re.sub('(\w*)(\?)(\w*)', '\\1\\3\\2', 'Is?"they')
print regs
Is?"they # this is the output of my regex.
Your regex doesn't match because " is not in the \w character class. You would need to change it to something like:
regs = re.sub('(\w*)(\?)([^"\w]*)', '\\1\\3\\2', 'Is?"they')
As shown here, " is not captured by \w. Hence, it would probably be best to just use a .:
>>> import re
>>> re.sub("(.*)(\?)(.*)", r'\1\3\2', 'Is?"they')
'Is"they?'
>>>
. captures anything/everything in Regex (except newlines).
Also, you'll notice that I used a raw-string for the second argument of re.sub. Doing so is cleaner than having all those backslashes.

Regex pattern for illegal regex groups `\g<...>`

In the following regex r"\g<NAME>\w+", I would like to know that a group named NAME must be used for replacements corresponding to a match.
Which regex matches the wrong use of \g<...> ?
For example, the following code finds any not escaped groups.
p = re.compile(r"(?:[^\\])\\g<(\w+)>")
for m in p.finditer(r"\g<NAME>\w+\\g<ESCAPED>"):
print(m.group(1))
But there is a last problem to solve. How can I manage cases of \g<WRONGUSE\> and\g\<WRONGUSE> ?
As far as I am aware, the only restriction on named capture groups is that you can't put metacharacters in there, such as . \, etc...
Have you come across some kind of problem with named capture groups?
The regex you used, r"illegal|(\g<NAME>\w+)" is only illegal because you referred to a backreference without it being declared earlier in the regex string. If you want to make a named capture group, it is (?P<NAME>regex)
Like this:
>>> import re
>>> string = "sup bro"
>>> re.sub(r"(?P<greeting>sup) bro", r"\g<greeting> mate", string)
'sup mate'
If you wanted to do some kind of analysis on the actual regex string in use, I don't think there is anything inside the re module which can do this natively.
You would need to run another match on the string itself, so, you would put the regex into a string variable and then match something like \(\?P<(.*?)>\) which would give you the named capture group's name.
I hope that is what you are asking for... Let me know.
So, what you want is to get the string of the group name, right?
Maybe you can get it by doing this:
>>> regex = re.compile(r"illegal|(?P<group_name>\w+)")
>>> regex.groupindex
{'group_name': 1}
As you see, groupindex returns a dictionary mapping the group names and their position in the regex. Having that, it is easy to retrieve the string:
>>> # A list of the group names in your regex:
... regex.groupindex.keys()
['group_name']
>>> # The string of your group name:
... regex.groupindex.keys()[0]
'group_name'
Don't know if that is what you were looking for...
Use a negative lookahead?
\\g(?!<\w+>)
This search for any g not followed by <…>, thus a "wrong use".
Thanks to all the comments, I have this solution.
# Good uses.
p = re.compile(r"(?:[^\\])\\g<(\w+)>")
for m in p.finditer(r"</\g\<at__tribut1>\\g<notattribut>>"):
print(m.group(1))
# Bad uses.
p = re.compile(r"(?:[^\\])\\g(?!<\w+>)")
if p.search(r"</\g\<at__tribut1>\\g<notattribut>>"):
print("Wrong use !")

Categories