regex in python - how to understand this ip lable without parentheses - python

I have this code to check if a string is a valid IPv4 address:
import re
def is_ip4(IP):
label = "([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])"
pattern = re.compile("(" + label + "\.){3}" + label + "$")
if pattern.match(IP):
print("matched!")
else:
print("No!")
it works fine. but if I remove the parentheses from the label, as this
import re
def is_ip4(IP):
label = "[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]"
pattern = re.compile("(" + label + "\.){3}" + label + "$")
if pattern.match(IP):
print("matched!")
else:
print("No!")
it show valid ip for "2090.1.11.0", "20.1.11.0", but not for "2.1.11.0". I'm actually a bit confused for the cases with vs without parentheses. Can someone explain this for me? thanks

The reason you need the parentheses is because of the two-step process you're using. By itself, the parentheses don't do anything (other than capturing in a group). But you're also doing this:
pattern = re.compile("(" + label + "\.){3}" + label + "$")
The label regex is copied twice, first for three repetitions followed by a period. That copy is fine (almost), because in the statement, it is enclosed in parentheses once more. However, the second copy is outside any parentheses, so you end up with a regex like (simplified):
pattern == '(a|ab|abc\.){3}a|ab|abc$'
This matches if either (a|ab|abc\.){3}a matches, or ab or abc. With parentheses, it would be like:
pattern == '((a|ab|abc)\.){3}(a|ab|abc)$'
So, although the parentheses appear superfluous, they are not for two reasons. They are keeping the period separate from the last option abc and they are keeping the final choices together and apart from the first bit.
However, you shouldn't be doing this in the first place. Just use:
from ipaddress import ip_address
def is_ip4(ip):
try:
ip_address(ip)
return True
except ValueError:
return False
No installation required, it's a standard library.
The reason you get a match for '2090.1.11.0' is because matching it to this:
'([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]\\.){3}[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]$'
Comes down to matching it to this:
'([0-9]){3}[0-9]'
Since, [0-9] is the first option in the 'or' expression in parentheses, repeated three times and the second [0-9] is the first option in the 'or' expression after the {3}.
Note that the $ you put in to ensure the entire string was matches is lumped in with the final 'or' option, so that doesn't do anything here.
Try running the below and note the identical first match:
import re
print(re.findall('([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]\\.){3}[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]$', '2090.1.11.0'))
print(re.findall('([0-9]){3}[0-9]', '2090.1.11.0'))
(ignore the second match on the first line, not as relevant)

Related

wildcard match & replace and/or multiple string wildcard matching

I have two very related questions:
I want to match a string pattern with a wildcard (i.e. containing one or more '*' or '?')
and then form a replacement string with a second wildcard pattern. There the placeholders should refer to the same matched substring
(As for instance in the DOS copy command)
Example: pattern='*.txt' and replacement-pattern='*.doc':
I want aaa.txt --> aaa.doc and xx.txt.txt --> xx.txt.doc
Ideally it would work with multiple, arbitrarily placed wildcards: e.g., pattern='*.*' and replacement-pattern='XX*.*'.
Of course one needs to apply some constraints (e.g. greedy strategy). Otherwise patterns such as X*X*X are not unique for string XXXXXX.
or, alternatively, form a multi-match. That is I have one or more wildcard patterns each with the same number of wildcard characters. Each pattern is matched to one string but the wildcard characters should refer to the same matching text.
Example: pattern1='*.txt' and pattern2='*-suffix.txt
Should match the pair string1='XX.txt' and string2='XX-suffix.txt' but not
string1='XX.txt' and string2='YY-suffix.txt'
In contrast to the first this is a more well defined problem as it avoids the ambiguity problem but is perhaps quite similar.
I am sure there are algorithms for these tasks, however, I am unable to find anything useful.
The Python library has fnmatch but this is does not support what I want to do.
There are many ways to do this, but I came up with the following, which should work for your first question. Based on your examples I’m assuming you don’t want to match whitespace.
This function turns the first passed pattern into a regex and the passed replacement pattern into a string suitable for the re.sub function.
import re
def replaceWildcards(string, pattern, replacementPattern):
splitPattern = re.split(r'([*?])', pattern)
splitReplacement = re.split(r'([*?])', replacementPattern)
if (len(splitPattern) != len(splitReplacement)):
raise ValueError("Provided pattern wildcards do not match")
reg = ""
sub = ""
for idx, (regexPiece, replacementPiece) in enumerate(zip(splitPattern, splitReplacement)):
if regexPiece in ["*", "?"]:
if replacementPiece != regexPiece:
raise ValueError("Provided pattern wildcards do not match")
reg += f"(\\S{regexPiece if regexPiece == '*' else ''})" # Match anything but whitespace
sub += f"\\{idx + 1}" # Regex matches start at 1, not 0
else:
reg += f"({re.escape(regexPiece)})"
sub += f"{replacementPiece}"
return re.sub(reg, sub, string)
Sample output:
replaceWildcards("aaa.txt xx.txt.txt aaa.bat", "*.txt", "*.doc")
# 'aaa.doc xx.txt.doc aaa.bat'
replaceWildcards("aaa10.txt a1.txt aaa23.bat", "a??.txt", "b??.doc")
# 'aab10.doc a1.txt aaa23.bat'
replaceWildcards("aaa10.txt a1-suffix.txt aaa23.bat", "a*-suffix.txt", "b*-suffix.doc")
# 'aaa10.txt b1-suffix.doc aaa23.bat'
replaceWildcards("prefix-2aaa10-suffix.txt a1-suffix.txt", "prefix-*a*-suffix.txt", "prefix-*b*-suffix.doc")
# 'prefix-2aab10-suffix.doc a1-suffix.txt
Note f-strings require Python >=3.6.

Regex to exclude words followed by space

I tried a lot of solutions but can't get this Regex to work.
The string-
"Flow Control None"
I want to exclude "Flow Control" plus the blank space, and only return whatever is on the right.
Since you have tagged your question with #python and #regex, I'll outline a simple solution to your problem using these tools. Furthermore, the other two answers don't really tackle the exact problem of matching "whatever is on the right" of your "Flow Control " prefix.
First, start by importing the re builtin module (read the docs).
import re
Define the pattern you want to match. Here, we're matching "whatever is on the right" ((?P<suffix>.+)$) of ^Flow Control .
pattern = re.compile(r"^Flow Control (?P<suffix>.+)$")
Grab the match for a given string (e.g. "Flow Control None")
suffix = pattern.search("Flow Control None").group("suffix")
print(suffix) # Out: None
Hopefully, this complete working example will also help you
import re
def get_suffix(text: str):
pattern = re.compile(r"^Flow Control (?P<suffix>.+)$")
matches = pattern.search(text)
return matches.group("suffix") if matches else None
examples = [
"Flow Control None",
"Flow Control None None",
"Flow Control None",
"Flow Control ",
]
for example in examples:
suffix = get_suffix(text=example)
if suffix:
print(f"Matched: {repr(suffix)}")
else:
print(f"No matches for: {repr(example)}")
Use split like so:
my_str = 'Flow Control None'
out_str = my_str.split()[-1]
# 'None'
Or use re.findall:
import re
out_str = re.findall(r'^.*\s(\S+)$', my_str)[0]
If you really want a purely regex solution try this: (?<= )[a-zA-Z]*$
The (?<= ) matches a single ' ' but doesn't include it in the match. [a-zA-Z]* matches anything from a to z or A to Z any number of times. $ matches the end of the line.
You could also try replacing the * with a + if you want to ensure that your match has at least one letter (* will produce a 0-length match if your string ends in a space, + will match nothing).
But it may be clearer to do something like
data = "Flow Control None"
split = data.split(' ')
split[len(split) - 1] # returns "None"
EDIT data.split(' ')[-1] also returns "None"
or
data[data.rfind(' ') + 1:] # returns "None"
that don't involve regexes at all.

Change pa$$word to pa\$\$word in Python

I have a string pa$$word. I want to change this string to pa\$\$word. This must be changed to 2 or more such characters only and not for pa$word. The replacement must happen n number of times where n is the number of "$" symbols. For example, pa$$$$word becomes pa\$\$\$\$word and pa$$$word becomes pa\$\$\$word.
How can I do it?
import re
def replacer(matchobj):
mat = matchobj.group()
return "".join(item for items in zip("\\" * len(mat), mat) for item in items)
print re.sub(r"((\$)\2+)", replacer, "pa$$$$word")
# pa\$\$\$\$word
print re.sub(r"((\$)\2+)", replacer, "pa$$$word")
# pa\$\$\$word
print re.sub(r"((\$)\2+)", replacer, "pa$$word")
# pa\$\$word
print re.sub(r"((\$)\2+)", replacer, "pa$word")
# pa$word
((\$)\2+) - We create two capturing groups here. First one is, the entire match as it is, which can be referred later as \1. The second capturing group is a nested one, which captures the string \$ and referred as \2. So, we first match $ once and make sure that it exists more than once, continuously by \2+.
So, when we find a string like that, we call replacer function with the matched string and the captured groups. In the replacer function, we get the entire matched string with matchobj.group() and then we simply interleave that matched string with \.
I believe the regex you're after is:
[$]{2,}
which will match 2 or more of the character $
this should help
import re
result = re.sub("\$", "\\$", yourString)
or you can try
str.replace("\$", "\\$")

Regular expression in python to capture multiple forms of badly formatted addresses

I have been tweaking a regular expression over several days to try to capture, with a single definition, several cases of inconsistent format in the address field of a database.
I am new to Python and regular expressions, and have gotten great feedback here is stackoverflow, and with my new knowledge, I built a RegEx that is getting close to the final result, but still can't spot the problem.
import re
r1 = r"([\w\s+]+),?\s*\(?([\w\s+\\/]+)\)?\s*\(?([\w\s+\\/]+)\)?"
match1 = re.match(r1, 'caracas, venezuela')
match2 = re.match(r1, 'caracas (venezuela)')
match3 = re.match(r1, 'caracas, (venezuela) (df)')
group1 = match1.groups()
group2 = match2.groups()
group3 = match3.groups()
print group1
print group2
print group3
This thing should return 'caracas, venezuela' for groups 1 and 2, and 'caracas, venezuela, df' for group 3, instead, it returns:
('caracas', 'venezuel' 'a')
('caracas ', 'venezuel' 'a')
('caracas', 'venezuela', 'df')
The only perfect match is group 3. The other 2 are isolating the 'a' at the end, and the 2nd one has an extra space at the end of 'caracas '.
Thanks in advance for any insight.
Cheers!
Regular expressions might be overkill... what exactly is your problem statement? What do you need to capture?
Some things I caught (in order of appearance in your regex; sometimes it helps to read it out, left-to-right, English-style):
([\w\s+]+)
This says, "capture one or more (letter or one or more spaces)"
Do you really want to capture the spaces at the end of the city name? Also, you don't need (indeed, shouldn't have) the 1-or-more symbol + inside your brackets [ ], since your regex will already be matching one or more of them based on the outer +. I'd rewrite this part like this:
([\w\s]*\w)
Which will match eagerly up to the last alphanumeric character ("zero or more (letter or space) followed by a letter"). This does assume you have at least one character, but is better than your assumption that a single space would work as well.
Next you have:
,?\s*\(?
which looks okay to me except that it doesn't guarantee that you'll see either a comma or an open paren anymore. What about:
(?:,\s*\(|,\s*|\s*\()
which says, "non-capturingly match either (a comma with maybe some spaces and then an open paren) OR (a comma with maybe some spaces) OR (maybe some spaces and then an open paren)". This enforces that you must have either a comma or a paren or both.
Next you have the capturing expression, very similar to the first:
([\w\s+\\/]+)
Again, you don't want the spaces (or slashes in this case) at the end of the city name, and you don't want the + inside the [ ]:
([\w\s\\/]*\w)
The next expression is probably where you're getting your venezuel a problem; let's take a look:
\)?\s*\(?([\w\s+\\/]+)\)?
This is a rather long one, so let's break it down:
\)?\s*\(?
says to "maybe match a close paren, and then maybe some spaces, and then maybe an open paren". This is okay I guess, let's move on to the real problem:
([\w\s+\\/]+)
This capturing group MUST match at least one character. If the matcher sees "venezuela" at the end of your address, it will eagerly match the characters venezuel and then need to satisfy this final expression with what it has left, a. Try instead:
\)?\s*
Followed by making your entire final expression optional, and the outer expression non-capturing:
(?:\(?([\w\s+\\/]+)\)?)?
The final expression would be:
([\w\s]*\w)(?:,\s*\(|,\s*|\s*\()([\w\s\\/]*\w)\)?\s*(?:\(?([\w\s+\\/]+)\)?)?
Edit: fixed a problem that made the final group capture twice, once with the parens, once without. Now it should only capture the text inside the parens.
Testing it on your examples:
>>> re.match(r, 'caracas, venezuela').groups()
('caracas', 'venezuela', None)
>>> re.match(r, 'caracas (venezuela)').groups()
('caracas', 'venezuela', None)
>>> re.match(r, 'caracas, (venezuela) (df)').groups()
('caracas', 'venezuela', 'df')
Could you not just find all the words in the text?
E.g.:
>>> import re
>>> samples = ['caracas, venezuela','caracas (venezuela)','caracas, (venezuela) (df)']
>>>
>>> def find_words(text):
... return re.findall('\w+',text)
...
>>> for sample in samples:
... print find_words(sample)
...
['caracas', 'venezuela']
['caracas', 'venezuela']
['caracas', 'venezuela', 'df']

Matching optional numbers in regex

This one is probably a simple one, but I could not find an example that's simple enough to understand (sorry, I'm new with RegEx).
I'm writing some Python code to search for any string that matches any of the following examples:
float[20]
float[7532]
float[]
So this is what I have so far:
import re
p = re.compile('float\[[0-9]+\]')
print p.match("float[20]")
print p.match("float[7532]")
print p.match("float[]")
The code works great for the first and second scenarios, but not the third (no numbers between brackets). What's the best way to add that condition?
Thanks a lot!
p = re.compile('float\[[0-9]*\]')
putting a * after the character class means 0 or matches of the character class.
Try
float\[\d*\]
\d is a shortcut for [0-9].
The asterisk matches 0..n (any number) of characters of the character class.
The + operator requires at least one instance of whatever it's applying to, which your third option doesn't have. You want the * operator which is 0 or more. So:
p = re.compile('float\[[0-9]*\]')
Try:
import re
p = re.compile('float\[[0-9]*\]')
print p.match("float[20]")
print p.match("float[7532]")
print p.match("float[]")
+ is for one or more elements and * is used for zero or more element.

Categories