I need to check a string for any IPv4 address or one of following CIDR blocks: /16 or /24.
So, 192.168.0.1 should match. 192.168.0.0/16 should match. 192.168.0.0/17 should NOT match
I'm using following regex:
re.compile(r'^([0-9]{1,3}\.){3}[0-9]{1,3}(/(16|24))?')
This matches all IP addresses but also strings like 192.168.0.0/aaaa
Now, if I change the regex (remove ? at end):
re.compile(r'^([0-9]{1,3}\.){3}[0-9]{1,3}(/(16|24))')
It matches CIDR blocks /16 or /24 but not the IP Addresses(eg, 192.168.0.1) anymore.
Isn't '?' supposed to check a group for optional occurrence? What am I doing wrong?
Note: I know the IP address regex itself is not perfect, but I'm more interested in getting help on the issue described.
This should work:
^([0-9]{1,3}\.){3}[0-9]{1,3}($|/(16|24))$
It checks for $ (line end) or / and 16 or 24.
Just like you said ? marks a group as optional, which means that it will try to include that in the match if possible. But in some cases it cannot like in 192.168.0.0/aaaa, but because it is optional it will still match the other parts.
That is why the above regex is more suited for your needs. This way you will only get a match if it ends either with /24, /16 or end of line eg. 192.168.0.1.
Accurate Match
Matches 0.0.0.0 through 255.255.255.255. If CIDR block specified, then matches only if the CIDR is 16 or 24. In action:
^ # Start string
(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\. # A in A.B.C.D
(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\. # B in A.B.C.D
(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\. # C in A.B.C.D
(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)($|/(16|24))? # D in A.B.C.D and /16 or /24
$ # End string
Is there some reason you feel compelled to approach this with a single regex? Is it really a nail(*)? Is there some reason why you can't install and use the Python IPAddr module and use it to parse and manipulate your IP addresses? I guess you could then do something like:
#!/usr/bin/env python
import ipaddr
...
mynet = ipaddr.IPv4Network('192.168.0.0/16')
try:
other = ipaddr.IPv4Network(other_network_string)
nm = other.netmask
except ipaddr.AddressValueError:
other = None
nm = None
...
if nm and nm == mynet.netnmask:
be_happy()
In other words there's a package where someone has done all the heavy lifting of parsing and manipulating IP Address strings. How much of that do you really want to redo for your code? How much time do you want to spend testing your new code and finding the same sorts of bugs that the creators of this package have probably found and fixed?
If I sound like I'm hammering on the point a bit ... it's because this approach seems entirely too similar to attempts to parse HTML (or XML) using regexes rather than using existing, tested, robust parsers which have already been written.
(If the only tool at hand is a hammer, every problem looks like a nail)
The semantics of '?' is a bit more complex (just a bit). You can imagine it like a synonym of the adverb "possibly".
It works this way: IF there's a substring matching my pattern THEN go on with the matching process. I "highlighted" IF and THEN because the semantics of the implication says that, in case the premise is not satisfied, the whole sentence is still true.
Therefore, let's now apply this principle to your case. You put a '?' on a suffix. Let's assume that the former part matches and, now, let's deal with the suffix: if there's a suffix that matches your pattern, the whole string will match. If the suffix doesn't match, there's no problem: the block marked with '?' is "optional" (remember the "possibly" semantics or, equivalently, the implication semantics), therefore the string still matches.
Therefore, putting a '?' block in the last part of your pattern is not very useful, because the string will still match, whether or not there's a matching suffix. Optional blocks are useful only in the middle of a string, indeed.
Related
Good afternoon, I have a query, I have an answer that I receive after executing an SSH connection to a server, the question is that I need to eliminate a considerable substring from the response, so I would like to know if I am successful in some method that allows the Connecting fewer lines, for example, the replace method is not effective for me, since as indicated in the image below, the replacement text is considerable.
The beginning / end of the chain that I am going to replace I know, so that would be the limits to replace
buff_string:
I must remove everything that is highlighted
...... code.....
shell.close() #Cerramos canal
ssh.close() #Cerramos cliente SSH
# print(val_count)
# print(buff_config.count('*B:P79COL01#'))
print(buff_string)
See the following link How to remove substring from a string in python? to document me, but the reality in the response is about removing chain of short substring, the opposite of my case
You can find the substring that starts your sequence by using index, then use the regular list slice operator ([<start>:<end>]) to extract your substring:
data = 'foobar*B:P79COL01#barfoo'
start = '*B:P79COL01#'
print(data[data.index(start):])
-> "*B:P79COL01#barfoo"
I have a functional python 2.7 code that extracts IPs from the routing table. It only extracts ip in x.x.x.x/xx format. I do however has a issue excluding some lines in the route table.
For example, this line:
D 10.50.80.0/24 [90/3072] via 10.10.10.1, 3w6d, Vlan10
In this line all I care about is 10.50.80.0/24. Since this is the only ip with /24 notation, I can only grab that and have regex ignore onces without / (e.g, 10.10.10.1). But in the table, we have below 2 anomalies:
10.10.60.0/16 is variably subnetted, 58 subnets, 4 masks
C 10.10.140.0/24 is directly connected, Vlan240
I would like to capture the IP on second line (10.10.140.0/24) but not first line (10.10.60.0/16). The program is extracting IPs and checking if any subnet is available in table or not. 10.10.60.0/16 is issue as it is not saying that 10.10.60.0/16 is in table but only saying that this subnet has variable subnetting.
Currently my tool is capturing this IP and marking whole 10.10.60.0/16 range as in table which is not true. I tried some regex edit but was not really happy with it. I do not accidentally want to skip any subnet accidentally especially the second line that is similar to first. It is very important to capture all correct subnets.
Can someone suggest a best regex edit to accomplish this. Only skip lines that has x.x.x.x/xx is variably subnetted, x subnets, x masks
Here is my current code:
match = re.findall(r'(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\/(?:[\d]{1,3})', text)
Thanks
Damon
If I got your question correctly you want your existing regex to skip any IP/subnet that is followed by 'is variably subnetted'. Do that that you can use this regex:
(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\/(?:[\d]{1,3})\b(?! is variably)
I've added \b(?! is variably) at the end of your regex
\b at the end indicates a word boundary
(?! is variably) has a negative lookahead (?! which makes sure that the text ' is variably' isn't present after the IP/subnet.
Demo: https://regex101.com/r/jTu8cj/1
Matches:
D 10.50.80.0/24 [90/3072] via 10.10.10.1, 3w6d, Vlan10
C 10.10.140.0/24 is directly connected, Vlan240
Doesn't match:
10.10.60.0/16 is variably subnetted, 58 subnets, 4 masks
255.255.255.1
As input I have a series of long strings, which may or may not have the pattern(s) I'm looking for. The strings that have the pattern(s) will have an identifier(s) somewhere in the string, but not necessarily directly preceding the pattern(s). Currently I'm using this logic to find what I'm looking for:
droid_name = re.compile("(r2-d2|c-3po)")
location = re.compile("pattern_of_numbered_sectors_where_theyre_located")
find_droid = re.findall(location, string) if re.match(droid_name, string) else not_the_droids_youre_looking_for
r2-d2 and c-3po won't be the same length.
Can I combine this logic into a single regex? Thanks!
EDIT:
I'm looking for a one-line solution because I have a number of different types of information that I want to extract from various strings, so I'm using a dictionary with the regexes. So, something like this:
regexes = {
'droid location': re.compile("droid_location_pattern")
'jedi name': re.compile("jedi_name_pattern")
'tatooine phone number': re.compile("tatooine_phone_pattern")
}
def analyze(some_string):
for key, regex in regexes:
data = re.findall(regex, some_string)
if data:
for data_item in data:
send_to_mysql(label=key, info=data_item)
EDIT:
Some sample strings are below.
Valid numbers will have the pattern: 9XXXX, which may also be written as 9XXX-X
I don't want to match the number 92222:
[Darth Vader]: Hey babe, I'm chilling in the Death Star. Where are you?
[Padme Amidala]: At the Galactic Senate, can't talk.
[Darth Vader]: Netflix and chill?
[Padme Amidala]: Call me later on my burner phone, the number is: 92222.
Here, I want to match the number 97777, because the string contains r2-d2:
[communique yoda:palpatine] spotted luke skywalker i have.
[communique yoda:palpatine] with the droid he is. r2-d2 we must kill.
[communique yoda:palpatine] location 97777 you must go.
Another possible match because the string contains c-3po:
root#palpatine$ at-at start --target c-3po --location 9777-7
AT-AT startup sequence...
[Error] fuel reserves low, aborting startup. Goodbye.
Don't want to match:
https://members.princessleiapics.com?username=stormtrooper&password=96969
Well, this highly depends on your actual strings. Assuming that c-3po or r2-d2 will always be before the desired location number (am I correct here?) you could use for both your examples the following regex:
(?:c-3po|r2-d2)(?=.*\b(9\d\d\d-?\d)\b)
# looks for c-3po or r2-d2 literally
# start a positive lookahead
# which consumes every character zero or unlimited times
# looks for a word boundary
# and captures a five digit number with or without a dash
# looks for a word boundary afterwards and close the lookahead
Be aware that this only works in DOTALL mode (aka the dot matches newline characters as well). See a working demo on regex101 here (copy and paste your other strings to confirm the examples are working).
Additionaly thoughts: It might be better though to check if the strings c-3po or r2-d2 occur in the chunks using normal python string functions and if so try to match the desired location number with the following regex:
\b(9\d\d\d-?\d)\b
# same as above without the lookahead
I want to replace consecutive symbols just one such as;
this is a dog???
to
this is a dog?
I'm using
str = re.sub("([^\s\w])(\s*\1)+", "\\1",str)
however I notice that this might replace symbols in urls that might happen in my text.
like http://example.com/this--is-a-page.html
Can someone give me some advice how to alter my regex?
So you want to unleash the power of regular expressions on an irregular language like HTML. First of all, search SO for "parse HTML with regex" to find out why that might not be such a good idea.
Then consider the following: You want to replace duplicate symbols in (probably user-entered) text. You don't want to replace them inside a URL. How can you tell what a URL is? They don't always start with http – let's say ars.userfriendly.org might be a URL that is followed by a longer path that contains duplicate symbols.
Furthermore, you'll find lots of duplicate symbols that you definitely don't want to replace (think of nested parentheses (like this)), some of them maybe inside a <script> on the page you're working on (||, && etc. come to mind.
So you might come up with something like
(?<!\b(?:ftp|http|mailto)\S+)([^\\|&/=()"'\w\s])(?:\s*\1)+
which happens to work on the source code of this very page but will surely fail in other cases (for example if URLs don't start with ftp, http or mailto). Plus, it won't work in Python since it uses variable repetition inside lookbehind.
All in all, you probably won't get around parsing your HTML with a real parser, locating the body text, applying a regex to it and writing it back.
EDIT:
OK, you're already working on the parsed text, but it still might contain URLs.
Then try the following:
result = re.sub(
r"""(?ix) # case-insensitive, verbose regex
# Either match a URL
# (protocol optional (if so, URL needs to start with www or ftp))
(?P<URL>\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$])
# or
|
# match repeated non-word characters
(?P<rpt>[^\s\w])(?:\s{0,100}(?P=rpt))+""",
# and replace with both captured groups (one will always be empty)
r"\g<URL>\g<rpt>", subject)
Re-EDIT: Hm, Python chokes on the (?:\s*(?P=rpt))+ part, saying the + has nothing to repeat. Looks like a bug in Python (reproducible with (.)(\s*\1)+ whereas (.)(\s?\1)+ works)...
Re-Re-EDIT: If I replace the * with {0,100}, then the regex compiles. But now Python complains about an unmatched group. Obviously you can't reference a group in a replacement if it hasn't participated in the match. I give up... :(
I am using python and would like a simple api or regex to check for a domain name's validity. By validity I am the syntactical validity and not whether the domain name actually exists on the Internet or not.
Any domain name is (syntactically) valid if it's a dot-separated list of identifiers, each no longer than 63 characters, and made up of letters, digits and dashes (no underscores).
So:
r'[a-zA-Z\d-]{,63}(\.[a-zA-Z\d-]{,63})*'
would be a start. Of course, these days some non-Ascii characters may be allowed (a very recent development) which changes the parameters a lot -- do you need to deal with that?
r'^(?=.{4,255}$)([a-zA-Z0-9][a-zA-Z0-9-]{,61}[a-zA-Z0-9]\.)+[a-zA-Z0-9]{2,5}$'
Lookahead makes sure that it has a minimum of 4 (a.in) and a maximum of 255 characters
One or more labels (separated by periods) of length between 1 to 63, starting and ending with alphanumeric characters, and containing alphanumeric chars and hyphens in the middle.
Followed by a top level domain name (whose max length is 5 for museum)
Note that while you can do something with regular expressions, the most reliable way to test for valid domain names is to actually try to resolve the name (with socket.getaddrinfo):
from socket import getaddrinfo
result = getaddrinfo("www.google.com", None)
print result[0][4]
Note that technically this can leave you open to DoS (if someone submits thousands of invalid domain names, it can take a while to resolve invalid names) but you could simply rate-limit someone who tries this.
The advantage of this is that it'll catch "hotmail.con" as invalid (instead of "hotmail.com", say) whereas a regex would say "hotmail.con" is valid.
I've been using this:
(r'(\.|\/)(([A-Za-z\d]+|[A-Za-z\d][-])+[A-Za-z\d]+){1,63}\.([A-Za-z]{2,3}\.[A-Za-z]{2}|[A-Za-z]{2,6})')
to ensure it follows either after dot (www.) or / (http://) and the dash occurs only inside the name and to match suffixes such as gov.uk too.
The answers are all pretty outdated with the spec at this point. I believe the below will match the current spec correctly:
r'^(?=.{1,253}$)(?!.*\.\..*)(?!\..*)([a-zA-Z0-9-]{,63}\.){,127}[a-zA-Z0-9-]{1,63}$'