Regex to include and exclude certain IPs - python

I have a functional python 2.7 code that extracts IPs from the routing table. It only extracts ip in x.x.x.x/xx format. I do however has a issue excluding some lines in the route table.
For example, this line:
D 10.50.80.0/24 [90/3072] via 10.10.10.1, 3w6d, Vlan10
In this line all I care about is 10.50.80.0/24. Since this is the only ip with /24 notation, I can only grab that and have regex ignore onces without / (e.g, 10.10.10.1). But in the table, we have below 2 anomalies:
10.10.60.0/16 is variably subnetted, 58 subnets, 4 masks
C 10.10.140.0/24 is directly connected, Vlan240
I would like to capture the IP on second line (10.10.140.0/24) but not first line (10.10.60.0/16). The program is extracting IPs and checking if any subnet is available in table or not. 10.10.60.0/16 is issue as it is not saying that 10.10.60.0/16 is in table but only saying that this subnet has variable subnetting.
Currently my tool is capturing this IP and marking whole 10.10.60.0/16 range as in table which is not true. I tried some regex edit but was not really happy with it. I do not accidentally want to skip any subnet accidentally especially the second line that is similar to first. It is very important to capture all correct subnets.
Can someone suggest a best regex edit to accomplish this. Only skip lines that has x.x.x.x/xx is variably subnetted, x subnets, x masks
Here is my current code:
match = re.findall(r'(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\/(?:[\d]{1,3})', text)
Thanks
Damon

If I got your question correctly you want your existing regex to skip any IP/subnet that is followed by 'is variably subnetted'. Do that that you can use this regex:
(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\/(?:[\d]{1,3})\b(?! is variably)
I've added \b(?! is variably) at the end of your regex
\b at the end indicates a word boundary
(?! is variably) has a negative lookahead (?! which makes sure that the text ' is variably' isn't present after the IP/subnet.
Demo: https://regex101.com/r/jTu8cj/1
Matches:
D 10.50.80.0/24 [90/3072] via 10.10.10.1, 3w6d, Vlan10
C 10.10.140.0/24 is directly connected, Vlan240
Doesn't match:
10.10.60.0/16 is variably subnetted, 58 subnets, 4 masks
255.255.255.1

Related

How to select an entire entity around a regex without splitting the string first?

My project (unrelated to this question, just context) is a ML classifier, I'm trying to improve it and have found that when I stripped URLS from the text given to it, some of the URLS have been broken by spaces. For example:
https:// twitter.com/username/sta tus/ID
After I remove links that are not broken, I am left with thinks like www website com. I removed those with the following regular expression in Python:
tweet = re.sub('(www|http).*?(org |net |edu |com |be |tt |me |ms )','',tweet);
I've put a space after every one of them because this happens after the regular strip and text processing (so only working with parts of a URL separated by spaces) and theoretically we should only pick up the remainders of a broken link... not something like
http website strangeTLD .... communication
It's not perfect but it works, however I just thought that I might try to preemptively remove URLS from twitter only, since I know that the spaces that break the regular URL strip will always be in the same places, hoping this improves my classifier accuracy? This will get rid of the string of characters that occurs after a link... specifically pictures, which is a lot of my data.
Specifically, is there a way to select the entity surrounding/after:
pic.twitter.com/
or, in reference to the example I gave earlier, select the entity after the username broken by the space in status (I'm just guessing at this regex)...
http.*?twitter.com/*?/sta tus/
Thank you in advance! And for the record, I was given this dataset to work with; I am not sure why the URLs are almost all broken by spaces.
Yes, what you are talking about is called Positive Lookbehind and works using (?<=...), where the ellipsis should be replaced by what you want to skip.
E.g. if you want to select whatever comes after username in https://twitter.com/username/status/ID, just use
(?<=https:\/\/twitter\.com\/username\/).*
and you will get status/ID, like you can see with this live demo.
In this case I had to escape slashes / using backslashes, as required by Regex specifications; I also used the Kleene star operator, i.e. the asterisk, to match any occurrence of . (any character), just like you did.
What a positive lookbehind combination does is specifying some mandatory text before the current position of your cursor; in other words, it puts the cursor after the expression you feed it (if the said text exists).
Of course this is not enough in your case, since username won't be a fixed string but a variable one. This might be an additional requirement, since lookbehinds do not work with variable lengths.
So you can just skip www.twitter.com/
(?<=https:\/\/twitter\.com\/).*
And then, via Python, create a substring
currentText = "username/status/ID"
result = currentText.split("/",1)[1] # returns status/ID
Test it in this demo (click "Execute"); a simple explanation of how this works is in the answer to this question (in short, you just split the string at the first slash character).
As a sidenote, blanks/spaces aren't allowed in URLs and if necessary are usually encoded as %20 or + (see e.g. this answer). In other words, every URL you got can be safely stripped of spaces before processing, so... why didn't they do it?

Can I combine these two regexes into a single regex? (Find `that` in `string` if `this` is anywhere in `string`)

As input I have a series of long strings, which may or may not have the pattern(s) I'm looking for. The strings that have the pattern(s) will have an identifier(s) somewhere in the string, but not necessarily directly preceding the pattern(s). Currently I'm using this logic to find what I'm looking for:
droid_name = re.compile("(r2-d2|c-3po)")
location = re.compile("pattern_of_numbered_sectors_where_theyre_located")
find_droid = re.findall(location, string) if re.match(droid_name, string) else not_the_droids_youre_looking_for
r2-d2 and c-3po won't be the same length.
Can I combine this logic into a single regex? Thanks!
EDIT:
I'm looking for a one-line solution because I have a number of different types of information that I want to extract from various strings, so I'm using a dictionary with the regexes. So, something like this:
regexes = {
'droid location': re.compile("droid_location_pattern")
'jedi name': re.compile("jedi_name_pattern")
'tatooine phone number': re.compile("tatooine_phone_pattern")
}
def analyze(some_string):
for key, regex in regexes:
data = re.findall(regex, some_string)
if data:
for data_item in data:
send_to_mysql(label=key, info=data_item)
EDIT:
Some sample strings are below.
Valid numbers will have the pattern: 9XXXX, which may also be written as 9XXX-X
I don't want to match the number 92222:
[Darth Vader]: Hey babe, I'm chilling in the Death Star. Where are you?
[Padme Amidala]: At the Galactic Senate, can't talk.
[Darth Vader]: Netflix and chill?
[Padme Amidala]: Call me later on my burner phone, the number is: 92222.
Here, I want to match the number 97777, because the string contains r2-d2:
[communique yoda:palpatine] spotted luke skywalker i have.
[communique yoda:palpatine] with the droid he is. r2-d2 we must kill.
[communique yoda:palpatine] location 97777 you must go.
Another possible match because the string contains c-3po:
root#palpatine$ at-at start --target c-3po --location 9777-7
AT-AT startup sequence...
[Error] fuel reserves low, aborting startup. Goodbye.
Don't want to match:
https://members.princessleiapics.com?username=stormtrooper&password=96969
Well, this highly depends on your actual strings. Assuming that c-3po or r2-d2 will always be before the desired location number (am I correct here?) you could use for both your examples the following regex:
(?:c-3po|r2-d2)(?=.*\b(9\d\d\d-?\d)\b)
# looks for c-3po or r2-d2 literally
# start a positive lookahead
# which consumes every character zero or unlimited times
# looks for a word boundary
# and captures a five digit number with or without a dash
# looks for a word boundary afterwards and close the lookahead
Be aware that this only works in DOTALL mode (aka the dot matches newline characters as well). See a working demo on regex101 here (copy and paste your other strings to confirm the examples are working).
Additionaly thoughts: It might be better though to check if the strings c-3po or r2-d2 occur in the chunks using normal python string functions and if so try to match the desired location number with the following regex:
\b(9\d\d\d-?\d)\b
# same as above without the lookahead

IPAddress or CIDR block matching regex

I need to check a string for any IPv4 address or one of following CIDR blocks: /16 or /24.
So, 192.168.0.1 should match. 192.168.0.0/16 should match. 192.168.0.0/17 should NOT match
I'm using following regex:
re.compile(r'^([0-9]{1,3}\.){3}[0-9]{1,3}(/(16|24))?')
This matches all IP addresses but also strings like 192.168.0.0/aaaa
Now, if I change the regex (remove ? at end):
re.compile(r'^([0-9]{1,3}\.){3}[0-9]{1,3}(/(16|24))')
It matches CIDR blocks /16 or /24 but not the IP Addresses(eg, 192.168.0.1) anymore.
Isn't '?' supposed to check a group for optional occurrence? What am I doing wrong?
Note: I know the IP address regex itself is not perfect, but I'm more interested in getting help on the issue described.
This should work:
^([0-9]{1,3}\.){3}[0-9]{1,3}($|/(16|24))$
It checks for $ (line end) or / and 16 or 24.
Just like you said ? marks a group as optional, which means that it will try to include that in the match if possible. But in some cases it cannot like in 192.168.0.0/aaaa, but because it is optional it will still match the other parts.
That is why the above regex is more suited for your needs. This way you will only get a match if it ends either with /24, /16 or end of line eg. 192.168.0.1.
Accurate Match
Matches 0.0.0.0 through 255.255.255.255. If CIDR block specified, then matches only if the CIDR is 16 or 24. In action:
^ # Start string
(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\. # A in A.B.C.D
(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\. # B in A.B.C.D
(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\. # C in A.B.C.D
(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)($|/(16|24))? # D in A.B.C.D and /16 or /24
$ # End string
Is there some reason you feel compelled to approach this with a single regex? Is it really a nail(*)? Is there some reason why you can't install and use the Python IPAddr module and use it to parse and manipulate your IP addresses? I guess you could then do something like:
#!/usr/bin/env python
import ipaddr
...
mynet = ipaddr.IPv4Network('192.168.0.0/16')
try:
other = ipaddr.IPv4Network(other_network_string)
nm = other.netmask
except ipaddr.AddressValueError:
other = None
nm = None
...
if nm and nm == mynet.netnmask:
be_happy()
In other words there's a package where someone has done all the heavy lifting of parsing and manipulating IP Address strings. How much of that do you really want to redo for your code? How much time do you want to spend testing your new code and finding the same sorts of bugs that the creators of this package have probably found and fixed?
If I sound like I'm hammering on the point a bit ... it's because this approach seems entirely too similar to attempts to parse HTML (or XML) using regexes rather than using existing, tested, robust parsers which have already been written.
(If the only tool at hand is a hammer, every problem looks like a nail)
The semantics of '?' is a bit more complex (just a bit). You can imagine it like a synonym of the adverb "possibly".
It works this way: IF there's a substring matching my pattern THEN go on with the matching process. I "highlighted" IF and THEN because the semantics of the implication says that, in case the premise is not satisfied, the whole sentence is still true.
Therefore, let's now apply this principle to your case. You put a '?' on a suffix. Let's assume that the former part matches and, now, let's deal with the suffix: if there's a suffix that matches your pattern, the whole string will match. If the suffix doesn't match, there's no problem: the block marked with '?' is "optional" (remember the "possibly" semantics or, equivalently, the implication semantics), therefore the string still matches.
Therefore, putting a '?' block in the last part of your pattern is not very useful, because the string will still match, whether or not there's a matching suffix. Optional blocks are useful only in the middle of a string, indeed.

Get address out of a paragraph with regex

Alright, this one's a bit of a pain. I'm doing some scraping with Python, trying to get an address out of a few lines of poorly tagged HTML. Here's a sample of the format:
256-555-5555<br/>
1234 Fake Ave S<br/>
Gotham (Lower Ward)<br/>
I'd like to retrieve only 1234 Fake Ave S, Gotham. Any ideas? I've been doing regex's all night and now my brain is mush...
Edit:
More detail about what the possible scenarios of how the data will arrive. Sometimes the first line will be there, sometimes not. All of the addresses I have seen have Ave, Way, St in it although I would prefer not to use that as a factor in the selection as I am not certain they will always be that way. The second and third line are alPhone (or possible email or website):
What I had in mind was something that
Selects everything on 2nd to last line (so, second line if there are three lines, first line if just two when there isn't a phone number).
Selects everything on last line that isn't in parentheses.
Combine the 2nd to last line and last line, adding a ", " in between the two.
I'm using Scrapy to acquire the HTML code. The address is all in the same div, I want to use regex to further break the data up into appropriate sections. Now how to do that is what I'm unable to figure out.
Edit2:
As per Ofir's comment, I should mention that I have already made expressions to isolate the phone number and parentheses section.
Phone (or possible email or website):
((1[-. ])?[0-9]{3}[-. ])?\(?([0-9]{3}[-. ][A?([0-9]{4})|([\w\.-]+#[\w\.-]+)|(www.+)|([\w\.-]*(?:com|net|org|us))
parentheses:
\((.*?)\)
I'm not sure how to use those to construct a everything-but-these statement.
It is possible that in your case it is easier to focus on what you don't want:
html tags (<br>)
phone numbers
everything in parenthesis
Each of which can be matched easily with simple regular expressions, making it easy to construct one to match the rest (presumably - the address)
This attempts to isolate the last two lines out of the string:
>>> s="""256-555-5555<br/>
... 1234 Fake Ave S<br/>
... Gotham (Lower Ward)<br/>
... """
>>> m = re.search(r'((?!</br>).*)<br/>\n((?!</br>).*)<br/>$)', s)
>>> print m.group(1)
1234 Fake Ave S
Trimming the parentheses is probably best left to a separate line of code, rather than complicating the regular expression further.
As far as I understood you problem, I think you are taking the wrong way to solve it.
Regexes are not a magical tool that could extract pertinent data from a pulp and jumble of undifferentiated elements of text. It is a tool that can only extract data from a text having variable parts but also a minimum of stable structure acting as anchors relatively to which the variable parts can be localized.
In your treatment, it seems to me that you first isolated this part containing possible phone number followed by address on 1/2 lines. But doing so, you lost information: what is before and what is after is anchoring information, you shouldn't try to find something in the remaining section obtained after having eliminated this information.
Moreover, I presume that you don't want only to catch a phone number and an address: you may want to extract other pieces of information lying before and after this section. With a good shaped regex, you could capture all the pieces in one shot.
So, please, give more of the text, with enough characters before and enough characters after the limited section allowing to write a correct and easier regex strategy to catch all the data you want. triplee has already asked you that, and you didn't, why ?

Check for a valid domain name in a string?

I am using python and would like a simple api or regex to check for a domain name's validity. By validity I am the syntactical validity and not whether the domain name actually exists on the Internet or not.
Any domain name is (syntactically) valid if it's a dot-separated list of identifiers, each no longer than 63 characters, and made up of letters, digits and dashes (no underscores).
So:
r'[a-zA-Z\d-]{,63}(\.[a-zA-Z\d-]{,63})*'
would be a start. Of course, these days some non-Ascii characters may be allowed (a very recent development) which changes the parameters a lot -- do you need to deal with that?
r'^(?=.{4,255}$)([a-zA-Z0-9][a-zA-Z0-9-]{,61}[a-zA-Z0-9]\.)+[a-zA-Z0-9]{2,5}$'
Lookahead makes sure that it has a minimum of 4 (a.in) and a maximum of 255 characters
One or more labels (separated by periods) of length between 1 to 63, starting and ending with alphanumeric characters, and containing alphanumeric chars and hyphens in the middle.
Followed by a top level domain name (whose max length is 5 for museum)
Note that while you can do something with regular expressions, the most reliable way to test for valid domain names is to actually try to resolve the name (with socket.getaddrinfo):
from socket import getaddrinfo
result = getaddrinfo("www.google.com", None)
print result[0][4]
Note that technically this can leave you open to DoS (if someone submits thousands of invalid domain names, it can take a while to resolve invalid names) but you could simply rate-limit someone who tries this.
The advantage of this is that it'll catch "hotmail.con" as invalid (instead of "hotmail.com", say) whereas a regex would say "hotmail.con" is valid.
I've been using this:
(r'(\.|\/)(([A-Za-z\d]+|[A-Za-z\d][-])+[A-Za-z\d]+){1,63}\.([A-Za-z]{2,3}\.[A-Za-z]{2}|[A-Za-z]{2,6})')
to ensure it follows either after dot (www.) or / (http://) and the dash occurs only inside the name and to match suffixes such as gov.uk too.
The answers are all pretty outdated with the spec at this point. I believe the below will match the current spec correctly:
r'^(?=.{1,253}$)(?!.*\.\..*)(?!\..*)([a-zA-Z0-9-]{,63}\.){,127}[a-zA-Z0-9-]{1,63}$'

Categories