Check for a valid domain name in a string? - python

I am using python and would like a simple api or regex to check for a domain name's validity. By validity I am the syntactical validity and not whether the domain name actually exists on the Internet or not.

Any domain name is (syntactically) valid if it's a dot-separated list of identifiers, each no longer than 63 characters, and made up of letters, digits and dashes (no underscores).
So:
r'[a-zA-Z\d-]{,63}(\.[a-zA-Z\d-]{,63})*'
would be a start. Of course, these days some non-Ascii characters may be allowed (a very recent development) which changes the parameters a lot -- do you need to deal with that?

r'^(?=.{4,255}$)([a-zA-Z0-9][a-zA-Z0-9-]{,61}[a-zA-Z0-9]\.)+[a-zA-Z0-9]{2,5}$'
Lookahead makes sure that it has a minimum of 4 (a.in) and a maximum of 255 characters
One or more labels (separated by periods) of length between 1 to 63, starting and ending with alphanumeric characters, and containing alphanumeric chars and hyphens in the middle.
Followed by a top level domain name (whose max length is 5 for museum)

Note that while you can do something with regular expressions, the most reliable way to test for valid domain names is to actually try to resolve the name (with socket.getaddrinfo):
from socket import getaddrinfo
result = getaddrinfo("www.google.com", None)
print result[0][4]
Note that technically this can leave you open to DoS (if someone submits thousands of invalid domain names, it can take a while to resolve invalid names) but you could simply rate-limit someone who tries this.
The advantage of this is that it'll catch "hotmail.con" as invalid (instead of "hotmail.com", say) whereas a regex would say "hotmail.con" is valid.

I've been using this:
(r'(\.|\/)(([A-Za-z\d]+|[A-Za-z\d][-])+[A-Za-z\d]+){1,63}\.([A-Za-z]{2,3}\.[A-Za-z]{2}|[A-Za-z]{2,6})')
to ensure it follows either after dot (www.) or / (http://) and the dash occurs only inside the name and to match suffixes such as gov.uk too.

The answers are all pretty outdated with the spec at this point. I believe the below will match the current spec correctly:
r'^(?=.{1,253}$)(?!.*\.\..*)(?!\..*)([a-zA-Z0-9-]{,63}\.){,127}[a-zA-Z0-9-]{1,63}$'

Related

Python regular expression for Windows file path

The problem, and it may not be easily solved with a regex, is that I want to be able to extract a Windows file path from an arbitrary string. The closest that I have been able to come (I've tried a bunch of others) is using the following regex:
[a-zA-Z]:\\([a-zA-Z0-9() ]*\\)*\w*.*\w*
Which picks up the start of the file and is designed to look at patterns (after the initial drive letter) of strings followed by a backslash and ending with a file name, optional dot, and optional extension.
The difficulty is what happens, next. Since the maximum path length is 260 characters, I only need to count 260 characters beyond the start. But since spaces (and other characters) are allowed in file names I would need to make sure that there are no additional backslashes that could indicate that the prior characters are the name of a folder and that what follows isn't the file name, itself.
I am pretty certain that there isn't a perfect solition (the perfect being the enemy of the good) but I wondered if anyone could suggest a "best possible" solution?
Here's the expression I got, based on yours, that allow me to get the path on windows : [a-zA-Z]:\\((?:[a-zA-Z0-9() ]*\\)*).* . An example of it being used is available here : https://regex101.com/r/SXUlVX/1
First, I changed the capture group from ([a-zA-Z0-9() ]*\\)* to ((?:[a-zA-Z0-9() ]*\\)*).
Your original expression captures each XXX\ one after another (eg : Users\ the Users\).
Mine matches (?:[a-zA-Z0-9() ]*\\)*. This allows me to capture the concatenation of XXX\YYYY\ZZZ\ before capturing. As such, it allows me to get the full path.
The second change I made is related to the filename : I'll just match any group of character that does not contain \ (the capture group being greedy). This allows me to take care of strange file names.
Another regex that would work would be : [a-zA-Z]:\\((?:.*?\\)*).* as shown in this example : https://regex101.com/r/SXUlVX/2
This time, I used .*?\\ to match the XXX\ parts of the path.
.*? will match in a non-greedy way : thus, .*?\\ will match the bare minimum of text followed by a back-slash.
Do not hesitate if you have any question regarding the expressions.
I'd also encourage you to try to see how well your expression works using : https://regex101.com . This also has a list of the different tokens you can use in your regex.
Edit : As my previous answer did not work (though I'll need to spend some times to find out exactly why), I looked for another way to do what you want. And I managed to do so using string splitting and joining.
The command is "\\".join(TARGETSTRING.split("\\")[1:-1]).
How does this work : Is plit the original string into a list of substrings, based. I then remove the first and last part ([1:-1]from 2nd element to the one before the last) and transform the resulting list back into a string.
This works, whether the value given is a path or the full address of a file.
Program Files (x86)\\Adobe\\Acrobat Distiller\\acrbd.exe fred is a file path
Program Files (x86)\\Adobe\\Acrobat Distiller\\acrbd.exe fred\ is a directory path

How to select an entire entity around a regex without splitting the string first?

My project (unrelated to this question, just context) is a ML classifier, I'm trying to improve it and have found that when I stripped URLS from the text given to it, some of the URLS have been broken by spaces. For example:
https:// twitter.com/username/sta tus/ID
After I remove links that are not broken, I am left with thinks like www website com. I removed those with the following regular expression in Python:
tweet = re.sub('(www|http).*?(org |net |edu |com |be |tt |me |ms )','',tweet);
I've put a space after every one of them because this happens after the regular strip and text processing (so only working with parts of a URL separated by spaces) and theoretically we should only pick up the remainders of a broken link... not something like
http website strangeTLD .... communication
It's not perfect but it works, however I just thought that I might try to preemptively remove URLS from twitter only, since I know that the spaces that break the regular URL strip will always be in the same places, hoping this improves my classifier accuracy? This will get rid of the string of characters that occurs after a link... specifically pictures, which is a lot of my data.
Specifically, is there a way to select the entity surrounding/after:
pic.twitter.com/
or, in reference to the example I gave earlier, select the entity after the username broken by the space in status (I'm just guessing at this regex)...
http.*?twitter.com/*?/sta tus/
Thank you in advance! And for the record, I was given this dataset to work with; I am not sure why the URLs are almost all broken by spaces.
Yes, what you are talking about is called Positive Lookbehind and works using (?<=...), where the ellipsis should be replaced by what you want to skip.
E.g. if you want to select whatever comes after username in https://twitter.com/username/status/ID, just use
(?<=https:\/\/twitter\.com\/username\/).*
and you will get status/ID, like you can see with this live demo.
In this case I had to escape slashes / using backslashes, as required by Regex specifications; I also used the Kleene star operator, i.e. the asterisk, to match any occurrence of . (any character), just like you did.
What a positive lookbehind combination does is specifying some mandatory text before the current position of your cursor; in other words, it puts the cursor after the expression you feed it (if the said text exists).
Of course this is not enough in your case, since username won't be a fixed string but a variable one. This might be an additional requirement, since lookbehinds do not work with variable lengths.
So you can just skip www.twitter.com/
(?<=https:\/\/twitter\.com\/).*
And then, via Python, create a substring
currentText = "username/status/ID"
result = currentText.split("/",1)[1] # returns status/ID
Test it in this demo (click "Execute"); a simple explanation of how this works is in the answer to this question (in short, you just split the string at the first slash character).
As a sidenote, blanks/spaces aren't allowed in URLs and if necessary are usually encoded as %20 or + (see e.g. this answer). In other words, every URL you got can be safely stripped of spaces before processing, so... why didn't they do it?

regex does not match only upper case letters, despite being instructed to do so

I'm making a script to crawl through a web page and find all upper case names, equalling a number (ex. DUP_NB_FUNC=8). The part where my regular expression has to match only upper case letters however, does not seem to be working properly.
value = re.findall(r"[A-Z0-9_]*(?==\d).{2,}", input)
|tc_apb_conf_00.v:-:DUP_NB_FUNC=2
|:-:DUP_NB_FUNC=2
|:-:DUP_NB_FUNC=4
|:-:DUP_NB_FUNC=5
|tc_apb_conf_01.v:-:DUP_NB_FUNC=8
Desired output should look something like the above. However, I am getting:
|tc_apb_conf_00.v:-:=1" name="viewport"/>
|:-:DUP_NB_FUNC=2
|:-:DUP_NB_FUNC=4
|:-:DUP_NB_FUNC=5
|tc_apb_conf_01.v:-:DUP_NB_FUNC=8
Based on the input I can see its finding a match starting at =1. I don't however understand why as I've put only A-Z in the regex range. I'd really appreciate a bit of assistance and clearing up.
This should be help:
[A-Z0-9_]+(?==\d).{2,}
or
\b[A-Z0-9_]*(?==\d).{2,}\b
But anyway your regex quite weird, according to your requirement above I suggest this
[A-Z0-9_]+=\d+
Instead of using
(?==\d).{2,}: any letters two or more and make sure that the first two letter are = and a one integer respectively,
you can just use
=\d+
Try this.
value = re.findall(r"[A-Z0-9_]+(?==\d).{2,}", input)
You want the case sensitive match to match at least once, which means you want the + quantifier, not the * quantifier, that matches between zero and unlimited times.
I will suggest you define your pattern and check you input if it is available
for i in tlist:
value=re.compile(r"[A-Z0-9_:-.]+=\d+")
jee=value.match(i)
if jee is not None:
print i
tlist contains your input

IPAddress or CIDR block matching regex

I need to check a string for any IPv4 address or one of following CIDR blocks: /16 or /24.
So, 192.168.0.1 should match. 192.168.0.0/16 should match. 192.168.0.0/17 should NOT match
I'm using following regex:
re.compile(r'^([0-9]{1,3}\.){3}[0-9]{1,3}(/(16|24))?')
This matches all IP addresses but also strings like 192.168.0.0/aaaa
Now, if I change the regex (remove ? at end):
re.compile(r'^([0-9]{1,3}\.){3}[0-9]{1,3}(/(16|24))')
It matches CIDR blocks /16 or /24 but not the IP Addresses(eg, 192.168.0.1) anymore.
Isn't '?' supposed to check a group for optional occurrence? What am I doing wrong?
Note: I know the IP address regex itself is not perfect, but I'm more interested in getting help on the issue described.
This should work:
^([0-9]{1,3}\.){3}[0-9]{1,3}($|/(16|24))$
It checks for $ (line end) or / and 16 or 24.
Just like you said ? marks a group as optional, which means that it will try to include that in the match if possible. But in some cases it cannot like in 192.168.0.0/aaaa, but because it is optional it will still match the other parts.
That is why the above regex is more suited for your needs. This way you will only get a match if it ends either with /24, /16 or end of line eg. 192.168.0.1.
Accurate Match
Matches 0.0.0.0 through 255.255.255.255. If CIDR block specified, then matches only if the CIDR is 16 or 24. In action:
^ # Start string
(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\. # A in A.B.C.D
(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\. # B in A.B.C.D
(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\. # C in A.B.C.D
(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)($|/(16|24))? # D in A.B.C.D and /16 or /24
$ # End string
Is there some reason you feel compelled to approach this with a single regex? Is it really a nail(*)? Is there some reason why you can't install and use the Python IPAddr module and use it to parse and manipulate your IP addresses? I guess you could then do something like:
#!/usr/bin/env python
import ipaddr
...
mynet = ipaddr.IPv4Network('192.168.0.0/16')
try:
other = ipaddr.IPv4Network(other_network_string)
nm = other.netmask
except ipaddr.AddressValueError:
other = None
nm = None
...
if nm and nm == mynet.netnmask:
be_happy()
In other words there's a package where someone has done all the heavy lifting of parsing and manipulating IP Address strings. How much of that do you really want to redo for your code? How much time do you want to spend testing your new code and finding the same sorts of bugs that the creators of this package have probably found and fixed?
If I sound like I'm hammering on the point a bit ... it's because this approach seems entirely too similar to attempts to parse HTML (or XML) using regexes rather than using existing, tested, robust parsers which have already been written.
(If the only tool at hand is a hammer, every problem looks like a nail)
The semantics of '?' is a bit more complex (just a bit). You can imagine it like a synonym of the adverb "possibly".
It works this way: IF there's a substring matching my pattern THEN go on with the matching process. I "highlighted" IF and THEN because the semantics of the implication says that, in case the premise is not satisfied, the whole sentence is still true.
Therefore, let's now apply this principle to your case. You put a '?' on a suffix. Let's assume that the former part matches and, now, let's deal with the suffix: if there's a suffix that matches your pattern, the whole string will match. If the suffix doesn't match, there's no problem: the block marked with '?' is "optional" (remember the "possibly" semantics or, equivalently, the implication semantics), therefore the string still matches.
Therefore, putting a '?' block in the last part of your pattern is not very useful, because the string will still match, whether or not there's a matching suffix. Optional blocks are useful only in the middle of a string, indeed.

Create (sane/safe) app bundle identifier from any (unsafe) string

I want to create a sane/safe app bundle name (i.e. somewhat readable, no "strange" characters, etc.) from some random Unicode string (mich might contain just anything).
(It doesn't matter for me wether the function is Cocoa, ObjC, Python, etc.)
(This is related to the filename question and the bundle name question but the bundle identifier is much more restrictive. I think it cannot even contain spaces and I also would want to strip out the dots and put my own prefix.)
I think Xcode also hase some function to do that automatically from the app name. Maybe there is some standard function in Cocoa to do that.
Bundle identifiers are meant to be in reverse URL form (guaranteeing global uniqueness):
com.apple.xcode, for example
So really you need a domain name, then you can invent whatever scheme you like below that.
Given this, and some knowledge of the characters in your input, you can either scan through your input composing a new string with only the bits you want, or use methods like stringByReplacingOccurrencesOfString: withString: and, if you like, lowercaseString.
The permitted characters in bundle identifiers are named in the Property List Documentation as:
The bundle ID string must be a uniform type identifier (UTI) that contains only alphanumeric (A-Z,a-z,0-9), hyphen (-), and period (.) characters. The string should also be in reverse-DNS format.

Categories