Regex for URL without path - python

I know there are many solutions, articles and libraries for this case, but couldn't find one to match my case. I'm trying to write a regex to extract a URL(which represent the website) from a text (a signature of a person in an email), and has multiple cases:
Could contain http(s):// , or not
Could contain www. , or not
Could have multiple TLD such as "test.com.cn"
Here are some examples:
www.test.com
https://test.com.cn
http://www.test.com.cn
test.com
test.com.cn
I've come up with the following regex:
(https?://)?(www\.)?\w{2,}\.[a-zA-Z]{2,}(\.[a-zA-Z]{2,})?$
But there are two main problems with this, because the signature can contain an email address:
It (wrongly) capture the TLDs of emails like this one: name.surname#test2.com
It doesn't capture URLS in the middle of a line, and if I remove the $ sign at the end, it captures the name.surname part of the last example
For (1) I tried using negative lookbehind, adding this (?<!#) to the beginning, the problem is that now it captures est2.com instead of not matching it at all.

I think you could use \b (boundary) instead of $ (and at the beginning as well) and exclude # in negative lookbehind and lookahead:
(?<!#|\.|-)\b(https?://)?(www\.)?\w{2,}\.[a-zA-Z]{2,}(\.[a-zA-Z]{2,})?\b(?!#|\.|-)
Edit: exclude the dot (and all non alphanumeric characters likely to occur in an URL/email address) in your lookarounds to avoid matching name.middlename in name.middlename.surname#test2.com or com.cn in name.surname#test2.com.cn. See this answer for the list of characters

Related

Python regex conditional, don't match if

Sorry for the somewhat unhelpful title, I'm having a really hard time explaining this issue.
I have a list of unique identifiers that can appear in a number of different ways and I'm trying to use regex to normalize them so I can compare across several databases. Here are some examples of them:
AB1201
AB-1201
AB1201-T
AB-12-01L1
AB1201-TER
AB1201 Transit
I've written a line of code that pulls out all hypens and spaces, and the used this regex:
([a-zA-Z]{2}[\d]{4})(L\d|Transit|T$)?
This works exactly as expected, returning a list looking like this:
AB1201
AB1201
AB1201T
AB1201L1
AB1201
AB1201T
The issue is, I have one identifier that looks like this: AB1201-02. I need this to be raised as an exception, and not included as a match.
Any ideas? I'm happy to provide more clarification if necessary. Thanks!
From Regex101 online tester
You can exclude matching the following hyphen and a digit (?!-\d) using a negative lookahead.
If it should start at the beginning of the string, you could use an anchor ^
Note that you could write [\d] as \d
^([a-zA-Z]{2}\d{4})(?!-\d)(L\d|Transit|T$)?
The pattern will look like
^ Start of string
( Capture group 1
[a-zA-Z]{2}\d{4} Match 2 times a-zA-Z and 4 digits
) Close group
(?!-\d) Negative lookahead, assert what is directly to the right is not - and a digit
(L\d|Transit|T$)? Optional capture group 2
Regex demo
Try this regular expression
^([a-zA-Z]{2}[\d]{4})(?!-\d)(L\d|Transit|T|-[A-Z]{3})?$
I have added the (?!...) Negative Lookahead to avoid matching with the -02.
(?!...) Negative Lookahead: Starting at the current position in the expression, ensures that the given pattern will not match. Does not consume characters.
You can view a demo on this link.

Trying to find the regex for this particular case? Also can I parse this without creating groups?

text to capture looks like this..
Policy Number ABCD000012345 other text follows in same line....
My regex looks like this
regex value='(?i)(?:[P|p]olicy\s[N|n]o[|:|;|,][\n\r\s\t]*[\na-z\sA-Z:,;\r\d\t]*[S|s]e\s*[H|h]abla\s*[^\n]*[\n\s\r\t]*|(?i)[P|p]olicy[\s\n\t\r]*[N|n]umber[\s\n\r\t]*)(?P<policy_number>[^\n]*)'
this particular case matches with the second or case.. however it is also capturing everything after the policy number. What can be the stopping condition for it to just grab the number. I know something is wrong but can't find a way out.
(?i)[P|p]olicy[\s\n\t\r]*[N|n]umber[\s\n\r\t]*)
current output
ABCD000012345othertextfollowsinsameline....
expected output
ABCD000012345
You may use a more simple regex, just finding from the beginning "[P|p]olicy\s*[N|n]umber\s*\b([A-Z]{4}\d+)\b.*" and use the word boundary \b
pattern = re.compile(r"[P|p]olicy\s*[N|n]umber\s*\b([A-Z0-9]+)\b.*")
line = "Policy Number ABCD000012345 other text follows in same line...."
matches = pattern.match(line)
id_res = matches.group(1)
print(id_res) # ABCD000012345
And if there's always 2 words before you can use (?:\w+\s+){2}\b([A-Z0-9]+)\b.*
Also \s is for [\r\n\t\f\v ] so no need to repeat them, your [\n\r\s\t] is just \s
you don't need the upper and lower case p and n specified since you're already specifying case insensitive.
Also \s already covers \n, \t and \r.
(?i)policy\s+number\s+([A-Z]{4}\d+)\b
for verification purpose: Regex
Another Solution:
^[\s\w]+\b([A-Z]{4}\d+)\b
for verification purpose: Regex
I like this better, in case your text changes from policy number

regular expression - partially match

My aim is to find matches in a text where not always all matches are present.
I am trying to collect the phone number, the E-mail and the website of venues from a web site. Only some venues have all three information available but most of them only one or two of them. I tried to write a code. However, it works only if all 3 information are available. Could someone help me what is wrong?
grouped = re.compile('col-right[\s\S]*?' +
'Tel[\s\S]*?([0-9]{0,4}-?[0-9]{3,7}-?[0-9]{0,4}-?[0-9]{0,4})' +
'[\s\S]*?href="http://([\w\W]*?)"' +
'[\s\S]*?href="mailto:([\s\S]*?)">[\s\S]*?</div>')
for match in re.finditer(grouped, text):
print (match.group(1))
print (match.group(2))
print (match.group(3))
Also the digits in the phone numbers are divided with "-" but sometimes there is a space between the "-" and the next set of digits. How can I include that in the code that this space is only occasionally present?
Your logic is good, but it needs a little work.
First of all, you need the phone number. Write a regex for it, and add it to a group: (regex)* the group is marked with (``) and * means that it has to be present 0 or more times.
Write the next regex, add it to another group (emailRegex)* and the third group (website)*.
Instead of * you could also use the ?, once or none at all (as I can see, you used ?.
Now, putting all together, simply mix them with any character in between them
(group1)?.*(emailRegex)?.*(website)*
grup1 matches phone number, followed by any character, email, followed by any character, website. And if one of them is missing, there is no problem at all.
Email regex example: (probably not the most complete one)
([a-zA-Z_]+[a-zA-Z_.-0-9]*#[a-zA-Z0-9]\.[a-z]+])?
This works like this: the email should start with a letter or an underscore _ and it should be followed by lower/upper case, numbers, underscore or a dot ( .) followed by # and letters followed by a dot (notice that I used \. to escape the special any character notation and in the end you add a mix of at least a letter.
works for email#mail.com.
The fact that I put the entire regex in brackets means it is a group and it should appear once or none at all (hence the ?). Between groups, you add .* meaning that in between the phone number/email/address can be any characters.

Python regex: Matching a URL

I have some confusion regarding the pattern matching in the following expression. I tried to look up online but couldn't find an understandable solution:
imgurUrlPattern = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')
What exactly are the parentheses doing ? I understood up until the first asterisk , but I can't figure out what is happening after that.
Regular expressions can be represented as graphs to understand there operation. A parallel connection between nodes indicate that it is optional a serial connection indicates taht it is mandatory and a loop indicated repitition over the same node.
(http://i.imgur.com/(.*))(\?.*)?
Debuggex Demo
So this starts with an imgur URL http://i.imgur.com/(.*) (mandatorily) having any characters untill a '?'(optional) is encountered. Following any characters after the '?'. Notice '?' has been escaped of its regular behaviour. The pink highlights indicate the capture groups.
(http://i.imgur.com/(.*))(\?.*)?
The first capturing group (http://i.imgur.com/(.*)) means that the string should start with http://i.imgur.com/ followed by any number of characters (.*) (this is a poor regex, you shouldn't do it this way). (.*) is also the second capturing group.
The third capturing group (\?.*) means that this part of the string must start with ? and then contain any number of any characters, as above.
The last ? means that the last capturing group is optional.
EDIT:
These groups can then be used as:
p = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')
m = p.match('ab')
m.group(0);
m.group(2);
To improve the regex, you must limit the engine to what characters you need, like:
(http://i.imgur.com/([A-z0-9\-]+))(\?[[^/]+*)?
[A-z0-9\-]+ limit to alphanumeric characters
[^/] exclude /
The (.*) means any character repeated any amount of times, the (\?.*)? matches the query string of a url for example (a imgur search of "cat"):
http://imgur.com/search?q=cat
http://imgur.com/search is matched by the (http://i.imgur.com/(.*)) (the search is specifically matched by the (.*)) section of the regex. The ?q=cat is matched by the (\?.*)? of the regex. In the regex the ? in the end means optional, so it means there might or might not be a query string. There is no query string in the url http://www.imgur.com. The parenthesis are used for grouping. We want to group (http://i.imgur.com/(.*)) as one thing because it matches the url, and there is another group within this that matches the page you are request (this is (.*)). We want to group (\?.*)? because it matches the query string.
Here is a diagram to help you

using regular expression to find the urls that do not contain specific word in domain part

I want a regular expression to grab urls that does not contain specific word in their domain name but no matter if there is that word in the query string or other subdirectories of the domain.Also it doesn't matter how the hrl starts for exmaple by http/fttp/https/without any of them. I found this expression ^((?!foo).)*$") I don't know how should I change it to fit into these conditions.
These are the accepted url for the word "foo":
whatever.whatever.whatever/foo/pic
whatever.whatever.whatever?sdfd="foo"
and these are not accepted:
whatever.whateverfoo.whatever
whatever.foowhatever.whatever
whatever.foo.whatever.whatever
whatever.whatever.foo.whatever
Try this (explanation):
^(?:(?!foo).)*?[\/\?]
What this means is basically:
match anthing not containing foo
until a slash or question mark is encountered
The precise syntax may vary depending on your programming language/editor. The explanation link shows the PHP example. The regex elements I've used are pretty common, so it should work for you. If not, let me know.
This regex can only be matched against a single URL at a time. So if you are trying this in regex101, don't enter all URLs at once.
Update: Example in Java (now using turner instead of foo):
Pattern p = Pattern.compile("^(?:(?!turner).)*?[\\/\\?].*");
System.out.println(p.matcher(
"i.cdn.turner.com/cnn/.e/img/3.0/1px.gif").matches());
System.out.println(p.matcher(
"www.facebook.com/plugins/like.php?href=http%3A%2F%2F"
+ "www.facebook.com%2Fturnerkjl‌​jl").matches());
Output:
false
true
Here is your regex in java
"^[^/?]+(?<!foo)"
Explanation - From beginning search for characters which does not matches with / or ?. The moment it finds any one of the above two characters then the pattern search backward for negative match for foo. If foo is found then it returns false else true. This is in java. Also the regex will vary from language to language.
in grep cmd (unix or shell script) you have to take negation of the following regex match
"^[^/?]+foo"
Here's a regex that will match the cases that you want to reject
(?:.+://){0,1}(?<subdomain>[^.]+\.){0,1}(?<domain>[^.]*whatever[^.]*\.)(?<top>[^.]+).*
(?: ) is a non-capturing group
(?<groupName> ) is a named group (useful for testing, in regexhero you can see what is being captured by the group)
{0,1} means 0 or 1
. means any character except new line
[^.] means any character except "."
means 0 or more
means 1 or more, for example, .+ means 1 or many "any characters"
. escapes the special character .
See http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet
you can try it here: http://regexhero.net/tester/

Categories