How to match emails with specific rules - python

How do I achieve the following with a regex:
Match if string doesn't start with a certain character
Match if there are no two ","'s or any other characters
Match if the string has double ", even if they are not adjacent
Using Python.
Currently I am attempting to match email addresses with these rules included. The current pattern I have is
pattern = '^([A-Z0-9._-\"]|\"[!\,;]\"){1-127}+#[^-][A-Z0-9.-]{3-256}+\.[A-Z]{2,4}[^-]$'
But I am confused with how to implement these rules.
Being more specific:
I want a pattern that matches an email adress consisting of 2 parts (name, domain).
The name part should be no longer then 128 characters and should go before #. It should cosist of a-z0-9 chracters and also ., _, -, ". The name can't have to adjacent dots.
If the name has " then it should be paired with another ". The name can have !;, characters if they are in between paired ".
The domain name should be no longer then 256 and no shorter then 3 characters, should be separated by a dot. The domain name can't begin or end with -.
This information is given to help you understand what I want, the main question is about three rules I stated in the top. I will gladly appreciate it if you tell me how to achieve them.

I am confused about your question. Your title says comma separated list but then you talk about email addresses. There is an official standard regex for emails:
(?:[a-z0-9!#$%&'*+/=?^_{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_{|}~-]+)|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])")#(?:(?:a-z0-9?.)+a-z0-9?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])+)])

Related

Regex for URL without path

I know there are many solutions, articles and libraries for this case, but couldn't find one to match my case. I'm trying to write a regex to extract a URL(which represent the website) from a text (a signature of a person in an email), and has multiple cases:
Could contain http(s):// , or not
Could contain www. , or not
Could have multiple TLD such as "test.com.cn"
Here are some examples:
www.test.com
https://test.com.cn
http://www.test.com.cn
test.com
test.com.cn
I've come up with the following regex:
(https?://)?(www\.)?\w{2,}\.[a-zA-Z]{2,}(\.[a-zA-Z]{2,})?$
But there are two main problems with this, because the signature can contain an email address:
It (wrongly) capture the TLDs of emails like this one: name.surname#test2.com
It doesn't capture URLS in the middle of a line, and if I remove the $ sign at the end, it captures the name.surname part of the last example
For (1) I tried using negative lookbehind, adding this (?<!#) to the beginning, the problem is that now it captures est2.com instead of not matching it at all.
I think you could use \b (boundary) instead of $ (and at the beginning as well) and exclude # in negative lookbehind and lookahead:
(?<!#|\.|-)\b(https?://)?(www\.)?\w{2,}\.[a-zA-Z]{2,}(\.[a-zA-Z]{2,})?\b(?!#|\.|-)
Edit: exclude the dot (and all non alphanumeric characters likely to occur in an URL/email address) in your lookarounds to avoid matching name.middlename in name.middlename.surname#test2.com or com.cn in name.surname#test2.com.cn. See this answer for the list of characters

Regex - If not match then match this - Python

I apologise for the amount of text, but I cannot wrap my head around this and I would like to make my problem clear.
I am currently attempting to create a regex expression to find the end of a website/email link in order to then process the rest of the address. I have decided to look for the ending of the address (eg. '.com', '.org', '.net'); however, I am having difficulty in two areas when dealing with this. (I have chosen this method as it is the best fit for the current project)
Firstly I am trying to get around accidentally hindering users typing words with these keywords within them (eg. '"org"anisation', 'try this "or g"o to'). How I have tackled this is, as an example, the regex:
org(?!\w) - To skip the match if there are letters directly after the keyword.
The secondary problem is finding extra parts of an address (eg. 'www.website."org".uk') which would not be matched. To combat this, as an example, I have used the regex:
org((\W*|\.|dot)\w\w) - In an attempt to find the first two letters after the keyword, as most extensions are only two letters.
The Main Problem:
In order to prevent both of the above situations I have used the regex akin to:
org(.|dot)\w\w|(?!\w)
However, I am not as versed as I would like to be in Regex to find a solution and I understand that this would not create correct results. I know there is a form of 'If this then that' within Regex but I just cant seem to understand the online documentation I have found on the subject.
If possible would someone be able to explain how I may go about creating a system to say:
IF: NOT org(\w)
ELSE IF: org(.|dot)
THEN: MATCH org(.|dot)\w\w
ELSE: MATCH org
I would really appreciate any help on the matter, this has been on my mind for a while now. I would just like to see it through, but I just do not possess the required knowledge.
Edit:
Test cases the Regex would need to pass (Specifically for the 'org' regex for these examples):
(I have marked matches in square brackets '[ ]', and I have marked possible matches to be disregarded with '< >' )
"Hello, please come and check out my website: www.website.[org]"
"I have just uploaded a new game at games.[org.uk]"
"If you would like quote please email me at email#email.[org.ru]"
"I have just made a new <org>anisation website at website.[org], please get in contact at name.name#email.[org.us]"
"For more info check info.[org] <or g>o to info.[org.uk]"
I hope this allows for a better insight to what the Regex needs to do.
The following regex:
(?i)(?<=\.)org(?:\.[a-z]{2})?\b
should do the work for you.
demo:
https://regex101.com/r/8F9qbQ/2/
explanations:
(?i) to activate the case as insensitive (.ORG or .org)
(?<=.) forces that there is a . before org to avoid matches when org is actually a part of a word.
org to match ORG or org
(?:...)? non capturing group that can appear 0 to 1 time
\.[a-zA-Z]{2} dot followed by exactly 2 letters
\b word boundary constraint
There are some other simpler way to catch any website, but assuming that you exactly need the feature IF: NOT org(\w) ELSE IF: org(.|dot) THEN: MATCH org(.|dot)\w\w ELSE: MATCH org, then you can use:
org(?!\w)(\.\w\w)?
It will match:
"org.uk" of www.domain.org.uk
"org" of www.domain.org
But will not match www.domain.orgzz and orgzz
Explanation:
The org(?!\w) part will match org that is not followed by a letter character. It will match the org of org, org of org. but will not match orgzz.
Then, if we already have the org, we will try if we can match additional (\.\w\w) by adding the quantifier ? which means match if there is any, which will match the \.uk but it is not necessary.
I made a little regex that captures a website as long as it starts with 'www.' that is followed by some characters with a following '.'.
import re
matcher = re.compile('(www\.\S*\.\S*)') #matches any website with layout www.whatever
string = 'they sky is very blue www.harvard.edu.co see nothing else triggers it, www, org'
match = re.search(matcher, string).group(1)
#output
#'www.harvard.edu.co'
Now you can tighten this up as needed to avoid false positives.

Underscore character matched all the email address regex python

I have my email adress format like username#domain.extension
The username starts with an English alphabetical character, and any subsequent characters consist of one or more of the following: alphanumeric characters, -, . , and _.
The domain and extension contain only English alphabetical characters.
The extension is 1,2 or 3 characters in length.
I have used the below regex to validate my email address:
[a-zA-Z]+\s<\b[a-z0-9._-]+#[a-zA-Z]+\.[A-Za-z]{1,3}\b>
Email adresses:
this <is#valid.com>
this <is_it#valid.com>
this <_is#notvalid.com>
this <.is#notvalid.com>
this <-is#notvalid.com>
It matched email address 1,2,3 while 4,5 have . and - at the start of domain so it got rejected. So why for 3rd email underscore at the starting of domain it's causing issue and getting accepted.I can't have . , - , _ at the start of domain as per instructions mentioned above. Here is the link
Correct ans:
1,2 email should only match
Your character class after <\b is accepting _ hence any email address starting with - is also becoming valid.
You can use this regex to only allow an alphabet as starting letter of your email:
[a-zA-Z]+\s<\b[A-Za-z][a-zA-Z0-9._-]*#[a-zA-Z]+\.[A-Za-z]{1,3}\b>
Updated RegEx Demo
or you can make use of \w:
[a-zA-Z]+\s<\b[a-zA-Z][\w.-]*#[a-zA-Z]+\.[A-Za-z]{1,3}\b>
Newbie to regex:
([a-zA-Z]+[ ][<][a-zA-Z]+[a-zA-Z._-]+[#][a-zA-Z]+\.[A-Za-z]{1,3})[>]
It's my try for your problem:

How to check two dots placed together and quotation marks in email?

I've got a task to solve:
Write a function on python, which checks an e-mail on compliance with these
rules:
e-mail consists of the name and domain parts, and the "#" mark is between them;
the domain part is between 3 and 256 symbols, is a set of non-empty strings, consisting of a-z 0-9_- symbols separated by dot;
each component of the domain part can't begin or end with "-" symbol;
the name part (before #) is no more than 128 symbols, consists of a-z0-9"._-;
in the name part, we can't write two dots going together "..";
if we have double quotes in the name part (") , they should have a pair ("blabla");
we also can write "!,:" symbols in the name part, but only between double quotes.
I wrote a small regular expression step-by-step up to 4th point:
((?!-)[A-Z0-9"\.\-_]{1,128}(?<!-)#(?!-)[A-Z0-9\-_.]{3,256}(?<!-))
but I stuck on 5th and 6th.
How to implement these conditions in my regexp? I tried to add the
|(?:\.(?!\.))
in the end, but it doesn't work.
Do not try to do this in regexp, this is an example of an email validator written in regex with Perl, to this day that monstrosity haunts my dreams.
Use a proper parser, you should try looking at the source of the validate_email library and make change to serve your purposes. This might also be a good source to use as base.

regular expression - partially match

My aim is to find matches in a text where not always all matches are present.
I am trying to collect the phone number, the E-mail and the website of venues from a web site. Only some venues have all three information available but most of them only one or two of them. I tried to write a code. However, it works only if all 3 information are available. Could someone help me what is wrong?
grouped = re.compile('col-right[\s\S]*?' +
'Tel[\s\S]*?([0-9]{0,4}-?[0-9]{3,7}-?[0-9]{0,4}-?[0-9]{0,4})' +
'[\s\S]*?href="http://([\w\W]*?)"' +
'[\s\S]*?href="mailto:([\s\S]*?)">[\s\S]*?</div>')
for match in re.finditer(grouped, text):
print (match.group(1))
print (match.group(2))
print (match.group(3))
Also the digits in the phone numbers are divided with "-" but sometimes there is a space between the "-" and the next set of digits. How can I include that in the code that this space is only occasionally present?
Your logic is good, but it needs a little work.
First of all, you need the phone number. Write a regex for it, and add it to a group: (regex)* the group is marked with (``) and * means that it has to be present 0 or more times.
Write the next regex, add it to another group (emailRegex)* and the third group (website)*.
Instead of * you could also use the ?, once or none at all (as I can see, you used ?.
Now, putting all together, simply mix them with any character in between them
(group1)?.*(emailRegex)?.*(website)*
grup1 matches phone number, followed by any character, email, followed by any character, website. And if one of them is missing, there is no problem at all.
Email regex example: (probably not the most complete one)
([a-zA-Z_]+[a-zA-Z_.-0-9]*#[a-zA-Z0-9]\.[a-z]+])?
This works like this: the email should start with a letter or an underscore _ and it should be followed by lower/upper case, numbers, underscore or a dot ( .) followed by # and letters followed by a dot (notice that I used \. to escape the special any character notation and in the end you add a mix of at least a letter.
works for email#mail.com.
The fact that I put the entire regex in brackets means it is a group and it should appear once or none at all (hence the ?). Between groups, you add .* meaning that in between the phone number/email/address can be any characters.

Categories