Underscore character matched all the email address regex python - python

I have my email adress format like username#domain.extension
The username starts with an English alphabetical character, and any subsequent characters consist of one or more of the following: alphanumeric characters, -, . , and _.
The domain and extension contain only English alphabetical characters.
The extension is 1,2 or 3 characters in length.
I have used the below regex to validate my email address:
[a-zA-Z]+\s<\b[a-z0-9._-]+#[a-zA-Z]+\.[A-Za-z]{1,3}\b>
Email adresses:
this <is#valid.com>
this <is_it#valid.com>
this <_is#notvalid.com>
this <.is#notvalid.com>
this <-is#notvalid.com>
It matched email address 1,2,3 while 4,5 have . and - at the start of domain so it got rejected. So why for 3rd email underscore at the starting of domain it's causing issue and getting accepted.I can't have . , - , _ at the start of domain as per instructions mentioned above. Here is the link
Correct ans:
1,2 email should only match

Your character class after <\b is accepting _ hence any email address starting with - is also becoming valid.
You can use this regex to only allow an alphabet as starting letter of your email:
[a-zA-Z]+\s<\b[A-Za-z][a-zA-Z0-9._-]*#[a-zA-Z]+\.[A-Za-z]{1,3}\b>
Updated RegEx Demo
or you can make use of \w:
[a-zA-Z]+\s<\b[a-zA-Z][\w.-]*#[a-zA-Z]+\.[A-Za-z]{1,3}\b>

Newbie to regex:
([a-zA-Z]+[ ][<][a-zA-Z]+[a-zA-Z._-]+[#][a-zA-Z]+\.[A-Za-z]{1,3})[>]
It's my try for your problem:

Related

Regex pattern only catches first occurrence of US address

I have a transcribed text where customer and agent talks to each other. I want to match addresses. I have a regex pattern:
\d+ (.)?(dr|drive|circle|highway|way|street|st|road|rd|boulevard|blvd|parkway|avenue|ave\b|court|ct|cove\b|crossing|estate|junction|loop|park|\bpike\b|ridge|square|terrace|trail|turnpike|village) .*? \d{4,6}
It catches the address. However, it does not catch second address. How to catch all addresses instead of only first occurrence? The second address is appended to end of the sample text 15620 e glenwood... I provide my pattern and sample text below:
Regex 101
i added to your original regex ( \w+ )? at the begining to iclude the case where the user says the name of the road
this should do the trick :
\d+ (.)?( \w+ )?(dr|drive|circle|highway|way|street|st|road|rd|boulevard|blvd|parkway|avenue|ave\b|court|ct|cove\b|crossing|estate|junction|loop|park|\bpike\b|ridge|square|terrace|trail|turnpike|village).*?\d{4,6}

validate email with points using regex

I want to validate more than 40k emails from a csv file, the problem is that in this file there are some emails with blank spaces or it has only this value <blank>. I remove many rows from my dataframe using df.dropna() but yet there are rows with blank spaces. Now I want validate this emails using a regular expression or regex with python and re lib.
Here my code:
import re
series = pd.Series(['test.123#gmail.com',
'two.dots.m12#gmail.com',
'test.test2.c#gmail.com.es',
'sam_alc12#congreso.gob.pe',
'hellowolrd.com',
'<blank>'])
regex = '^[a-z0-9]+[\._]?[a-z0-9]+[#]\w+[.]\w{2,3}$'
for email in series:
if re.search(regex, email):
print("{}: Valid Email".format(email))
else:
print("{} : Invalid Email".format(email))
This was the output:
test.123#gmail.com: Valid Email
two.dots.m12#gmail.com : Invalid Email
test.test2.c#gmail.com.es : Invalid Email
sam_alc12#congreso.gob.pe : Invalid Email
hellowolrd.com : Invalid Email
<blank> : Invalid Email
However the were 3 incorrect validations with this emails:
two.dots.m12#gmail.com
test.test2.c#gmail.com.es
sam_alc12#congreso.gob.pe
All them are valid emails.. the current regex can't valida one email with more than 2 dots before of # and after of #.
I tryed many mods in the current regex but nothing happened.
I also used email-validator but it takes a lot of time because is verifying that it is a real email.
For your given examples, the issue is that you are only matching a single time an optional . or _
Instead, you can optionally repeat matching either one of them to match it multiple times, but not match consecutive .. or ___
You don't have to escape the \. in the character class, and the [#] does not have to be in square brackets.
^[a-z0-9]+(?:[._][a-z0-9]+)*#(?:\w+\.)+\w{2,3}$
^ Start of string
[a-z0-9]+ Match 1+ times any of the listed
(?:[._][a-z0-9]+)* Optionally repeat matching either . or _ and 1+ one of the listed
# Match literally
(?:\w+\.)+ Repeat 1+ times matching 1+ word chars and .
\w{2,3} match 2-3 word chars
$ End of string
Regex demo
Note that this pattern accepts a limited set of email addresses allowing only to match \w

Regex - extract word inside < > brackets

I am trying to extract an email address from a string like
John Smith <jsmith#email.com>
I just need the email address in the < > brackets.
Here is what I have tried so far, but I'm not very good with regex and it doesn't seem to be working, can anyone help?
import re
sender = str(message.sender)
p = re.search(r"\<(\w+)\>", sender)
logging.info(p.group(1))
You can try this:
import re
s = "John Smith <jsmith#email.com>"
email = re.findall('<(.*?)>', s)[0]
Output:
'jsmith#email.com'
Or, a more email-specific solution:
email = re.findall('(?<=\<)\w+#[a-zA-Z]+\.[a-z]+(?=\>)', s)[0]
Output:
'jsmith#email.com'
Currently your regex is : "\<(\w+)\>"
You do not actually need to escape the <>, so it becomes: "<(\w+)>"
\w matches letters, numbers and the underschore '_'. In an e-mail address there are other characters as well.
You have two options: Either just accept anything inside the <> with a regex like "<(.*)>" or actually parse an e-mail address.
A simple regex for that would be "<\S+#\S+>" (non-whitespace characters followed by # followed by non-whitespace characters.
Restricting ourselves to the more commonly used characters, we can write: "<[a-zA-Z0-9+_.-]+#[a-zA-Z0-9.-]+> This still permits certain illegal e-mail addresses because I have kept it fairly simple.
Use a negative character set:
import re
s = "John Smith <jsmith#email.com>"
email = re.findall('<([^>])>', s)[0]
That matches anything thats not a > character, so everything thats in the angular brackets.

REGEX extracting specific part non greedy

I'm new to Python 2.7. Using regular expressions, I'm trying to extract from a text file just the emails from input lines. I am using the non-greedy method as the emails are repeated 2 times in the same line. Here is my code:
import re
f_hand = open('mail.txt')
for line in f_hand:
line.rstrip()
if re.findall('\S+#\S+?',line): print re.findall('\S+#\S+?',line)
however this is what i"m getting instead of just the email address:
['href="mailto:secretary#abc-mediaent.com">sercetary#a']
What shall I use in re.findall to get just the email out?
If you parse a simple file with anchors for email addresses and always the same syntax (like double quotes to enclose attributes), you can use:
for line in f_hand:
print re.findall(r'href="mailto:([^"#]+#[^"]+)">\1</a>', line)
(re.findall returns only the capture group. \1 stands for the content of the first capture group.)
If the file is a more complicated html file, use a parser, extract the links and filter them.Or eventually use XPath, something like: substring-after(//a/#href[starts-with(., "mailto:")], "mailto:")
\S means not a space. " and > are not spaces.
You should use mailto:([^#]+#[^"]+) as the regex (quoted form: 'mailto:([^#]+#[^"]+)'). This will put the email address in the first capture group.
try this
re.findall('mailto:(\S+#\S+?\.\S+)\"',str))
It should give you something like
['secretary#abc-mediaent.com']
\S accepts many characters that aren't valid in an e-mail address. Try a regular expression of
[a-zA-Z0-9-_.]+#[a-zA-Z0-9-_.]+\\.[a-zA-Z0-9-_.]+
(presuming you are not trying to support Unicode -- it seems that you aren't since your input is a "text file").
This will require a "." in the server portion of the e-mail address, and your match will stop on the first character that is not valid within the e-mail address.
This is the format of an email address - https://www.rfc-editor.org/rfc/rfc5322#section-3.4.1.
Keeping that in mind the regex that you need is - r"([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)". (This works without having to depend on the text surrounding an email address.)
The following lines of code -
html_str = r'sachin.gokhale#indiacast.com'
email_regex = r"([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)"
print re.findall(email_regex, html_str)
yields -
['sachin.gokhale#indiacast.com', 'sachin.gokhale#indiacast.com']
P.S. - I got the regex for email addresses by googling for "email address regex" and clicking on the first site - http://emailregex.com/

How to match emails with specific rules

How do I achieve the following with a regex:
Match if string doesn't start with a certain character
Match if there are no two ","'s or any other characters
Match if the string has double ", even if they are not adjacent
Using Python.
Currently I am attempting to match email addresses with these rules included. The current pattern I have is
pattern = '^([A-Z0-9._-\"]|\"[!\,;]\"){1-127}+#[^-][A-Z0-9.-]{3-256}+\.[A-Z]{2,4}[^-]$'
But I am confused with how to implement these rules.
Being more specific:
I want a pattern that matches an email adress consisting of 2 parts (name, domain).
The name part should be no longer then 128 characters and should go before #. It should cosist of a-z0-9 chracters and also ., _, -, ". The name can't have to adjacent dots.
If the name has " then it should be paired with another ". The name can have !;, characters if they are in between paired ".
The domain name should be no longer then 256 and no shorter then 3 characters, should be separated by a dot. The domain name can't begin or end with -.
This information is given to help you understand what I want, the main question is about three rules I stated in the top. I will gladly appreciate it if you tell me how to achieve them.
I am confused about your question. Your title says comma separated list but then you talk about email addresses. There is an official standard regex for emails:
(?:[a-z0-9!#$%&'*+/=?^_{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_{|}~-]+)|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])")#(?:(?:a-z0-9?.)+a-z0-9?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])+)])

Categories