Email Validation using Python Regular Expression - python

I have 9 email patterns. I expect:
myname#domainemail.com
my.name#domainemail.com
my.name1#domainemail.com
my_name.1#domainemail.com
are valid emails.
and
my-name#domainemail.com
my.name.1#domainemail.com
domainname.1#domainemail.com
1myname#domainemail.com
1.myname#domainemail.com
are not valid emails.
Then, I have made script of regex like:
regex = r"(^[a-zA-Z_]+[\.]?[a-z0-9]+)#([\w.]+\.[\w.]+)$"
But, email domainname.1#domainemail.com is still valid.
How to make the right pattern regex so that email become not valid, and all of email patterns can fit to my expectation?

For the example data you could either match an optional part with underscores where a dot followed by a digit is allowed before the #
Or you match a part that with a dot and a char a-z before the #
^[a-zA-Z]+(?:(?:_[a-zA-Z0-9]+)+\.[A-Za-z0-9]+|\.[a-zA-Z][a-zA-Z0-9]*)?#(?:[a-zA-Z0-9]+\.)*[a-zA-Z0-9]{2,}$
Explanation
^ Start of string
[a-zA-Z]+ Match 1+ times a char a-z
(?: Non capture group
(?:_[a-zA-Z0-9]+)+ Repeat 1+ times an underscore followed by a char a-z or digit 0-9
\.[A-Za-z0-9]+ Match a dot and 1+ chars a-z or digit 0-9
| Or
\.[a-zA-Z][a-zA-Z0-9]* Match a a dot and a single char a-z and 0+ chars a-z or digits
)? Close group and make it optional
# Match literally
(?:[a-zA-Z0-9]+\.)* Repeat 0+ times a-z0-9 followed by a dot
[a-zA-Z0-9]{2,} Match a-z0-9 2 or more times
$ End of string
Regex demo

Use the following regex pattern with gmi flags:
^[a-z]+(?:(?:\.[a-z]+)+\d*|(?:_[a-z]+)+(?:\.\d+)?)?#(?!.*\.\.)[^\W_][a-z\d.]+[a-z\d]{2}$
https://regex101.com/r/xoVprE/4

Related

Regex python ignore word followed by given character

I have the regex (?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z]+[A-Za-z0-9-_]+)(?!\w).
Given the string #first#nope #second#Hello #my-friend, email# whats.up#example.com #friend, what can I do to exclude the strings #first and #second since they are not whole words on their own ?
In other words, exclude them since they are succeeded by # .
You can use
(?<![a-zA-Z0-9_.-])#(?=([A-Za-z]+[A-Za-z0-9_-]*))\1(?![#\w])
(?a)(?<![\w.-])#(?=([A-Za-z][\w-]*))\1(?![#\w])
See the regex demo. Details:
(?<![a-zA-Z0-9_.-]) - a negative lookbehind that matches a location that is not immediately preceded with ASCII digits, letters, _, . and -
# - a # char
(?=([A-Za-z]+[A-Za-z0-9_-]*)) - a positive lookahead with a capturing group inside that captures one or more ASCII letters and then zero or more ASCII letters, digits, - or _ chars
\1 - the Group 1 value (backreferences are atomic, no backtracking is allowed through them)
(?![#\w]) - a negative lookahead that fails the match if there is a word char (letter, digit or _) or a # char immediately to the right of the current location.
Note I put hyphens at the end of the character classes, this is best practice.
The (?a)(?<![\w.-])#(?=([A-Za-z][\w-]*))\1(?![#\w]) alternative uses shorthand character classes and the (?a) inline modifier (equivalent of re.ASCII / re.A makes \w only match ASCII chars (as in the original version). Remove (?a) if you plan to match any Unicode digits/letters.
Another option is to assert a whitespace boundary to the left, and assert no word char or # sign to the right.
(?<!\S)#([A-Za-z]+[\w-]+)(?![#\w])
The pattern matches:
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left
# Match literally
([A-Za-z]+[\w-]+) Capture group1, match 1+ chars A-Za-z and then 1+ word chars or -
(?![#\w]) Negative lookahead, assert not # or word char to the right
Regex demo
Or match a non word boundary \B before the # instead of a lookbehind.
\B#([A-Za-z]+[\w-]+)(?![#\w])
Regex demo

Test for comma delimited string, ignoring any encountered periods, say from real numbers?

The following works for a simple comma delimited string, that has no periods, but if periods in real numbers found it breaks.
pattern = re.compile(r"^(\w+)(,\s*\w+)*$")
How can I modify or change the above to ignore periods? But still validate the given string is comma delimited?
A sample test string is "23,HIGH,1.0,LOW,1.0,HIGH,1.0,LOW,1.0".
\w matches "word" characters: letters, digits and _. It doesn't match a dot. If you want to match dots as well, use [\w.] instead of \w:
pattern = re.compile(r"^([\w.]+)(,\s*[\w.]+)*$")
You might also want to add -, if you expect negative numbers. To put - in a character class, you either have to backslash escape it or make sure it's either the first or last character in the class:
[-.\w]
[\w.-]
[\w\-.]
If the value can only be a number, and matching dots only would not be desired you can use and alternation to match either word characters or a number.
^(?:[+-]?\d*\.?\d+|\w+)(?:,(?:[+-]?\d*\.?\d+|\w+))*$
Explanation
^ Start of string
(?: Non capture group
[+-]?\d*\.?\d+ Match an optional + or -, then optional digits, optional dot and 1+ digits
| Or
\w+ Match 1+ word characters
) Close non capture group
(?: Non capture group
, Match the comma
(?:[+-]?\d*\.?\d+|\w+) The same pattern as in the first part
)* Close non capture group and optionally repeat to also match a single occurrence
$ End of string
Regex demo

Missing something in the regex?

I'm trying to use this regex
art\..*[A-Z].*\s
to extract the text in bold here
some text bla art. 100 of Important_text other text bla
Basically, I would like to extract all the text that follow this pattern:
*art.* *number* *whatever* *first word that starts in uppercase*
But it's not working as expected. Any suggestion?
With your shown samples, please try following.
\bart\..*?\d+.*?[A-Z]\w*
Online demo for above regex
Explanation: Adding detailed explanation for above.
\b ##mentioning word boundary here.
art\. ##Looking for word art with a literal dot here.
.*?\d+ ##Using non-greedy approach for matching 1 or more digits.
.*?[A-Z]\w* ##Using non-greedy approach to match 1 capital letter followed by word characters.
You can match art. then match until the first digits and then match until the first occurrence of an uppercase char.
\bart\.\D*\d+[^A-Z]*[A-Z]\S*
The pattern matches
\bart\. Match art. preceded by a word boundary
\D*\d+ Match 0+ times a non digit, followed by 1+ digits
[^A-Z]* Match 0+ times any char except A-Z
[A-Z]\S* Match a char A-Z followed by optional non whitespace chars.
Regex demo
If the word has to start with A-Z you can assert a whitespace boundary to the left using (?<!\S) before matching an uppercase char A-Z.
\bart\.\D*\d+[^A-Z]*(?<!\S)[A-Z]\S*

Python regex: removing all special characters and numbers NOT attached to words

I am trying to remove all special characters and numbers in python, except numbers that are directly attached to words.
I have succeded in doing this for all cases of special characters and numbers attached and not attached to words, how to do it in such a way that numbers attached are not matched.
Here's what I did:
import regex as re
string = "win32 backdoor guid:64664646 DNS-lookup h0lla"
re.findall(r'[^\p{P}\p{S}\s\d]+', string.lower())
I get as output
win backdoor guid DNS lookup h lla
But I want to get:
win32 backdoor guid DNS lookup h0lla
demo: https://regex101.com/r/x4HrGo/1
To match alphanumeric strings or only letter words you may use the following pattern with re:
import re
# ...
re.findall(r'(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*|[^\W\d_]+', text.lower())
See the regex demo.
Details
(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*- either 1+ letters followed with a digit, or 1+ digits followed with a letter, and then 0+ letters/digits
| - or
[^\W\d_]+ - either any 1+ Unicode letters
NOTE It is equivalent to \d*[^\W\d_][^\W_]* pattern posted by PJProudhon, that matches any 1+ alphanumeric character chunks with at least 1 letter in them.
You could give a try to \b\d*[^\W\d_][^\W_]*\b
Decomposition:
\b # word boundary
/d* # zero or more digits
[^\W\d_] # one alphabetic character
[^\W_]* # zero or more alphanumeric characters
\b # word boundary
For beginners:
[^\W] is typical double negated construct. Here you want to match any character which is not alphanumeric or _ (\W is the negation of \w, which matches any alphanumeric character plus _ - common equivalent [a-zA-Z0-9_]).
It reveals useful here to compose:
Any alphanumeric character = [^\W_] matches any character which is not non-[alphanumeric or _] and is not _.
Any alphabetic character = [^\W\d_] matches any character which is not non-[alphanumeric or _] and is not digit (\d) and is not _.
Some further reading here.
Edit:
When _ is also considered a word delimiter, just skip the word boundaries, which toggle on that character, and use \d*[^\W\d_][^\W_]*.
Default greediness of star operator will ensure all relevant characters are actually matched.
Demo.
Try this RegEx instead:
([A-Za-z]+(\d)*[A-Za-z]*)
You can expand it from here, for example flipping the * and + on the first and last sets to capture string like "win32" and "01ex" equally.

Regular Expression for a string contains if characters all in capital python

I'm extracting textual paragraph followed by text like "OBSERVATION #1" or "OBSERVATION #2" in the output from library like PyPDF2.
However there would be some error so it could be like "OBSERVA'TION #2" and I have to avoid like "Suite #300" so the rule is "IF THERE IS CHARACTER, IT WOULD BE IN CAPITAL".
Currently the python code snippet like
inspection_observation=pdfFile.getPage(z).extractText()
if 'OBSERVATION' in inspection_observation:
for finding in re.findall(r"[OBSERVATION] #\d+(.*?) OBSERVA'TION #\d?", inspection_observation, re.DOTALL):
#print inspection_observation;
print finding;
Please advise the appropriate regular expression for this instance,
If there should be a capital and the word can contain a ', you could use a character class where you can list the characters that are allowed and a positive lookahead.
Then you can capture the content between those capital words and use a positive lookahead to check if what follows is another capital word followed by # and 1+ digits or the end of the string. This regex makes use of re.DOTALL where the dot matches a newline.
(?=[A-Z']*[A-Z])[A-Z']+\s+#\d+(.*?(?=[A-Z']*[A-Z][A-Z']*\s+#\d+|$))
Explanation
(?=[A-Z']*[A-Z]) Positive lookahead to assert what follows at least a char A-Z where a ' can occur before
[A-Z']+\s+#\d+ match 1+ times A-Z or ', 1+ whitespace characters and 1+ digits
( Capture group
.*? Match any character
(?= Positive lookahead to assert what follows is
[A-Z']*[A-Z][A-Z']* Match uppercase char A-Z where a ' can be before and after
\s+#\d+ Match 1+ whitespace chars, # and 1+ digits or the end of the string
) Close non capture group
) Close capture group
Regex demo

Categories