I want to validate more than 40k emails from a csv file, the problem is that in this file there are some emails with blank spaces or it has only this value <blank>. I remove many rows from my dataframe using df.dropna() but yet there are rows with blank spaces. Now I want validate this emails using a regular expression or regex with python and re lib.
Here my code:
import re
series = pd.Series(['test.123#gmail.com',
'two.dots.m12#gmail.com',
'test.test2.c#gmail.com.es',
'sam_alc12#congreso.gob.pe',
'hellowolrd.com',
'<blank>'])
regex = '^[a-z0-9]+[\._]?[a-z0-9]+[#]\w+[.]\w{2,3}$'
for email in series:
if re.search(regex, email):
print("{}: Valid Email".format(email))
else:
print("{} : Invalid Email".format(email))
This was the output:
test.123#gmail.com: Valid Email
two.dots.m12#gmail.com : Invalid Email
test.test2.c#gmail.com.es : Invalid Email
sam_alc12#congreso.gob.pe : Invalid Email
hellowolrd.com : Invalid Email
<blank> : Invalid Email
However the were 3 incorrect validations with this emails:
two.dots.m12#gmail.com
test.test2.c#gmail.com.es
sam_alc12#congreso.gob.pe
All them are valid emails.. the current regex can't valida one email with more than 2 dots before of # and after of #.
I tryed many mods in the current regex but nothing happened.
I also used email-validator but it takes a lot of time because is verifying that it is a real email.
For your given examples, the issue is that you are only matching a single time an optional . or _
Instead, you can optionally repeat matching either one of them to match it multiple times, but not match consecutive .. or ___
You don't have to escape the \. in the character class, and the [#] does not have to be in square brackets.
^[a-z0-9]+(?:[._][a-z0-9]+)*#(?:\w+\.)+\w{2,3}$
^ Start of string
[a-z0-9]+ Match 1+ times any of the listed
(?:[._][a-z0-9]+)* Optionally repeat matching either . or _ and 1+ one of the listed
# Match literally
(?:\w+\.)+ Repeat 1+ times matching 1+ word chars and .
\w{2,3} match 2-3 word chars
$ End of string
Regex demo
Note that this pattern accepts a limited set of email addresses allowing only to match \w
Related
I am interested in extracting some information from some PDF files that look like this. I only need the information at pages 2 and after which looks like this:
(U) country: On [date] [text]. (text in brackets)
This means it always starts with a number a dot a country and finishes with brackets which brackets may also go to the next line too.
My implementation in python is the following:
use pdfminer extract_text function to get the whole text.
Then use re.findall function in the whole text using this regex ^\d{1,2}\. \(u\) \w+.\w*.\w*:.* on \d{1,2} \w+.*$ with the re.MULTILINE option too.
I have noticed that this extracts the first line of all the paragraphs that I am interested in, but I cannot find a way to grab everything until the end of the paragraph which is the brackets (.*).
I was wondering if anyone can provide some help into this. I was hoping I can match this by only one regex. Otherwise I might try split it by line and iterate through each one.
Thanks in advance.
You could update the pattern using a negated character class matching until the first occurrence of : and then match at least on after it.
To match all following line, you can match a newline and assert that the nextline does not contain only spaces followed by a newline using a negative lookahead.
Using a case insensitive match:
^\d{1,2}\.\s\(u\)\s[^:\n]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*
The pattern matches:
^ Start of string
\d{1,2}\.\s\(u\)\s Match 2 digits, . a whitespace char and (u)
[^:\n]*: Match any char except : or a newline, then match :
.*?\son\s Match the first occurrence of on between whitespace chars
\d{1,2}\s Match 1-2 digits and a whitespace char
.* Match the rest of the line
(?: Non capture group
\n(?![^\S\r\n]*\n).* Match a newline, and assert not only spaces followed by a newline
)* Close non capture group and optionally repeat
Regex demo
For example
pattern = r"^\d{1,2}\.\s\(u\)\s[^:]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*"
print(re.findall(pattern, extracted_text, re.M | re.I))
I am trying to parse UIDs from URLs. However regex is not something I am good at so seeking for some help.
Example Input:
https://example.com/d/iazs9fEil/somethingelse?foo=bar
Example Output:
iazs9fEil
What I've tried so far is
([/d/]+[\d\x])\w+
Which somehow works, but returns in with the /d/ prefix, so the output is /d/iazs9fEil.
How to change the regex to not contain the /d/ prefix?
EDIT:
I've tried this regex ([^/d/]+[\d\x])\w+ which outputs the correct string which is iazs9fEil, but also returns the rest of the url, so here it is somethingelse?foo=bar
In short, you may use
match = re.search(r'/d/(\w+)', your_string) # Look for a match
if match: # Check if there is a match first
print(match.group(1)) # Now, get Group 1 value
See this regex demo and a regex graph:
NOTE
/ is not any special metacharacter, do not escape it in Python string patterns
([/d/]+[\d\x])\w+ matches and captures into Group 1 any one or more slashes or digits (see [/d/]+, a positive character class) and then a digit or (here, Python shows an error: sre_contants.error incomplete escape \x, probably it could parse it as x, but it is not the case), and then matches 1+ word chars. You put the /d/ into a character class and it stopped matching a char sequence, [/d/]+ matches slashes and digits in any order and amount, and certainly places this string into Group 1.
Try (?<=/d/)[^/]+
Explanation:
(?<=/d/) - positive lookbehind, assure that what's preceeding is /d/
[^/]+ - match one or more characters other than /, so it matches everything until /
Demo
You could use a capturing group:
https?://.*?/d/([^/\s]+)
Regex demo
I have my email adress format like username#domain.extension
The username starts with an English alphabetical character, and any subsequent characters consist of one or more of the following: alphanumeric characters, -, . , and _.
The domain and extension contain only English alphabetical characters.
The extension is 1,2 or 3 characters in length.
I have used the below regex to validate my email address:
[a-zA-Z]+\s<\b[a-z0-9._-]+#[a-zA-Z]+\.[A-Za-z]{1,3}\b>
Email adresses:
this <is#valid.com>
this <is_it#valid.com>
this <_is#notvalid.com>
this <.is#notvalid.com>
this <-is#notvalid.com>
It matched email address 1,2,3 while 4,5 have . and - at the start of domain so it got rejected. So why for 3rd email underscore at the starting of domain it's causing issue and getting accepted.I can't have . , - , _ at the start of domain as per instructions mentioned above. Here is the link
Correct ans:
1,2 email should only match
Your character class after <\b is accepting _ hence any email address starting with - is also becoming valid.
You can use this regex to only allow an alphabet as starting letter of your email:
[a-zA-Z]+\s<\b[A-Za-z][a-zA-Z0-9._-]*#[a-zA-Z]+\.[A-Za-z]{1,3}\b>
Updated RegEx Demo
or you can make use of \w:
[a-zA-Z]+\s<\b[a-zA-Z][\w.-]*#[a-zA-Z]+\.[A-Za-z]{1,3}\b>
Newbie to regex:
([a-zA-Z]+[ ][<][a-zA-Z]+[a-zA-Z._-]+[#][a-zA-Z]+\.[A-Za-z]{1,3})[>]
It's my try for your problem:
I am a total regex beginner. I want to create a regular expression that strictly allows the word delete followed by two closed parenthesis that contain any kind of characters (http://www.waynesworld1.com).
If I put it all together, it should accept the following: delete(http://www.waynesworld123.com).
Let me emphasize that the regex should strictly accept delete() and shouldn't accept elete(). As long as the user types in delete() anything is acceptable within the parenthesis (example: this would be fine delete(12!#Ww)
How can I craft this regex in Python? So far all I have is /delete/ for my regex.
Here you go:
^delete\(.*\)$
^ assert position at start of the string
delete matches the characters delete literally (case sensitive)
\( matches the character ( literally
.* matches any character (except newline)
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\) matches the character ) literally
$ assert position at end of the string
Here is some Python test code:
import re
txt= {"delete(http://www.waynesworld123.com)",
"delete(12!#Ww)",
"elete(test)",
"delete[test]",
"test"}
pattern=re.compile('^delete\(.*\)$', re.DOTALL)
for line in txt:
if pattern.search(line):
print 'PASS', line
else:
print 'FAIL',line
My aim is to find matches in a text where not always all matches are present.
I am trying to collect the phone number, the E-mail and the website of venues from a web site. Only some venues have all three information available but most of them only one or two of them. I tried to write a code. However, it works only if all 3 information are available. Could someone help me what is wrong?
grouped = re.compile('col-right[\s\S]*?' +
'Tel[\s\S]*?([0-9]{0,4}-?[0-9]{3,7}-?[0-9]{0,4}-?[0-9]{0,4})' +
'[\s\S]*?href="http://([\w\W]*?)"' +
'[\s\S]*?href="mailto:([\s\S]*?)">[\s\S]*?</div>')
for match in re.finditer(grouped, text):
print (match.group(1))
print (match.group(2))
print (match.group(3))
Also the digits in the phone numbers are divided with "-" but sometimes there is a space between the "-" and the next set of digits. How can I include that in the code that this space is only occasionally present?
Your logic is good, but it needs a little work.
First of all, you need the phone number. Write a regex for it, and add it to a group: (regex)* the group is marked with (``) and * means that it has to be present 0 or more times.
Write the next regex, add it to another group (emailRegex)* and the third group (website)*.
Instead of * you could also use the ?, once or none at all (as I can see, you used ?.
Now, putting all together, simply mix them with any character in between them
(group1)?.*(emailRegex)?.*(website)*
grup1 matches phone number, followed by any character, email, followed by any character, website. And if one of them is missing, there is no problem at all.
Email regex example: (probably not the most complete one)
([a-zA-Z_]+[a-zA-Z_.-0-9]*#[a-zA-Z0-9]\.[a-z]+])?
This works like this: the email should start with a letter or an underscore _ and it should be followed by lower/upper case, numbers, underscore or a dot ( .) followed by # and letters followed by a dot (notice that I used \. to escape the special any character notation and in the end you add a mix of at least a letter.
works for email#mail.com.
The fact that I put the entire regex in brackets means it is a group and it should appear once or none at all (hence the ?). Between groups, you add .* meaning that in between the phone number/email/address can be any characters.