Python email regex doesn't work - python

I am trying to get all email address from a text file using regular expression and Python but it always returns NoneType while it suppose to return the email. For example:
content = 'My email is lehai#gmail.com'
#Compare with suitable regex
emailRegex = re.compile(r'(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)')
mo = emailRegex.search(content)
print(mo.group())
I suspect the problem lies in the regex but could not figure out why.

Because of spaces in content; remove the ^ and $ to match anywhere:
([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)

Try this one as a regex, but I am completely not sure whether it will work for you:
([^#|\s]+#[^#]+.[^#|\s]+)

Your regular expression doesn't match the pattern.
I normally call the regex search like this:
mo = re.search(regex, searchstring)
So in your case I would try
content = 'My email is lehai#gmail.com'
#Compare with suitable regex
emailRegex = re.compile(r'gmail')
mo = re.search(emailRegex, content)
print(mo.group())`
You can test your regex here: https://regex101.com/
This will work:
([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)

Related

Regex - extract word inside < > brackets

I am trying to extract an email address from a string like
John Smith <jsmith#email.com>
I just need the email address in the < > brackets.
Here is what I have tried so far, but I'm not very good with regex and it doesn't seem to be working, can anyone help?
import re
sender = str(message.sender)
p = re.search(r"\<(\w+)\>", sender)
logging.info(p.group(1))
You can try this:
import re
s = "John Smith <jsmith#email.com>"
email = re.findall('<(.*?)>', s)[0]
Output:
'jsmith#email.com'
Or, a more email-specific solution:
email = re.findall('(?<=\<)\w+#[a-zA-Z]+\.[a-z]+(?=\>)', s)[0]
Output:
'jsmith#email.com'
Currently your regex is : "\<(\w+)\>"
You do not actually need to escape the <>, so it becomes: "<(\w+)>"
\w matches letters, numbers and the underschore '_'. In an e-mail address there are other characters as well.
You have two options: Either just accept anything inside the <> with a regex like "<(.*)>" or actually parse an e-mail address.
A simple regex for that would be "<\S+#\S+>" (non-whitespace characters followed by # followed by non-whitespace characters.
Restricting ourselves to the more commonly used characters, we can write: "<[a-zA-Z0-9+_.-]+#[a-zA-Z0-9.-]+> This still permits certain illegal e-mail addresses because I have kept it fairly simple.
Use a negative character set:
import re
s = "John Smith <jsmith#email.com>"
email = re.findall('<([^>])>', s)[0]
That matches anything thats not a > character, so everything thats in the angular brackets.

REGEX extracting specific part non greedy

I'm new to Python 2.7. Using regular expressions, I'm trying to extract from a text file just the emails from input lines. I am using the non-greedy method as the emails are repeated 2 times in the same line. Here is my code:
import re
f_hand = open('mail.txt')
for line in f_hand:
line.rstrip()
if re.findall('\S+#\S+?',line): print re.findall('\S+#\S+?',line)
however this is what i"m getting instead of just the email address:
['href="mailto:secretary#abc-mediaent.com">sercetary#a']
What shall I use in re.findall to get just the email out?
If you parse a simple file with anchors for email addresses and always the same syntax (like double quotes to enclose attributes), you can use:
for line in f_hand:
print re.findall(r'href="mailto:([^"#]+#[^"]+)">\1</a>', line)
(re.findall returns only the capture group. \1 stands for the content of the first capture group.)
If the file is a more complicated html file, use a parser, extract the links and filter them.Or eventually use XPath, something like: substring-after(//a/#href[starts-with(., "mailto:")], "mailto:")
\S means not a space. " and > are not spaces.
You should use mailto:([^#]+#[^"]+) as the regex (quoted form: 'mailto:([^#]+#[^"]+)'). This will put the email address in the first capture group.
try this
re.findall('mailto:(\S+#\S+?\.\S+)\"',str))
It should give you something like
['secretary#abc-mediaent.com']
\S accepts many characters that aren't valid in an e-mail address. Try a regular expression of
[a-zA-Z0-9-_.]+#[a-zA-Z0-9-_.]+\\.[a-zA-Z0-9-_.]+
(presuming you are not trying to support Unicode -- it seems that you aren't since your input is a "text file").
This will require a "." in the server portion of the e-mail address, and your match will stop on the first character that is not valid within the e-mail address.
This is the format of an email address - https://www.rfc-editor.org/rfc/rfc5322#section-3.4.1.
Keeping that in mind the regex that you need is - r"([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)". (This works without having to depend on the text surrounding an email address.)
The following lines of code -
html_str = r'sachin.gokhale#indiacast.com'
email_regex = r"([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)"
print re.findall(email_regex, html_str)
yields -
['sachin.gokhale#indiacast.com', 'sachin.gokhale#indiacast.com']
P.S. - I got the regex for email addresses by googling for "email address regex" and clicking on the first site - http://emailregex.com/

Capturing string in Python

So, I'm trying to capture this big string in Python but it is failing me. The regex I wrote works fine in regexr: http://regexr.com/3cmdc
But trying to using it in Python to capture the text returns None. This is the code:
pattern = "var initialData = (.*?);\\n"
match = re.search(pattern, source).group(1)
What am I missing ?
You need to set the appropriate flags:
re.search(pattern, source, re.MULTILINE | re.DOTALL).group(1)
Use pythons raw string notation:
pattern = r"var initialData = (.*?);\\n"
match = re.search(pattern, source).group(1)
More information

Regex for escaping path separator in url

I have a url pattern: "somepath/email/". I don't want to write a regex for matching email instead I want anything which isn't a path separator to match email.
Please suggest a regex for this. I am using Python and the url is for a Django application, So any library function will also be helpful but I will prefer a regex.
The regex [^/\\]+ is a negative character class with a + quantifier and matches any number of characters that are not a / or \\
Code sample:
match = re.search("[^/\\]+", subject)
if match:
result = match.group()
else:
result = ""

Python regular expression string groupings

I'm trying to match either # or the string at, like for name#email and nameatemail. I imagine it's something like
regex = '#|at'
or
regex = '#|(at)'
but I just can't find the right syntax.
I suggest you use Kodos to test your regular expressions (it also provides you with Python code for your regex). And this for regular expression info.
For your issue both regex works correctly:
match = re.search("#|at", subject)
if match:
result = match.group()

Categories