Regex - extract word inside < > brackets - python

I am trying to extract an email address from a string like
John Smith <jsmith#email.com>
I just need the email address in the < > brackets.
Here is what I have tried so far, but I'm not very good with regex and it doesn't seem to be working, can anyone help?
import re
sender = str(message.sender)
p = re.search(r"\<(\w+)\>", sender)
logging.info(p.group(1))

You can try this:
import re
s = "John Smith <jsmith#email.com>"
email = re.findall('<(.*?)>', s)[0]
Output:
'jsmith#email.com'
Or, a more email-specific solution:
email = re.findall('(?<=\<)\w+#[a-zA-Z]+\.[a-z]+(?=\>)', s)[0]
Output:
'jsmith#email.com'

Currently your regex is : "\<(\w+)\>"
You do not actually need to escape the <>, so it becomes: "<(\w+)>"
\w matches letters, numbers and the underschore '_'. In an e-mail address there are other characters as well.
You have two options: Either just accept anything inside the <> with a regex like "<(.*)>" or actually parse an e-mail address.
A simple regex for that would be "<\S+#\S+>" (non-whitespace characters followed by # followed by non-whitespace characters.
Restricting ourselves to the more commonly used characters, we can write: "<[a-zA-Z0-9+_.-]+#[a-zA-Z0-9.-]+> This still permits certain illegal e-mail addresses because I have kept it fairly simple.

Use a negative character set:
import re
s = "John Smith <jsmith#email.com>"
email = re.findall('<([^>])>', s)[0]
That matches anything thats not a > character, so everything thats in the angular brackets.

Related

Underscore character matched all the email address regex python

I have my email adress format like username#domain.extension
The username starts with an English alphabetical character, and any subsequent characters consist of one or more of the following: alphanumeric characters, -, . , and _.
The domain and extension contain only English alphabetical characters.
The extension is 1,2 or 3 characters in length.
I have used the below regex to validate my email address:
[a-zA-Z]+\s<\b[a-z0-9._-]+#[a-zA-Z]+\.[A-Za-z]{1,3}\b>
Email adresses:
this <is#valid.com>
this <is_it#valid.com>
this <_is#notvalid.com>
this <.is#notvalid.com>
this <-is#notvalid.com>
It matched email address 1,2,3 while 4,5 have . and - at the start of domain so it got rejected. So why for 3rd email underscore at the starting of domain it's causing issue and getting accepted.I can't have . , - , _ at the start of domain as per instructions mentioned above. Here is the link
Correct ans:
1,2 email should only match
Your character class after <\b is accepting _ hence any email address starting with - is also becoming valid.
You can use this regex to only allow an alphabet as starting letter of your email:
[a-zA-Z]+\s<\b[A-Za-z][a-zA-Z0-9._-]*#[a-zA-Z]+\.[A-Za-z]{1,3}\b>
Updated RegEx Demo
or you can make use of \w:
[a-zA-Z]+\s<\b[a-zA-Z][\w.-]*#[a-zA-Z]+\.[A-Za-z]{1,3}\b>
Newbie to regex:
([a-zA-Z]+[ ][<][a-zA-Z]+[a-zA-Z._-]+[#][a-zA-Z]+\.[A-Za-z]{1,3})[>]
It's my try for your problem:

REGEX extracting specific part non greedy

I'm new to Python 2.7. Using regular expressions, I'm trying to extract from a text file just the emails from input lines. I am using the non-greedy method as the emails are repeated 2 times in the same line. Here is my code:
import re
f_hand = open('mail.txt')
for line in f_hand:
line.rstrip()
if re.findall('\S+#\S+?',line): print re.findall('\S+#\S+?',line)
however this is what i"m getting instead of just the email address:
['href="mailto:secretary#abc-mediaent.com">sercetary#a']
What shall I use in re.findall to get just the email out?
If you parse a simple file with anchors for email addresses and always the same syntax (like double quotes to enclose attributes), you can use:
for line in f_hand:
print re.findall(r'href="mailto:([^"#]+#[^"]+)">\1</a>', line)
(re.findall returns only the capture group. \1 stands for the content of the first capture group.)
If the file is a more complicated html file, use a parser, extract the links and filter them.Or eventually use XPath, something like: substring-after(//a/#href[starts-with(., "mailto:")], "mailto:")
\S means not a space. " and > are not spaces.
You should use mailto:([^#]+#[^"]+) as the regex (quoted form: 'mailto:([^#]+#[^"]+)'). This will put the email address in the first capture group.
try this
re.findall('mailto:(\S+#\S+?\.\S+)\"',str))
It should give you something like
['secretary#abc-mediaent.com']
\S accepts many characters that aren't valid in an e-mail address. Try a regular expression of
[a-zA-Z0-9-_.]+#[a-zA-Z0-9-_.]+\\.[a-zA-Z0-9-_.]+
(presuming you are not trying to support Unicode -- it seems that you aren't since your input is a "text file").
This will require a "." in the server portion of the e-mail address, and your match will stop on the first character that is not valid within the e-mail address.
This is the format of an email address - https://www.rfc-editor.org/rfc/rfc5322#section-3.4.1.
Keeping that in mind the regex that you need is - r"([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)". (This works without having to depend on the text surrounding an email address.)
The following lines of code -
html_str = r'sachin.gokhale#indiacast.com'
email_regex = r"([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)"
print re.findall(email_regex, html_str)
yields -
['sachin.gokhale#indiacast.com', 'sachin.gokhale#indiacast.com']
P.S. - I got the regex for email addresses by googling for "email address regex" and clicking on the first site - http://emailregex.com/

Python email regex doesn't work

I am trying to get all email address from a text file using regular expression and Python but it always returns NoneType while it suppose to return the email. For example:
content = 'My email is lehai#gmail.com'
#Compare with suitable regex
emailRegex = re.compile(r'(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)')
mo = emailRegex.search(content)
print(mo.group())
I suspect the problem lies in the regex but could not figure out why.
Because of spaces in content; remove the ^ and $ to match anywhere:
([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)
Try this one as a regex, but I am completely not sure whether it will work for you:
([^#|\s]+#[^#]+.[^#|\s]+)
Your regular expression doesn't match the pattern.
I normally call the regex search like this:
mo = re.search(regex, searchstring)
So in your case I would try
content = 'My email is lehai#gmail.com'
#Compare with suitable regex
emailRegex = re.compile(r'gmail')
mo = re.search(emailRegex, content)
print(mo.group())`
You can test your regex here: https://regex101.com/
This will work:
([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)

Replace characters using re.sub - keep one character

I'm trying to repair broken email records in a table. There are emails, for example: 'google#google.comyahoo#yahoo.com' but there can be a single email like 'google#google.com'. The best way to make this correct is in my opinion to use re.sub. But there is a little problem. If there is a record:
email = 'google#google.comyahoo#yahoo.com'
I can't simply do replace('.com','.com, ') because it affects both '.com' substrings. So I want to use re.sub('.com\w', '.com, \w',email) which replaces only those '.com' substrings, which aren't in the end of the record. The problem is that I want to keep a \w value there.
print re.sub('.com\w', '.com, \w',email)
>>> google#google.com, \wahoo#yahoo.com
instead of
>>> google#google.com, yahoo#yahoo.com
Can anybody give me an advice how to make it work? (I want to separate emails by comma and space)
Use a capturing group and backreference the group inside of the replacement call:
>>> import re
>>> email = 'google#google.comyahoo#yahoo.com'
>>> re.sub(r'\.com(\w)', '.com, \\1', email)
'google#google.com, yahoo#yahoo.com'
Backreferences recall what was matched by a capturing group. A backreference is specified as a backslash (\); followed by a digit indicating the number of the group to be recalled.
x="google#google.comyahoo#yahoo.com"
print re.sub(r"(?<=\.com)(?=\w)",", ",x)
Output:google#google.com, yahoo#yahoo.com
use lookarounds.See demo.
https://regex101.com/r/sJ9gM7/48
Lookarounds don't consume any of the string. They are just assertions. When you use them, you need not replace the consumed string back like the above answer does.

How to check in Python3.2 if a string has any numbers using regex

I need to write a script where one can type their name (via input) and the script has to check if the name has the right format (no numbers, no Capslock, starting with the capital letter). This is what i have so far:
import re
def inputName():
name = input("Enter your name: ")
if re.search('^[A-Z]{1}\w[a-z]+',name):
print("ok")
else:
print('not ok')
inputName()
I've also tried [^\d] and \D and [^0-9] but it still doesnt work correctly. When I enter "A8hkjh" it returns "not ok", nut when I type "Ahj8k" it returns "ok", even though there is a digit in the string.
How can I make the script check the whole string?
\w matches letters as well as digits and underscores. Also, don't forget to anchor the regex to the end of the string, otherwise it will succeed on a partial match. For example, on "Ab1", the substring "Ab" is matched by your regex if you don't use the $ anchor:
re.search('^[A-Z][a-z]+$',name)
should fix this.
First off, if you want to check the whole string, you need to anchor to the beginning of the string AND the end. You have an anchor to the beginning with ^, but you should anchor to the end with $. Your regex ^[A-Z]{1}\w[a-z]+ is only checking the first few characters of the string which is why Ahj8k returns valid. It is also very naive in that it will not handle names like "McDonald", "Smith-Jones", "Jackson Jr." or "de Icaza". Were I to write a regex for a name, I'd keep it simple and permissive: ^[A-Za-z\.\s\-]+$.
import re
username = "john"
result = re.search(r'.*?(?!.*?(\d|[A-Z]))', username)
if result.groups()[0] is None:
print "{} passed validation".format(username)
The regex: .*?(?!.*?(\d|[A-Z])) will match anything as long as it has no digits or capital letters in it, as requested.
The mistake that you made is that \w take the digit.
Try this : regex101
it's a pretty cool website to test your regexp
import re
def inputName():
name = input('Enter your name: ')
if re.search("^[A-Z][a-z]+$", name):
print('ok')
else:
print("not ok")
This code works, for :
$>Enter your name: Ahj8k
Not Ok

Categories