Capturing emails with regex in Python - python

I will be gathering scattered emails from a larger CSV file. I am just now learning regex. I am trying to extract the emails from this example sentence. However, emails is populating with only the # symbol and the letter immediately before that. Can you help me see what's going wrong?
import re
String = "'Jessica's email is jessica#gmail.com, and Daniel's email is daniel123#gmail.com. Edward's is edwardfountain#gmail.com, and his grandfather, Oscar's, is odawg#gmail.com.'"
emails = re.findall(r'.[#]', String)
names = re.findall(r'[A-Z][a-z]*',String)
print(emails)
print(names)

your regex e-mail is not working at all: emails = re.findall(r'.[#]', String) matches anychar then #.
I would try a different approach: match the sentences and extract name,e-mails couples with the following empiric assumptions (if your text changes too much, that would break the logic)
all names are followed by 's" and is somewhere (using non-greedy .*? to match all that is in between
\w matches any alphanum char (or underscore), and only one dot for domain (else it matches the final dot of the sentence)
code:
import re
String = "'Jessica's email is jessica#gmail.com, and Daniel's email is daniel123#gmail.com. Edward's is edwardfountain#gmail.com, and his grandfather, Oscar's, is odawg#gmail.com.'"
print(re.findall("(\w+)'s.*? is (\w+#\w+\.\w+)",String))
result:
[('Jessica', 'jessica#gmail.com'), ('Daniel', 'daniel123#gmail.com'), ('Edward', 'edwardfountain#gmail.com'), ('Oscar', 'odawg#gmail.com')]
converting to dict would even give you a dictionary name => address:
{'Oscar': 'odawg#gmail.com', 'Jessica': 'jessica#gmail.com', 'Daniel': 'daniel123#gmail.com', 'Edward': 'edwardfountain#gmail.com'}
The general case needs more chars (not sure I'm exhaustive):
String = "'Jessica's email is jessica_123#gmail.com, and Daniel's email is daniel-123#gmail.com. Edward's is edward.fountain#gmail.com, and his grandfather, Oscar's, is odawg#gmail.com.'"
print(re.findall("(\w+)'s.*? is ([\w\-.]+#[\w\-.]+\.[\w\-]+)",String))
result:
[('Jessica', 'jessica_123#gmail.com'), ('Daniel', 'daniel-123#gmail.com'), ('Edward', 'edward.fountain#gmail.com'), ('Oscar', 'odawg#gmail.com')]

1. Emails
In [1382]: re.findall(r'\S+#\w+\.\w+', text)
Out[1382]:
['jessica#gmail.com',
'daniel123#gmail.com',
'edwardfountain#gmail.com',
'odawg#gmail.com']
How it works: All emails are xxx#xxx.xxx. One thing to note is a bunch of characters surrounding #, and the singular .. So, we use \S to demarcate anything that is not a whitespace. And + is to search for 1 or more such characters. \w+\.\w+ is just a fancy way of saying search for a string that only has one . in it.
2. Names
In [1375]: re.findall('[A-Z][\S]+(?=\')', text)
Out[1375]: ['Jessica', 'Daniel', 'Edward', 'Oscar']
How it works: Any word starting with an upper case. The (?=\') is a lookahead. As you see, all names follow the pattern Name's. We want everything before the apostrophe. Hence, the lookahead, which is not captured.
Now, if you want to map names to emails by capturing them together with one massive regex, you can. Jean-François Fabre's answer is a good start. But I recommend getting the basics down par first.

You need to find anchors, patterns to match. An improved pattern could be:
import re
String = "'Jessica's email is jessica#gmail.com, and Daniel's email is
daniel123#gmail.com. Edward's is edwardfountain#gmail.com, and his
grandfather, Oscar's, is odawg#gmail.com.'"
emails = re.findall(r'[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', String)
names = re.findall(r'[A-Z][a-z]*', String)
print(emails)
print(names)
\w+ is missing '-' which are allowed in email adresses.

This is because you are not using the repeat operator. The below code uses the + operator which means the characters / sub patterns just before it can repeat 1 to many times.
s = '''Jessica's email is jessica#gmail.com, and Daniel's email is daniel123#gmail.com. Edward's is edwardfountain#gmail.com, and his grandfather, Oscar's, is odawg#gmail.com.'''
p = r'[a-z0-9]+#[a-z]+\.[a-z]+'
ans = re.findall(p, s)
print(ans)

Related

Python : Extract mails from the string of filenames

I want to get the mail from the filenames. Here is a set of examples of filenames :
string1 = "benoit.m.fontaine#outlook.fr_2022-05-11T11_59_58+00_00.pdf"
string2 = "jeane_benrand#toto.pt_test.pdf"
string3 = "rosy.gray#amazon.co.uk-fdsdfsd-saf.pdf"
I would like to split the filename by the parts. The first one would contain the email and the second one is the rest. So it should give for the string2 :
['jeane_benrand#toto.pt', '_test.pdf']
I try this regex function however it does not work for the second and third string.
email = re.search(r"[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+", string)
Thank you for your help
Given the samples you provided, you can do something like this:
import re
strings = ["benoit.m.fontaine#outlook.fr_2022-05-11T11_59_58+00_00.pdf",
"jeane_benrand#toto.pt_test.pdf",
"rosy.gray#amazon.co.uk-fdsdfsd-saf.pdf"]
pattern = r'([^#]+#[\.A-Za-z]+)(.*)'
[re.findall(pattern, string)[0] for string in strings]
Output:
[('benoit.m.fontaine#outlook.fr', '_2022-05-11T11_59_58+00_00.pdf'),
('jeane_benrand#toto.pt', '_test.pdf'),
('rosy.gray#amazon.co.uk', '-fdsdfsd-saf.pdf')]
Mail pattern explanation ([^#]+#[\.A-Za-z]+):
[^#]+: any combination of characters except #
#: at
[\.A-Za-z]+: any combination of letters and dots
Rest pattern explanation (.*)
(.*): any combination of characters

Python REGEX search returns None as answer

I'm getting None as answer for this code. When I do either email or phone alone the code works. But using both returns a none . Please help!
import re
string = 'Love, Kenneth, kenneth+challenge#teamtreehouse.com, 555-555-5555, #kennethlove Chalkley, Andrew, andrew#teamtreehouse.co.uk, 555-555-5556, #chalkers McFarland, Dave, dave.mcfarland#teamtreehouse.com, 555-555-5557, #davemcfarland Kesten, Joy, joy#teamtreehouse.com, 555-555-5558, #joykesten'
contacts = re.search(r'''
^(?P<email>[-\w\d.+]+#[-\w\d.]+) # Email
(?P<phone>\d{3}-\d{3}-\d{4})$ # Phone
''', string, re.X|re.M)
print(contacts.groupdict)
Perhaps you want:
(?P<email>[-\w\d.+]+#[-\w\d.]+), (?P<phone>\d{3}-\d{3}-\d{4})
This matches the parts:
kenneth+challenge#teamtreehouse.com, 555-555-5555
andrew#teamtreehouse.co.uk, 555-555-5556
dave.mcfarland#teamtreehouse.com, 555-555-5557
joy#teamtreehouse.com, 555-555-5558
Debuggex Demo
You are using ^ and $ to enforce a match on the entire string. Your regexp seems designed to match only a substring.

Python regex to match multiple times

I'm trying to match a pattern against strings that could have multiple instances of the pattern. I need every instance separately. re.findall() should do it but I don't know what I'm doing wrong.
pattern = re.compile('/review: (http://url.com/(\d+)\s?)+/', re.IGNORECASE)
match = pattern.findall('this is the message. review: http://url.com/123 http://url.com/456')
I need 'http://url.com/123', http://url.com/456 and the two numbers 123 & 456 to be different elements of the match list.
I have also tried '/review: ((http://url.com/(\d+)\s?)+)/' as the pattern, but no luck.
Use this. You need to place 'review' outside the capturing group to achieve the desired result.
pattern = re.compile(r'(?:review: )?(http://url.com/(\d+))\s?', re.IGNORECASE)
This gives output
>>> match = pattern.findall('this is the message. review: http://url.com/123 http://url.com/456')
>>> match
[('http://url.com/123', '123'), ('http://url.com/456', '456')]
You've got extra /'s in the regex. In python the pattern should just be a string. e.g. instead of this:
pattern = re.compile('/review: (http://url.com/(\d+)\s?)+/', re.IGNORECASE)
It should be:
pattern = re.compile('review: (http://url.com/(\d+)\s?)+', re.IGNORECASE)
Also typically in python you'd actually use a "raw" string like this:
pattern = re.compile(r'review: (http://url.com/(\d+)\s?)+', re.IGNORECASE)
The extra r on the front of the string saves you from having to do lots of backslash escaping etc.
Use a two-step approach: First get everything from "review:" to EOL, then tokenize that.
msg = 'this is the message. review: http://url.com/123 http://url.com/456'
review_pattern = re.compile('.*review: (.*)$')
urls = review_pattern.findall(msg)[0]
url_pattern = re.compile("(http://url.com/(\d+))")
url_pattern.findall(urls)

Python - trying to capture the middle of a line, regex or split

I have a text file with some names and emails and other stuff. I want to capture email addresses.
I don't know whether this is a split or regex problem.
Here are some sample lines:
[name]bill billy [email]bill.billy#hotmail.com [dob]01.01.81
[name]mark hilly [email]mark.hilly#hotmail.com [dob]02.11.80
[name]gill silly [email]gill.silly#hotmail.com [dob]03.12.79
I want to be able to do a loop that prints all the email addresses.
Thanks.
I'd use a regex:
import re
data = '''[name]bill billy [email]bill.billy#hotmail.com [dob]01.01.81
[name]mark hilly [email]mark.hilly#hotmail.com [dob]02.11.80
[name]gill silly [email]gill.silly#hotmail.com [dob]03.12.79'''
group_matcher = re.compile(r'\[(.*?)\]([^\[]+)')
for line in data.split('\n'):
o = dict(group_matcher.findall(line))
print o['email']
\[ is literally [.
(.*?) is a non-greedy capturing group. It "expands" to capture the text.
\] is literally ]
( is the beginning of a capturing group.
[^\[] matches anything but a [.
+ repeats the last pattern any number of times.
) closes the capturing group.
for line in lines:
print line.split("]")[2].split(" ")[0]
You can pass substrings to split, not just single characters, so:
email = line.partition('[email]')[-1].partition('[')[0].rstrip()
This has an advantage over the simple split solutions that it will work on fields that can have spaces in the value, on lines that have things in a different order (even if they have [email] as the last field), etc.
To generalize it:
def get_field(line, field):
return line.partition('[{}]'.format(field)][-1].partition('[')[0].rstrip()
However, I think it's still more complicated than the regex solution. Plus, it can only search for exactly one field at a time, instead of all fields at once (without making it even more complicated). To get two fields, you'll end up parsing each line twice, like this:
for line in data.splitlines():
print '''{} "babysat" Dan O'Brien on {}'''.format(get_field(line, 'name'),
get_field(line, 'dob'))
(I may have misinterpreted the DOB field, of course.)
You can split by space and then search for the element that starts with [email]:
line = '[name]bill billy [email]bill.billy#hotmail.com [dob]01.01.81'
items = line.split()
for item in items:
if item.startswith('[email]'):
print item.replace('[email]', '', 1)
say you have a file with lines.
import re
f = open("logfile", "r")
data = f.read()
for line in data.split("\n"):
match=re.search("email\](?P<id>.*)\[dob", line)
if match:
# either store or print the emails as you like
print match.group('id').strip(), "\n"
Thats all (try it, for python 3 n above remember print is a function make those changes ) !
The output from your sample data:
bill.billy#hotmail.com
mark.hilly#hotmail.com
gill.silly#hotmail.com
>>>

Matching alternative regexps in Python

I'm using Python to parse a file in search for e-mail addresses, but I can't figure out what the syntax for alternative regexps should be. Here's the code:
addresses = []
pattern = '(\w+)#(\w+\.com)|(\w+)#(it.\w+\.com)'
for line in file:
matches = re.findall(pattern,line)
for m in matches:
address = '%s#%s' % m
addresses.append(address)
So I want to find addresses that look like john#company.com or john#it.company.com, but the above code doesn't work because either the first two groups are empty or the last two groups are empty. What is the correct solution? I need to use groups to store the user name (before #) and server name (after #) separately.
EDIT: Matching email adresses is only an example. What I'm trying to find out is how to match different regexps that have only one thing in common - they match two groups.
(\w+)#((?:it\.)?\w+\.com)
You want to capture the part after the # whether it's example.com or it.example.com, so you put both options inside the same capture group. But since they share a similar format, you can condense (it\.\w+\.com|\w+\.com) to just ((it\.)?\w+\.com)
The (?: ) makes that parens a non-capturing group, so it won't take part in your matched groups. There will be one match for the first (\w+), and one match for the whole ((?:it\.)?\w+\.com) after the #. That's two matches total, plus the default group-0 match for the full string.
EDIT: To answer your new question, all you have to do is follow the grouping I used, but stop before you condense it.
If your test cases are:
1) example#abcdef
2) example#123456
You could write your regex as such: (\w+)#([a-zA-Z]+|\d+), which would always have the part before the # in group 1, and the part after in group 2. Notice that there are only two pairs of parens, and the |("or") operator appears inside of the second parens group.
I once found here a well written email regex, it was build for extracting a wide range of valid email adresses from a generic string, so it should also be able to do what you're looking for.
Example:
>>> email_regex = re.compile("""((([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~]+|"([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~(),:;<>#\[\]\.]|\\[ \\"])*")\.)*([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~]+|"([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~(),:;<>#\[\]\.]|\\[ \\"])*"))#((([a-zA-Z0-9]([a-zA-Z0-9]*(\-[a-zA-Z0-9]*)*)?\.)*[a-zA-Z]+|\[((0?\d{1,2}|1\d{2}|2[0-4]\d|25[0-5])\.){3}(0?\d{1,2}|1\d{2}|2[0-4]\d|25[0-5])\]|\[[Ii][Pp][vV]6(:[0-9a-fA-F]{0,4}){6}\]))""")
>>>
>>> m = email_regex.search('john#it.company.com')
>>> m.group(0)
'john#it.company.com'
>>> m.group(1)
'john'
>>> m.group(7)
'it.company.com'
>>>
>>> n = email_regex.search('john#company.com')
>>> n.group(0)
'john#company.com'
>>> n.group(1)
'john'
>>> n.group(7)
'company.com'

Categories