Python : Extract mails from the string of filenames - python

I want to get the mail from the filenames. Here is a set of examples of filenames :
string1 = "benoit.m.fontaine#outlook.fr_2022-05-11T11_59_58+00_00.pdf"
string2 = "jeane_benrand#toto.pt_test.pdf"
string3 = "rosy.gray#amazon.co.uk-fdsdfsd-saf.pdf"
I would like to split the filename by the parts. The first one would contain the email and the second one is the rest. So it should give for the string2 :
['jeane_benrand#toto.pt', '_test.pdf']
I try this regex function however it does not work for the second and third string.
email = re.search(r"[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+", string)
Thank you for your help

Given the samples you provided, you can do something like this:
import re
strings = ["benoit.m.fontaine#outlook.fr_2022-05-11T11_59_58+00_00.pdf",
"jeane_benrand#toto.pt_test.pdf",
"rosy.gray#amazon.co.uk-fdsdfsd-saf.pdf"]
pattern = r'([^#]+#[\.A-Za-z]+)(.*)'
[re.findall(pattern, string)[0] for string in strings]
Output:
[('benoit.m.fontaine#outlook.fr', '_2022-05-11T11_59_58+00_00.pdf'),
('jeane_benrand#toto.pt', '_test.pdf'),
('rosy.gray#amazon.co.uk', '-fdsdfsd-saf.pdf')]
Mail pattern explanation ([^#]+#[\.A-Za-z]+):
[^#]+: any combination of characters except #
#: at
[\.A-Za-z]+: any combination of letters and dots
Rest pattern explanation (.*)
(.*): any combination of characters

Related

regex on python spiting a string into a specific sequence

I have a string that may look like
CITS/CPU/0218/2305CITS/VDU/0218/2305CITS/KEY/0218/2305
or
CITS/CPU/0218/2305CITS/VDU/0218/2305 CITS/KEY/0218/2305
or
CITS/CPU/0218/2305 CITS/VDU/0218/2305 CITS/KEY/0218/2305
or
CITS/CPU/0218/2305
I was trying to come up with a regex that would match against a sequence like CITS/CPU/0218/2305 so that I can split any string into a list that matches this case only.
Essentially I just need to extract the */*/*/* part into a list from incoming strings
My code
product_code = CITS/CPU/0218/2305CITS/VDU/0218/2305 CITS/KEY/0218/2305
(re.split(r'^((?:[a-z][a-z]+))(.)((?:[a-z][a-z]+))((?:[a-z][a-z]+))(.)(\\d+)(.)(\\d+)$', product_code))
Any suggestions?
Try using re.findall here:
inp = "CITS/CPU/0218/2305CITS/VDU/0218/2305CITS/KEY/0218/2305"
matches = re.findall(r'[A-Z]+/[A-Z]+/[0-9]+/[0-9]+', inp)
print(matches)
This prints:
['CITS/CPU/0218/2305', 'CITS/VDU/0218/2305', 'CITS/KEY/0218/2305']
If you only want the first match, then just access it:
print(matches[0])
['CITS/CPU/0218/2305']

How to start at a specific letter and end when it hits a digit?

I have some sample strings:
s = 'neg(able-23, never-21) s2-1/3'
i = 'amod(Market-8, magical-5) s1'
I've got the problem where I can figure out if the string has 's1' or 's3' using:
word = re.search(r's\d$', s)
But if I want to know if the contains 's2-1/3' in it, it won't work.
Is there a regex expression that can be used so that it works for both cases of 's#' and 's#+?
Thanks!
You can allow the characters "-" and "/" to be captured as well, in addition to just digits. It's hard to tell the exact pattern you're going for here, but something like this would capture "s2-1/3" from your example:
import re
s = "neg(able-23, never-21) s2-1/3"
word = re.search(r"s\d[-/\d]*$", s)
I'm guessing that maybe you would want to extract that with some expression, such as:
(s\d+)-?(.*)$
Demo 1
or:
(s\d+)-?([0-9]+)?\/?([0-9]+)?$
Demo 2
Test
import re
expression = r"(s\d+)-?(.*)$"
string = """
neg(able-23, never-21) s211-12/31
neg(able-23, never-21) s2-1/3
amod(Market-8, magical-5) s1
"""
print(re.findall(expression, string, re.M))
Output
[('s211', '12/31'), ('s2', '1/3'), ('s1', '')]

Capturing emails with regex in Python

I will be gathering scattered emails from a larger CSV file. I am just now learning regex. I am trying to extract the emails from this example sentence. However, emails is populating with only the # symbol and the letter immediately before that. Can you help me see what's going wrong?
import re
String = "'Jessica's email is jessica#gmail.com, and Daniel's email is daniel123#gmail.com. Edward's is edwardfountain#gmail.com, and his grandfather, Oscar's, is odawg#gmail.com.'"
emails = re.findall(r'.[#]', String)
names = re.findall(r'[A-Z][a-z]*',String)
print(emails)
print(names)
your regex e-mail is not working at all: emails = re.findall(r'.[#]', String) matches anychar then #.
I would try a different approach: match the sentences and extract name,e-mails couples with the following empiric assumptions (if your text changes too much, that would break the logic)
all names are followed by 's" and is somewhere (using non-greedy .*? to match all that is in between
\w matches any alphanum char (or underscore), and only one dot for domain (else it matches the final dot of the sentence)
code:
import re
String = "'Jessica's email is jessica#gmail.com, and Daniel's email is daniel123#gmail.com. Edward's is edwardfountain#gmail.com, and his grandfather, Oscar's, is odawg#gmail.com.'"
print(re.findall("(\w+)'s.*? is (\w+#\w+\.\w+)",String))
result:
[('Jessica', 'jessica#gmail.com'), ('Daniel', 'daniel123#gmail.com'), ('Edward', 'edwardfountain#gmail.com'), ('Oscar', 'odawg#gmail.com')]
converting to dict would even give you a dictionary name => address:
{'Oscar': 'odawg#gmail.com', 'Jessica': 'jessica#gmail.com', 'Daniel': 'daniel123#gmail.com', 'Edward': 'edwardfountain#gmail.com'}
The general case needs more chars (not sure I'm exhaustive):
String = "'Jessica's email is jessica_123#gmail.com, and Daniel's email is daniel-123#gmail.com. Edward's is edward.fountain#gmail.com, and his grandfather, Oscar's, is odawg#gmail.com.'"
print(re.findall("(\w+)'s.*? is ([\w\-.]+#[\w\-.]+\.[\w\-]+)",String))
result:
[('Jessica', 'jessica_123#gmail.com'), ('Daniel', 'daniel-123#gmail.com'), ('Edward', 'edward.fountain#gmail.com'), ('Oscar', 'odawg#gmail.com')]
1. Emails
In [1382]: re.findall(r'\S+#\w+\.\w+', text)
Out[1382]:
['jessica#gmail.com',
'daniel123#gmail.com',
'edwardfountain#gmail.com',
'odawg#gmail.com']
How it works: All emails are xxx#xxx.xxx. One thing to note is a bunch of characters surrounding #, and the singular .. So, we use \S to demarcate anything that is not a whitespace. And + is to search for 1 or more such characters. \w+\.\w+ is just a fancy way of saying search for a string that only has one . in it.
2. Names
In [1375]: re.findall('[A-Z][\S]+(?=\')', text)
Out[1375]: ['Jessica', 'Daniel', 'Edward', 'Oscar']
How it works: Any word starting with an upper case. The (?=\') is a lookahead. As you see, all names follow the pattern Name's. We want everything before the apostrophe. Hence, the lookahead, which is not captured.
Now, if you want to map names to emails by capturing them together with one massive regex, you can. Jean-François Fabre's answer is a good start. But I recommend getting the basics down par first.
You need to find anchors, patterns to match. An improved pattern could be:
import re
String = "'Jessica's email is jessica#gmail.com, and Daniel's email is
daniel123#gmail.com. Edward's is edwardfountain#gmail.com, and his
grandfather, Oscar's, is odawg#gmail.com.'"
emails = re.findall(r'[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', String)
names = re.findall(r'[A-Z][a-z]*', String)
print(emails)
print(names)
\w+ is missing '-' which are allowed in email adresses.
This is because you are not using the repeat operator. The below code uses the + operator which means the characters / sub patterns just before it can repeat 1 to many times.
s = '''Jessica's email is jessica#gmail.com, and Daniel's email is daniel123#gmail.com. Edward's is edwardfountain#gmail.com, and his grandfather, Oscar's, is odawg#gmail.com.'''
p = r'[a-z0-9]+#[a-z]+\.[a-z]+'
ans = re.findall(p, s)
print(ans)

Regex pattern to match the string

What is the regex pattern to match a string starting with abc-def-xyz and ending with anything ??
Update
Since you only want to match host names that begin with abc-def you can simply use str.startswith():
hosts = ['abc-def.1.desktop.rul.com',
'abc-def.2.desktop.rul.com',
'abc-def.3.desktop.rul.com',
'abc-def.4.desktop.rul.com',
'abc-def.44.desktop.rul.com',
'abc-def.100.desktop.rul.com',
'qwe-rty.100.desktop.rul.com',
'z.100.desktop.rul.com',
'192.168.1.10',
'abc-def.100abc.desktop.rul.com']
filtered_hosts = [host for host in hosts if host.startswith('abc-def')]
print filtered_hosts
Output
['abc-def.1.desktop.rul.com', 'abc-def.2.desktop.rul.com', 'abc-def.3.desktop.rul.com', 'abc-def.4.desktop.rul.com', 'abc-def.44.desktop.rul.com', 'abc-def.100.desktop.rul.com', 'abc-def.100abc.desktop.rul.com']
Original regex solution follows.
Let's say that your data is a list of host names such as these:
hosts = ['abc-def.1.desktop.rul.com',
'abc-def.2.desktop.rul.com',
'abc-def.3.desktop.rul.com',
'abc-def.4.desktop.rul.com',
'abc-def.44.desktop.rul.com',
'abc-def.100.desktop.rul.com',
'qwe-rty.100.desktop.rul.com',
'z.100.desktop.rul.com',
'192.168.1.10',
'abc-def.100abc.desktop.rul.com']
import re
pattern = re.compile(r'abc-def\.\d+\.')
filtered_hosts = [host for host in hosts if pattern.match(host)]
print filtered_hosts
Output
['abc-def.1.desktop.rul.com', 'abc-def.2.desktop.rul.com', 'abc-def.3.desktop.rul.com', 'abc-def.4.desktop.rul.com', 'abc-def.44.desktop.rul.com', 'abc-def.100.desktop.rul.com']
The regex pattern says to match any lines that start with abc-def. followed by one or more digits, followed by a dot.
If you wanted to match a more generic pattern such as any sequence of 3 lowercase letters followed by a - and then another 3 lowercase letters, you could do this:
pattern = re.compile(r'[a-z]{3}-[a-z]{3}\.\d+\.')
Now the output also includes 'qwe-rty.100.desktop.rul.com'.

Python regex to match multiple times

I'm trying to match a pattern against strings that could have multiple instances of the pattern. I need every instance separately. re.findall() should do it but I don't know what I'm doing wrong.
pattern = re.compile('/review: (http://url.com/(\d+)\s?)+/', re.IGNORECASE)
match = pattern.findall('this is the message. review: http://url.com/123 http://url.com/456')
I need 'http://url.com/123', http://url.com/456 and the two numbers 123 & 456 to be different elements of the match list.
I have also tried '/review: ((http://url.com/(\d+)\s?)+)/' as the pattern, but no luck.
Use this. You need to place 'review' outside the capturing group to achieve the desired result.
pattern = re.compile(r'(?:review: )?(http://url.com/(\d+))\s?', re.IGNORECASE)
This gives output
>>> match = pattern.findall('this is the message. review: http://url.com/123 http://url.com/456')
>>> match
[('http://url.com/123', '123'), ('http://url.com/456', '456')]
You've got extra /'s in the regex. In python the pattern should just be a string. e.g. instead of this:
pattern = re.compile('/review: (http://url.com/(\d+)\s?)+/', re.IGNORECASE)
It should be:
pattern = re.compile('review: (http://url.com/(\d+)\s?)+', re.IGNORECASE)
Also typically in python you'd actually use a "raw" string like this:
pattern = re.compile(r'review: (http://url.com/(\d+)\s?)+', re.IGNORECASE)
The extra r on the front of the string saves you from having to do lots of backslash escaping etc.
Use a two-step approach: First get everything from "review:" to EOL, then tokenize that.
msg = 'this is the message. review: http://url.com/123 http://url.com/456'
review_pattern = re.compile('.*review: (.*)$')
urls = review_pattern.findall(msg)[0]
url_pattern = re.compile("(http://url.com/(\d+))")
url_pattern.findall(urls)

Categories