finding email address in a web page using regular expression - python

I'm a beginner-level student of Python. Here is the code I have to find instances of email addresses from a web page.
page = urllib.request.urlopen("http://website/category")
reg_ex = re.compile(r'[-a-z0-9._]+#([-a-z0-9]+)(\.[-a-z0-9]+)+', re.IGNORECASE
m = reg_ex.search_all(page)
m.group()
When I ran it, the Python module said that there is an invalid syntax and it is on the line:
m = reg_ex.search_all(page)
Would anyone tell me why it is invalid?

Consider an alternative:
## Suppose we have a text with many email addresses
str = 'purple alice#google.com, blah monkey bob#abc.com blah dishwasher'
## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+#[\w\.-]+', str)
## ['alice#google.com', 'bob#abc.com']
for email in emails:
# do something with each found email string
print email
Source: https://developers.google.com/edu/python/regular-expressions

Besides, reg_ex has no search_all method. And you should pass in page.read().

You don't have closing ) at this line:
reg_ex = re.compile(r'[a-z0-9._]+#([-a-z0-9]+)(\.[-a-z0-9]+)+', re.IGNORECASE)
Plus, your regex is not valid, try this instead:
"[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
FYI, validating email using regex is not that trivial, see these threads:
Python check for valid email address?
Using a regular expression to validate an email address

there is no .search_all method with the re module
maybe theone you are looking for is .findall
you can try
re.findall(r"(\w(?:[-.+]?\w+)+\#(?:[a-zA-Z0-9](?:[-+]?\w+)*\.)+[a-zA-Z]{2,})", text)
i assume text is the text to search, in your case should be text = page.read()
or you need to compile the regex:
r = re.compile(r"(\w(?:[-.+]?\w+)+\#(?:[a-z0-9](?:[-+]?\w+)*\.)+[a-z]{2,})", re.I)
results = r.findall(text)
Note:
.findall returns a list of matches
if you need to iterate to get a match object, you can use .finditer
(from the example before)
r = re.compile(r"(\w(?:[-.+]?\w+)+\#(?:[a-z0-9](?:[-+]?\w+)*\.)+[a-z]{2,})", re.I)
for email_match in r.finditer(text):
email_addr = email_match.group() #or anything you need for a matched object
Now the problem is what Regex you have to use :)

Change r'[-a-z0-9._]+#([-a-z0-9]+)(\.[-a-z0-9]+)+' to r'[aA-zZ0-9._]+#([aA-zZ0-9]+)(\.[aA-zZ0-9]+)+'. The - character before a-z is the cause

Related

writing flexible regex expressions

I am trying to write a program that reads text from screenshots and then identifies various PII from it. Using pytesseract to read in the text, I am trying to write regex for urls, email IDs etc. Here is an example of a function which takes in a string and returns True email IDs and False otherwise:
def email_regex(text):
pattern = compile(r"\A[a-z0-9!#$%&'*+/=?^_‘{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_‘{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?")
return bool(pattern.match(text))
This function works well for all email IDs in a proper format(abc#xyz.dd), but since the input to the function is text read in from pytesseract, the text is not guaranteed to be properly formatted. My function returns False for abc#xyzdd. I'm running into the same issues with URL regex,domain name regex etc. Is there a way to make my regex expressions more robust to reading in errors from pytesseract?
I have tried following the accepted solution to this answer, but that leads to the regex functions returning True for random words as well. Any help to resolve this would be greatly appreciated.
EDIT :- Here are my url and domain regexs, where I'm running into the same problem as my email regex. Any help with these will be very useful for me.
pattern = compile(r'\b(((([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,86}
[a-zA-Z0-9]))\.(([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,73}[a-zA-Z0-9]))\.(([a-zA-Z0-9]{2,12}\.[a-zA-Z0-9]{2,12})|([a-zA-Z0-9]{2,25})))|((([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,162}[a-zA-Z0-9]))\.(([a-zA-Z0-9]{2,12}\.[a-zA-Z0-9]{2,12})|([a-zA-Z0-9]{2,25}))))\b', re.IGNORECASE)
return pattern.match(text)```
def url_regex(text):
pattern = compile(r'(http|https://)?:(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F])+)', re.IGNORECASE)
return pattern.match(text)
Perhaps adding some flags, such as ignorecase and DOTALL for newlines:
# Match email ID:
my_pattern = compile(r"^[a-z0-9]+[\._]?[a-z0-9]+[#]\w+[.]?\w{2,3}$", re.I, re.S)
Match URLs:
https://gist.github.com/gruber/8891611

Unable to parse a link from some content

I'm trying to parse a link out of some content using regex. I've already got success but I had to use replace() function and this as a flag. The thing is this may not always be present there. So, I seek any solution to get the same output without those two things I've mentioned already.
import re
content = """
widgetEvCall('handlers.onMenuClicked', event, this, 'http://www.stirwen.be/medias/documents/20181002_carte_octobre-novembre_2018_FR.pdf')
"""
link = re.findall(r'this,\s*([^)]*)',content.strip())[0].replace("'","")
print(link)
Output:
http://www.stirwen.be/medias/documents/20181002_carte_octobre-novembre_2018_FR.pdf
How can I get the link using pure regex?
You may extract all chars between single quotes after this, and spaces:
import re
content = """
widgetEvCall('handlers.onMenuClicked', event, this, 'http://w...content-available-to-author-only...n.be/medias/documents/20181002_carte_octobre-novembre_2018_FR.pdf')
"""
link = ''
m = re.search(r"this,\s*'([^']*)'", content)
if m:
link = m.group(1)
print(link)
# => http://www.stirwen.be/medias/documents/20181002_carte_octobre-novembre_2018_FR.pdf
See the Python demo
Also, see the regex demo.

Search for email address with the pattern [at]/(at) in python

I am developing a web scraper code. The basic thing which I am retrieving is email address from the HTML source. I am using the following code
r = re.compile(r"\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
emailAddresses = r.findall(html)
In few websites the email address is in the format abcd[at]gmail.com/abcd(at)gmail.com. I need a generic regex code which will retrieve email address in either of the three formats abcd[at]gmail.com/abcd(at)gmail.com/abcd#gmail.com. I tried the following code, but didn't get expected result. Can any one help me.
r = re.compile(r"\b[A-Z0-9._%+-]+[#|(at)|[at]][A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
emailAddresses = r.findall(html)
Solution: Replace # by (#|\(at\)|\[at\]) as such:
r = re.compile(r"\b[A-Z0-9._%+-]+(#|\(at\)|\[at\])[A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
emailAddresses = r.findall(html)
Explanation: In your attempt, you did [one|two|three], you cannot do that. […] is used for single characters or for sets ([a-z] is the same as [abcd…xyz]). You must use (one|two|three) instead. [1]
Also, you attempt to match () and [] which are all special characters regarding to REGEX, so they have special functionality. If you want to actually match them (and not using their special functionality), you must remember to escape them before by putting a \ in front of them. Same goes for .?+* etc.
Suggestion: You can also try to match [dot] and (dot) that very same way if you wish so.
Just remember that there are a ton of way to obfuscate email addresses out there, including some you might not be aware of.
And that, also, validating email addresses (and so trying to catch them with REGEX) can be very tricky:
The actual official REGEX is (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\]).
(EDIT: Source: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html Looks like it could be even worse than the above REGEX!!)
[1] Beware that using (…) will capture its content, if you wish this content not being captured you have to use (?:…) instead.
r = re.compile(r"\b[A-Z0-9._%+-]+(?:#|[(\[]at[\])])[A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
^^^^^^^^^^^^^^^^^^
emailAddresses = r.findall(html)
See demo.
https://regex101.com/r/nD5jY4/5#python

python regexp parse

I have hash:
{'login': u'myemail (myemail#gmail.com)'}
I need parse only email myemail#gmail.com
What regexp I must compose
No regex is needed. Use string manipulation instead. This will split the value on the first space, then strip the () from the second item ([1]) of the returned array.
yourhash = {'login': u'myemail (myemail#gmail.com)'}
email = yourhash['login'].split()[1].strip("()")
print(email)
# myemail#gmail.com
If you really need a regular expression solution (versus the excellent string split options also posted) this will do it for you:
>>> import re
>>> re.match('.*\((.*)\)', 'myemail (myemail#gmail.com)').group(1)
'myemail#gmail.com'
>>>
Use string methods instead:
my_dict['login'].split['('][1].strip(')')
There are many patterns for matching emails. A good resource can be found here.
For example,
^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}$

Find email domain in address with regular expressions

I know I'm an idiot, but I can't pull the domain out of this email address:
'blahblah#gmail.com'
My desired output:
'#gmail.com'
My current output:
.
(it's just a period character)
Here's my code:
import re
test_string = 'blahblah#gmail.com'
domain = re.search('#*?\.', test_string)
print domain.group()
Here's what I think my regular expression says ('#*?.', test_string):
' # begin to define the pattern I'm looking for (also tell python this is a string)
# # find all patterns beginning with the at symbol ("#")
* # find all characters after ampersand
? # find the last character before the period
\ # breakout (don't use the next character as a wild card, us it is a string character)
. # find the "." character
' # end definition of the pattern I'm looking for (also tell python this is a string)
, test string # run the preceding search on the variable "test_string," i.e., 'blahblah#gmail.com'
I'm basing this off the definitions here:
http://docs.activestate.com/komodo/4.4/regex-intro.html
Also, I searched but other answers were a bit too difficult for me to get my head around.
Help is much appreciated, as usual. Thanks.
My stuff if it matters:
Windows 7 Pro (64 bit)
Python 2.6 (64 bit)
PS. StackOverflow quesiton: My posts don't include new lines unless I hit "return" twice in between them. For example (these are all on a different line when I'm posting):
# - find all patterns beginning with the at symbol ("#")
* - find all characters after ampersand
? - find the last character before the period
\ - breakout (don't use the next character as a wild card, us it is a string character)
. - find the "." character
, test string - run the preceding search on the variable "test_string," i.e., 'blahblah#gmail.com'
That's why I got a blank line b/w every line above. What am I doing wrong? Thx.
Here's something I think might help
import re
s = 'My name is Conrad, and blahblah#gmail.com is my email.'
domain = re.search("#[\w.]+", s)
print domain.group()
outputs
#gmail.com
How the regex works:
# - scan till you see this character
[\w.] a set of characters to potentially match, so \w is all alphanumeric characters, and the trailing period . adds to that set of characters.
+ one or more of the previous set.
Because this regex is matching the period character and every alphanumeric after an #, it'll match email domains even in the middle of sentences.
Ok, so why not use split? (or partition )
"#"+'blahblah#gmail.com'.split("#")[-1]
Or you can use other string methods like find
>>> s="bal#gmail.com"
>>> s[ s.find("#") : ]
'#gmail.com'
>>>
and if you are going to extract out email addresses from some other text
f=open("file")
for line in f:
words= line.split()
if "#" in words:
print "#"+words.split("#")[-1]
f.close()
Using regular expressions:
>>> re.search('#.*', test_string).group()
'#gmail.com'
A different way:
>>> '#' + test_string.split('#')[1]
'#gmail.com'
You can try using urllib
from urllib import parse
email = 'myemail#mydomain.com'
domain = parse.splituser(email)[1]
Output will be
'mydomain.com'
Just wanted to point out that chrisaycock's method would match invalid email addresses of the form
herp#
to correctly ensure you're just matching a possibly valid email with domain you need to alter it slightly
Using regular expressions:
>>> re.search('#.+', test_string).group()
'#gmail.com'
Using the below regular expression you can extract any domain like .com or .in.
import re
s = 'my first email is user1#gmail.com second email is enter code hereuser2#yahoo.in and third email is user3#outlook.com'
print(re.findall('#+\S+[.in|.com|]',s))
output
['#gmail.com', '#yahoo.in']
Here is another method using the index function:
email_addr = 'blahblah#gmail.com'
# Find the location of # sign
index = email_addr.index("#")
# extract the domain portion starting from the index
email_domain = email_addr[index:]
print(email_domain)
#------------------
# Output:
#gmail.com

Categories