Find email domain in address with regular expressions - python

I know I'm an idiot, but I can't pull the domain out of this email address:
'blahblah#gmail.com'
My desired output:
'#gmail.com'
My current output:
.
(it's just a period character)
Here's my code:
import re
test_string = 'blahblah#gmail.com'
domain = re.search('#*?\.', test_string)
print domain.group()
Here's what I think my regular expression says ('#*?.', test_string):
' # begin to define the pattern I'm looking for (also tell python this is a string)
# # find all patterns beginning with the at symbol ("#")
* # find all characters after ampersand
? # find the last character before the period
\ # breakout (don't use the next character as a wild card, us it is a string character)
. # find the "." character
' # end definition of the pattern I'm looking for (also tell python this is a string)
, test string # run the preceding search on the variable "test_string," i.e., 'blahblah#gmail.com'
I'm basing this off the definitions here:
http://docs.activestate.com/komodo/4.4/regex-intro.html
Also, I searched but other answers were a bit too difficult for me to get my head around.
Help is much appreciated, as usual. Thanks.
My stuff if it matters:
Windows 7 Pro (64 bit)
Python 2.6 (64 bit)
PS. StackOverflow quesiton: My posts don't include new lines unless I hit "return" twice in between them. For example (these are all on a different line when I'm posting):
# - find all patterns beginning with the at symbol ("#")
* - find all characters after ampersand
? - find the last character before the period
\ - breakout (don't use the next character as a wild card, us it is a string character)
. - find the "." character
, test string - run the preceding search on the variable "test_string," i.e., 'blahblah#gmail.com'
That's why I got a blank line b/w every line above. What am I doing wrong? Thx.

Here's something I think might help
import re
s = 'My name is Conrad, and blahblah#gmail.com is my email.'
domain = re.search("#[\w.]+", s)
print domain.group()
outputs
#gmail.com
How the regex works:
# - scan till you see this character
[\w.] a set of characters to potentially match, so \w is all alphanumeric characters, and the trailing period . adds to that set of characters.
+ one or more of the previous set.
Because this regex is matching the period character and every alphanumeric after an #, it'll match email domains even in the middle of sentences.

Ok, so why not use split? (or partition )
"#"+'blahblah#gmail.com'.split("#")[-1]
Or you can use other string methods like find
>>> s="bal#gmail.com"
>>> s[ s.find("#") : ]
'#gmail.com'
>>>
and if you are going to extract out email addresses from some other text
f=open("file")
for line in f:
words= line.split()
if "#" in words:
print "#"+words.split("#")[-1]
f.close()

Using regular expressions:
>>> re.search('#.*', test_string).group()
'#gmail.com'
A different way:
>>> '#' + test_string.split('#')[1]
'#gmail.com'

You can try using urllib
from urllib import parse
email = 'myemail#mydomain.com'
domain = parse.splituser(email)[1]
Output will be
'mydomain.com'

Just wanted to point out that chrisaycock's method would match invalid email addresses of the form
herp#
to correctly ensure you're just matching a possibly valid email with domain you need to alter it slightly
Using regular expressions:
>>> re.search('#.+', test_string).group()
'#gmail.com'

Using the below regular expression you can extract any domain like .com or .in.
import re
s = 'my first email is user1#gmail.com second email is enter code hereuser2#yahoo.in and third email is user3#outlook.com'
print(re.findall('#+\S+[.in|.com|]',s))
output
['#gmail.com', '#yahoo.in']

Here is another method using the index function:
email_addr = 'blahblah#gmail.com'
# Find the location of # sign
index = email_addr.index("#")
# extract the domain portion starting from the index
email_domain = email_addr[index:]
print(email_domain)
#------------------
# Output:
#gmail.com

Related

How to find the match URL from a HTML page using RegEx Python

I'm trying to match the following URL by its query string from a html page in Python but could not able to solved it. I'm a newbie in python.
<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>
I want to match the above URL with &user_id=[any_digit_from_0_to_99]& and print this URL on the screen.
URL without this &user_id=[any_digit_from_0_to_99]& wont be match.
Here's my horror incomplete regex code:
https?:\/\/.{0,30}\.+[a-zA-Z0-9\/?_+=]{0,30}&user_id=[0-9][0-9]&.*?"
I know this code has so many wrong, but this code somehow managed to match the above URL till " double qoute.
Complete code would look like this:
import re
reg = re.compile(r'https?:\/\/.{0,30}\.+[a-zA-Z0-9\/?_+=]{0,30}&user_id=[0-9][0-9]&.*?"')
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()
print(result)
Output:
$ python reg.py
http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E"
It shows the " at the end of the URL and I know this is not the good regex code I want the better version of my above code.
A few remarks can be made on your regexp:
/ is not a special re character, there's no need to escape it
Has the fact that the domain can't be larger than 30 chracters been done on purpose? Otherwise, you can just select as much characters as you want with .*
Do you know that the string you're working with contains a valid URL? If no, there are some things you can do, like ensuring the domain is at least 4 chracters long, contains a period which is not the last character, etc...
The [0-9][0-9] part will also match stuff like 04, which is not strictly speaking a digit between 0 and 99
Taking this into account, you can design this simpler regex:
reg = re.compile("https?://.*&user_id=[1-9][0-9]?&")
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()
print(result)
Using this regex on your example will print 'http://example.com/?query_id=9&user_id=4&', without the " at the end. If you want to have to full URL, then you can look for the /> symbol:
reg = re.compile("https?://.*&user_id=[1-9][0-9]?&.*/>")
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()[:-2]
print(result)
Note the [:-2] which is used to remove the /> symbol. In that case, this code will print http://example.com/?query_id=9&user_id=4&token_id=4JGO4I394HD83E" id="838".
Note also that these regexp usesthe wildcard .. Depending on whether you are sure that the strings you're working with contains only valid URLs, you may want to change this. For instance, a domain name can only contain ASCII characters. You may want to look at the \w special sequence with the ASCII flag of the re module.

Python 3.6 Identifying a string and if X in Y

Newb programmer here working on my first project. I've searched this site and the python documentation, and either I'm not seeing the answer, or I'm not using the right terminology. I've read the regex and if sections, specifically, and followed links around to other parts that seemed relevant.
import re
keyphrase = '##' + '' + '##'
print(keyphrase) #output is ####
j = input('> ') ###whatever##
if keyphrase in j:
print('yay')
else:
print('you still haven\'t figured it out...')
k = j.replace('#', '')
print(k) #whatever
This is for a little reddit bot project. I want the bot to be called like ##whatever## and then be able to do things with the word(s) in between the ##'s. I've set up the above code to test if Python was reading it but I keep getting my "you still haven't figured it out..." quip.
I tried adding the REGEX \W in the middle of keyphrase, to no avail. Also weird combinations of \$\$ and quotes
So, my question, is how do I put a placeholder in keyphrase for user input?
For instance, if a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.
You could use the following regex r'##(.*?)##' to capture everything inside of the key phrase you've chosen.
Sample Output:
>>> import re
>>> f = lambda s: re.match(r'##(.*?)##', s).group(1)
>>> f("##whatever##")
whatever
>>> f = lambda s: re.findall(r'##(.*?)##', s)
>>> f("a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.")
['comment', 'this', 'I can grab']
How does it work? (1) We state the string constant head and tail for the capture group 1 between the brackets (). Great, almost there! (2) We then match any character .*? with greedy search enforced so that we capture the whole string.
Suggested Readings:
Introduction to Regex in Python - Jee Gikera
Something like this should work:
import re
keyphrase_regex = re.compile(r'##(.*)##')
user_input = input('> ')
keyphrase_match = keyphrase_regex.search(user_input)
# `search` returns `None` if regex didn't match anywhere in the string
keyphrase_content = keyphrase_match.group(1) if keyphrase_match else None
if keyphrase_content:
keyphrase_content = keyphrase_match.group(1)
print('yay! You submitted "', keyphrase_content, '" to the bot!')
else:
# Bonus tip: Use double quotes to make a string containing apostrophe
# without using a backslash escape
print("you still haven't figured it out...")
# Use `keyphrase_content` for whatever down here
Regular expressions are kind of hard to wrap your head around, because they work differently than most programming constructs. It's a language to describe patterns.
Regex One is a fantastic beginners guide.
Regex101 is an online sandbox that allows you to type a regular expression and some sample strings, then see what matches (and why) in real time
The regex ##(.*)## basically means "search through the string until you find two '#' signs. Right after those, start capturing zero-or-more of any character. If you find another '#', stop capturing characters. If that '#' is followed by another one, stop looking at the string, return successfully, and hold onto the entire match (from first '#' to last '#'). Also, hold onto the captured characters in case the programmer asks you for just them.
EDIT: Props to #ospahiu for bringing up the ? lazy quantifier. A final solution, combining our approaches, would look like this:
# whatever_bot.py
import re
# Technically, Python >2.5 will compile and cache regexes automatically.
# For tiny projects, it shouldn't make a difference. I think it's better style, though.
# "Explicit is better than implicit"
keyphrase_regex = re.compile(r'##(.*?)##')
def parse_keyphrases(input):
return keyphrase_regex.find_all(input)
Lambdas are cool. I prefer them for one-off things, but the code above is something I'd rather put in a module. Personal preference.
You could even make the regex substitutable, using the '##' one by default
# whatever_bot.py
import re
keyphrase_double_at_sign = re.compile(r'##(.*?)##')
def parse_keyphrases(input, keyphrase_regex=keyphrase_double_at_sign):
return keyphrase_regex.find_all(input)
You could even go bonkers and write a function that generates a keyphrase regex from an arbitrary "tag" pattern! I'll leave that as an exercise for the reader ;) Just remember: Several characters have special regex meanings, like '*' and '?', so if you want to match that literal character, you'd need to escape them (e.g. '\?').
If you want to grab the content between the "#", then try this:
j = input("> ")
"".join(j.split("#"))
You're not getting any of the info between the #'s in your example because you're effectively looking for '####' in whatever input you give it. Unless you happen to put 4 #'s in a row, that RE will never match.
What you want to do instead is something like
re.match('##\W+##', j)
which will look for 2 leading ##s, then any number greater than 1 alphanumeric characters (\W+), then 2 trailing ##s. From there, your strip code looks fine and you should be able to grab it.

In my date time value I want to use regex to strip out the slash and colon from time and replace it with underscore

I am using Python, Webdriver for my automated test. My scenario is on the Admin page of our website I click Add project button and i enter a project name.
Project Name I enter is in the format of LADEMO_IE_05/20/1515:11:38
It is a date and time at the end.
What I would like to do is using a regex I would like to find the / and :
and replace them with an underscore _
I have worked out the regex expression:
[0-9]{2}[/][0-9]{2}[/][0-9]{4}:[0-9]{2}[:][0-9]{2}
This finds 2 digits then / followed by 2 digits then / and so on.
I would like to replace / and : with _.
Can I do this in Python using import re? I need some help with the syntax please.
My method which returns the date is:
def get_datetime_now(self):
dateTime_now = datetime.datetime.now().strftime("%x%X")
print dateTime_now #prints e.g. 05/20/1515:11:38
return dateTime_now
My code snippet for entering the project name into the text field is:
project_name_textfield.send_keys('LADEMO_IE_' + self.get_datetime_now())
The Output is e.g.
LADEMO_IE_05/20/1515:11:38
I would like the Output to be:
LADEMO_IE_05_20_1515_11_38
Just format the datetime using strftime() into the desired format:
>>> datetime.datetime.now().strftime("%m_%d_%y%H_%M_%S")
'05_20_1517_20_16'
Another simple option is just using string replace :
s = "your time string"
s = s.replace("/", "_").replace(":", "_")
Two ways:
i) use strftime with the format:
strftime("%m_%d_%y_%H_%M_%S")
ii) simply use replace() method of strings to replace '/' and ':' to '_'
Basically, you want ton replace every unadvised character by an underscore. To do it, instead of using regex, you could simply use the str.replace method. For example:
out_string = in_string.replace('/', '_').replace(':', '_')
In this example, the first replace returns a string with all the slash replaced, and the second call replace the colons. I think it's the simplest way for replacing one or two characters. But, if you want your program to be able to evolve, I advise you using re.sub, as follows:
# first we compile the regex, for speed sake
# this regex match every one of the bad characters, and it's modular: just add one, in case
bad_characters = re.compile(r'/|:')
# your code
# replacement
out_string = re.sub(bad_characters, '_', in_string)

using \b in regex

--SOLVED--
I solved my issue by enabling multiline mode, and now the characters ^ and $ work perfectly for identifying the beginning and end of each string
--EDIT--
My code:
import re
import test_regex
def regex_content(text_content, regex_dictionary):
#text_content = text_content.lower()
regex_matches = []
# Search sanitized text (markup removed) for DLP theme keywords
for key,value in regex_dictionary.items():
# Get confiiguration settings
min_matches = value.get('min_matches',1)
risk = value.get('risk',1)
enabled = value.get('enabled',False)
regex_str = value.get('regex','')
# Fast compute True/False hit for each DLP theme word
if enabled:
print "Searching for key : %s" % (key)
my_regex = re.compile(value.get('regex'))
hits = my_regex.findall(text_content)
if len(hits) > 0:
regex_matches.append((key, risk, len(hits), hits))
# Return array of results (key, risk, number of hits, regex matches)
return regex_matches
def main():
#print defaults.test_regex.dlp_regex
text_content = ""
for line in open('testData.txt'):
text_content+=line
for match in regex_content(text_content, test_regex.dlp_regex):
print "\nFound %s : %s" % (match[0], match[3])
print "\n"
if __name__ == '__main__':
main()
and it is using the regex found here:
'Large number of US Zip Codes' : { 'regex' : "\b\d{5}(?:-\d{1,4})?\b"},
When I precede my regex with the 'r' flag, I can find the zip codes I'm looking for, but as well as every other 5 digit number in my document I am searching through. From my understanding this is because it ignored the \b characters. Without the r flag though, it cannot find any zip codes. It works perfectly fine in regexr, but not in my code. I haven't had any luck making \b characters work, nor ^ and $ for identifying the beginnings and ends of the strings I'm searching for. What is it that I am misunderstanding about these special characters?
--Original post--
I am writing a regex for identifying zip codes (and only zip codes), so to avoid false positives I am trying to include a boundary on my regex, using both of the following:
\b\d{5}\b|\b\d{5}-\b\d{1,4}\b
using the online regex debugger Regexr, my code should correctly catch 5 digit zip codes, such as 34332. However, I have two problems:
1. This regex is not working in my actual code for finding any zip codes, but it does work when I don't have the boundary (\b) characters. The exact code I'm trying to extract with my regex is:
Zip:
----
98839-0111
34332
2. I don't see why my regex can't correctly identify 98839-0111 in Regexr. I tried doing the super-primitive approach of
\b\d{5}\b|98839-0111
and even that couldn't identify 98839-0111. Does anyone know what could be going on?
Note: I have also tried using ^ and $ for the boundaries of my regex, but this also doesn't find the regex's, not even in Regexr.
EDIT: After removing the first part of my regex, leaving only
98839-0111
It can now correctly identify it. I guess this means that once a string is pulled out by one of my regex's, it can no longer be found by any subsequent regexs? Why is this?
It is because of the alternative list: the first part was matched, and the engine stopped checking.
Try this regex
98839-0111|\b\d{5}\b
And you'll get a match.
Or, to be more generic in your case:
\b(?:\d{5}-\d{4}|\d{5})\b
will match both, and more (actually, functionally the same as \b\d{5}(?:-\d{4})?\b). See demo.
Your pattern is evaluated for each position in the string from the left to the right, so if the left branch of your pattern succeeds, the second branch isn't tested at all.
I suggest you to use this pattern that solves the problem:
\b\d{5}(?:-\d{1,4})?\b
You can use this regex:
\b(\d{5}-\d{1,4}|\d{5})\b
Working demo

Python regex to extract substring at start and end of string

I am looking for a regex that will extract everything up to the first . (period) in a string, and everything including and after the last . (period)
For example:
my_file.10.4.5.6.csv
myfile2.56.3.9.txt
Ideally the regex when run against these strings would return:
my_file.csv
myfile2.txt
The numeric stamp in the file will be different each time the script is run, so I am looking essentially to exclude it.
The following prints out the string up to the first . (period)
print re.search("^[^.]*", data_file).group(0)
I am having trouble though getting it to also return the the last period and string after it.
Sorry just to update this based upon feedback and comments below:
This does need to be a regex. The regex will be passed into the program from a configuration file. The user will not have access to the source code as it will be packaged.
The user may need to change the regex based upon some arbitrary criteria, so they will need to update the config file, rather than edit the application and re-build the package.
Thanks
You don’t need a regular expression!
parts = data_file.split(".")
print parts[0] + "." + parts[-1]
Instead of regular expressions, I would suggest using str.split. For example:
>>> data_file = 'my_file.10.4.5.6.csv'
>>> parts = data_file.split('.')
>>> print parts[0] + '.' + parts[-1]
my_file.csv
However if you insist on regular expressions, here is one approach:
>>> print re.sub(r'\..*\.', '.', data_file)
my_file.csv
You don't need a regex.
tokens = expanded_name.split('.')
compressed_name = '.'.join((tokens[0], tokens[-1]))
If you are concerned about performance, you could use a length limit and rsplit() to only chop up the string as much as you need.
compressed_name = expanded_name.split('.', 1)[0] + '.' + expanded_name.rsplit('.', 1)[1]
Do you need a regex here?
>>> address = "my_file.10.4.5.6.csv"
>>> split_by_periods = address.split(".")
>>> "{}.{}".format(address[0], address[-1])
>>> "my_file.csv"

Categories