Regex to extract top level domain from email address

Regex to extract top level domain from email address - python

From email address like
xxx#site.co.uk
xxx#site.uk
xxx#site.me.uk
I want to write a regex which should return 'uk' is all the cases.
I have tried
'+#([^.]+)\..+'
which gives only the domain name. I have tried using
'[^/.]+$'
but it is giving error.

The regex to extract what you are asking for is:
\.([^.\n\s]*)$ with /gm modifiers
explanation:
\. matches the character . literally
1st Capturing group ([^.\n\s]*)
[^.\n\s]* match a single character not present in the list below
Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
. the literal character .
\n matches a fine-feed (newline) character (ASCII 10)
\s match any white space character [\r\n\t\f ]
$ assert position at end of a line
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
g modifier: global. All matches
for your input example, it will be:
import re
m = re.compile(r'\.([^.\n\s]*)$', re.M)
f = re.findall(m, data)
print f
output:
['uk', 'uk', 'uk']
hope this helps.

As myemail#com is a valid address, you can use:
#.*([^.]+)$

You don't need regex. This would always give you 'uk' in your examples:
>>> url = 'foo#site.co.uk'
>>> url.split('.')[-1]
'uk'

Simply .*\.(\w+) won't help?
Can add more validations for "#" to the regular expression if needed.

Related

How to limit list of string is pattern with regex?

I tried to compose patten with regex, and tried to validate multiple strings. However, seems my patterns fine according to regex documentation, but some reason, some invalid string is not validated correctly. Can anyone point me out what is my mistakes here?
test use case
this is test use case for one input string:
import re
usr_pat = r"^\$\w+_src_username_\w+$"
u_name='$ini_src_username_cdc_char4ec_pits'
m = re.match(usr_pat, u_name, re.M)
if m:
print("Valid username:", m.group())
else:
print("ERROR: Invalid user_name:\n", u_name)
I am expecting this return error because I am expecting input string must start with $ sign, then one string _\w+, then _, then src, then _, then user_name, then _, then end with only one string \w+. this is how I composed my pattern and tried to validate the different input strings, but some reason, it is not parsed correctly. Did I miss something here? can anyone point me out here?
desired output
this is valid and invalid input:
valid:
$ini_src_usrname_ajkc2e
$ini_src_password_ajkc2e
$ini_src_conn_url_ajkc2e
invalid:
$ini_src_usrname_ajkc2e_chan4
$ini_src_password_ajkc2e_tst1
$ini_smi_src_conn_url_ajkc2e_tst2
ini_smi_src_conn_url_ajkc2e_tst2
$ini_src_usrname_ajkc2e_chan4_jpn3
according to regex documentation, r"^\$\w+_src_username_\w+$" this should capture the logic that I want to parse, but it is not working all my test case. what did I miss here? thanks

The \w character class also matches underscores and numbers:
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.
(https://docs.python.org/3/library/re.html#regular-expression-syntax).
So the final \w+ matches the entirety of cdc_char4ec_pits
I think you are looking for [a-zA-Z0-9] which will not match underscores.
usr_pat = r"^\$[a-zA-Z0-9]+_src_username_[a-zA-Z0-9]+$"

\w+
First: \w means that capture:
1- one letter from a to z, or from A to Z
OR
2- one number from 0 to 9
OR
3- an underscore(_)
Second: The plus(+) sign after \w means that matches the previous token between one and unlimited times.
So if my regex pattern is: r"^\$\w+$"
It would match the string: '$ini_src_username_cdc_char4ec_pits'
1- The ^\$ will match the dollar sign at the beginning of the string $
2- \w+ at first it will match the character i of the word ini and because of the + sign it will continue to match the character n and the second i. After that the underscore exists after the word ini will be matched as well, this is because \w matches an underscore not just a number or a letter, the word src will be matched too, the underscore after the word src will be matched, the username word will be matched too and the whole string will be matched.
You mentioned the word "string", if you mean letters and numbers such as : "bla123", "123455" or "BLAbla", then you can use something like [a-zA-Z0-9]+ instead of \w+.

How to match numeric characters with no white space following

I need to match lines in text document where the line starts with numbers and the numbers are followed by nothing.... I want to include numbers that have '.' and ',' separating them.
Currently, I have:
p = re.compile('\$?\s?[0-9]+')
for i, line in enumerate(letter):
m = p.match(line)
if s !=None:
print(m)
print(line)
Which gives me this:
"15,704" and "416" -> this is good, I want this
but also this:
"$40 million...." -> I do not want to match this line or any line where the numbers are followed by words.
I've tried:
p = re.compile('\$?\s?[0-9]+[ \t\n\r\f\v]')
But it doesn't work. One reason is that it turns out there is no white space after the numbers I'm trying to match.
Appreciate any tips or tricks.

If you want to match the whole string with a regex,
you have 2 choices:
Either call re.fullmatch(pattern, string) (note full in the function name).
It tries to match just the whole string.
Or put $ anchor at the end of your regex and call re.match(pattern, string).
It tries to find a match from the start of the string.
Actually, you could also add ^ at the start of regex and call re.search(pattern,
string), but it would be a very strange combination.
I have also a remark concerning how you specified your conditions, maybe in incomplete
way: You put e.g. $40 million string and stated that the only reason to reject
it is space and letters after $40.
So actually you should have written that you want to match a string:
Possibly starting with $.
After the $ there can be a space (maybe, I'm not sure).
Then there can be a sequence of digits, dots or commas.
And nothing more.
And one more remark concerning Python literals: Apparently you have forgotten to prepend the pattern with r.
If you use r-string literal, you do not have to double backslashes inside.
So I think the most natural solution is to call a function devoted just to
match the whole string (i.e. fullmatch), without adding start / end
anchors and the whole script can be:
import re
pat = re.compile(r'(?:\$\s?)?[\d,.]+')
lines = ["416", "15,704", "$40 million"]
for line in lines:
if pat.fullmatch(line):
print(line)
Details concerning the regex:
(?: - A non-capturing group.
\$ - Consisting of a $ char.
\s? - And optional space.
)? - End of the non-capturing group and ? stating that the whole
group group is optional.
[\d,.]+ - A sequence of digits, commas and dots (note that between [
and ] the dot represents itself, so no backslash quotation is needed.
If you would like to reject strings like 2...5 or 3.,44 (no consecutive
dots or commas allowed), change the last part of the above regex to:
[\d]+(?:[,.]?[\d]+)*
Details:
[\d]+ - A sequence of digits.
(?: - A non-capturing group.
[,.] - Either a comma or a dot (single).
[\d]+ - Another sequence of digits.
)* - End of the non-capturing group, it may occur several times.

With a little modification to your code:
letter = ["15,704", "$40 million"]
p = re.compile('^\d{1,3}([\.,]\d{3})*$') # Numbers separated by commas or points
for i, line in enumerate(letter):
m = p.match(line)
if m:
print(line)
Output:
15,704

You could use the following regex:
import re
pattern = re.compile('^[0-9,.]+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^[0-9,.]+\s*$ matches everything that is a digit a , or ., followed by zero or more spaces. If you want to match only numbers with one , or . use the following pattern: '^\d+[,.]?\d+\s*$', code:
import re
pattern = re.compile('^\d+[,.]?\d+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^\d+[,.]?\d+\s*$ matches everything that starts with a group of digits (\d+) followed by an optional , or . ([,.]?) followed by a group of digits, with an optional group of spaces \s*.

Regex to check if it is exactly one single word

I am basically trying to match string pattern(wildcard match)
Please carefully look at this -
*(star) - means exactly one word .
This is not a regex pattern...it is a convention.
So,if there patterns like -
*.key - '.key.' is preceded by exactly one word(word containing no dots)
*.key.* - '.key.' is preceded and succeeded by exactly one word having no dots
key.* - '.key' preceeds exactly one word .
So,
"door.key" matches "*.key"
"brown.door.key" doesn't match "*.key".
"brown.key.door" matches "*.key.*"
but "brown.iron.key.door" doesn't match "*.key.*"
So, when I encounter a '*' in pattern, I have replace it with a regex so that it means it is exactly one word.(a-zA-z0-9_).Can anyone please help me do this in python?

To convert your pattern to a regexp, you first need to make sure each character is interpreted literally and not as a special character. We can do that by inserting a \ in front of any re special character. Those characters can be obtained through sre_parse.SPECIAL_CHARS.
Since you have a special meaning for *, we do not want to escape that one but instead replace it by \w+.
Code
import sre_parse
def convert_to_regexp(pattern):
special_characters = set(sre_parse.SPECIAL_CHARS)
special_characters.remove('*')
safe_pattern = ''.join(['\\' + c if c in special_characters else c for c in pattern ])
return safe_pattern.replace('*', '\\w+')
Example
import re
pattern = '*.key'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.key'
re.match(r_pattern, 'door.key') # Match
re.match(r_pattern, 'brown.door.key') # None
And here is an example with escaped special characters
pattern = '*.(key)'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.\\(key\\)'
re.match(r_pattern, 'door.(key)') # Match
re.match(r_pattern, 'brown.door.(key)') # None
Sidenote
If you intend looking for the output pattern with re.search or re.findall, you might want to wrap the re pattern between \b boundary characters.

The conversion rules you are looking for go like this:
* is a word, thus: \w+
. is a literal dot: \.
key is and stays a literal string
plus, your samples indicate you are going to match whole strings, which in turn means your pattern should match from the ^ beginning to the $ end of the string.
Therefore, *.key becomes ^\w+\.key$, *.key.* becomes ^\w+\.key\.\w+$, and so forth..
Online Demo: play with it!

^ means a string that starts with the given set of characters in a regular expression.
$ means a string that ends with the given set of characters in a regular expression.
\s means a whitespace character.
\S means a non-whitespace character.
+ means 1 or more characters matching given condition.
Now, you want to match just a single word meaning a string of characters that start and end with non-spaced string. So, the required regular expression is:
^\S+$

You could do it with a combination of "any characters that aren't period" and the start/end anchors.
*.key would be ^[^.]*\.key, and *.key.* would be ^[^.]*\.key\.[^.]*$
EDIT: As tripleee said, [^.]*, which matches "any number of characters that aren't periods," would allow whitespace characters (which of course aren't periods), so using \w+, "any number of 'word characters'" like the other answers is better.

Split string at capital letter but only if no whitespace

Set-up
I've got a string of names which need to be separated into a list.
Following this answer, I have,
string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
re.findall('[A-Z][a-z]*', string)
where the last line gives me,
['Kreuzberg', 'Lichtenberg', 'Neuk', 'Prenzlauer', 'Berg']
Problems
1) Whitespace is ignored
'Prenzlauer Berg' is actually 1 name but the code splits according to the 'split-at-capital-letter' rule.
What is the command ensuring it to not split at a capital letter if preceding character is a whitespace?
2) Special characters not handled well
The code used cannot handle 'ö'. How do I include such 'German' characters?
I.e. I want to obtain,
['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']

You can use positive and negative lookbehind and just list the Umlauts explicitly:
>>> string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
>>> re.findall('(?<!\s)[A-ZÄÖÜ](?:[a-zäöüß\s]|(?<=\s)[A-ZÄÖÜ])*', string)
['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']
(?<!\s)...: matches ... that is not preceded by \s
(?<=\s)...: matches ... that is preceded by \s
(?:...): non-capturing group so as to not mess with the findall results

This works
string="KreuzbergLichtenbergNeuköllnPrenzlauer Berg"
pattern="[A-Z][a-ü]+\s[A-Z][a-ü]+|[A-Z][a-ü]+"
re.findall(pattern, string)
#>>>['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']

Using Python Regex to find a phrase between 2 tags

I've got a string that I want to use regex to find the characters encapsulated between two known patterns, "Cp_6%3A" then some characters then "&" and potentially more characters, or no & and just the end of string.
My code looks like this:
def extract_id_from_ref(ref):
id = re.search("Cp\_6\%3A(.*?)(\& | $)", ref)
print(id)
But this isn't producing anything, Any ideas?
Thanks in advance

Note that (\& | $) matches either the & char and a space after it, or a space and end of string (the spaces are meaningful here!).
Use a negated character class [^&]* (zero or more chars other than &) to simplify the regex (no need for an alternation group or lazy dot matching pattern) and then access .group(1):
def extract_id_from_ref(ref):
m = re.search(r"Cp_6%3A([^&]*)", ref)
if m:
print(m.group(1))
Note that neither _ nor % are special regex metacharacters, and do not have to be escaped.
See the regex demo.

The problem is that spaces in a regex pattern, are also taken into account. Furthermore in order to add a backspace to the string, you either have to add \\ (two backslashes) or use a raw string:
So you should write:
r"Cp_6\%3A(.*?)(?:\&|$)"
If you then match with:
def extract_id_from_ref(ref):
id = re.search(r"Cp_6\%3A(.*?)(?:\&|$)", ref)
print(id)
It should work.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex to extract top level domain from email address - python

From email address like xxx#site.co.uk xxx#site.uk xxx#site.me.uk I want to write a regex which should return 'uk' is all the cases. I have tried '+#([^.]+)\..+' which gives only the domain name. I have tried using '[^/.]+$' but it is giving error.

As myemail#com is a valid address, you can use: #.*([^.]+)$

You don't need regex. This would always give you 'uk' in your examples: >>> url = 'foo#site.co.uk' >>> url.split('.')[-1] 'uk'

Simply .*\.(\w+) won't help? Can add more validations for "#" to the regular expression if needed.

Related

How to limit list of string is pattern with regex?

How to match numeric characters with no white space following

Regex to check if it is exactly one single word

Split string at capital letter but only if no whitespace

Using Python Regex to find a phrase between 2 tags

Categories

Resources