Regular expression for phone number check does not work - python

I'm trying to use a regular expression for checking phone numbers.
Below is the code I'm using:
phnum ='1-234-567-8901'
pattern = re.search('^\+?\d{0,3}\s?\(?\d{3}\)?[-.\s]?d{3}[-.\s]?d{4}$',phnum,re.IGNORECASE)
print(pattern)
Even for simple numbers it does not seem to work. Can anyone please correct me where am going wrong?

Here's a potential solution. I'm not great at regex, so I may be missing something.
import re
phone_pattern = re.compile(r"^(\+?\d{0,2}-)?(\d{3})-(\d{3})-(\d{4})$")
phone_numbers = ["123-345-6134",
"1-234-567-8910",
"+01-235-235-2356",
"123-123-123-123",
"1-asd-512-1232",
"a-125-125-1255",
"234-6721"]
for num in phone_numbers:
print(phone_pattern.findall(num))
Output:
[('', '123', '345', '6134')]
[('1-', '234', '567', '8910')]
[('+01-', '235', '235', '2356')]
[]
[]
[]
[]

The immediate problem is that you are missing the \ before the last two d:s. Furthermore, the first \s obviously does not match a dash.
I would also strongly encourage r'...' raw strings for all regexes, to avoid having Python's string parser from evaluating some backslash sequences before they reach the regex engine.
phnum ='1-234-567-8901'
pattern = re.search(
r'^\+?\d{0,3}[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$',
phnum, re.IGNORECASE)
print(pattern)
Demo: https://ideone.com/EYtTKZ
More fundamentally, perhaps you should only accept a closing parenthesis if there is an opening parenthesis before it, etc. A common approach is to normalize number sequences before attempting to use them as phone numbers, but it's a bit of a chicken and egg problem. (You don't want to get false positives on large numbers or IP addresses, for example.)

Related

How to extract unknown number of different parts from string with Python regex?

Does anyone know a smart way to extract unknown number of different parts from a string with Python regex?
I know this question is probably too general to answer clearly, so please let's have a look at the example:
S = "name.surname#sub1.sub2.sub3"
As a result I would like to get separately a local part and each subdomain. Please note that in this sample email address we have three subdomains but I would like to find a regular expression that is able to capture any number of them, so please do not use this number.
To avoid straying from the point, let's additionaly assume only alphanumeric characters (hence \w), dots and one # are allowed in email addresses.
I tried to solve it myself and found this way:
L = re.findall(r"([\w.]+)(?=#)|(\w+)", S)
for i in L:
if i[0] == '': print i[1],
else: print i[0],
# output: name.surname sub1 sub2 sub3
But it doesn't look nice to me. Does anyone know a way to achieve this with one regex and without any loop?
Of course, we can easily do it without regular expressions:
L = S.split('#')
localPart = L[0] # name.surname
subdomains = str(L[1]).split('.') # ['sub1', 'sub2', 'sub3']
But I am interested in how to figure it out with regexes.
[EDIT]
Uff, finally I figured this out, here is the nice solution:
S = "name.surname#sub1.sub2.sub3"
print re.split(r"#|\.(?!.*#)", S) # ['name.surname', 'sub1', 'sub2', 'sub3']
S = "name.surname.nick#sub1.sub2.sub3.sub4"
print re.split(r"#|\.(?!.*#)", S) # ['name.surname.nick', 'sub1', 'sub2', 'sub3', 'sub4']
Perfect output.
If I am understanding your request correctly, you want to find each section in your sample email address, without the periods. What you are missing in your sample regex snippet is re.compile. For example:
import re
s = "name.surname#sub1.sub2.sub3"
r = "\w+"
r2 = re.compile(r)
re.findall(r2,s)
This looks for the r2 regex object in the string s and outputs ['name', 'surname', 'sub1', 'sub2', 'sub3'].
Basically you can use the fact that when there's a capture group in the pattern, re.findall returns only the content of this capture group and no more the whole match:
>>> re.findall(r'(?:^[^#]*#|\.)([^.]*)', s)
['sub1', 'sub2', 'sub3']
Obviously the email format can be more complicated than your example string.

How do I strip patterns or words from the end of the string backwards?

I have a string like this:
<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>
I would like to strip the first 3 opening and the last 3 closing tags from the string. I do not know the tag names in advance.
I can strip the first 3 strings with re.sub(r'<[^<>]+>', '', in_str, 3)). How do I strip the closing tags? What should remain is:
<v1>aaa<b>bbb</b>ccc</v1>
I know I could maybe 'do it right', but I actually do not wish to do xml nor html parsing for my purpose, which is to aid myself visualizing the xml representation of some classes.
Instead, I realized that this problem is interesting. It seems I cannot simply search backwards with regex, ie. right to left. because that seems unsupported:
If you mean, find the right-most match of several (similar to the
rfind method of a string) then no, it is not directly supported. You
could use re.findall() and chose the last match but if the matches can
overlap this may not give the correct result.
But .rstrip is not good with words, and won't do patterns either.
I looked at Strip HTML from strings in Python but I only wish to strip up to 3 tags.
What approach could be used here? Should I reverse the string (ugly in itself and due to the '<>'s). Do tokenization (why not parse, then?)? Or create static closing tags based on the left-to-right match?
Which strategy to follow to strip the patterns from the end of the string?
The simplest would be to use old-fashing string splitting and limiting the split:
in_str.split('>', 3)[-1].rsplit('<', 3)[0]
Demo:
>>> in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>'
>>> in_str.split('>', 3)[-1].rsplit('<', 3)[0]
'<v1>aaa<b>bbb</b>ccc</v1>'
str.split() and str.rsplit() with a limit will split the string from the start or the end up to the limit times, letting you select the remainder unsplit.
You've already got practically all the solution. re can't do backwards, but you can:
in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
in_str = re.sub(r'<[^<>]+>', '', in_str, 3)
in_str = in_str[::-1]
print in_str
in_str = re.sub(r'>[^<>]+/<', '', in_str, 3)
in_str = in_str[::-1]
print in_str
<v1>aaa<b>bbb</b>ccc</v1>
Note the reversed regex for the reversed string, but then it goes back-to-front.
Of course, as mentioned, this is way easier with a proper parser:
in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
from lxml.html import etree
ix = etree.fromstring(in_str)
print etree.tostring(ix[0][0][0])
<v1>aaa<b>bbb</b>ccc</v1>
I would look into regular expressions and use one such pattern to use a split
http://docs.python.org/3/library/re.html?highlight=regex#re.regex.split
Sorry, can't comment, but will give it as an answer.
in_str.split('>', 3)[-1].rsplit('<', 3)[0] will work for the given example
<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>, but not for
<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo><another>test</another>.
You just should be aware of this.
To solve the counter example provided by me, you will have to track state (or count) of tags and evaluate that you match the correct pairs.

Python Regular Expression - right-to-left

I am trying to use regular expressions in python to match the frame number component of an image file in a sequence of images. I want to come up with a solution that covers a number of different naming conventions. If I put it into words I am trying to match the last instance of one or more numbers between two dots (eg .0100.). Below is an example of how my current logic falls down:
import os
import re
def sub_frame_number_for_frame_token(path, token='#'):
folder = os.path.dirname(path)
name = os.path.basename(path)
pattern = r'\.(\d+)\.'
matches = list(re.finditer(pattern, name) or [])
if not matches:
return path
# Get last match.
match = matches[-1]
frame_token = token * len(match.group(1))
start, end = match.span()
apetail_name = '%s.%s.%s' % (name[:start], frame_token, name[end:])
return os.path.join(folder, apetail_name)
# Success
eg1 = 'xx01_010_animation.0100.exr'
eg1 = sub_frame_number_for_frame_token(eg1) # result: xx01_010_animation.####.exr
# Failure
eg2 = 'xx01_010_animation.123.0100.exr'
eg2 = sub_frame_number_for_frame_token(eg2) # result: xx01_010_animation.###.0100.exr
I realise there are other ways in which I can solve this issue (I have already implemented a solution where I am splitting the path at the dot and taking the last item which is a number) but I am taking this opportunity to learn something about regular expressions. It appears the regular expression creates the groups from left-to-right and cannot use characters in the pattern more than once. Firstly is there anyway to search the string from right-to-left? Secondly, why doesn't the pattern find two matches in eg2 (123 and 0100)?
Cheers
finditer will return an iterator "over all non-overlapping matches in the string".
In your example, the last . of the first match will "consume" the first . of the second. Basically, after making the first match, the remaining string of your eg2 example is 0100.exr, which doesn't match.
To avoid this, you can use a lookahead assertion (?=), which doesn't consume the first match:
>>> pattern = re.compile(r'\.(\d+)(?=\.)')
>>> pattern.findall(eg1)
['0100']
>>> pattern.findall(eg2)
['123', '0100']
>>> eg3 = 'xx01_010_animation.123.0100.500.9000.1234.exr'
>>> pattern.findall(eg3)
['123', '0100', '500', '9000', '1234']
# and "right to left"
>>> pattern.findall(eg3)[::-1]
['1234', '9000', '500', '0100', '123']
My solution uses a very simple hackish way of fixing it. It reverses the string path in the beginning of your function and reverses the return value at the end of it. It basically uses regular expressions to search the backwards version of your given strings. Hackish, but it works. I used the syntax shown in this question to reverse the string.
import os
import re
def sub_frame_number_for_frame_token(path, token='#'):
path = path[::-1]
folder = os.path.dirname(path)
name = os.path.basename(path)
pattern = r'\.(\d+)\.'
matches = list(re.finditer(pattern, name) or [])
if not matches:
return path
# Get last match.
match = matches[-1]
frame_token = token * len(match.group(1))
start, end = match.span()
apetail_name = '%s.%s.%s' % (name[:start], frame_token, name[end:])
return os.path.join(folder, apetail_name)[::-1]
# Success
eg1 = 'xx01_010_animation.0100.exr'
eg1 = sub_frame_number_for_frame_token(eg1) # result: xx01_010_animation.####.exr
# Failure
eg2 = 'xx01_010_animation.123.0100.exr'
eg2 = sub_frame_number_for_frame_token(eg2) # result: xx01_010_animation.123.####.exr
print(eg1)
print(eg2)
I believe the problem is that finditer returns only non-overlapping matches. Because both '.' characters are part of the regular expression, it doesn't consider the second dot as a possible start of another match. You can probably use the lookahead construct ?= to match the second dot without consuming it with "?=.".
Because of the way regular expressions work, I don't think there is an easy way to search right-to-left (though I suppose you could reverse the string and write the pattern backwards...).
If all you care about is the last \.(\d+)\., then anchor your pattern from the end of the string and do a simple re.search(_):
\.(\d+)\.(?:.*?)$
where (?:.*?) is non-capturing and non-greedy, so it will consume as few characters as possible between your real target and the end of the string, and those characters will not show up in matches.
(Caveat 1: I have not tested this. Caveat 2: That is one ugly regex, so add a comment explaining what it's doing.)
UPDATE: Actually I guess you could just do a ^.*(\.\d\.) and let the implicitly greedy .* match as much as possible (including matches that occur earlier in the string) while still matching your group. That makes for a simpler regex, but I think it makes your intentions less clear.

Checking and removing extra symbols

I'm interested by removing extra symbols from strings in python.
What could by the more efficient and pythonic way to do that ? Is there some grammar module ?
My first idea would be to locate the more nested text and go through the left and the right, counting the opening and closing symbols. Then i remove the last one of the symbol counter that contain too much symbol.
An example would be this string
text = "(This (is an example)"
You can clearly see that the first parenthesis is not balanced by another one. So i want to delete it.
text = "This (is and example)"
The solution has to be independant of the position of the parentheses.
Others example could be :
text = "(This (is another example) )) (to) explain) the question"
That would become :
text = "(This (is another example) ) (to) explain the question"
Had to break this into an answer for formatting. Check the Python's regular expression module.
If I'm understanding what you are asking, look at re.sub. You can use a regular expression to find the character you'd like to remove, and replace them with an empty string.
Suppose we want to remove all instances of '.', '&', and '*'.
>>> import re
>>> s = "abc&def.ghi**jkl&"
>>> re.sub('[\.\&\*]', '', s)
'abcdefghijkl'
If the pattern to be matched is larger, you can use re.compile and pass that as the first argument to sub.
>>> r = re.compile('[\.\&\*]')
>>> re.sub(r, '', s)
'abcdefghijkl'
Hope this helps.

How to match the following regex python?

How to match the following with regex?
string1 = '1.0) The Ugly Duckling (TUD) (10 Dollars)'
string2 = '1.0) Little 1 Red Riding Hood (9.50 Dollars)'
I am trying the following:
groupsofmatches = re.match('(?P<booknumber>.*)\)([ \t]+)?(?P<item>.*)(\(.*\))?\(.*?((\d+)?(\.\d+)?).*([ \t]+)?Dollars(\))?', string1)
The issue is when I apply it to string2 it works fine, but when I apply the expression to string1, I am unable to get the "m.group(name)" because of the "(TUD)" part. I want to use a single expression that works for both strings.
I expect:
booknumber = 1.0
item = The Ugly Duckling (TUD)
Your problem is that .* matches greedily, and it may be consuming too much of the string. Printing all of the match groups will make this more obvious:
import re
string1 = '1.0) The Ugly Duckling (TUD) (10 Dollars)'
string2 = '1.0) Little 1 Red Riding Hood (9.50 Dollars)'
result = re.match(r'(.*?)\)([ \t]+)?(?P<item>.*)\(.*?(?P<dollaramount>(\d+)?(\.\d+)?).*([ \t]+)?Dollars(\))?', string1)
print repr(result.groups())
print result.group('item')
print result.group('dollaramount')
Changing them to *? makes the match the minimum.
This can be expensive in some RE engines, so you can also write eg \([^)]*\) to match all the parenthesis. If you're not processing a lot of text it probably doesn't matter.
btw, you should really use raw strings (ie r'something') for regexps, to avoid surprising backslash behaviour, and to give the reader a clue.
I see you had this group (\(.*?\))? which presumably was cutting out the (TUD), but if you actually want that in the title, just remove it.
You could impose some heavier restrictions on your repeated characters:
groupsofmatches = re.match('([^)]*)\)[ \t]*(?P<item>.*)\([^)]*?(?P<dollaramount>(?:\d+)?(?:\.\d+)?)[^)]*\)$', string1)
This will make sure that the numbers are taken from the last set of parentheses.
I would write it as:
num, name, value = re.match(r'(.+?)\) (.*?) \(([\d.]+) Dollars\)', s2).groups()
This is how I would do it with a Demo
(?P<booknumber>\d+(?:\.\d+)?)\)\s+(?P<item>.*?)\s+\(\d+(?:\.\d+)?\s+Dollars\)
I suggest you to use regex pattern
(?P<booknumber>[^)]*)\)\s+(?P<item>.*\S)\s+\((?!.*\()(?P<amount>\S+)\s+Dollars?\)

Categories