Regex working wrong, matching unexpected things - python

I have this regex:
[\(\+\[]?[0-9]([\-\)\.\/-\]]?\s?\(?[0-9\s\)]){8,20}?
It must match only phone numbers, but instead it also matches things like:
[95.86.22.137]
95.86.22.137
(192.168.1.94)
274.1363525390625px;">
2014-8-720:32:45
Can someone help me correct this regex please ?

If you really want to do this right, I would start over and first establish the pattern you intend to match, which is not evident by your regex and wasn't in your question - only what you didn't want. You need to look at it as "what do I want?", not "what am I trying to exclude?". Do it as "what do I want?" and what you require will eliminate all those nasty other possibilities.
You must first decide what you will accept as a "valid phone number". Remember that even within the NANP (North American Numbering Plan), there are a few different formats that go like:
XXX-XXX-XXXX or
XXX XXX-XXXX or
1-XXX-XXX-XXXX or
(XXX) XXX-XXXX or
+(XXX) XXX-XXXX or
1 (XXX) XXX-XXXX
And all of those are valid numbers, so you'd have to decide the format you'll accept. And then there are different formats throughout the rest of the world of varying lengths from 9 (Portugal) to 13 (South Korea) digits, including the country and international code. So you have to decide:
Will you be accepting only NANP numbers, or other numbers outside of that standard?
Will you accept "+" or make them write the international code? Will the code for the countries your user will be inputting have a set number of digits, if you require the code? If they enter the code, will your regex be able to handle (if acceptable) or red-flag (if not acceptable) that?
Will you enforce parenthesis around the area code, simply allow them (make parenthesis optional), or outright refuse them?
And on that last one, note that different countries have parenthesis in different places in their numbers, i.e. Mexico has 2 digit area codes (and is not in NANP, btw).
And remember, each time you make these kinds of decisions that require a character somewhere, you are negating other possible, valid phone numbers, unless you allowed the other valid characters in that slot, too. That is why there is no one-size fits all solution to your problem. For that reason, many will tell you to just strip out "+", "(", ")", "-", and then count the digits. But this fails if you consider "1" in a NANP number to be required, but someone doesn't include it (because it's generally optional within the NANP), or when different countries have different numbers of digits in their numbers -- even within their own country, like New Zealand.
There is this so-called comprehensive guide: A comprehensive regex for phone number validation
But I found it horribly lacking in addressing things like how to make a person enter "+" vs. "1" and a space (for NANP numbers), how to enforce parenthesis, hyphens, etc. It gives you the regexs rather than explaining how to get you there. Hence my "blog", here, for an answer.
The following is my strict NANP regex I use that will accept:
+(XXX) XXX-XXXX
1 (XXX) XXX-XXXX
(XXX) XXX-XXXX
It requires parenthesis and the hyphen, which for NANP numbers, I believe, gives a lot of good flexibility while still conforming to a standard. I do not deal in international (outside NANP) numbers, fortunately:
/^(\+|1\s)?[(][2-9]\d{2}[)][\s][2-9]\d{2}-\d{4}$/
/^ = Match at the start of word; basically just indicates the start of the expression
(\+|1\s)? group
Parenthesis are used to offset the group, and say that any of the characters inside are optional, via the ? at the end, and allow the "or" condition inside (see pipe character)
\+ = Escaped "+", to allow matching on the plus sign (must be escaped as it is a key character in regex, using the backslash)
| = Pipe character, to say it should match either what is to the left or to the right, within the group
1\s = Requires the number 1 and a space character. Space is not made required by [] - this has not worked for me, though I've seen other posts that seem to indicate it does. It is \s.
[(] = This is how you indicate that the open parenthesis is required.
[2-9]\d{2} group
[2-9] = This is to make the expression match on a digit 2-9. This is because in NANP, 0 and 1 are invalid numbers in the start of an area code (first set of 3 numbers) or in the phone exchange (second set of 3 numbers).
\d{2} = This says to allow 2 digits, from 0-9. This is shorthand for [0-9][0-9].
For a three-digit group from 000-999, you would simply say: \d{3}
[)] = This is how you indicate that the closed parenthesis is required.
[\s] = This will require a space.
[2-9]\d{2}-\d{4} group
This first part, before the hyphen, is the same as before.
Setting the hyphen will require it, here. If you put -?, it would be optional.
\d{4} = This says to allow 4 digits, from 0-9. This is shorthand for [0-9][0-9][0-9][0-9]
$/ = Says to match to the end of the word; basically just indicates the end of the expression.
Hopefully this will help you to build your expression.

Related

How to check if the whole input string (real numbers separated by a space) matches a regex in Python?

I have an input string consisting of a sequence of real numbers separated by a single space. It is also acceptable for the string to contain only one real number (no spaces). My goal is to check whether the string structure matches the following (in this order):
optional (0/1): minus (-)
1/more digits
optional (1+): a period and 1/more digits
optional (0+): a group consisting of a space and the first group (the first three bullet points)
It should describe the string completely. If not, it should print an error message and exit.
My current regular expression is ^(-?\d+(\.?\d)*)( \1)*$ which I thought would be okay, but even the first group doesn't match all the real numbers individually. And I need it to check the string from the beginning to the end, including the spaces.
My code for this function looks like this:
import re
def structure_check(string):
structure = r"^(-?\d+(\.?\d)*)( \1)*$"
if re.match(structure,string):
return("OK")
else:
print("Input error")
exit()
It should accept strings like: 15 35 -45 8 -2.3 4564.18 56 etc., but it doesn't correspond to changes in the input (doesn't match) at all. It shouldn't match if there is too many spaces, incorrectly placed . or -, or if there are other characters than digits, periods, dashes (-) and spaces.
I could also do this with just the first group while iterating over a list created by splitting the input string by space, but I would prefer to check it according to my main goal, since I wouldn't have to split the input in the validation function and also to save some more code lines by checking the input alltogether (eg. for excess spaces, or unsupported characters, which I'd have to otherwise check separately).
Sorry if I missed any answered questions, I couldn't find any appropriate for my problem in Python. If you know about any, feel free to link them, please. And thank you, I am a beginner and started learning regex for a project just about yesterday.
You can use:
^((?:[+-]?\d+(?:[.]\d+)?)(?:[ \t]|$))*$
Demo and explantation
I added + to the optional sign. If you only want to match with no sign or -, just remove that from the optional character class.
You could also use an unrolled version to prevent matching a space at the end.
^-?\d+(?:\.\d+)?(?: -?\d+(?:\.\d+)?)*$
Regex demo
The backreference \1 will match exactly what is matched in group 1 and for your pattern will match for example 123 123 123
If you want to repeat the group, you could recurse the first group using the PyPi regex module and (?1)
^(-?\d+(?:\.\d+)?)(?: (?1))*$
See a Python example
Problem is in your regexp, to be specific, in ( \1)* part.
This, described, means: space and string that was matched in group 1 zero or more times
Thus, your regexp will match for the following, for example:
15 15 15
-5.3 -5.3 -5.3 -5.3
And so on.
To fix the regexp, I would replace the group reference with the actual group, like so:
^(-?\d+(\.?\d)*)( -?\d+(\.?\d)*)*$
I would also point out that this regexp allows the numbers to have multiple decimal dots, (e.g. 1.2.3 passes) however I'm not sure if that's intended or not.
In JavaScript you can use the method .test of regex. The regex should work in python.
let ok = /^(([+\-]?\d+(\.\d+)?)( |$))+$/.test("15 35 -45 8 -2.3 4564.18 56");
console.log(ok);
Explanation: (.\d+)? You must make the whole group optional. The number can be followed by a space or the end of a string ( |$). The pattern is repeated throughout the string so I wrapped the entire expression in a group. Insert ^ at the beginning of the regex and $ at the end of the regex to force the regex to check the string completely.

How to make regex that matches a number with commas for every three digits?

I am a beginner in Python and in regular expressions and now I try to deal with one exercise, that sound like that:
How would you write a regex that matches a number with commas for
every three digits? It must match the following:
'42'
'1,234'
'6,368,745'
but not the following:
'12,34,567' (which has only two digits between the commas)
'1234' (which lacks commas)
I thought it would be easy, but I've already spent several hours and still don't have write answer. And even the answer, that was in book with this exercise, doesn't work at all (the pattern in the book is ^\d{1,3}(,\d{3})*$)
Thank you in advance!
The answer in your book seems correct for me. It works on the test cases you have given also.
(^\d{1,3}(,\d{3})*$)
The '^' symbol tells to search for integers at the start of the line. d{1,3} tells that there should be at least one integer but not more than 3 so ;
1234,123
will not work.
(,\d{3})*$
This expression tells that there should be one comma followed by three integers at the end of the line as many as there are.
Maybe the answer you are looking for is this:
(^\d+(,\d{3})*$)
Which matches a number with commas for every three digits without limiting the number being larger than 3 digits long before the comma.
You can go with this (which is a slightly improved version of what the book specifies):
^\d{1,3}(?:,\d{3})*$
Demo on Regex101
I got it to work by putting the stuff between the carrot and the dollar in parentheses like so: re.compile(r'^(\d{1,3}(,\d{3})*)$')
but I find this regex pretty useless, because you can't use it to find these numbers in a document because the string has to begin and end with the exact phrase.
#This program is to validate the regular expression for this scenerio.
#Any properly formattes number (w/Commas) will match.
#Parsing through a document for this regex is beyond my capability at this time.
print('Type a number with commas')
sentence = input()
import re
pattern = re.compile(r'\d{1,3}(,\d{3})*')
matches = pattern.match(sentence)
if matches.group(0) != sentence:
#Checks to see if the input value
#does NOT match the pattern.
print ('Does Not Match the Regular Expression!')
else:
print(matches.group(0)+ ' matches the pattern.')
#If the values match it will state verification.
The Simple answer is :
^\d{1,2}(,\d{3})*$
^\d{1,2} - should start with a number and matches 1 or 2 digits.
(,\d{3})*$ - once ',' is passed it requires 3 digits.
Works for all the scenarios in the book.
test your scenarios on https://pythex.org/
I also went down the rabbit hole trying to write a regex that is a solution to the question in the book. The question in the book does not assume that each line is such a number, that is, there might be multiple such numbers in the same line and there might some kind of quotation marks around the number (similar to the question text). On the other hand, the solution provided in the book makes those assumptions: (^\d{1,3}(,\d{3})*$)
I tried to use the question text as input and ended up with the following pattern, which is way too complicated:
r'''(
(?:(?<=\s)|(?<=[\'"])|(?<=^))
\d{1,3}
(?:,\d{3})*
(?:(?=\s)|(?=[\'"])|(?=$))
)'''
(?:(?<=\s)|(?<=[\'"])|(?<=^)) is a non-capturing group that allows
the number to start after \s characters, ', ", or the start of the text.
(?:,\d{3})* is a non-capturing group to avoid capturing, for example, 123 in 12,123.
(?:(?=\s)|(?=[\'"])|(?=$)) is a non-capturing group that allows
the number to end before \s characters, ', ", or the end of the text (no newline case).
Obviously you could extend the list of allowed characters around the number.

Python regex examples

I suck at Python regex and would love to see some solved examples to help me gain understanding. I am experimenting using http://pyregex.com/ which is great but need some 'good' examples to get me started.
I try to create a set of rules like so:
rules = [('name', r'[a-z]+'),
('operator', r'[+-*\]']
which I have found but not confident enough to create my own regexes for cases like the ones listed below:
match only the = or += or *= characters
match the + character (i.e the operator as seen above) separately from the ++ characters
match any one word after a specific keyword (e.g. int) and any number of space(s) and/or tabs. [edited - initially had followed which was wrong]
For 1. I have tried [\+=|=], for 2. I know the order in the rules is important and for 3. I am completely lost with the [] and on how I can generalize that case to work not just for int, but for float as well.
Any code examples will be greatly appreciated since I am only just starting with Python and coding!
match only the = or += or *= characters
r'[+*]?='
The [+*]?= consists of an optional atom, a character class [+*] that matches either a + or a *, ? - one or zero times, and a literal = symbol. Why not r'\+=|\*=|='? Not only the optional character class solution is shorter, but also it is more efficient: when you use alternation, you always have more redundant backtracking involved. You also need to be attentive to place the alternatives in a correct order, so that the longest appears first (although that does not always guarantee that the longest will match (depends on the branch subpatterns), or the order does not matter if there are anchors on both sides of the alternation group).
match the + character (i.e the operator as seen above) separately from the ++ characters
r'(?<!\+)\+(?!\+)'
This pattern matches a literal + (as it is escaped) and only in case it is neither preceded with another plus (see the negative lookbehind (?<!\+)) nor followed with another plus (see the positive lookahead (?!\+)). The lookarounds are non-consuming, i.e. the regex index remains right before a plus when it checks for a plus in front of it, and after the plus when it checks for a plus after it. The characters (or start/end of string positions) are not returned as part of the match (that is why they are called zero-width, non-capturing patterns).
match any one word after a specific keyword (e.g. int) and any number of space(s) and/or tabs.
r'\bint\b(?=\s+\w+\s+)'
If you read the explanation above, you will recognize another zero-width assertion here: (?=\s+\w+\s+) is a positive lookahead that checks if a whole word int (as \b matches word boundary positions) is followed with 1+ whitespaces, then 1+ word characters, and then again 1+ whitespaces.
The examples provided in the documentation and in the previous answers should get you started in the right path. An additional consideration, since you said you are new to programming and Python, is that regular expressions are an intermediate to advanced topic (depending what you want to do with it) and should be tackled once you have a better grasp of good programming practices and Python's fundamentals.
In any case more information and examples can be found at:
Python Regular Expressions module.

regular expression - partially match

My aim is to find matches in a text where not always all matches are present.
I am trying to collect the phone number, the E-mail and the website of venues from a web site. Only some venues have all three information available but most of them only one or two of them. I tried to write a code. However, it works only if all 3 information are available. Could someone help me what is wrong?
grouped = re.compile('col-right[\s\S]*?' +
'Tel[\s\S]*?([0-9]{0,4}-?[0-9]{3,7}-?[0-9]{0,4}-?[0-9]{0,4})' +
'[\s\S]*?href="http://([\w\W]*?)"' +
'[\s\S]*?href="mailto:([\s\S]*?)">[\s\S]*?</div>')
for match in re.finditer(grouped, text):
print (match.group(1))
print (match.group(2))
print (match.group(3))
Also the digits in the phone numbers are divided with "-" but sometimes there is a space between the "-" and the next set of digits. How can I include that in the code that this space is only occasionally present?
Your logic is good, but it needs a little work.
First of all, you need the phone number. Write a regex for it, and add it to a group: (regex)* the group is marked with (``) and * means that it has to be present 0 or more times.
Write the next regex, add it to another group (emailRegex)* and the third group (website)*.
Instead of * you could also use the ?, once or none at all (as I can see, you used ?.
Now, putting all together, simply mix them with any character in between them
(group1)?.*(emailRegex)?.*(website)*
grup1 matches phone number, followed by any character, email, followed by any character, website. And if one of them is missing, there is no problem at all.
Email regex example: (probably not the most complete one)
([a-zA-Z_]+[a-zA-Z_.-0-9]*#[a-zA-Z0-9]\.[a-z]+])?
This works like this: the email should start with a letter or an underscore _ and it should be followed by lower/upper case, numbers, underscore or a dot ( .) followed by # and letters followed by a dot (notice that I used \. to escape the special any character notation and in the end you add a mix of at least a letter.
works for email#mail.com.
The fact that I put the entire regex in brackets means it is a group and it should appear once or none at all (hence the ?). Between groups, you add .* meaning that in between the phone number/email/address can be any characters.

regex- capturing text between matches

In the following text, I try to match a number followed by ")" and number followed by a period. I am trying to retrieve the text between the matches.
Example:
"1) there is a dsfsdfsd and 2) there is another one and 3) yet another
case"
so I am trying to output: ["there is a dsfsdfsd and", "there is another one and", yet another case"]
I've used this regex: (?:\d)|\d.)
Adding a .* at the end matches the entire string, I only want it to match the words between
also in this string:
"we will give 4. there needs to be another option and 6.99 USD is a
bit amount"
I want to only match the 4. and not the 6.99
Any pointers will be appreciated. Thank you. r
tldr
Regular expressions are tricky beasts and you should avoid them if at all possible.
If you can't avoid them, then make sure you have lots of test cases for all the edge cases that can occur.
Build up your regular expression slowly and systematically, testing your assumptions at every step.
If this code will go intro production, then please write unit tests that explain the thinking process to the poor soul who has to maintain it one day
The long version
Regular expressions are finicky. Your best approach may be to solve the problem a different way.
For example, your language might have a library function that allows you to split up strings using a regular expression to define what comes between the numbers. That will let you get away with writing a simpler regex to match the numbers and brackets/dots.
If you still decide to use regular expressions, then you need to be very structured about how you build up your regular expressions. It's extremely easy to miss edge cases.
So let's break this down piece by piece...
Set up a test environment for quickly experimenting with your regex.
There are lots of options here, depending on your programming language and OS. Ones I sometimes use are:
a Powershell window for testing .Net regexes (NB: the cli gives you a history of past attempts, so you can go back a few steps if you mess things up too badly)
a Python console for testing Python regexes (which are slightly different to .Net regexes in their syntax for named capture groups).
an html page with JavaScript to test the regex
an online or desktop regex tool (I still use the ancient Regular Expression Workbench from Eric Gunnerson, but I'm sure there are better alternatives these days)
Since you didn't specify a language or regex version, I'll assume .Net regular expressions
Create a single test string for testing a wider variety of options.
Your goal is to include as many edge cases as you can think of. Here's what I would use: "ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10."
Note that I've added a few extra cases you didn't mention:
empty strings between two round bracket numbers: "4)" and "5)"
white space string between two round bracket numbers: "5)" and "6)"
empty strings between a round bracket number and a dotted number: "6)" and "10."
empty string after the dotted number "10." at the end of the string
random text and empty space, which should be ignored, before the first number
I'm going to make a few assumptions here, which you will need to vary based on your actual requirements:
You DO want to capture the white space after the dot or round bracket.
You DO want to capture the white space before the next dotted number or round bracket number.
You might have numbers that go beyond 9, so I've included "10" in the test cases.
You want to capture empty strings at the end e.g. after the "10."
NOTES:
Thinking through this test case forces you to be more rigorous about your requirements.
It will also help you be more efficient while you are manually testing your regular expression.
HOWEVER, this is assuming you aren't following a TDD approach. If you are, then you should probably do things a little differently... create unit tests for each scenario separately and get the regex working incrementally.
This test string doesn't cover all cases. For example, there are no newline or tab characters in the test string. Also it can't test for an empty string following a round bracket number at the very end.
First get a regex working that just captures the round brackets and dotted brackets.
Don't worry about the $6.99 edge case yet.
Drop the "(?:" non-capturing group syntax from your regex for now: "\d)|\d."
This doesn't even parse, because you have an unescaped round bracket.
The revised string is "\d\)|\d.", which parses, but which also matches "99" which you probably weren't expecting. That's because you forgot to escape the "."
The revised string is "\d\)|\d\.". This no longer matches "99", but it now matches "0." at the end instead of "10.". That's because it assumes that numbers will be single digit only.
The following string seems to work: "\d+\)|\d+\."
Time to deal with that pesky "$6.99" now...
Modify the regex so that it doesn't capture a floating point number.
You need to use a negative look ahead pattern to prevent a digit being after the decimal point.
Result: "\d+\)|\d+\.(?!\d)"
Count how many matches this produces. You're going to use this number for checking later results.
Hint: Save the regex pattern somewhere. You want to be able to go back to it any time you mess up your regex pattern beyond repair.
If you found a string splitting function, then you should use it now and avoid the complexity that follows. [I've included an example of this at the end.]
Simple is better, but I'm going to continue with the longer solution in the interests of showing an approach to staying in control of regex'es that start getting horribly complicated
Decide how to exclude that pattern
You used the non-capture group pattern in your question i.e. "(?:"
That approach can work. But it's a bit cumbersome, because you need to have a capturing group after it that you will look for instead.
It would be much nicer if your entire pattern matched what you are looking for.
So wrap the number pattern inside a zero-width positive look behind pattern (if your language supports it) i.e. "(?<=".
This checks for the pattern, but doesn't include it in what gets captured.
So now your regex looks like this: "(?<=\d+\)|\d+\.(?!\d))"
Test it!
It might seem silly to test this on its own - all the matches are empty strings.
Do it anyway. You want to sanity check every step of the way.
Make sure that it still produces the same number of matches as in step 4.
Decide how to match the text in between the numbers.
You rightly mention that ".*" will match the entire string, not just the parts in between.
There's a neat trick that allows you to reuse the pattern from step 5 to get the text in between.
Start by just matching the next character
The trick is that you want to match any character unless it's the start of the next number
That sounds like a negative look ahead pattern again: "(?!"
Let X be the pattern you saved in step 4. Matching a single character will look like this: "(?!X)."
You want to match lots of those characters. So put that pattern into a non-capturing group and repeat it: "(?:(?!X).)*"
This assumes you want to capture empty text.
If you're not, then change the "*" to a "+".
Hint: This is such a common pattern that you will want to reuse it in future pasting in different patterns in place of X
I used a non-capturing group instead of a normal group so that you can also embed this pattern in regexes where you do care about the capturing groups
Resulting pattern: "(?:(?!\d+\)|\d+\.(?!\d)).)*"
I suggest testing this pattern on its own to see what it does
Now put parts 5 and 7 together: "(?<=\d+\)|\d+\.(?!\d))(?:(?!\d+\)|\d+\.(?!\d)).)*"
Test it!
Unit tests!
If this is going into production, then please write lots of unit tests that will explain each step of this thought process
Have pity on the poor soul who has to maintain your regex in future!
By rights that person should be you
I suggest putting a note in your calendar to return to this code in 6 months' time and make sure you can still understand it from the unit tests alone!
Refactor
In six months' time, if you can't understand the code any more, use your newfound insight (and incentive) to solve the problem without using regular expressions (or only very simple ones)
Addendum
As an example of using a string splitting function to get away with a simpler regex, here's a solution in Powershell:
$string = 'ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10.'
$pattern = [regex] '\d+\)|\d+\.(?!\d)'
$string -split $pattern | select-object -skip 1
Judging by the task you have, it might be easier to match the delimiters and use re.split (as also pointed out by bobblebubble in the comments).
I dsuggest a mere
\d+[.)]\B\s*
See it in action (demo)
It matches 1 or more digits, then a . or a ), then it makes sure there is no word letter (digit, letter or underscore) after it and then matches zero or more whitespace.
Python demo:
import re
rx = r'\d+[.)]\B\s*'
test_str = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case\n\"we will give 4. there needs to be another option and 6.99 USD is a bit amount"
print([x for x in re.split(rx,test_str) if x])
Try the following regex with the g modifier:
([A-Za-z\s\-_]+|\d(?!(\)|\.)\D)|\.\d)
Example: https://regex101.com/r/kB1xI0/3
[A-Za-z\s\-_]+ automatically matches all alphabetical characters + whitespace
\d(?!(\)|\.)\D) match any numeric sequence of digits not followed by a closing parenthesis ) or decimal value (.99)
\.\d match any period followed by numeric digit.
I used this pattern:
(?<=\d.\s)(.*?)(?=\d.\s)
demo
This looks for the contents between any digit, any character, then a space.
Edit: Updated pattern to handle the currency issue and line ends better:
This is with flag 'g'
(?<=[0-9].\s)(.*?)(?=\s[0-9].\s|\n|\r)
Demo 2
import re
s = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case"
s1 = "we will give 4. there needs to be another option and 6.99 USD is a bit amount"
regex = re.compile("\d\)\s.*?|\s\d\.\D.*?")
print ([x for x in regex.split(s) if x])
print regex.split(s1)
Output:
['there is a dsfsdfsd and ', 'there is another one and ', 'yet another case']
['we will give', 'there needs to be another option and 6.99 USD is a bit amount']

Categories