Python Regular expression search specific string beside number - python

I need help here.
I have a list and string.
Things I want to do is to find all the numbers from the string and also match the words from the list in the string that are beside numbers.
str = 'Lily goes to school everyday at 9:00. Her House is near to her school.
Lily's address - Flat No. 203, 14th street lol lane, opp to yuta mall,
washington. Her school name is kids International.'
list = ['school', 'international', 'house', 'flat no']
I wrote a regex which can pull numbers
x = re.findall('([0-9]+[\S]+[0-9]+|[0-9]+)' , str,re.I|re.M)
Output I want:
Numbers - ['9:00', '203', '14th']
Flat No.203 (because flat no is beside 203)
14 is also beside string but I dont want it because it is not contained in list.
But How can I write regex to make second condition satisfy. that is to search
whether flat no is beside 203 or not in same regex.

There you go:
(\d{1,2}:\d{1,2})|(?:No\. (\d+))|(\d+\w{2})
Demo on Regex101.com can be found here
What does it do and how does it work?
I use two pipes (|) to gather different number "types" you want:
First alteration ((\d{1,2}:\d{1,2}) - captures time using 1-2 digits followed by a colon and another set of 1-2 digits (probably you could go for 2 digits only).
Second alteration (?:No\. (\d+)) - gives you the number prefixed with literal "No. " (note the space at the end), and then captures following number, no matter how long (at least one digit)
The third and the last part (\d+\w{2}) - simply captures any number of digits (again, at least one) followed by two word characters. You could further improve this part of the regex to match only st, nd, and th suffixes, but I will leave this up to you.
Also to get rid of further unneeded matches you could use lookarounds, but again - I'll leave this up to you to implement.
General note - rather than using one regex to rule... erm - match them all, you should focus on creating many simple regexes. Not only will this improve legibility, but also maintainability of the regexes. This also allows you to search for timestamps, building numbers and positional numerals separately, easily allowing you to split this information to specific variables.

Related

Split a string and capture all instances in python regex

Newbie here, I have been trying to learn regex for some time but sometimes I feel I can't understand how regex is handling strings. Because in planning phase I seem to work it out, but in implementation it doesn't work as I expect it.
Here is my little problem: I have strings that contains one or more names (team names). The problem is that if the string contains more than one, there is no separator. All names are joint directly.
Some examples :
------------String -----------------Contains----------Names to be extracted
'RangersIslandersDevils' --> 3 names ->>> [Rangers, Islanders, Devils]
'49ersRaiders' -------------> 2 names ->>> [49ers, Raiders]
'Avalanche'----------------> 1 name ->>> [Avalanche]
'Red Wings'---------------> 1 name ->>> [Red Wings]
I want to capture each name in each string and use them in a loop later on. But I can't seem to implement the pattern I imagine for it.
The pattern implementation in my head for the strings are like this:
Start scanning the text which is expected to start with a capital
letter or number
If you see a literal 's' followed by a capital letter (like ...s[A-Z]..) capture the text until "s" (including s)
Repeat step two until you no more see (....s[A-Z]..) pattern. And capture the rest of the string as the last name.
Optionally, Write all names in a list
Well I tried in vain some code in which the step two captures only one instance and step 3 normally gives another.
re.findall('([A-Z0-9].*s)*([A-Z].*)+', 'RangersIslandersMolsDevil')
That returns only two names:
[('RangersIslandersMols', 'Devil')]
whereas I want four:
[Rangers, Islanders, Mols, Devil]
([A-Z0-9].*s)* will capture as many of any character as it can, so that's causing 'RangersIslandersMols' to get stuck together as one match.
It sounds like the boundary between team names is defined as a lowercase letter (not necessarily an 's', as in 'Avalanche') followed immediately by an uppercase letter or number, so our pattern should look for:
uppercase letters or numbers, followed by
lowercase letters
Because a team name can have multiple words, we'll also look for a space followed by the same pattern as above, for any possible number of words.
Try this pattern:
>>> pattern = r'[A-Z0-9]+[a-z]+(?: [A-Z0-9]+[a-z]+)*'
>>> findall(pattern, "RangersIslandersDevils49ersWashginton Football TeamAvalancheWarriors")
['Rangers', 'Islanders', 'Devils', '49ers', 'Washginton Football Team', 'Avalanche', 'Warriors']

How to check if the whole input string (real numbers separated by a space) matches a regex in Python?

I have an input string consisting of a sequence of real numbers separated by a single space. It is also acceptable for the string to contain only one real number (no spaces). My goal is to check whether the string structure matches the following (in this order):
optional (0/1): minus (-)
1/more digits
optional (1+): a period and 1/more digits
optional (0+): a group consisting of a space and the first group (the first three bullet points)
It should describe the string completely. If not, it should print an error message and exit.
My current regular expression is ^(-?\d+(\.?\d)*)( \1)*$ which I thought would be okay, but even the first group doesn't match all the real numbers individually. And I need it to check the string from the beginning to the end, including the spaces.
My code for this function looks like this:
import re
def structure_check(string):
structure = r"^(-?\d+(\.?\d)*)( \1)*$"
if re.match(structure,string):
return("OK")
else:
print("Input error")
exit()
It should accept strings like: 15 35 -45 8 -2.3 4564.18 56 etc., but it doesn't correspond to changes in the input (doesn't match) at all. It shouldn't match if there is too many spaces, incorrectly placed . or -, or if there are other characters than digits, periods, dashes (-) and spaces.
I could also do this with just the first group while iterating over a list created by splitting the input string by space, but I would prefer to check it according to my main goal, since I wouldn't have to split the input in the validation function and also to save some more code lines by checking the input alltogether (eg. for excess spaces, or unsupported characters, which I'd have to otherwise check separately).
Sorry if I missed any answered questions, I couldn't find any appropriate for my problem in Python. If you know about any, feel free to link them, please. And thank you, I am a beginner and started learning regex for a project just about yesterday.
You can use:
^((?:[+-]?\d+(?:[.]\d+)?)(?:[ \t]|$))*$
Demo and explantation
I added + to the optional sign. If you only want to match with no sign or -, just remove that from the optional character class.
You could also use an unrolled version to prevent matching a space at the end.
^-?\d+(?:\.\d+)?(?: -?\d+(?:\.\d+)?)*$
Regex demo
The backreference \1 will match exactly what is matched in group 1 and for your pattern will match for example 123 123 123
If you want to repeat the group, you could recurse the first group using the PyPi regex module and (?1)
^(-?\d+(?:\.\d+)?)(?: (?1))*$
See a Python example
Problem is in your regexp, to be specific, in ( \1)* part.
This, described, means: space and string that was matched in group 1 zero or more times
Thus, your regexp will match for the following, for example:
15 15 15
-5.3 -5.3 -5.3 -5.3
And so on.
To fix the regexp, I would replace the group reference with the actual group, like so:
^(-?\d+(\.?\d)*)( -?\d+(\.?\d)*)*$
I would also point out that this regexp allows the numbers to have multiple decimal dots, (e.g. 1.2.3 passes) however I'm not sure if that's intended or not.
In JavaScript you can use the method .test of regex. The regex should work in python.
let ok = /^(([+\-]?\d+(\.\d+)?)( |$))+$/.test("15 35 -45 8 -2.3 4564.18 56");
console.log(ok);
Explanation: (.\d+)? You must make the whole group optional. The number can be followed by a space or the end of a string ( |$). The pattern is repeated throughout the string so I wrapped the entire expression in a group. Insert ^ at the beginning of the regex and $ at the end of the regex to force the regex to check the string completely.

How to make regex that matches a number with commas for every three digits?

I am a beginner in Python and in regular expressions and now I try to deal with one exercise, that sound like that:
How would you write a regex that matches a number with commas for
every three digits? It must match the following:
'42'
'1,234'
'6,368,745'
but not the following:
'12,34,567' (which has only two digits between the commas)
'1234' (which lacks commas)
I thought it would be easy, but I've already spent several hours and still don't have write answer. And even the answer, that was in book with this exercise, doesn't work at all (the pattern in the book is ^\d{1,3}(,\d{3})*$)
Thank you in advance!
The answer in your book seems correct for me. It works on the test cases you have given also.
(^\d{1,3}(,\d{3})*$)
The '^' symbol tells to search for integers at the start of the line. d{1,3} tells that there should be at least one integer but not more than 3 so ;
1234,123
will not work.
(,\d{3})*$
This expression tells that there should be one comma followed by three integers at the end of the line as many as there are.
Maybe the answer you are looking for is this:
(^\d+(,\d{3})*$)
Which matches a number with commas for every three digits without limiting the number being larger than 3 digits long before the comma.
You can go with this (which is a slightly improved version of what the book specifies):
^\d{1,3}(?:,\d{3})*$
Demo on Regex101
I got it to work by putting the stuff between the carrot and the dollar in parentheses like so: re.compile(r'^(\d{1,3}(,\d{3})*)$')
but I find this regex pretty useless, because you can't use it to find these numbers in a document because the string has to begin and end with the exact phrase.
#This program is to validate the regular expression for this scenerio.
#Any properly formattes number (w/Commas) will match.
#Parsing through a document for this regex is beyond my capability at this time.
print('Type a number with commas')
sentence = input()
import re
pattern = re.compile(r'\d{1,3}(,\d{3})*')
matches = pattern.match(sentence)
if matches.group(0) != sentence:
#Checks to see if the input value
#does NOT match the pattern.
print ('Does Not Match the Regular Expression!')
else:
print(matches.group(0)+ ' matches the pattern.')
#If the values match it will state verification.
The Simple answer is :
^\d{1,2}(,\d{3})*$
^\d{1,2} - should start with a number and matches 1 or 2 digits.
(,\d{3})*$ - once ',' is passed it requires 3 digits.
Works for all the scenarios in the book.
test your scenarios on https://pythex.org/
I also went down the rabbit hole trying to write a regex that is a solution to the question in the book. The question in the book does not assume that each line is such a number, that is, there might be multiple such numbers in the same line and there might some kind of quotation marks around the number (similar to the question text). On the other hand, the solution provided in the book makes those assumptions: (^\d{1,3}(,\d{3})*$)
I tried to use the question text as input and ended up with the following pattern, which is way too complicated:
r'''(
(?:(?<=\s)|(?<=[\'"])|(?<=^))
\d{1,3}
(?:,\d{3})*
(?:(?=\s)|(?=[\'"])|(?=$))
)'''
(?:(?<=\s)|(?<=[\'"])|(?<=^)) is a non-capturing group that allows
the number to start after \s characters, ', ", or the start of the text.
(?:,\d{3})* is a non-capturing group to avoid capturing, for example, 123 in 12,123.
(?:(?=\s)|(?=[\'"])|(?=$)) is a non-capturing group that allows
the number to end before \s characters, ', ", or the end of the text (no newline case).
Obviously you could extend the list of allowed characters around the number.

regex- capturing text between matches

In the following text, I try to match a number followed by ")" and number followed by a period. I am trying to retrieve the text between the matches.
Example:
"1) there is a dsfsdfsd and 2) there is another one and 3) yet another
case"
so I am trying to output: ["there is a dsfsdfsd and", "there is another one and", yet another case"]
I've used this regex: (?:\d)|\d.)
Adding a .* at the end matches the entire string, I only want it to match the words between
also in this string:
"we will give 4. there needs to be another option and 6.99 USD is a
bit amount"
I want to only match the 4. and not the 6.99
Any pointers will be appreciated. Thank you. r
tldr
Regular expressions are tricky beasts and you should avoid them if at all possible.
If you can't avoid them, then make sure you have lots of test cases for all the edge cases that can occur.
Build up your regular expression slowly and systematically, testing your assumptions at every step.
If this code will go intro production, then please write unit tests that explain the thinking process to the poor soul who has to maintain it one day
The long version
Regular expressions are finicky. Your best approach may be to solve the problem a different way.
For example, your language might have a library function that allows you to split up strings using a regular expression to define what comes between the numbers. That will let you get away with writing a simpler regex to match the numbers and brackets/dots.
If you still decide to use regular expressions, then you need to be very structured about how you build up your regular expressions. It's extremely easy to miss edge cases.
So let's break this down piece by piece...
Set up a test environment for quickly experimenting with your regex.
There are lots of options here, depending on your programming language and OS. Ones I sometimes use are:
a Powershell window for testing .Net regexes (NB: the cli gives you a history of past attempts, so you can go back a few steps if you mess things up too badly)
a Python console for testing Python regexes (which are slightly different to .Net regexes in their syntax for named capture groups).
an html page with JavaScript to test the regex
an online or desktop regex tool (I still use the ancient Regular Expression Workbench from Eric Gunnerson, but I'm sure there are better alternatives these days)
Since you didn't specify a language or regex version, I'll assume .Net regular expressions
Create a single test string for testing a wider variety of options.
Your goal is to include as many edge cases as you can think of. Here's what I would use: "ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10."
Note that I've added a few extra cases you didn't mention:
empty strings between two round bracket numbers: "4)" and "5)"
white space string between two round bracket numbers: "5)" and "6)"
empty strings between a round bracket number and a dotted number: "6)" and "10."
empty string after the dotted number "10." at the end of the string
random text and empty space, which should be ignored, before the first number
I'm going to make a few assumptions here, which you will need to vary based on your actual requirements:
You DO want to capture the white space after the dot or round bracket.
You DO want to capture the white space before the next dotted number or round bracket number.
You might have numbers that go beyond 9, so I've included "10" in the test cases.
You want to capture empty strings at the end e.g. after the "10."
NOTES:
Thinking through this test case forces you to be more rigorous about your requirements.
It will also help you be more efficient while you are manually testing your regular expression.
HOWEVER, this is assuming you aren't following a TDD approach. If you are, then you should probably do things a little differently... create unit tests for each scenario separately and get the regex working incrementally.
This test string doesn't cover all cases. For example, there are no newline or tab characters in the test string. Also it can't test for an empty string following a round bracket number at the very end.
First get a regex working that just captures the round brackets and dotted brackets.
Don't worry about the $6.99 edge case yet.
Drop the "(?:" non-capturing group syntax from your regex for now: "\d)|\d."
This doesn't even parse, because you have an unescaped round bracket.
The revised string is "\d\)|\d.", which parses, but which also matches "99" which you probably weren't expecting. That's because you forgot to escape the "."
The revised string is "\d\)|\d\.". This no longer matches "99", but it now matches "0." at the end instead of "10.". That's because it assumes that numbers will be single digit only.
The following string seems to work: "\d+\)|\d+\."
Time to deal with that pesky "$6.99" now...
Modify the regex so that it doesn't capture a floating point number.
You need to use a negative look ahead pattern to prevent a digit being after the decimal point.
Result: "\d+\)|\d+\.(?!\d)"
Count how many matches this produces. You're going to use this number for checking later results.
Hint: Save the regex pattern somewhere. You want to be able to go back to it any time you mess up your regex pattern beyond repair.
If you found a string splitting function, then you should use it now and avoid the complexity that follows. [I've included an example of this at the end.]
Simple is better, but I'm going to continue with the longer solution in the interests of showing an approach to staying in control of regex'es that start getting horribly complicated
Decide how to exclude that pattern
You used the non-capture group pattern in your question i.e. "(?:"
That approach can work. But it's a bit cumbersome, because you need to have a capturing group after it that you will look for instead.
It would be much nicer if your entire pattern matched what you are looking for.
So wrap the number pattern inside a zero-width positive look behind pattern (if your language supports it) i.e. "(?<=".
This checks for the pattern, but doesn't include it in what gets captured.
So now your regex looks like this: "(?<=\d+\)|\d+\.(?!\d))"
Test it!
It might seem silly to test this on its own - all the matches are empty strings.
Do it anyway. You want to sanity check every step of the way.
Make sure that it still produces the same number of matches as in step 4.
Decide how to match the text in between the numbers.
You rightly mention that ".*" will match the entire string, not just the parts in between.
There's a neat trick that allows you to reuse the pattern from step 5 to get the text in between.
Start by just matching the next character
The trick is that you want to match any character unless it's the start of the next number
That sounds like a negative look ahead pattern again: "(?!"
Let X be the pattern you saved in step 4. Matching a single character will look like this: "(?!X)."
You want to match lots of those characters. So put that pattern into a non-capturing group and repeat it: "(?:(?!X).)*"
This assumes you want to capture empty text.
If you're not, then change the "*" to a "+".
Hint: This is such a common pattern that you will want to reuse it in future pasting in different patterns in place of X
I used a non-capturing group instead of a normal group so that you can also embed this pattern in regexes where you do care about the capturing groups
Resulting pattern: "(?:(?!\d+\)|\d+\.(?!\d)).)*"
I suggest testing this pattern on its own to see what it does
Now put parts 5 and 7 together: "(?<=\d+\)|\d+\.(?!\d))(?:(?!\d+\)|\d+\.(?!\d)).)*"
Test it!
Unit tests!
If this is going into production, then please write lots of unit tests that will explain each step of this thought process
Have pity on the poor soul who has to maintain your regex in future!
By rights that person should be you
I suggest putting a note in your calendar to return to this code in 6 months' time and make sure you can still understand it from the unit tests alone!
Refactor
In six months' time, if you can't understand the code any more, use your newfound insight (and incentive) to solve the problem without using regular expressions (or only very simple ones)
Addendum
As an example of using a string splitting function to get away with a simpler regex, here's a solution in Powershell:
$string = 'ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10.'
$pattern = [regex] '\d+\)|\d+\.(?!\d)'
$string -split $pattern | select-object -skip 1
Judging by the task you have, it might be easier to match the delimiters and use re.split (as also pointed out by bobblebubble in the comments).
I dsuggest a mere
\d+[.)]\B\s*
See it in action (demo)
It matches 1 or more digits, then a . or a ), then it makes sure there is no word letter (digit, letter or underscore) after it and then matches zero or more whitespace.
Python demo:
import re
rx = r'\d+[.)]\B\s*'
test_str = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case\n\"we will give 4. there needs to be another option and 6.99 USD is a bit amount"
print([x for x in re.split(rx,test_str) if x])
Try the following regex with the g modifier:
([A-Za-z\s\-_]+|\d(?!(\)|\.)\D)|\.\d)
Example: https://regex101.com/r/kB1xI0/3
[A-Za-z\s\-_]+ automatically matches all alphabetical characters + whitespace
\d(?!(\)|\.)\D) match any numeric sequence of digits not followed by a closing parenthesis ) or decimal value (.99)
\.\d match any period followed by numeric digit.
I used this pattern:
(?<=\d.\s)(.*?)(?=\d.\s)
demo
This looks for the contents between any digit, any character, then a space.
Edit: Updated pattern to handle the currency issue and line ends better:
This is with flag 'g'
(?<=[0-9].\s)(.*?)(?=\s[0-9].\s|\n|\r)
Demo 2
import re
s = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case"
s1 = "we will give 4. there needs to be another option and 6.99 USD is a bit amount"
regex = re.compile("\d\)\s.*?|\s\d\.\D.*?")
print ([x for x in regex.split(s) if x])
print regex.split(s1)
Output:
['there is a dsfsdfsd and ', 'there is another one and ', 'yet another case']
['we will give', 'there needs to be another option and 6.99 USD is a bit amount']

Regex working wrong, matching unexpected things

I have this regex:
[\(\+\[]?[0-9]([\-\)\.\/-\]]?\s?\(?[0-9\s\)]){8,20}?
It must match only phone numbers, but instead it also matches things like:
[95.86.22.137]
95.86.22.137
(192.168.1.94)
274.1363525390625px;">
2014-8-720:32:45
Can someone help me correct this regex please ?
If you really want to do this right, I would start over and first establish the pattern you intend to match, which is not evident by your regex and wasn't in your question - only what you didn't want. You need to look at it as "what do I want?", not "what am I trying to exclude?". Do it as "what do I want?" and what you require will eliminate all those nasty other possibilities.
You must first decide what you will accept as a "valid phone number". Remember that even within the NANP (North American Numbering Plan), there are a few different formats that go like:
XXX-XXX-XXXX or
XXX XXX-XXXX or
1-XXX-XXX-XXXX or
(XXX) XXX-XXXX or
+(XXX) XXX-XXXX or
1 (XXX) XXX-XXXX
And all of those are valid numbers, so you'd have to decide the format you'll accept. And then there are different formats throughout the rest of the world of varying lengths from 9 (Portugal) to 13 (South Korea) digits, including the country and international code. So you have to decide:
Will you be accepting only NANP numbers, or other numbers outside of that standard?
Will you accept "+" or make them write the international code? Will the code for the countries your user will be inputting have a set number of digits, if you require the code? If they enter the code, will your regex be able to handle (if acceptable) or red-flag (if not acceptable) that?
Will you enforce parenthesis around the area code, simply allow them (make parenthesis optional), or outright refuse them?
And on that last one, note that different countries have parenthesis in different places in their numbers, i.e. Mexico has 2 digit area codes (and is not in NANP, btw).
And remember, each time you make these kinds of decisions that require a character somewhere, you are negating other possible, valid phone numbers, unless you allowed the other valid characters in that slot, too. That is why there is no one-size fits all solution to your problem. For that reason, many will tell you to just strip out "+", "(", ")", "-", and then count the digits. But this fails if you consider "1" in a NANP number to be required, but someone doesn't include it (because it's generally optional within the NANP), or when different countries have different numbers of digits in their numbers -- even within their own country, like New Zealand.
There is this so-called comprehensive guide: A comprehensive regex for phone number validation
But I found it horribly lacking in addressing things like how to make a person enter "+" vs. "1" and a space (for NANP numbers), how to enforce parenthesis, hyphens, etc. It gives you the regexs rather than explaining how to get you there. Hence my "blog", here, for an answer.
The following is my strict NANP regex I use that will accept:
+(XXX) XXX-XXXX
1 (XXX) XXX-XXXX
(XXX) XXX-XXXX
It requires parenthesis and the hyphen, which for NANP numbers, I believe, gives a lot of good flexibility while still conforming to a standard. I do not deal in international (outside NANP) numbers, fortunately:
/^(\+|1\s)?[(][2-9]\d{2}[)][\s][2-9]\d{2}-\d{4}$/
/^ = Match at the start of word; basically just indicates the start of the expression
(\+|1\s)? group
Parenthesis are used to offset the group, and say that any of the characters inside are optional, via the ? at the end, and allow the "or" condition inside (see pipe character)
\+ = Escaped "+", to allow matching on the plus sign (must be escaped as it is a key character in regex, using the backslash)
| = Pipe character, to say it should match either what is to the left or to the right, within the group
1\s = Requires the number 1 and a space character. Space is not made required by [] - this has not worked for me, though I've seen other posts that seem to indicate it does. It is \s.
[(] = This is how you indicate that the open parenthesis is required.
[2-9]\d{2} group
[2-9] = This is to make the expression match on a digit 2-9. This is because in NANP, 0 and 1 are invalid numbers in the start of an area code (first set of 3 numbers) or in the phone exchange (second set of 3 numbers).
\d{2} = This says to allow 2 digits, from 0-9. This is shorthand for [0-9][0-9].
For a three-digit group from 000-999, you would simply say: \d{3}
[)] = This is how you indicate that the closed parenthesis is required.
[\s] = This will require a space.
[2-9]\d{2}-\d{4} group
This first part, before the hyphen, is the same as before.
Setting the hyphen will require it, here. If you put -?, it would be optional.
\d{4} = This says to allow 4 digits, from 0-9. This is shorthand for [0-9][0-9][0-9][0-9]
$/ = Says to match to the end of the word; basically just indicates the end of the expression.
Hopefully this will help you to build your expression.

Categories