Related
I need help to complete a regex pattern. I need a pattern to match a range of numbers including unit.
Examples:
The car drives 50,5 - 80 km/10min on the road.
The car drives 50,5 - 80 km / 10min on the road.
The car drives 40,5-80 km/h on the road.
The car drives 30-50 km/h on the road.
The car drives 40 - 60.8 km/ h on the road.
The car drives 40.90-60,8 km/h on the road.
I need to match the entire ranges. Good would also be (?:km/10min|km / 10min|km/h|km/ h) to simplify this part so that this does not have to be listed multiple times. So also here the blanks taken into account.
([,.\d]+)\s*(?:km/10min|km / 10min|km/h|km/ h)
https://regex101.com/r/Ey792V/1
Currently, unfortunately, only the first number is matched. Thanks in advance for the help.
You could make the pattern a bit more specific and optionally match whitespace chars instead of hard coding all the possible spaces variations
\b\d+(?:[.,]\d+)?(?:\s*-\s*\d+(?:[.,]\d+)?)?\s*km\s*/\s*(?:h|10min)\b
Explanation
\b A word boundary
\d+(?:[.,]\d+)? Match 1+ digits with an optional decimal part
(?: Non capture group
\s*-\s* Match - between optional whitespace chars
\d+(?:[.,]\d+)? Match 1+ digits with an optional decimal part
)? Close the non capture group and make it optional
\s*km\s*/\s* Match km/ surrounded with optional whitespace chars to match different variations
(?:h|10min) Match either h or 10min (Or use \d+min to match 1+ digits)
\b A word boundary
See a regex demo.
Your question is not entirely clear as you framed it in terms of examples. To be precise you need to state the question in words, then use the examples for illustration. To take one example, the question does not make clear whether
"The car drives 40,5- 80 km /h on the road."
is to be matched.
Expressing a question in words is not always easy but it is a skill that you need to acquire in order write clear code specifications. A by-product is that it makes the code easier to write, as that amounts to merely translating the words into code.
Let's give it a try.
Match a string comprised by six successive substrings:
One or more digits that are not preceded by a comma or period, optionally followed by a comma, hyphen or period, which, if present, is followed by one or more digits.
A hyphen, optionally preceded and/or followed by a space.
One or more digits, optionally followed by a comma or period, the comma or period, if present, being followed by one or more digits.
The literal " km".
A forward slash, optionally preceded and/or followed by a space.
The literal "h" or one or more digits followed by "min", followed by a word boundary.
I cannot be sure that this is what you want but you should be able to easily modify these requirements as necessary.
Now let's translate these requirements into a regular expression.
1. One or more digits that are not preceded by a comma or period, optionally followed by a comma, hyphen or period, which, if present, is followed by one or more digits.
(?<![,.])\d+(?:[,.-]\d+)?
(?<![,.]) is a negative lookbehind. It is needed to avoid matching, for example, the indicated part of the following string.
"The car drives 1,500.5 - 80 km/10min on the road."
^^^^^^^^^^^^^^^
2. A hyphen, optionally preceded and/or followed by a space.
?- ?
(The first question mark is preceded by a space.)
3. One or more digits, optionally followed by a comma or period, the comma or period, if present, being followed by one or more digits.
\d+(?:[,.]\d+)?
4. The literal " km".
km
5. A forward slash, optionally preceded and/or followed by a space.
?\/ ?
(The first question mark is preceded by a space.)
6. The literal "h" or one or more digits followed by "min", followed by a word boundary.
(?:h|\d+min)\b
Now we can simply join these pieces to form the regular expression.
\d+(?:[,.-]\d+)? ?- ?\d+(?:[,.]\d+)?km ?\/ ?(?:h|\d+min)\b
Demo
\d.+(?:h\b|min\b|s\b)
Would also work. Demo
I am trying to write a regex expression in python that can match the following lines - I am just able to match the very first number by doing something like this
re.compile(r'\d.\d{14}\s+')
but could not do rest. Also tried doing [^-\d] to catch the negative sign - does not seem working.
Any help? Thanks!
First, lets start by looking at the numbers. You've already got a decent expression for finding a single number (\d.\d{14}\s+), but there are a couple things wrong with it.
In regex, . indicates any single character. This means that your expression will accept any character after the first digit.
It's not taking into account the possibility that there could be a negative sign at the beginning.
Both of these problems are really easy to fix. The first can be fixed by simply escaping the period (\.). The second can be fixed by adding the negative sign to the pattern and giving it a quantifier. In this case, the ? quantifier will be the best option because it matches between 0 and 1 times. All this means is that it won't care if the symbol is there, but if it is it will match it. After these 2 changes, the pattern looks like this: -?\d\.\d{14}\s+.
Next, we need to tell it to match more than once. This can be done very easily by putting the pattern in a group and applying a quantifier to said group. Now the question is which quantifier should be used. In your example, there are only 3 numbers before the single character at the end of the line. You can match this pattern exactly 3 times by using the {3} quantifier. If you know there will be at least 1 but don't know how many in total there will be, you can use the + quantifier. For this example I will be using the {3} quantifier just so it's more specific to your question. After adding this, the pattern will look something like this: (-?\d\.\d{14}\s+){3}
Now all that's left is to match the character at the end. You can use \S to match any single word character. You can add a quantifier to it, but again, for the purposes of your question, I won't be since there's only a single character. The final expression would look like (-?\d\.\d{14}\s+){3}\S.
I am trying to create a regex that can cover most of the possibilities of a text pattern. The format I'm trying to find is the numeric value of a listing that can come with different monetary values.
Inside the behavior that I can find is to have a result like the following:
$ 8
expected result: 8
$ 12.548
expected result: 12.548
$ -8
expected result: -8
$ -6.098
expected result: -6.098
$ -59
expected result: -59
$ 778
expected result: 778
$ 73
expected result: 73
It is important to note that only one record will come at a time, but the result can come with any of the formats previously shown. Also within the pattern the $ sign will always come.
I need to have a regular expression that can find all the numerical values, however the one that is complicating me the most is the pattern with the negative number.
The expression I have, brings me only the positive values:
(\d+(\.\d+)?(?=$|))
As information I use Python 3.7 and I use the re.findall function to search for those records
Any ideas how to incorporate the negative numbers? Would I have to do it with a conditional?
Your exiting regex for positive number is already good. You can just modify it a bit to enhance it to support also negative numbers, as follows:
-?\d+(?:\.\d+)?
Regex Demo
Python run demo
If you want to match only the number without any other characters following it, you can use:
-?\d+(?:\.\d+)?$
Regex Demo
Details:
-? to match - literally but make it optional by ?
\d+ match the integer part of one or more digits
(?: to make a non-capturing group for the optional fractional part
\.\d+ to match a decimal point followed by one or more decimal point digits
)? end of the optional fractional part
$ this is the anchor to asserts the end of the line so that if any characters follows the number, if will not match the whole number.
To match the literal $, you have to escape it: \$, else it will try to match the end of line.
To match an integer, you can use \d to match a digit, and ask to match one or more: \d+
Matching the fractional part is trickier: you want to match the point, and then some digits after it: \.\d+. You need to escape the ., else it will match any character.
But you also want to match this whole thing zero or one times, using ?. An obvious way to do that would be (\.\d+)?, but that would be a capturing group, and you likely want to capture the entire number, not the fractional part alone. So you use a non-capturing group: (?:\.\d+)?
You also don't want to allow any other characters after the number, so you want to match the end-of-line, the $.
All together now:
\$ (\d+(?:\.\d+)?)$
To understand better how things work in it, I'd recommend a tool like https://regex101.com/.
Ah, yes, the optional minus; I bet you can cope with that without my help now.
To match all the example that you provided, you can try this instead:
[-\d]
match "any digits that come with or without the - sign in front of it and ignore everything else.
I want to adjust my previous answer which is very short when I consider only the example syntax that op wants to match. In the case that op wants to watch a specific number that follows one or more symbol $ and ignore any "." and "," in-between, it is a little more complicated. OP is very close to the right answer (in term of searching google) when he uses (?=$) for looking ahead which refer to the condition that "A number must follow by the symbol $", not mentioning the "|" in his full solution:
(\d+(\.\d+)?(?=$|))
which then, fortunately, help him to escape the wrong condition and again match every group of number with "." in between. Although this is not the answer that we wanted.
Since we want our condition to start with "$" to distinguish it from any random number that is not a part of the money we care about, we start with:
(?<=\$)
the full syntax:
(?<=\$)(-?[0-9]+\.[0-9]+)
for a pattern that only contains "." and :
(?<=\$)(-?[0-9]+\.[0-9]+)|(-?[0-9]+\,[0-9]+)
for a pattern that contains both "." and "," but not "." and "," between a number, for example with the following string:
" $-99.000 $10.0000 $$$999,000 $-99,000ppjujj okeer134124- "
the code should give you:
[('-99.000', ''), ('10.0000', ''), ('', '999,000'), ('', '-99,000')]
I hope this is helpful.
I am a beginner in Python and in regular expressions and now I try to deal with one exercise, that sound like that:
How would you write a regex that matches a number with commas for
every three digits? It must match the following:
'42'
'1,234'
'6,368,745'
but not the following:
'12,34,567' (which has only two digits between the commas)
'1234' (which lacks commas)
I thought it would be easy, but I've already spent several hours and still don't have write answer. And even the answer, that was in book with this exercise, doesn't work at all (the pattern in the book is ^\d{1,3}(,\d{3})*$)
Thank you in advance!
The answer in your book seems correct for me. It works on the test cases you have given also.
(^\d{1,3}(,\d{3})*$)
The '^' symbol tells to search for integers at the start of the line. d{1,3} tells that there should be at least one integer but not more than 3 so ;
1234,123
will not work.
(,\d{3})*$
This expression tells that there should be one comma followed by three integers at the end of the line as many as there are.
Maybe the answer you are looking for is this:
(^\d+(,\d{3})*$)
Which matches a number with commas for every three digits without limiting the number being larger than 3 digits long before the comma.
You can go with this (which is a slightly improved version of what the book specifies):
^\d{1,3}(?:,\d{3})*$
Demo on Regex101
I got it to work by putting the stuff between the carrot and the dollar in parentheses like so: re.compile(r'^(\d{1,3}(,\d{3})*)$')
but I find this regex pretty useless, because you can't use it to find these numbers in a document because the string has to begin and end with the exact phrase.
#This program is to validate the regular expression for this scenerio.
#Any properly formattes number (w/Commas) will match.
#Parsing through a document for this regex is beyond my capability at this time.
print('Type a number with commas')
sentence = input()
import re
pattern = re.compile(r'\d{1,3}(,\d{3})*')
matches = pattern.match(sentence)
if matches.group(0) != sentence:
#Checks to see if the input value
#does NOT match the pattern.
print ('Does Not Match the Regular Expression!')
else:
print(matches.group(0)+ ' matches the pattern.')
#If the values match it will state verification.
The Simple answer is :
^\d{1,2}(,\d{3})*$
^\d{1,2} - should start with a number and matches 1 or 2 digits.
(,\d{3})*$ - once ',' is passed it requires 3 digits.
Works for all the scenarios in the book.
test your scenarios on https://pythex.org/
I also went down the rabbit hole trying to write a regex that is a solution to the question in the book. The question in the book does not assume that each line is such a number, that is, there might be multiple such numbers in the same line and there might some kind of quotation marks around the number (similar to the question text). On the other hand, the solution provided in the book makes those assumptions: (^\d{1,3}(,\d{3})*$)
I tried to use the question text as input and ended up with the following pattern, which is way too complicated:
r'''(
(?:(?<=\s)|(?<=[\'"])|(?<=^))
\d{1,3}
(?:,\d{3})*
(?:(?=\s)|(?=[\'"])|(?=$))
)'''
(?:(?<=\s)|(?<=[\'"])|(?<=^)) is a non-capturing group that allows
the number to start after \s characters, ', ", or the start of the text.
(?:,\d{3})* is a non-capturing group to avoid capturing, for example, 123 in 12,123.
(?:(?=\s)|(?=[\'"])|(?=$)) is a non-capturing group that allows
the number to end before \s characters, ', ", or the end of the text (no newline case).
Obviously you could extend the list of allowed characters around the number.
My aim is to find matches in a text where not always all matches are present.
I am trying to collect the phone number, the E-mail and the website of venues from a web site. Only some venues have all three information available but most of them only one or two of them. I tried to write a code. However, it works only if all 3 information are available. Could someone help me what is wrong?
grouped = re.compile('col-right[\s\S]*?' +
'Tel[\s\S]*?([0-9]{0,4}-?[0-9]{3,7}-?[0-9]{0,4}-?[0-9]{0,4})' +
'[\s\S]*?href="http://([\w\W]*?)"' +
'[\s\S]*?href="mailto:([\s\S]*?)">[\s\S]*?</div>')
for match in re.finditer(grouped, text):
print (match.group(1))
print (match.group(2))
print (match.group(3))
Also the digits in the phone numbers are divided with "-" but sometimes there is a space between the "-" and the next set of digits. How can I include that in the code that this space is only occasionally present?
Your logic is good, but it needs a little work.
First of all, you need the phone number. Write a regex for it, and add it to a group: (regex)* the group is marked with (``) and * means that it has to be present 0 or more times.
Write the next regex, add it to another group (emailRegex)* and the third group (website)*.
Instead of * you could also use the ?, once or none at all (as I can see, you used ?.
Now, putting all together, simply mix them with any character in between them
(group1)?.*(emailRegex)?.*(website)*
grup1 matches phone number, followed by any character, email, followed by any character, website. And if one of them is missing, there is no problem at all.
Email regex example: (probably not the most complete one)
([a-zA-Z_]+[a-zA-Z_.-0-9]*#[a-zA-Z0-9]\.[a-z]+])?
This works like this: the email should start with a letter or an underscore _ and it should be followed by lower/upper case, numbers, underscore or a dot ( .) followed by # and letters followed by a dot (notice that I used \. to escape the special any character notation and in the end you add a mix of at least a letter.
works for email#mail.com.
The fact that I put the entire regex in brackets means it is a group and it should appear once or none at all (hence the ?). Between groups, you add .* meaning that in between the phone number/email/address can be any characters.