Get numerical values from a behavior pattern with Regex

Get numerical values from a behavior pattern with Regex - python

I am trying to create a regex that can cover most of the possibilities of a text pattern. The format I'm trying to find is the numeric value of a listing that can come with different monetary values.
Inside the behavior that I can find is to have a result like the following:
$ 8
expected result: 8
$ 12.548
expected result: 12.548
$ -8
expected result: -8
$ -6.098
expected result: -6.098
$ -59
expected result: -59
$ 778
expected result: 778
$ 73
expected result: 73
It is important to note that only one record will come at a time, but the result can come with any of the formats previously shown. Also within the pattern the $ sign will always come.
I need to have a regular expression that can find all the numerical values, however the one that is complicating me the most is the pattern with the negative number.
The expression I have, brings me only the positive values:
(\d+(\.\d+)?(?=$|))
As information I use Python 3.7 and I use the re.findall function to search for those records
Any ideas how to incorporate the negative numbers? Would I have to do it with a conditional?

Your exiting regex for positive number is already good. You can just modify it a bit to enhance it to support also negative numbers, as follows:
-?\d+(?:\.\d+)?
Regex Demo
Python run demo
If you want to match only the number without any other characters following it, you can use:
-?\d+(?:\.\d+)?$
Regex Demo
Details:
-? to match - literally but make it optional by ?
\d+ match the integer part of one or more digits
(?: to make a non-capturing group for the optional fractional part
\.\d+ to match a decimal point followed by one or more decimal point digits
)? end of the optional fractional part
$ this is the anchor to asserts the end of the line so that if any characters follows the number, if will not match the whole number.

To match the literal $, you have to escape it: \$, else it will try to match the end of line.
To match an integer, you can use \d to match a digit, and ask to match one or more: \d+
Matching the fractional part is trickier: you want to match the point, and then some digits after it: \.\d+. You need to escape the ., else it will match any character.
But you also want to match this whole thing zero or one times, using ?. An obvious way to do that would be (\.\d+)?, but that would be a capturing group, and you likely want to capture the entire number, not the fractional part alone. So you use a non-capturing group: (?:\.\d+)?
You also don't want to allow any other characters after the number, so you want to match the end-of-line, the $.
All together now:
\$ (\d+(?:\.\d+)?)$
To understand better how things work in it, I'd recommend a tool like https://regex101.com/.
Ah, yes, the optional minus; I bet you can cope with that without my help now.

To match all the example that you provided, you can try this instead:
[-\d]
match "any digits that come with or without the - sign in front of it and ignore everything else.

I want to adjust my previous answer which is very short when I consider only the example syntax that op wants to match. In the case that op wants to watch a specific number that follows one or more symbol $ and ignore any "." and "," in-between, it is a little more complicated. OP is very close to the right answer (in term of searching google) when he uses (?=$) for looking ahead which refer to the condition that "A number must follow by the symbol $", not mentioning the "|" in his full solution:
(\d+(\.\d+)?(?=$|))
which then, fortunately, help him to escape the wrong condition and again match every group of number with "." in between. Although this is not the answer that we wanted.
Since we want our condition to start with "$" to distinguish it from any random number that is not a part of the money we care about, we start with:
(?<=\$)
the full syntax:
(?<=\$)(-?[0-9]+\.[0-9]+)
for a pattern that only contains "." and :
(?<=\$)(-?[0-9]+\.[0-9]+)|(-?[0-9]+\,[0-9]+)
for a pattern that contains both "." and "," but not "." and "," between a number, for example with the following string:
" $-99.000 $10.0000 $$$999,000 $-99,000ppjujj okeer134124- "
the code should give you:
[('-99.000', ''), ('10.0000', ''), ('', '999,000'), ('', '-99,000')]
I hope this is helpful.

Related

regex to match coordinates

I am trying to write a regex expression in python that can match the following lines - I am just able to match the very first number by doing something like this
re.compile(r'\d.\d{14}\s+')
but could not do rest. Also tried doing [^-\d] to catch the negative sign - does not seem working.
Any help? Thanks!

First, lets start by looking at the numbers. You've already got a decent expression for finding a single number (\d.\d{14}\s+), but there are a couple things wrong with it.
In regex, . indicates any single character. This means that your expression will accept any character after the first digit.
It's not taking into account the possibility that there could be a negative sign at the beginning.
Both of these problems are really easy to fix. The first can be fixed by simply escaping the period (\.). The second can be fixed by adding the negative sign to the pattern and giving it a quantifier. In this case, the ? quantifier will be the best option because it matches between 0 and 1 times. All this means is that it won't care if the symbol is there, but if it is it will match it. After these 2 changes, the pattern looks like this: -?\d\.\d{14}\s+.
Next, we need to tell it to match more than once. This can be done very easily by putting the pattern in a group and applying a quantifier to said group. Now the question is which quantifier should be used. In your example, there are only 3 numbers before the single character at the end of the line. You can match this pattern exactly 3 times by using the {3} quantifier. If you know there will be at least 1 but don't know how many in total there will be, you can use the + quantifier. For this example I will be using the {3} quantifier just so it's more specific to your question. After adding this, the pattern will look something like this: (-?\d\.\d{14}\s+){3}
Now all that's left is to match the character at the end. You can use \S to match any single word character. You can add a quantifier to it, but again, for the purposes of your question, I won't be since there's only a single character. The final expression would look like (-?\d\.\d{14}\s+){3}\S.

How to check if the whole input string (real numbers separated by a space) matches a regex in Python?

I have an input string consisting of a sequence of real numbers separated by a single space. It is also acceptable for the string to contain only one real number (no spaces). My goal is to check whether the string structure matches the following (in this order):
optional (0/1): minus (-)
1/more digits
optional (1+): a period and 1/more digits
optional (0+): a group consisting of a space and the first group (the first three bullet points)
It should describe the string completely. If not, it should print an error message and exit.
My current regular expression is ^(-?\d+(\.?\d)*)( \1)*$ which I thought would be okay, but even the first group doesn't match all the real numbers individually. And I need it to check the string from the beginning to the end, including the spaces.
My code for this function looks like this:
import re
def structure_check(string):
structure = r"^(-?\d+(\.?\d)*)( \1)*$"
if re.match(structure,string):
return("OK")
else:
print("Input error")
exit()
It should accept strings like: 15 35 -45 8 -2.3 4564.18 56 etc., but it doesn't correspond to changes in the input (doesn't match) at all. It shouldn't match if there is too many spaces, incorrectly placed . or -, or if there are other characters than digits, periods, dashes (-) and spaces.
I could also do this with just the first group while iterating over a list created by splitting the input string by space, but I would prefer to check it according to my main goal, since I wouldn't have to split the input in the validation function and also to save some more code lines by checking the input alltogether (eg. for excess spaces, or unsupported characters, which I'd have to otherwise check separately).
Sorry if I missed any answered questions, I couldn't find any appropriate for my problem in Python. If you know about any, feel free to link them, please. And thank you, I am a beginner and started learning regex for a project just about yesterday.

You can use:
^((?:[+-]?\d+(?:[.]\d+)?)(?:[ \t]|$))*$
Demo and explantation
I added + to the optional sign. If you only want to match with no sign or -, just remove that from the optional character class.

You could also use an unrolled version to prevent matching a space at the end.
^-?\d+(?:\.\d+)?(?: -?\d+(?:\.\d+)?)*$
Regex demo
The backreference \1 will match exactly what is matched in group 1 and for your pattern will match for example 123 123 123
If you want to repeat the group, you could recurse the first group using the PyPi regex module and (?1)
^(-?\d+(?:\.\d+)?)(?: (?1))*$
See a Python example

Problem is in your regexp, to be specific, in ( \1)* part.
This, described, means: space and string that was matched in group 1 zero or more times
Thus, your regexp will match for the following, for example:
15 15 15
-5.3 -5.3 -5.3 -5.3
And so on.
To fix the regexp, I would replace the group reference with the actual group, like so:
^(-?\d+(\.?\d)*)( -?\d+(\.?\d)*)*$
I would also point out that this regexp allows the numbers to have multiple decimal dots, (e.g. 1.2.3 passes) however I'm not sure if that's intended or not.

In JavaScript you can use the method .test of regex. The regex should work in python.
let ok = /^(([+\-]?\d+(\.\d+)?)( |$))+$/.test("15 35 -45 8 -2.3 4564.18 56");
console.log(ok);
Explanation: (.\d+)? You must make the whole group optional. The number can be followed by a space or the end of a string ( |$). The pattern is repeated throughout the string so I wrapped the entire expression in a group. Insert ^ at the beginning of the regex and $ at the end of the regex to force the regex to check the string completely.

regex by using "." to take one character [duplicate]

I want a regular expression to match a string that may or may not start with plus symbol and then contain any number of digits.
Those should be matched
+35423452354554
or
3423564564

This should work
\+?\d+
Matches an optional + at the beginning of the line and digits after it
EDIT:
As of OP's request of clarification: 3423kk55 is matched because so it is the first part (3423). To match a whole string only use this instead:
^\+?\d+$

It'll look something like this:
\+?\d+
The \+ means a literal plus sign, the ? means that the preceding group (the plus sign) can appear 0 or 1 times, \d indicates a digit character, and the final + requires that the preceding group (the digit) appears one or more times.
EDIT: When using regular expressions, bear in mind that there's a difference between find and matches (in Java at least, though most regex implementations have similar methods). find will find the substring somewhere in the owning string, and matches will try to match the entire string against the pattern, failing if there are extra characters before or after. Ensure you're using the right method, and remember that you can add a ^ to force the beginning of the line and a $ to force the end of the line (making the entire thing look like ^\+?\d+$.

Simple ^\+?\d+$
Start line, then 1 or 0 plus signs, followed by at least 1 digit, then end of lnie

A Perl regular expression for it could be: \+?\d+

replacement pattern in python(re.sub())

The question
Can someone please explain the process of the following re.sub() to me.
I am thinking the process is as following:
look for a "." then look for a digit then look for another digit that is between 1 and 9. Now I am lost. What is the question mark for? What does the \d* do? Why do we need to use raw string regex in this case?

If you want to understand the process, I can simply explain it to you. I don't know if this regular expression is doing what you want or not..
At first, the . is a special character in regex which means any character. But, we here want to use the dot character. In regex, this can be done by using escaping character \ like so \.. So, using . means any character and using \. means a dot.
The \d represents any digit and acts exactly like [0-9]
When you used [1-9], by then you specified to get the numbers from 1 till 9 which means that zero is excluded.
We can use the asterisk * to choose zero or more characters. Unlike + which is used to choose one or more characters. So, using \d* means any consecutive digits from [0-9] or none.
The ? is used to indicate using just one character or none. So, using [1-9]? means try to find just one digit between 1 and 9 IF FOUND.
The Parenthesis () is used for grouping the whole regular expression in one output.
If you want to know more about regular expression, here is an awesome cheat sheet.
NOTE:
I think the regex you have written in the question is not correct. I think it should be as follows (\d*\.\d\d[1-9]?) to obtain the same result. I will try to explain this regular expression using this number 3.141500012. \d*\. means find any number of digits that could be found before the dot which would match the 3.. then after that \d\d matches two digits after the dot which are 14. Finally, the [1-9]? matches any digit between 1 and 9 if found which matches 1 in our example.

Need a specific explanation of part of a regex code

I'm developing a calculator program in Python, and need to remove leading zeros from numbers so that calculations work as expected. For example, if the user enters "02+03" into the calculator, the result should return 5. In order to remove these leading zeroes in-front of digits, I asked a question on here and got the following answer.
self.answer = eval(re.sub(r"((?<=^)|(?<=[^\.\d]))0+(\d+)", r"\1\2", self.equation.get()))
I fully understand how the positive lookbehind to the beginning of the string and lookbehind to the non digit, non period character works. What I'm confused about is where in this regex code can I find the replacement for the matched patterns?
I found this online when researching regex expressions.
result = re.sub(pattern, repl, string, count=0, flags=0)
Where is the "repl" in the regex code above? If possible, could somebody please help to explain what the r"\1\2" is used for in this regex also?
Thanks for your help! :)

The "repl" part of the regex is this component:
r"\1\2"
In the "find" part of the regex, group capturing is taking place (ordinarily indicated by "()" characters around content, although this can be overridden by specific arguments).
In python regex, the syntax used to indicate a reference to a positional captured group (sometimes called a "backreference") is "\n" (where "n" is a digit refering to the position of the group in the "find" part of the regex).
So, this regex is returning a string in which the overall content is being replaced specifically by parts of the input string matched by numbered groups.
Note: I don't believe the "\1" part of the "repl" is actually required. I think:
r"\2"
...would work just as well.
Further reading: https://www.regular-expressions.info/brackets.html

Firstly, repl includes what you are about to replace.
To understand \1\2 you need to know what capture grouping is.
Check this video out for basics of Group capturing.
Here , since your regex splits every match it finds into groups which are 1,2... so on. This is so because of the parenthesis () you have placed in the regex.
$1 , $2 or \1,\2 can be used to refer to them.
In this case: The regex is replacing all numbers after the leading 0 (which is caught by group 2) with itself.
Note: \1 is not necessary. works fine without it.

See example:
>>> import re
>>> s='awd232frr2cr23'
>>> re.sub('\d',' ',s)
'awd frr cr '
>>>
Explanation:
As it is, '\d' is for integer so removes them and replaces with repl (in this case ' ').

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.