Regular expression match numbers in Python - python

I have a list of numbers from the interval (0;1]. For example:
0.235
0.4
1.00
0.533
1
I need to append some new numbers to the list. To check correctness of new numbers, I need to write regex.
Firstly I write simple regex: [0|1\.]{2}\d+, but it ignores one condition: if the integer part is 1, the fractional part must contain 0 or more zeros.
So, I tried to use lookahead assertions to emulate if-else condition: (?([0\.]{2})\d+|[0]+), but it isn't working. Where is my mistake? How can I provide checking, that none of the numbers can't be more, than 1?

Better than regex is to try to convert the string to a float and check whether it is in the range:
def convert(s):
f = float(s)
if not 0. < f <= 1.:
raise ValueError()
return f
This method returns a float between 0 and 1 or it raises a ValueError (if invalid string or float not between 0 and 1)

So explaining my comment from above:
The Regex you Want should be:
"1 maybe followed by only 0's" OR "0 followed by a dot then some more numbers, which aren't all zeroes"
Breaking it down like this makes it easier to write.
For the first part "1 maybe followed by only 0's":
^1(\.0+)?$
This is fairly straightforward. "1" followed by (.0+) zero or one times. Where (.0+) is "." followed by one or more "0"'s.
And for the second part
^0\.(?!0+$)\d+$
This is a bit trickier. It is "0." followed by a lookahead "(?!0+$)". What this means is that if "0+$" (= "0" one or more times before the end of the string) is found it won't match. After that check you have "\d+$", which is digits, one or more times.
Combining these with an or you get:
^1(\.0+)?$|^0\.(?!0+$)\d+$

Related

Regex returns negative integers as positive [duplicate]

This question already has answers here:
Python regex to extract positive and negative numbers between two special characters
(4 answers)
Closed last year.
I am scraping a web and extracting some values, from which I need only the numeric half. For example, if the string says "-14.32 kcal/mole", I want to get the float -14.32
To do this I am applying the following code:
import re
number_string = '-9.2 kcal/mole'
number = re.search(r"[-+]?\d*\.\d+|\d+", number_string).group()
print(number)
Output: -9.2
Whenever the number_string is a float it works fine. But when the number is a negative integer, I get the postive value of that number.
For example,
import re
number_string = '-4 kcal/mole'
number = re.search(r"[-+]?\d*\.\d+|\d+", number_string).group()
print(number)
Output: 4 (instead of -4)
| is the lowest priority operator. You are looking for a non-zero float
[-+]?\d*\.\d+
or an unsigned integer
\d+
You need to parenthesize the expression for matching the absolute value to make the sign apply to either:
[-+]?(?:\d*\.\d+|\d+)
or make the fractional part optional.
[-+]?\d*(?:.\d+)?
In both cases, I've used non-capture groups to avoid changing the semantics of the following call to the groups method.
I would use something like this:
[+-]?(?:\d*\.)?\d+
[+-]? - optional positive or negative sign
(?:\d*\.)? - optional leading digits followed by decimal
\d+ - required digits
https://regex101.com/r/WKPQ4h/1
Since you are scraping web content this regex will simply find all numbers.
You will probably wish to target specific units of measurement:
[+-]?(?:\d*\.)?\d+(?= (?:kcal/mole|butterflies))
https://regex101.com/r/FM5ZXJ/1
Your regular expression is set up to search for [-+]?\d*\.\d+ or \d+, that is why it is happening. You can change you regular expression to something like [-+]?\d*\.\d+|[-+]?\d+ and that should get your expected result.

Reg Ex for specific number in string

I'd like to match numbers (int and real) in a string, but not if they are part of an identifier; e.g., i'd like to match 5.5 or 42, but not x5. Strings are roughly of the form "x5*1.1+42*y=40".
So far, I came up with
([0-9]*[.])?[0-9]+[^.*+=<>]
This correctly ignores x0, but also 0 or 0.5 (12.45, however, works). Changing the + to * leads to wrong matchings.
It would be very nice if someone could point out my error.
Thanks!
This is actually not simple. Float literals are more complex than you assumed, being able to contain an e or E for exponential format. Also, you can have prefixed signs (+ or -) for the number and/or the exponent. All in all it can be done like this:
re.findall(r'(?:(?<![a-zA-Z_0-9])|[+-]\s*)[\d.]+(?:[eE][+-]?\d+)?',
'x5*1.1+42*y=40+a123-3.14e-2')
This returns:
['1.1', '+42', '40', '-3.14e-2']
You should consider though whether a thing like 4+3 should lead to ['4', '3'] or ['4', '-3']. If the input was 4+-3 the '-3' would clearly be preferable. But to distinguish these isn't easy and you should consider using a proper formula parser for these.
Maybe the standard module ast can help you. The expression must be a valid Python expression in this case, so a thing like a+b=40 isn't allowed because left of the equal sign is no proper lvalue. But for valid Python objects you could use ast like this:
import ast
def find_all_numbers(e):
if isinstance(e, ast.BinOp):
for r in find_all_numbers(e.left):
yield r
for r in find_all_numbers(e.right):
yield r
elif isinstance(e, ast.Num):
yield e.n
list(find_all_numbers(ast.parse('x5*1.1+42*y-40').body[0].value))
Returns:
[1.1, 42, 40]
You could do it with something like
\b\d*(\.\d+)?\b
It matches any number of digits (\d*) followed by an optional decimal part ((\.\d+)?). The \b matches word boundaries, i.e. the location between a word character and a non word character. And since both digits and (english) letters are word characters, it won't match the 5 in a sequence like x5.
See this regex101 example.
The main reason your try fails is that it ends with [^.*+=<>] which requires the number (or rather match) to end with a character other than ., *, =, +, < or >. And when ending with a single digit, like 0 and 0.5 , the digit gets eaten by the [0-9]+, and there's nothin to match the [^.*+=<>] left, and thus it fails. In the case with 12.45 it first matches 12.4 and then the [^.*+=<>] matches the 5.
Do something like ((?<![a-zA-Z_])\d+(\.\d+)?)
It is using negative lookbehind in order not to select anything having [a-zA-Z_] prior to it.
Check it out here in Regex101.
About your regex ([0-9]*[.])?[0-9]+[^.*+=<>] use [0-9]+ instead of [0-9]* as it will not allow .05 to be captured, only 0.5. Another thing is [^.*+=<>] this part, you could add ? to the end of it in order to allow it not to have characters as well. Example 1.1 wont be captured as ([0-9]*[.])?[0-9]+ is satisfied but not [^.*+=<>] that comes after it as well.

Replacing all numeric value to formatted string

What I am trying to do is:
Find out all the numeric values in a string.
input_string = "高露潔光感白輕悅薄荷牙膏100 79.80"
numbers = re.finditer(r'[-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?',input_string)
for number in numbers:
print ("{} start > {}, end > {}".format(number.group(), number.start(0), number.end(0)))
'''Output'''
>>100 start > 12, end > 15
>>79.80 start > 18, end > 23
And then I want to replace all the integer and float value to a certain format:
INT_(number of digit) and FLT(number of decimal places)
eg. 100 -> INT_3 // 79.80 -> FLT_2
Thus, the expect output string is like this:
"高露潔光感白輕悅薄荷牙膏INT_3 FLT2"
But the string replace substring method in Python is kind of weird, which can't archive what I want to do.
So I am trying to use the substring append substring methods
string[:number.start(0)] + "INT_%s"%len(number.group()) +.....
which looks stupid and most importantly I still can't make it work.
Can anyone give me some advice on this problem?
Use re.sub and a callback method inside where you can perform various manipulations on the match:
import re
def repl(match):
chunks = match.group(1).split(".")
if len(chunks) == 2:
return "FLT_{}".format(len(chunks[1]))
else:
return "INT_{}".format(len(chunks[0]))
input_string = "高露潔光感白輕悅薄荷牙膏100 79.80"
result = re.sub(r'[-+]?([0-9]*\.?[0-9]+)(?:[eE][-+]?[0-9]+)?',repl,input_string)
print(result)
See the Python demo
Details:
The regex now has a capturing group over the number part (([0-9]*\.?[0-9]+)), this will be analyzed inside the repl method
Inside the repl method, Group 1 contents is split with . to see if we have a float/double, and if yes, we return the length of the fractional part, else, the length of the integer number.
You need to group the parts of your regex possibly like this
import re
def repl(m):
if m.group(1) is None: #int
return ("INT_%i"%len(m.group(2)))
else: #float
return ("FLT_%i"%(len(m.group(2))))
input_string = "高露潔光感白輕悅薄荷牙膏100 79.80"
numbers = re.sub(r'[-+]?([0-9]*\.)?([0-9]+)([eE][-+]?[0-9]+)?',repl,input_string)
print(numbers)
group 0 is the whole string that was matched (can be used for putting into float or int)
group 1 is any digits before the . and the . itself if exists else it is None
group 2 is all digits after the . if it exists else it it is just all digits
group 3 is the exponential part if existing else None
You can get a python-number from it with
def parse(m):
s=m.group(0)
if m.group(1) is not None or m.group(3) is not None: # if there is a dot or an exponential part it must be a float
return float(s)
else:
return int(s)
You probably are looking for something like the code below (of course there are other ways to do it). This one just starts with what you were doing and show how it can be done.
import re
input_string = u"高露潔光感白輕悅薄荷牙膏100 79.80"
numbers = re.finditer(r'[-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?',input_string)
s = input_string
for m in list(numbers)[::-1]:
num = m.group(0)
if '.' in num:
s = "%sFLT_%s%s" % (s[:m.start(0)],str(len(num)-num.index('.')-1),s[m.end(0):])
else:
s = "%sINT_%s%s" % (s[:m.start(0)],str(len(num)), s[m.end(0):])
print(s)
This may look a bit complicated because there are really several simple problems to solve.
For instance your initial regex find both ints and floats, but you with to apply totally different replacements afterward. This would be much more straightforward if you were doing only one thing at a time. But as parts of floats may look like an int, doing everything at once may not be such a bad idea, you just have to understand that this will lead to a secondary check to discriminate both cases.
Another more fundamental issue is that really you can't replace anything in a python string. Python strings are non modifiable objects, henceforth you have to make a copy. This is fine anyway because the format change may need insertion or removal of characters and an inplace replacement wouldn't be efficient.
The last trouble to take into account is that replacement must be made backward, because if you change the beginning of the string the match position would also change and the next replacement wouldn't be at the right place. If we do it backward, all is fine.
Of course I agree that using re.sub() is much simpler.

Why does this regex not match the second binary gap?

Trying a solution for the problem listed here in python, I thought I'd try a nice little regex to capture the maximum "binary gap" (chains of zeroes in the binary representation of a number).
The function I wrote for the problem is below:
def solution(N):
max_gap = 0
binary_N = format(N, 'b')
list = re.findall(r'1(0+)1', binary_N)
for element in list:
if len(element) > max_gap:
max_gap = len(element)
return max_gap
And it works pretty well. However... for some reason, it does not match the second set of zeroes in 10000010000000001 (binary representation of 66561). The 9 zeroes don't appear in the list of matches so it must be a problem with the regex - but I can't see where it is as it matches every other example given!
The same bit can't be included in two matches. Your regex matches a 1 followed by one or more 0s and ends with another 1. Once the first match has been found you are left with 0000000001 which doesn't start with a 1 so isn't matched by your regex.
As mentioned by #JoachimIsaksson, if you want to match both sets of 0s, you can use a lookahead so that the final 1 is checked but isn't included in the match. r'1(0+)(?=1)'.

python: regular expressions, how to match a string of undefind length which has a structure and finishes with a specific group

I need to create a regexp to match strings like this 999-123-222-...-22
The string can be finished by &Ns=(any number) or without this... So valid strings for me are
999-123-222-...-22
999-123-222-...-22&Ns=12
999-123-222-...-22&Ns=12
And following are not valid:
999-123-222-...-22&N=1
I have tried testing it several hours already... But did not manage to solve, really need some help
Not sure if you want to literally match 999-123-22-...-22 or if that can be any sequence of numbers/dashes. Here are two different regexes:
/^[\d-]+(&Ns=\d+)?$/
/^999-123-222-\.\.\.-22(&Ns=\d+)?$/
The key idea is the (&Ns=\d+)?$ part, which matches an optional &Ns=<digits>, and is anchored to the end of the string with $.
If you just want to allow strings 999-123-222-...-22 and 999-123-222-...-22&Ns=12 you better use a string function.
If you want to allow any numbers between - you can use the regex:
^(\d+-){3}[.]{3}-\d+(&Ns=\d+)?$
If the numbers must be of only 3 digits and the last number of only 2 digits you can use:
^(\d{3}-){3}[.]{3}-\d{2}(&Ns=\d{2})?$
This looks like a phone number and extension information..
Why not make things simpler for yourself (and anyone who has to read this later) and split the input rather than use a complicated regex?
s = '999-123-222-...-22&Ns=12'
parts = s.split('&Ns=') # splits on Ns and removes it
If the piece before the "&" is a phone number, you could do another split and get the area code etc into separate fields, like so:
phone_parts = parts[0].split('-') # breaks up the digit string and removes the '-'
area_code = phone_parts[0]
The portion found after the the optional '&Ns=' can be checked to see if it is numeric with the string method isdigit, which will return true if all characters in the string are digits and there is at least one character, false otherwise.
if len(parts) > 1:
extra_digits_ok = parts[1].isdigit()

Categories