This question already has answers here:
Python regex to extract positive and negative numbers between two special characters
(4 answers)
Closed last year.
I am scraping a web and extracting some values, from which I need only the numeric half. For example, if the string says "-14.32 kcal/mole", I want to get the float -14.32
To do this I am applying the following code:
import re
number_string = '-9.2 kcal/mole'
number = re.search(r"[-+]?\d*\.\d+|\d+", number_string).group()
print(number)
Output: -9.2
Whenever the number_string is a float it works fine. But when the number is a negative integer, I get the postive value of that number.
For example,
import re
number_string = '-4 kcal/mole'
number = re.search(r"[-+]?\d*\.\d+|\d+", number_string).group()
print(number)
Output: 4 (instead of -4)
| is the lowest priority operator. You are looking for a non-zero float
[-+]?\d*\.\d+
or an unsigned integer
\d+
You need to parenthesize the expression for matching the absolute value to make the sign apply to either:
[-+]?(?:\d*\.\d+|\d+)
or make the fractional part optional.
[-+]?\d*(?:.\d+)?
In both cases, I've used non-capture groups to avoid changing the semantics of the following call to the groups method.
I would use something like this:
[+-]?(?:\d*\.)?\d+
[+-]? - optional positive or negative sign
(?:\d*\.)? - optional leading digits followed by decimal
\d+ - required digits
https://regex101.com/r/WKPQ4h/1
Since you are scraping web content this regex will simply find all numbers.
You will probably wish to target specific units of measurement:
[+-]?(?:\d*\.)?\d+(?= (?:kcal/mole|butterflies))
https://regex101.com/r/FM5ZXJ/1
Your regular expression is set up to search for [-+]?\d*\.\d+ or \d+, that is why it is happening. You can change you regular expression to something like [-+]?\d*\.\d+|[-+]?\d+ and that should get your expected result.
Related
I am wondering, how would regular expression for testing correct format of number for German culture would look like.
In German, comma is used as decimal mark and dot is used to separate thousands.
Therefore:
1.000 equals to 1000
1,000 equals to 1
1.000,89 equals to 1000.89
1.000.123.456,89 equals to 1000123456.89
The real trick, seems to me, is to make sure, that there could be several dots, optionally followed by comma separator
This is the regex I would use:
^-?\d{1,3}(?:\.\d{3})*(?:,\d+)?$
Debuggex Demo
And this is a code example to interpret it as a valid floating point (notice the parseFloat() after the string replacements).
Edit: as mentioned in Severin Klug's answer, the below code assumes that the numbers are known to be in German format. Attempting to "detect" whether a string contains a German format or US format number is not arbitrary and out of scope for this question. '1.234' is valid in both formats but with different actual values, without context it is impossible to know for sure which format was meant.
var numbers = ['1.000', '1,000', '1.000,89', '1.000.123.456,89'];
document.getElementById('out').value=numbers.map(function(str) {
return parseFloat(str.replace(/\./g, '').replace(',', '.'));
}).join('\n');
<textarea id="out" rows="10" style="width:100%"></textarea>
I would have posted this as a comment, but I dont have enough reputation.
#funkwurm, your post https://stackoverflow.com/a/28361329/7329611 contains javascript
var numbers = ['1.000', '1,000', '1.000,89', '1.000.123.456,89', '1.2'];
numbers.map(function(str) {
return parseFloat(str.replace(/\./g, '').replace(',', '.'));
}).join('\n');
which should convert german numbers to english/international ones - which it does for every number with exactly three digits after a german thousands dot like the numbers you use in the example array. BUT - and there is the critical Use-Case-Error: it just deletes dots from any other string with not three digits after it aswell.
So if you insert a string like '1.2' it returns 12, if you insert '1.23' it returns 123.
And this is a very critical behaviour, if anyone just takes the above code snippet and thinks it'll convert any given number correctly into english ones. Because already correct english numbers will be corrupted! So be careful, please.
This regex should work :
([0-9]{1,3}(?:\.[0-9]{3})*(?:\,[0-9]+)?)
A good regex would be something like this
Regex regex = new Regex("-?\d{1,3}(?:\.\d{3})*(?:,\d+)?");
Match match = regex.Match(input);
Decimal result = Decimal.Zero;
if (match.Success)
result = Decimal.Parse(match.Value, new CultureInfo("de-DE"));
The result is the german number as parsed value.
Try this it will match your inputs:
^(\d+\.)*\d+(,\d+)?
This regex would work for + numbers
/^[0-9]{0,3}(\.[0-9]{3})*(,[0-9]{0,2})?$/
Breakdown
[0-9]{0,3} - this section allows zero up to 3 numbers. empty value is valid, '1', '26', '789' are valid. '1589' is invalid
(\.[0-9]{3})* - this section allows zero or more dots... if there's a dot, there must be three digits after the dot. '2.589' is valid. '2.5896' and '2.45' are invalid
(,[0-9]{0,2})? - this section allows zero or 1 comma. there can be zero up to 2 digits after the comma. '25,', '25,5', '25,45' are valid. '25,456' and '25,45,8' are invalid
Hope this is helpful
I'd like to match numbers (int and real) in a string, but not if they are part of an identifier; e.g., i'd like to match 5.5 or 42, but not x5. Strings are roughly of the form "x5*1.1+42*y=40".
So far, I came up with
([0-9]*[.])?[0-9]+[^.*+=<>]
This correctly ignores x0, but also 0 or 0.5 (12.45, however, works). Changing the + to * leads to wrong matchings.
It would be very nice if someone could point out my error.
Thanks!
This is actually not simple. Float literals are more complex than you assumed, being able to contain an e or E for exponential format. Also, you can have prefixed signs (+ or -) for the number and/or the exponent. All in all it can be done like this:
re.findall(r'(?:(?<![a-zA-Z_0-9])|[+-]\s*)[\d.]+(?:[eE][+-]?\d+)?',
'x5*1.1+42*y=40+a123-3.14e-2')
This returns:
['1.1', '+42', '40', '-3.14e-2']
You should consider though whether a thing like 4+3 should lead to ['4', '3'] or ['4', '-3']. If the input was 4+-3 the '-3' would clearly be preferable. But to distinguish these isn't easy and you should consider using a proper formula parser for these.
Maybe the standard module ast can help you. The expression must be a valid Python expression in this case, so a thing like a+b=40 isn't allowed because left of the equal sign is no proper lvalue. But for valid Python objects you could use ast like this:
import ast
def find_all_numbers(e):
if isinstance(e, ast.BinOp):
for r in find_all_numbers(e.left):
yield r
for r in find_all_numbers(e.right):
yield r
elif isinstance(e, ast.Num):
yield e.n
list(find_all_numbers(ast.parse('x5*1.1+42*y-40').body[0].value))
Returns:
[1.1, 42, 40]
You could do it with something like
\b\d*(\.\d+)?\b
It matches any number of digits (\d*) followed by an optional decimal part ((\.\d+)?). The \b matches word boundaries, i.e. the location between a word character and a non word character. And since both digits and (english) letters are word characters, it won't match the 5 in a sequence like x5.
See this regex101 example.
The main reason your try fails is that it ends with [^.*+=<>] which requires the number (or rather match) to end with a character other than ., *, =, +, < or >. And when ending with a single digit, like 0 and 0.5 , the digit gets eaten by the [0-9]+, and there's nothin to match the [^.*+=<>] left, and thus it fails. In the case with 12.45 it first matches 12.4 and then the [^.*+=<>] matches the 5.
Do something like ((?<![a-zA-Z_])\d+(\.\d+)?)
It is using negative lookbehind in order not to select anything having [a-zA-Z_] prior to it.
Check it out here in Regex101.
About your regex ([0-9]*[.])?[0-9]+[^.*+=<>] use [0-9]+ instead of [0-9]* as it will not allow .05 to be captured, only 0.5. Another thing is [^.*+=<>] this part, you could add ? to the end of it in order to allow it not to have characters as well. Example 1.1 wont be captured as ([0-9]*[.])?[0-9]+ is satisfied but not [^.*+=<>] that comes after it as well.
I'm trying to extract only valid percentage information and eliminate any incorrect representation from a string using regular expression in python. The function should work like this,
For,
0-100% = TRUE
0.12% = TRUE
23.1245467% = TRUE
9999% = FALSE
8937.2435% = FALSE
7.% = FALSE
I have checked a few solutions in stack overflow which only extract 0-100%. I have tried the following solutions,
('(\s100|[123456789][0-9]|[0-9])(\.\d+)+%')
'(\s100|\s\d{1,2})(\.\d+)+%'
'(\s100|\s\d[0-99])(\.\d+)+%'
All these works for all other possibilities except 0-99%(gives FALSE) and 12411.23526%(gives TRUE). The reason for space is that I want to extract only two digit numbers.
Figured it out. The problem lied in '+' in the expression '(\.\d+)+' whereas it should have been '(\.\d+)*'. The first expression expects to have decimal values for any two digit percentage values whereas the second doesn't. My final version is given below.
'\s(100|(\d{1,2}(\.\d+)*))%'
You can replace \s with $ for percentage values at the beginning of a sentence. Also, the versions in my question section accepted decimal values for 100 which is invalid percentage value.
I would not rely on regex alone - it is not meant to filter ranges in the first place.
Better look for candidates in your string and analyze them programmatically afterwards, like so:
import re
string = """
some gibberish in here 0-100% = TRUE
some gibberish in here 0.12% = TRUE
some gibberish in here 23.1245467% = TRUE
some gibberish in here 9999% = FALSE
some gibberish in here 8937.2435% = FALSE
some gibberish in here 7.% = FALSE
"""
numbers = []
# look for -, a digit, a dot ending with a digit and a percentage sign
rx = r'[-\d.]+\d%'
# loop over the results
for match in re.finditer(rx, string):
interval = match.group(0).split('-')
for number in interval:
if 0 <= float(number.strip('%')) <= 100:
numbers.append(number)
print numbers
# ['0', '100%', '0.12%', '23.1245467%']
Considering all possibilities following regex works.
If you just ignore the ?: i.e non-capturing group regex is not that intimidating.
Regex: ^(?:(?:\d{1,2}(?:\.\d+)?\-)?(?:(?:\d{1,2}(?:\.\d+)?)|100))%$
Explanation:
(?:(?:\d{1,2}(?:\.\d+)?\-)? matches lower limit if there is any, as in case of 0-100% with optional decimal part.
(?:(?:\d{1,2}(?:\.\d+)?)|100) matches the upper limit or if only single number with limit of 100 with optional decimal part.
Regex101 Demo
Another version of the same regex for matching such occurrences within the string would be to remove the anchor ^ and $ and check for non-digits at the beginning.
Regex: (?<=\D|^)(?:(?:\d{1,2}(?:\.\d+)?\-)?(?:(?:\d{1,2}(?:\.\d+)?)|100))%
Regex101 Demo
I have a list of numbers from the interval (0;1]. For example:
0.235
0.4
1.00
0.533
1
I need to append some new numbers to the list. To check correctness of new numbers, I need to write regex.
Firstly I write simple regex: [0|1\.]{2}\d+, but it ignores one condition: if the integer part is 1, the fractional part must contain 0 or more zeros.
So, I tried to use lookahead assertions to emulate if-else condition: (?([0\.]{2})\d+|[0]+), but it isn't working. Where is my mistake? How can I provide checking, that none of the numbers can't be more, than 1?
Better than regex is to try to convert the string to a float and check whether it is in the range:
def convert(s):
f = float(s)
if not 0. < f <= 1.:
raise ValueError()
return f
This method returns a float between 0 and 1 or it raises a ValueError (if invalid string or float not between 0 and 1)
So explaining my comment from above:
The Regex you Want should be:
"1 maybe followed by only 0's" OR "0 followed by a dot then some more numbers, which aren't all zeroes"
Breaking it down like this makes it easier to write.
For the first part "1 maybe followed by only 0's":
^1(\.0+)?$
This is fairly straightforward. "1" followed by (.0+) zero or one times. Where (.0+) is "." followed by one or more "0"'s.
And for the second part
^0\.(?!0+$)\d+$
This is a bit trickier. It is "0." followed by a lookahead "(?!0+$)". What this means is that if "0+$" (= "0" one or more times before the end of the string) is found it won't match. After that check you have "\d+$", which is digits, one or more times.
Combining these with an or you get:
^1(\.0+)?$|^0\.(?!0+$)\d+$
I need to create a regexp to match strings like this 999-123-222-...-22
The string can be finished by &Ns=(any number) or without this... So valid strings for me are
999-123-222-...-22
999-123-222-...-22&Ns=12
999-123-222-...-22&Ns=12
And following are not valid:
999-123-222-...-22&N=1
I have tried testing it several hours already... But did not manage to solve, really need some help
Not sure if you want to literally match 999-123-22-...-22 or if that can be any sequence of numbers/dashes. Here are two different regexes:
/^[\d-]+(&Ns=\d+)?$/
/^999-123-222-\.\.\.-22(&Ns=\d+)?$/
The key idea is the (&Ns=\d+)?$ part, which matches an optional &Ns=<digits>, and is anchored to the end of the string with $.
If you just want to allow strings 999-123-222-...-22 and 999-123-222-...-22&Ns=12 you better use a string function.
If you want to allow any numbers between - you can use the regex:
^(\d+-){3}[.]{3}-\d+(&Ns=\d+)?$
If the numbers must be of only 3 digits and the last number of only 2 digits you can use:
^(\d{3}-){3}[.]{3}-\d{2}(&Ns=\d{2})?$
This looks like a phone number and extension information..
Why not make things simpler for yourself (and anyone who has to read this later) and split the input rather than use a complicated regex?
s = '999-123-222-...-22&Ns=12'
parts = s.split('&Ns=') # splits on Ns and removes it
If the piece before the "&" is a phone number, you could do another split and get the area code etc into separate fields, like so:
phone_parts = parts[0].split('-') # breaks up the digit string and removes the '-'
area_code = phone_parts[0]
The portion found after the the optional '&Ns=' can be checked to see if it is numeric with the string method isdigit, which will return true if all characters in the string are digits and there is at least one character, false otherwise.
if len(parts) > 1:
extra_digits_ok = parts[1].isdigit()