I'd like to match numbers (int and real) in a string, but not if they are part of an identifier; e.g., i'd like to match 5.5 or 42, but not x5. Strings are roughly of the form "x5*1.1+42*y=40".
So far, I came up with
([0-9]*[.])?[0-9]+[^.*+=<>]
This correctly ignores x0, but also 0 or 0.5 (12.45, however, works). Changing the + to * leads to wrong matchings.
It would be very nice if someone could point out my error.
Thanks!
This is actually not simple. Float literals are more complex than you assumed, being able to contain an e or E for exponential format. Also, you can have prefixed signs (+ or -) for the number and/or the exponent. All in all it can be done like this:
re.findall(r'(?:(?<![a-zA-Z_0-9])|[+-]\s*)[\d.]+(?:[eE][+-]?\d+)?',
'x5*1.1+42*y=40+a123-3.14e-2')
This returns:
['1.1', '+42', '40', '-3.14e-2']
You should consider though whether a thing like 4+3 should lead to ['4', '3'] or ['4', '-3']. If the input was 4+-3 the '-3' would clearly be preferable. But to distinguish these isn't easy and you should consider using a proper formula parser for these.
Maybe the standard module ast can help you. The expression must be a valid Python expression in this case, so a thing like a+b=40 isn't allowed because left of the equal sign is no proper lvalue. But for valid Python objects you could use ast like this:
import ast
def find_all_numbers(e):
if isinstance(e, ast.BinOp):
for r in find_all_numbers(e.left):
yield r
for r in find_all_numbers(e.right):
yield r
elif isinstance(e, ast.Num):
yield e.n
list(find_all_numbers(ast.parse('x5*1.1+42*y-40').body[0].value))
Returns:
[1.1, 42, 40]
You could do it with something like
\b\d*(\.\d+)?\b
It matches any number of digits (\d*) followed by an optional decimal part ((\.\d+)?). The \b matches word boundaries, i.e. the location between a word character and a non word character. And since both digits and (english) letters are word characters, it won't match the 5 in a sequence like x5.
See this regex101 example.
The main reason your try fails is that it ends with [^.*+=<>] which requires the number (or rather match) to end with a character other than ., *, =, +, < or >. And when ending with a single digit, like 0 and 0.5 , the digit gets eaten by the [0-9]+, and there's nothin to match the [^.*+=<>] left, and thus it fails. In the case with 12.45 it first matches 12.4 and then the [^.*+=<>] matches the 5.
Do something like ((?<![a-zA-Z_])\d+(\.\d+)?)
It is using negative lookbehind in order not to select anything having [a-zA-Z_] prior to it.
Check it out here in Regex101.
About your regex ([0-9]*[.])?[0-9]+[^.*+=<>] use [0-9]+ instead of [0-9]* as it will not allow .05 to be captured, only 0.5. Another thing is [^.*+=<>] this part, you could add ? to the end of it in order to allow it not to have characters as well. Example 1.1 wont be captured as ([0-9]*[.])?[0-9]+ is satisfied but not [^.*+=<>] that comes after it as well.
Related
I am wondering, how would regular expression for testing correct format of number for German culture would look like.
In German, comma is used as decimal mark and dot is used to separate thousands.
Therefore:
1.000 equals to 1000
1,000 equals to 1
1.000,89 equals to 1000.89
1.000.123.456,89 equals to 1000123456.89
The real trick, seems to me, is to make sure, that there could be several dots, optionally followed by comma separator
This is the regex I would use:
^-?\d{1,3}(?:\.\d{3})*(?:,\d+)?$
Debuggex Demo
And this is a code example to interpret it as a valid floating point (notice the parseFloat() after the string replacements).
Edit: as mentioned in Severin Klug's answer, the below code assumes that the numbers are known to be in German format. Attempting to "detect" whether a string contains a German format or US format number is not arbitrary and out of scope for this question. '1.234' is valid in both formats but with different actual values, without context it is impossible to know for sure which format was meant.
var numbers = ['1.000', '1,000', '1.000,89', '1.000.123.456,89'];
document.getElementById('out').value=numbers.map(function(str) {
return parseFloat(str.replace(/\./g, '').replace(',', '.'));
}).join('\n');
<textarea id="out" rows="10" style="width:100%"></textarea>
I would have posted this as a comment, but I dont have enough reputation.
#funkwurm, your post https://stackoverflow.com/a/28361329/7329611 contains javascript
var numbers = ['1.000', '1,000', '1.000,89', '1.000.123.456,89', '1.2'];
numbers.map(function(str) {
return parseFloat(str.replace(/\./g, '').replace(',', '.'));
}).join('\n');
which should convert german numbers to english/international ones - which it does for every number with exactly three digits after a german thousands dot like the numbers you use in the example array. BUT - and there is the critical Use-Case-Error: it just deletes dots from any other string with not three digits after it aswell.
So if you insert a string like '1.2' it returns 12, if you insert '1.23' it returns 123.
And this is a very critical behaviour, if anyone just takes the above code snippet and thinks it'll convert any given number correctly into english ones. Because already correct english numbers will be corrupted! So be careful, please.
This regex should work :
([0-9]{1,3}(?:\.[0-9]{3})*(?:\,[0-9]+)?)
A good regex would be something like this
Regex regex = new Regex("-?\d{1,3}(?:\.\d{3})*(?:,\d+)?");
Match match = regex.Match(input);
Decimal result = Decimal.Zero;
if (match.Success)
result = Decimal.Parse(match.Value, new CultureInfo("de-DE"));
The result is the german number as parsed value.
Try this it will match your inputs:
^(\d+\.)*\d+(,\d+)?
This regex would work for + numbers
/^[0-9]{0,3}(\.[0-9]{3})*(,[0-9]{0,2})?$/
Breakdown
[0-9]{0,3} - this section allows zero up to 3 numbers. empty value is valid, '1', '26', '789' are valid. '1589' is invalid
(\.[0-9]{3})* - this section allows zero or more dots... if there's a dot, there must be three digits after the dot. '2.589' is valid. '2.5896' and '2.45' are invalid
(,[0-9]{0,2})? - this section allows zero or 1 comma. there can be zero up to 2 digits after the comma. '25,', '25,5', '25,45' are valid. '25,456' and '25,45,8' are invalid
Hope this is helpful
I'm trying to extract only valid percentage information and eliminate any incorrect representation from a string using regular expression in python. The function should work like this,
For,
0-100% = TRUE
0.12% = TRUE
23.1245467% = TRUE
9999% = FALSE
8937.2435% = FALSE
7.% = FALSE
I have checked a few solutions in stack overflow which only extract 0-100%. I have tried the following solutions,
('(\s100|[123456789][0-9]|[0-9])(\.\d+)+%')
'(\s100|\s\d{1,2})(\.\d+)+%'
'(\s100|\s\d[0-99])(\.\d+)+%'
All these works for all other possibilities except 0-99%(gives FALSE) and 12411.23526%(gives TRUE). The reason for space is that I want to extract only two digit numbers.
Figured it out. The problem lied in '+' in the expression '(\.\d+)+' whereas it should have been '(\.\d+)*'. The first expression expects to have decimal values for any two digit percentage values whereas the second doesn't. My final version is given below.
'\s(100|(\d{1,2}(\.\d+)*))%'
You can replace \s with $ for percentage values at the beginning of a sentence. Also, the versions in my question section accepted decimal values for 100 which is invalid percentage value.
I would not rely on regex alone - it is not meant to filter ranges in the first place.
Better look for candidates in your string and analyze them programmatically afterwards, like so:
import re
string = """
some gibberish in here 0-100% = TRUE
some gibberish in here 0.12% = TRUE
some gibberish in here 23.1245467% = TRUE
some gibberish in here 9999% = FALSE
some gibberish in here 8937.2435% = FALSE
some gibberish in here 7.% = FALSE
"""
numbers = []
# look for -, a digit, a dot ending with a digit and a percentage sign
rx = r'[-\d.]+\d%'
# loop over the results
for match in re.finditer(rx, string):
interval = match.group(0).split('-')
for number in interval:
if 0 <= float(number.strip('%')) <= 100:
numbers.append(number)
print numbers
# ['0', '100%', '0.12%', '23.1245467%']
Considering all possibilities following regex works.
If you just ignore the ?: i.e non-capturing group regex is not that intimidating.
Regex: ^(?:(?:\d{1,2}(?:\.\d+)?\-)?(?:(?:\d{1,2}(?:\.\d+)?)|100))%$
Explanation:
(?:(?:\d{1,2}(?:\.\d+)?\-)? matches lower limit if there is any, as in case of 0-100% with optional decimal part.
(?:(?:\d{1,2}(?:\.\d+)?)|100) matches the upper limit or if only single number with limit of 100 with optional decimal part.
Regex101 Demo
Another version of the same regex for matching such occurrences within the string would be to remove the anchor ^ and $ and check for non-digits at the beginning.
Regex: (?<=\D|^)(?:(?:\d{1,2}(?:\.\d+)?\-)?(?:(?:\d{1,2}(?:\.\d+)?)|100))%
Regex101 Demo
I have a list of numbers from the interval (0;1]. For example:
0.235
0.4
1.00
0.533
1
I need to append some new numbers to the list. To check correctness of new numbers, I need to write regex.
Firstly I write simple regex: [0|1\.]{2}\d+, but it ignores one condition: if the integer part is 1, the fractional part must contain 0 or more zeros.
So, I tried to use lookahead assertions to emulate if-else condition: (?([0\.]{2})\d+|[0]+), but it isn't working. Where is my mistake? How can I provide checking, that none of the numbers can't be more, than 1?
Better than regex is to try to convert the string to a float and check whether it is in the range:
def convert(s):
f = float(s)
if not 0. < f <= 1.:
raise ValueError()
return f
This method returns a float between 0 and 1 or it raises a ValueError (if invalid string or float not between 0 and 1)
So explaining my comment from above:
The Regex you Want should be:
"1 maybe followed by only 0's" OR "0 followed by a dot then some more numbers, which aren't all zeroes"
Breaking it down like this makes it easier to write.
For the first part "1 maybe followed by only 0's":
^1(\.0+)?$
This is fairly straightforward. "1" followed by (.0+) zero or one times. Where (.0+) is "." followed by one or more "0"'s.
And for the second part
^0\.(?!0+$)\d+$
This is a bit trickier. It is "0." followed by a lookahead "(?!0+$)". What this means is that if "0+$" (= "0" one or more times before the end of the string) is found it won't match. After that check you have "\d+$", which is digits, one or more times.
Combining these with an or you get:
^1(\.0+)?$|^0\.(?!0+$)\d+$
I would like to right justify strings containing Thai characters (Thai rendering doesn't work from left to right, but can go up and down as well).
For example, for the strings ไป (two characters, length 2) and ซื้อ (four characters, length 2) I want to have the following output (length 5):
...ไป
...ซื้อ
The naive
print 'ไป'.decode('utf-8').rjust(5)
print 'ซื้อ'.decode('utf-8').rjust(5)
however, respectively produce
...ไป
.ซื้อ
Any ideas how to get to the desired formatting?
EDIT:
Given a string of Thai characters tc, I want to determine how many [places/fields/positions/whatever you want to call it] the string uses. This is not the same as len(tc); len(tc) is usually larger than the number of places used. The second word gives len(tc) = 4, but has length 2 / uses 2 places / uses 2 positions.
Cause
Thai script contains normal characters (positive advance width) and non-spacing marks as well (zero advance width).
For example, in the word ซื้อ:
the first character is the initial consonant "SO SO",
then it has vowel mark SARA UUE,
then tone mark MAI THO,
and then the final pseudo-consonant O ANG
The problem is that characters ##2 and 3 in the list above are zero-width ones.
In other words, they do not make the string "wider".
In yet other words, ซื้อ ("to buy") and ซอ ("fiddle") would have equal width of two character places (but string lengths of 4 and 2, correspondingly).
Solution
In order to calculate the "real" string length, one must skip zero-width characters.
Python-specific
The unicodedata module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters. The data contained in this database is compiled from the UCD version 8.0.0.
The unicodedata.category(unichr) method returns one the following General Category Values:
"Lo" for normal character;
"Mn" for zero-width non-spacing marks;
The rest is obvious, simply filter out the latter ones.
Further info:
Unicode data for Thai script (scroll till the first occurrence of "THAI CHARACTER")
I think what you mean to ask is, how to determine the 'true' # of characters in เรือ, ไป, ซื้อ etc. (which are 3,2 and 2, respectively)
Unfortunately, here's how Python interprets these characters:
ไป
>>> 'ไป'
'\xe0\xb9\x84\xe0\xb8\x9b'
>>> len('ไป')
6
>>> len('ไป'.decode('utf-8'))
2
ซื้อ
>>> 'ซื้อ'
'\xe0\xb8\x8b\xe0\xb8\xb7\xe0\xb9\x89\xe0\xb8\xad'
>>> len('ซื้อ')
12
>>> len('ซื้อ'.decode('utf-8'))
4
เรือ
>>> 'เรือ'
'\xe0\xb9\x80\xe0\xb8\xa3\xe0\xb8\xb7\xe0\xb8\xad'
>>> len('เรือ')
12
>>> len('เรือ'.decode('utf-8'))
4
There's no real correlation between the # of characters displayed and the # of actual (from Python's perspective) characters that make up the string.
I can't think of an obvious way to do this. However, I've found this library which might be of help to you. (You will also need to install some prequisites.
It looks like the rjust() function will not work for you and you will need to count the number of cells in the string yourself. You can then insert the number of spaces required before the string to achieve justification
You seem to know about Thai language. Sum the number of consonants, preceding vowels, following vowels and Thai punctuation. Don't count diacritics and above and below vowels.
Something like (forgive my pseudo Python code),
cells = 0
for i in range (0, len(string))
if (string[i] == \xe31) or ((string[i] >= \xe34) and (string[i] <= \xe3a)) or ((string[i] >= \xe47) and (string[i] <= \xe4e))
# do nothing
else
# consonant, preceding or following vowel or punctuation
cells++
Here's a function to compute the length of a thai string (the number of characters arranged horizontally), based on bytebuster's answer
import unicodedata
def get_thai_string_length(string):
length = 0
for c in string:
if unicodedata.category(c) != 'Mn':
length += 1
return length
print(len('บอินทัช'))
print(get_thai_string_length('บอินทัช'))
I need to create a regexp to match strings like this 999-123-222-...-22
The string can be finished by &Ns=(any number) or without this... So valid strings for me are
999-123-222-...-22
999-123-222-...-22&Ns=12
999-123-222-...-22&Ns=12
And following are not valid:
999-123-222-...-22&N=1
I have tried testing it several hours already... But did not manage to solve, really need some help
Not sure if you want to literally match 999-123-22-...-22 or if that can be any sequence of numbers/dashes. Here are two different regexes:
/^[\d-]+(&Ns=\d+)?$/
/^999-123-222-\.\.\.-22(&Ns=\d+)?$/
The key idea is the (&Ns=\d+)?$ part, which matches an optional &Ns=<digits>, and is anchored to the end of the string with $.
If you just want to allow strings 999-123-222-...-22 and 999-123-222-...-22&Ns=12 you better use a string function.
If you want to allow any numbers between - you can use the regex:
^(\d+-){3}[.]{3}-\d+(&Ns=\d+)?$
If the numbers must be of only 3 digits and the last number of only 2 digits you can use:
^(\d{3}-){3}[.]{3}-\d{2}(&Ns=\d{2})?$
This looks like a phone number and extension information..
Why not make things simpler for yourself (and anyone who has to read this later) and split the input rather than use a complicated regex?
s = '999-123-222-...-22&Ns=12'
parts = s.split('&Ns=') # splits on Ns and removes it
If the piece before the "&" is a phone number, you could do another split and get the area code etc into separate fields, like so:
phone_parts = parts[0].split('-') # breaks up the digit string and removes the '-'
area_code = phone_parts[0]
The portion found after the the optional '&Ns=' can be checked to see if it is numeric with the string method isdigit, which will return true if all characters in the string are digits and there is at least one character, false otherwise.
if len(parts) > 1:
extra_digits_ok = parts[1].isdigit()