Consider the python code:
import re
re.findall('[0-9]+', 'XYZ 102 1030')
which returns:
['102', '1030']
Can one write a regex that requires at least one occurance of the digit 3, i.e. I am interested in '[0-9]+' where there is at least one 3? So the result of interest would be:
['1030']
More generally, how about at least n 3's?
And even more generally, how about at least n 3's and at least k 4's, etc?
Just try the regexp '\d*3\d*', which means "0 or more digits, followed by a 3, followed by 0 or more digits".
You can check it here
If you want "at least 'n' 3", use '\d*(3\d*){n}'.
At least one 3 in the string could be
\d*3\d*
https://regex101.com/r/yEbatk/4
If you are looking for (at least) 2 times the number 3 inside, you could use:
\d*3\d*3\d*
https://regex101.com/r/yEbatk/5
If you want it to be (at least) n times you could use the {min,max} repeat option:
\d*(3\d*){n}
https://regex101.com/r/yEbatk/7
For n occurrences of x, m occurrences of y, and so on, build forth on this general expression:
(?=(?:\d*x){n})(?=(?:\d*y){m})\b\d+\b
where the lookahead part (?=(?:\d*x){n}) is repeated for each desired n and x.
I chose to make the lookahead groups non-capturing by surrounding them with (?:..), although it makes it a bit less readable.
The counting part itself is just (\d*x){n}, and it needs a lookahead because with more than one set of numbers to look for, the digits may appear in any order.
The final \b\d+\b ensures you capture just digits, surrounded by 'not-word' characters, so it will skip any sequence containing letters but will work on something like abc-123-456.
Example: 2 3's and 2 4's, in XYZ 1023344a 1403403
(?=(?:\d*3){2})(?=(?:\d*4){2})\b\d+\b
will match 1403403 but not 1023344a.
See
https://regex101.com/r/QgYptp/3
Although you can use a regex for this, regexes get messy and hard to read when you're searching for more than a couple of different digits. Instead, you can use collections.Counter to count the number of occurrences of each character in a string:
from collections import Counter
# Must contain at least two 3s, three 4s, and one 7
mins = { '3': 2, '4': 3, '7': 1 }
input = '3444 33447 334447 foo334447 473443 2317349414'
tokens = input.split()
for token in tokens:
# Skip tokens that aren't numbers
if not token.isdigit():
continue
counter = Counter(token)
for digit, min_count in mins.items():
if counter[digit] < min_count:
break
else:
print(token)
Output:
334447
473443
2317349414
Related
I have a complex case where I can't get any further. The goal is to check a string via RegEx for the following conditions:
Exactly 12 letters
The letters W,S,I,O,B,A,R and H may only appear exactly once in the string
The letters T and E may only occur exactly 2 times in the string.
Important! The order must not matter
Example matches:
WSITTOBAEERH
HREEABOTTISW
WSITOTBAEREH
My first attempt:
results = re.match(r"^W{1}S{1}I{1}T{2}O{1}B{1}A{1}E{2}R{1}H{1}$", word)
The problem with this first attempt is that it only matches if the order of the letters in the RegEx has been followed. That violates condition 4
My second attempt:
results = re.match(r"^[W{1}S{1}I{1}T{2}O{1}B{1}A{1}E{2}R{1}H{1}]{12}$", word)
The problem with trial two: Now the order no longer matters, but the exact number of individual letters is ignored.
I can only do the basics of RegEx so far and can't get any further here. If anyone has an idea what a regular expression looks like that fits the four rules mentioned above, I would be very grateful.
One possibility, although I still think regex is inappropriate for this. Checks that all letters appear the desired amount and that it's 12 letters total (so there's no room left for any more/other letters):
import re
for s in 'WSITTOBAEERH', 'HREEABOTTISW', 'WSITOTBAEREH':
print(re.fullmatch('(?=.*W)(?=.*S)(?=.*I)(?=.*O)'
'(?=.*B)(?=.*A)(?=.*R)(?=.*H)'
'(?=.*T.*T)(?=.*E.*E).{12}', s))
Another, checking that none other than T and E appear twice, that none appear thrice, and that we have only the desired letters, 12 total:
import re
for s in 'WSITTOBAEERH', 'HREEABOTTISW', 'WSITOTBAEREH':
print(re.fullmatch(r'(?!.*([^TE]).*\1)'
r'(?!.*(.).*\1.*\1)'
r'[WSIOBARHTE]{12}', s))
A simpler way:
for s in 'WSITTOBAEERH', 'HREEABOTTISW', 'WSITOTBAEREH':
print(sorted(s) == sorted('WSIOBARHTTEE'))
It seems that I can't find a solution for this perhaps an easy problem: I want to be able to match with a simple regex all possible permutations of 5 specified digits, without repeating, where all digits must be used. So, for this sequence:
12345
the valid permutation is:
54321
but
55555
is not valid.
However, if the provided digits have the same number once or more, only in that case the accepted permutations will have those repeated digits, but each digit must be used only once. For example, if the provided number is:
55432
we see that 5 is provided 2 times, so it must be also present two times in each permutation, and some of the accepted answers would be:
32545
45523
but this is wrong:
55523
(not all original digits are used and 5 is repeated more than twice)
I came very close to solve this using:
(?:([43210])(?!.*\1)){5}
but unfortunately it doesn't work when there are multiple same digits provided(like 43211).
One way to solve this is to make a character class out of the search digits and build a regex to search for as many digits in that class as are in the search string. Then you can filter the regex results based on the sorted match string being the same as the sorted search string. For example:
import re
def find_perms(search, text):
search = sorted(search)
regex = re.compile(rf'\b[{"".join(search)}]{{{len(search)}}}\b')
matches = [m for m in regex.findall(text) if sorted(m) == search]
return matches
print(find_perms('54321', '12345 54321 55432'))
print(find_perms('23455', '12345 54321 55432'))
print(find_perms('24455', '12345 54321 55432'))
Output:
['12345', '54321']
['55432']
[]
Note I've included word boundaries (\b) in the regex so that (for example) 12345 won't match 654321. If you want to match substrings as well, just remove the word boundaries from the regex.
The mathematical term for this is a mutliset. In Python, this is handled by the Counter data type. For example,
from collections import Counter
target = '55432'
candidate = '32545'
Counter(candidate) == Counter(target)
If you want to generate all of the multisets, here's one question dealing with that: How to generate all the permutations of a multiset?
I'd like to match numbers (int and real) in a string, but not if they are part of an identifier; e.g., i'd like to match 5.5 or 42, but not x5. Strings are roughly of the form "x5*1.1+42*y=40".
So far, I came up with
([0-9]*[.])?[0-9]+[^.*+=<>]
This correctly ignores x0, but also 0 or 0.5 (12.45, however, works). Changing the + to * leads to wrong matchings.
It would be very nice if someone could point out my error.
Thanks!
This is actually not simple. Float literals are more complex than you assumed, being able to contain an e or E for exponential format. Also, you can have prefixed signs (+ or -) for the number and/or the exponent. All in all it can be done like this:
re.findall(r'(?:(?<![a-zA-Z_0-9])|[+-]\s*)[\d.]+(?:[eE][+-]?\d+)?',
'x5*1.1+42*y=40+a123-3.14e-2')
This returns:
['1.1', '+42', '40', '-3.14e-2']
You should consider though whether a thing like 4+3 should lead to ['4', '3'] or ['4', '-3']. If the input was 4+-3 the '-3' would clearly be preferable. But to distinguish these isn't easy and you should consider using a proper formula parser for these.
Maybe the standard module ast can help you. The expression must be a valid Python expression in this case, so a thing like a+b=40 isn't allowed because left of the equal sign is no proper lvalue. But for valid Python objects you could use ast like this:
import ast
def find_all_numbers(e):
if isinstance(e, ast.BinOp):
for r in find_all_numbers(e.left):
yield r
for r in find_all_numbers(e.right):
yield r
elif isinstance(e, ast.Num):
yield e.n
list(find_all_numbers(ast.parse('x5*1.1+42*y-40').body[0].value))
Returns:
[1.1, 42, 40]
You could do it with something like
\b\d*(\.\d+)?\b
It matches any number of digits (\d*) followed by an optional decimal part ((\.\d+)?). The \b matches word boundaries, i.e. the location between a word character and a non word character. And since both digits and (english) letters are word characters, it won't match the 5 in a sequence like x5.
See this regex101 example.
The main reason your try fails is that it ends with [^.*+=<>] which requires the number (or rather match) to end with a character other than ., *, =, +, < or >. And when ending with a single digit, like 0 and 0.5 , the digit gets eaten by the [0-9]+, and there's nothin to match the [^.*+=<>] left, and thus it fails. In the case with 12.45 it first matches 12.4 and then the [^.*+=<>] matches the 5.
Do something like ((?<![a-zA-Z_])\d+(\.\d+)?)
It is using negative lookbehind in order not to select anything having [a-zA-Z_] prior to it.
Check it out here in Regex101.
About your regex ([0-9]*[.])?[0-9]+[^.*+=<>] use [0-9]+ instead of [0-9]* as it will not allow .05 to be captured, only 0.5. Another thing is [^.*+=<>] this part, you could add ? to the end of it in order to allow it not to have characters as well. Example 1.1 wont be captured as ([0-9]*[.])?[0-9]+ is satisfied but not [^.*+=<>] that comes after it as well.
Let's say I have a number with a recurring pattern, i.e. there exists a string of digits that repeat themselves in order to make the number in question. For example, such a number might be 1234123412341234, created by repeating the digits 1234.
What I would like to do, is find the pattern that repeats itself to create the number. Therefore, given 1234123412341234, I would like to compute 1234 (and maybe 4, to indicate that 1234 is repeated 4 times to create 1234123412341234)
I know that I could do this:
def findPattern(num):
num = str(num)
for i in range(len(num)):
patt = num[:i]
if (len(num)/len(patt))%1:
continue
if pat*(len(num)//len(patt)):
return patt, len(num)//len(patt)
However, this seems a little too hacky. I figured I could use itertools.cycle to compare two cycles for equality, which doesn't really pan out:
In [25]: c1 = itertools.cycle(list(range(4)))
In [26]: c2 = itertools.cycle(list(range(4)))
In [27]: c1==c2
Out[27]: False
Is there a better way to compute this? (I'd be open to a regex, but I have no idea how to apply it there, which is why I didn't include it in my attempts)
EDIT:
I don't necessarily know that the number has a repeating pattern, so I have to return None if there isn't one.
Right now, I'm only concerned with detecting numbers/strings that are made up entirely of a repeating pattern. However, later on, I'll likely also be interested in finding patterns that start after a few characters:
magic_function(78961234123412341234)
would return 1234 as the pattern, 4 as the number of times it is repeated, and 4 as the first index in the input where the pattern first presents itself
(.+?)\1+
Try this. Grab the capture. See demo.
import re
p = re.compile(ur'(.+?)\1+')
test_str = u"1234123412341234"
re.findall(p, test_str)
Add anchors and flag Multiline if you want the regex to fail on 12341234123123, which should return None.
^(.+?)\1+$
See demo.
One way to find a recurring pattern and number of times repeated is to use this pattern:
(.+?)(?=\1+$|$)
w/ g option.
It will return the repeated pattern and number of matches (times repeated)
Non-repeated patterns (fails) will return only "1" match
Repeated patterns will return 2 or more matches (number of times repeated).
Demo
I would like to right justify strings containing Thai characters (Thai rendering doesn't work from left to right, but can go up and down as well).
For example, for the strings ไป (two characters, length 2) and ซื้อ (four characters, length 2) I want to have the following output (length 5):
...ไป
...ซื้อ
The naive
print 'ไป'.decode('utf-8').rjust(5)
print 'ซื้อ'.decode('utf-8').rjust(5)
however, respectively produce
...ไป
.ซื้อ
Any ideas how to get to the desired formatting?
EDIT:
Given a string of Thai characters tc, I want to determine how many [places/fields/positions/whatever you want to call it] the string uses. This is not the same as len(tc); len(tc) is usually larger than the number of places used. The second word gives len(tc) = 4, but has length 2 / uses 2 places / uses 2 positions.
Cause
Thai script contains normal characters (positive advance width) and non-spacing marks as well (zero advance width).
For example, in the word ซื้อ:
the first character is the initial consonant "SO SO",
then it has vowel mark SARA UUE,
then tone mark MAI THO,
and then the final pseudo-consonant O ANG
The problem is that characters ##2 and 3 in the list above are zero-width ones.
In other words, they do not make the string "wider".
In yet other words, ซื้อ ("to buy") and ซอ ("fiddle") would have equal width of two character places (but string lengths of 4 and 2, correspondingly).
Solution
In order to calculate the "real" string length, one must skip zero-width characters.
Python-specific
The unicodedata module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters. The data contained in this database is compiled from the UCD version 8.0.0.
The unicodedata.category(unichr) method returns one the following General Category Values:
"Lo" for normal character;
"Mn" for zero-width non-spacing marks;
The rest is obvious, simply filter out the latter ones.
Further info:
Unicode data for Thai script (scroll till the first occurrence of "THAI CHARACTER")
I think what you mean to ask is, how to determine the 'true' # of characters in เรือ, ไป, ซื้อ etc. (which are 3,2 and 2, respectively)
Unfortunately, here's how Python interprets these characters:
ไป
>>> 'ไป'
'\xe0\xb9\x84\xe0\xb8\x9b'
>>> len('ไป')
6
>>> len('ไป'.decode('utf-8'))
2
ซื้อ
>>> 'ซื้อ'
'\xe0\xb8\x8b\xe0\xb8\xb7\xe0\xb9\x89\xe0\xb8\xad'
>>> len('ซื้อ')
12
>>> len('ซื้อ'.decode('utf-8'))
4
เรือ
>>> 'เรือ'
'\xe0\xb9\x80\xe0\xb8\xa3\xe0\xb8\xb7\xe0\xb8\xad'
>>> len('เรือ')
12
>>> len('เรือ'.decode('utf-8'))
4
There's no real correlation between the # of characters displayed and the # of actual (from Python's perspective) characters that make up the string.
I can't think of an obvious way to do this. However, I've found this library which might be of help to you. (You will also need to install some prequisites.
It looks like the rjust() function will not work for you and you will need to count the number of cells in the string yourself. You can then insert the number of spaces required before the string to achieve justification
You seem to know about Thai language. Sum the number of consonants, preceding vowels, following vowels and Thai punctuation. Don't count diacritics and above and below vowels.
Something like (forgive my pseudo Python code),
cells = 0
for i in range (0, len(string))
if (string[i] == \xe31) or ((string[i] >= \xe34) and (string[i] <= \xe3a)) or ((string[i] >= \xe47) and (string[i] <= \xe4e))
# do nothing
else
# consonant, preceding or following vowel or punctuation
cells++
Here's a function to compute the length of a thai string (the number of characters arranged horizontally), based on bytebuster's answer
import unicodedata
def get_thai_string_length(string):
length = 0
for c in string:
if unicodedata.category(c) != 'Mn':
length += 1
return length
print(len('บอินทัช'))
print(get_thai_string_length('บอินทัช'))