Match Regex permuations without repeating but with a twist - python

It seems that I can't find a solution for this perhaps an easy problem: I want to be able to match with a simple regex all possible permutations of 5 specified digits, without repeating, where all digits must be used. So, for this sequence:
12345
the valid permutation is:
54321
but
55555
is not valid.
However, if the provided digits have the same number once or more, only in that case the accepted permutations will have those repeated digits, but each digit must be used only once. For example, if the provided number is:
55432
we see that 5 is provided 2 times, so it must be also present two times in each permutation, and some of the accepted answers would be:
32545
45523
but this is wrong:
55523
(not all original digits are used and 5 is repeated more than twice)
I came very close to solve this using:
(?:([43210])(?!.*\1)){5}
but unfortunately it doesn't work when there are multiple same digits provided(like 43211).

One way to solve this is to make a character class out of the search digits and build a regex to search for as many digits in that class as are in the search string. Then you can filter the regex results based on the sorted match string being the same as the sorted search string. For example:
import re
def find_perms(search, text):
search = sorted(search)
regex = re.compile(rf'\b[{"".join(search)}]{{{len(search)}}}\b')
matches = [m for m in regex.findall(text) if sorted(m) == search]
return matches
print(find_perms('54321', '12345 54321 55432'))
print(find_perms('23455', '12345 54321 55432'))
print(find_perms('24455', '12345 54321 55432'))
Output:
['12345', '54321']
['55432']
[]
Note I've included word boundaries (\b) in the regex so that (for example) 12345 won't match 654321. If you want to match substrings as well, just remove the word boundaries from the regex.

The mathematical term for this is a mutliset. In Python, this is handled by the Counter data type. For example,
from collections import Counter
target = '55432'
candidate = '32545'
Counter(candidate) == Counter(target)
If you want to generate all of the multisets, here's one question dealing with that: How to generate all the permutations of a multiset?

Related

Building a RegEx, 12 letters without order, fixed number of individual letters

I have a complex case where I can't get any further. The goal is to check a string via RegEx for the following conditions:
Exactly 12 letters
The letters W,S,I,O,B,A,R and H may only appear exactly once in the string
The letters T and E may only occur exactly 2 times in the string.
Important! The order must not matter
Example matches:
WSITTOBAEERH
HREEABOTTISW
WSITOTBAEREH
My first attempt:
results = re.match(r"^W{1}S{1}I{1}T{2}O{1}B{1}A{1}E{2}R{1}H{1}$", word)
The problem with this first attempt is that it only matches if the order of the letters in the RegEx has been followed. That violates condition 4
My second attempt:
results = re.match(r"^[W{1}S{1}I{1}T{2}O{1}B{1}A{1}E{2}R{1}H{1}]{12}$", word)
The problem with trial two: Now the order no longer matters, but the exact number of individual letters is ignored.
I can only do the basics of RegEx so far and can't get any further here. If anyone has an idea what a regular expression looks like that fits the four rules mentioned above, I would be very grateful.
One possibility, although I still think regex is inappropriate for this. Checks that all letters appear the desired amount and that it's 12 letters total (so there's no room left for any more/other letters):
import re
for s in 'WSITTOBAEERH', 'HREEABOTTISW', 'WSITOTBAEREH':
print(re.fullmatch('(?=.*W)(?=.*S)(?=.*I)(?=.*O)'
'(?=.*B)(?=.*A)(?=.*R)(?=.*H)'
'(?=.*T.*T)(?=.*E.*E).{12}', s))
Another, checking that none other than T and E appear twice, that none appear thrice, and that we have only the desired letters, 12 total:
import re
for s in 'WSITTOBAEERH', 'HREEABOTTISW', 'WSITOTBAEREH':
print(re.fullmatch(r'(?!.*([^TE]).*\1)'
r'(?!.*(.).*\1.*\1)'
r'[WSIOBARHTE]{12}', s))
A simpler way:
for s in 'WSITTOBAEERH', 'HREEABOTTISW', 'WSITOTBAEREH':
print(sorted(s) == sorted('WSIOBARHTTEE'))

Finding the longest substring of repeating characters in a string

(this is the basis for this codeforces problem)
I try not to get help with codeforces problems unless i'm really, really, stuck, which happens to be now.
Your first mission is to find the password of the Martian database. To achieve this, your best secret agents have already discovered the following facts:
The password is a substring of a given string composed of a sequence of non-decreasing digits
The password is as long as possible
The password is always a palindrome
A palindrome is a string that reads the same backwards. racecar, bob, and noon are famous examples.
Given those facts, can you find all possible passwords of the database?
Input
The first line contains n, the length of the input string (1 ≤ n ≤ 105).
The next line contains a string of length n. Every character of this string is a digit.
The digits in the string are in non-decreasing order.
Output
On the first line, print the number of possible passwords, k.
On the next k lines, print the possible passwords in alphabetical order.
My observations are:
A palindrome in a non-decreasing string is simply a string of repeating characters (eg. "4444" or "11" )
on character i, the last instance of i - the first instance of i +1 = length of the repeating character
Keeping track of the max password length and then filtering out every item that is shorter than the max password length guarantees that the passwords outputted are of max length
my solution based on these observations is:
n,s = [input() for i in range(2)]#input
maxlength = 0
results = []
for i in s:
length = (s.rfind(i)-s.find(i))+1
if int(i*(length)) not in results and length>=maxlength:
results.append(int(i*(length)))
maxlength = length
#filer everything lower than the max password length out
results = [i for i in results if len(str(i))>=maxlength]
#output
print(len(results))
for y in results:
print(y)
unfortunately, this solution is wrong, in fact and fails on the 4th test case. I do not understand what is wrong with the code, and so i cannot fix it. Can someone help with this?
Thanks for reading!
Your program will fail on:
4
0011
It will return just 11.
The problem is that the length of str(int('00')) is equal to 1.
You could fix it by removing the int and str calls from your program (i.e. saving the answers as strings instead of ints).
Peter de Rivaz seems to have identified the problem with your code, however, if you are interested in a different way to solve this problem consider using a regular expression.
import sys
import re
next(sys.stdin) # length not needed in Python
s = next(sys.stdin)
repeats = r'(.)\1+'
for match in re.finditer(repeats, s):
print(match.group())
The pattern (.)\1+ will find all substrings of repeated digits. Output for input
10
3445556788
would be:
44
555
88
If re.finditer() finds that there are no repeating digits then either the string is empty, or it consists of a sequence of increasing non-repeating digits. The first case is excluded since n must be greater than 0. For the second case the input is already sorted alphabetically, so just output the length and each digit.
Putting it together gives this code:
import sys
import re
next(sys.stdin) # length not needed in Python
s = next(sys.stdin).strip()
repeats = r'(.)\1+'
passwords = sorted((m.group() for m in re.finditer(repeats, s)),
key=len, reverse=True)
passwords = [s for s in passwords if len(s) == len(passwords[0])]
if len(passwords) == 0:
passwords = list(s)
print(len(passwords))
print(*passwords, sep='\n')
Note that the matching substrings are extracted from the match object and then sorted by length descending. The code relies on the fact that digits in the input must not decrease so a second alphabetic sort of the candidate passwords is not required.

Searching for non uniform time mentions in a string

I'm having trouble with my python script
import re
text = 'asd;lkas;ldkasld12:00 AMalskjdadlakjasdasdas1:24 PMasldkjaskldjaslkdjd'
banana = re.findall ('\d\d:\d{2} \wM', text)
print (banana)
I'm trying to search for any mentions of time, but I can't find the strings if they are single digit in the text.
You are searching for exactly 2 numbers with \d\d. You need to change it to:
'\d{1,2}:\d{2} \wM'
This will look for 1 or 2 numbers. Also, I suppose that you want to match AM or PM with \wM in that case you could use:
'\d{1,2}:\d{2} [AP]M'
date= re.findall("\d{1,2}:\d{2) [A|P]M", text)
The {1,2} gives an upper and lower limit to the amount of digits it should expect.
The [A|P]M gives it specific instruction to find either AM or PM. Reducing the risk of false positives.
If you want some more information on what regex can do here is the documentation that helped me learn:
https://docs.python.org/2/library/re.html
I think this iswhat you are looking for:
banana = re.findall ('\d?\d:\d{2} \wM', text)

Regex - require a specific digit to be contained in \d+?

Consider the python code:
import re
re.findall('[0-9]+', 'XYZ 102 1030')
which returns:
['102', '1030']
Can one write a regex that requires at least one occurance of the digit 3, i.e. I am interested in '[0-9]+' where there is at least one 3? So the result of interest would be:
['1030']
More generally, how about at least n 3's?
And even more generally, how about at least n 3's and at least k 4's, etc?
Just try the regexp '\d*3\d*', which means "0 or more digits, followed by a 3, followed by 0 or more digits".
You can check it here
If you want "at least 'n' 3", use '\d*(3\d*){n}'.
At least one 3 in the string could be
\d*3\d*
https://regex101.com/r/yEbatk/4
If you are looking for (at least) 2 times the number 3 inside, you could use:
\d*3\d*3\d*
https://regex101.com/r/yEbatk/5
If you want it to be (at least) n times you could use the {min,max} repeat option:
\d*(3\d*){n}
https://regex101.com/r/yEbatk/7
For n occurrences of x, m occurrences of y, and so on, build forth on this general expression:
(?=(?:\d*x){n})(?=(?:\d*y){m})\b\d+\b
where the lookahead part (?=(?:\d*x){n}) is repeated for each desired n and x.
I chose to make the lookahead groups non-capturing by surrounding them with (?:..), although it makes it a bit less readable.
The counting part itself is just (\d*x){n}, and it needs a lookahead because with more than one set of numbers to look for, the digits may appear in any order.
The final \b\d+\b ensures you capture just digits, surrounded by 'not-word' characters, so it will skip any sequence containing letters but will work on something like abc-123-456.
Example: 2 3's and 2 4's, in XYZ 1023344a 1403403
(?=(?:\d*3){2})(?=(?:\d*4){2})\b\d+\b
will match 1403403 but not 1023344a.
See
https://regex101.com/r/QgYptp/3
Although you can use a regex for this, regexes get messy and hard to read when you're searching for more than a couple of different digits. Instead, you can use collections.Counter to count the number of occurrences of each character in a string:
from collections import Counter
# Must contain at least two 3s, three 4s, and one 7
mins = { '3': 2, '4': 3, '7': 1 }
input = '3444 33447 334447 foo334447 473443 2317349414'
tokens = input.split()
for token in tokens:
# Skip tokens that aren't numbers
if not token.isdigit():
continue
counter = Counter(token)
for digit, min_count in mins.items():
if counter[digit] < min_count:
break
else:
print(token)
Output:
334447
473443
2317349414

Finding the recurring pattern

Let's say I have a number with a recurring pattern, i.e. there exists a string of digits that repeat themselves in order to make the number in question. For example, such a number might be 1234123412341234, created by repeating the digits 1234.
What I would like to do, is find the pattern that repeats itself to create the number. Therefore, given 1234123412341234, I would like to compute 1234 (and maybe 4, to indicate that 1234 is repeated 4 times to create 1234123412341234)
I know that I could do this:
def findPattern(num):
num = str(num)
for i in range(len(num)):
patt = num[:i]
if (len(num)/len(patt))%1:
continue
if pat*(len(num)//len(patt)):
return patt, len(num)//len(patt)
However, this seems a little too hacky. I figured I could use itertools.cycle to compare two cycles for equality, which doesn't really pan out:
In [25]: c1 = itertools.cycle(list(range(4)))
In [26]: c2 = itertools.cycle(list(range(4)))
In [27]: c1==c2
Out[27]: False
Is there a better way to compute this? (I'd be open to a regex, but I have no idea how to apply it there, which is why I didn't include it in my attempts)
EDIT:
I don't necessarily know that the number has a repeating pattern, so I have to return None if there isn't one.
Right now, I'm only concerned with detecting numbers/strings that are made up entirely of a repeating pattern. However, later on, I'll likely also be interested in finding patterns that start after a few characters:
magic_function(78961234123412341234)
would return 1234 as the pattern, 4 as the number of times it is repeated, and 4 as the first index in the input where the pattern first presents itself
(.+?)\1+
Try this. Grab the capture. See demo.
import re
p = re.compile(ur'(.+?)\1+')
test_str = u"1234123412341234"
re.findall(p, test_str)
Add anchors and flag Multiline if you want the regex to fail on 12341234123123, which should return None.
^(.+?)\1+$
See demo.
One way to find a recurring pattern and number of times repeated is to use this pattern:
(.+?)(?=\1+$|$)
w/ g option.
It will return the repeated pattern and number of matches (times repeated)
Non-repeated patterns (fails) will return only "1" match
Repeated patterns will return 2 or more matches (number of times repeated).
Demo

Categories