Let's say I have a number with a recurring pattern, i.e. there exists a string of digits that repeat themselves in order to make the number in question. For example, such a number might be 1234123412341234, created by repeating the digits 1234.
What I would like to do, is find the pattern that repeats itself to create the number. Therefore, given 1234123412341234, I would like to compute 1234 (and maybe 4, to indicate that 1234 is repeated 4 times to create 1234123412341234)
I know that I could do this:
def findPattern(num):
num = str(num)
for i in range(len(num)):
patt = num[:i]
if (len(num)/len(patt))%1:
continue
if pat*(len(num)//len(patt)):
return patt, len(num)//len(patt)
However, this seems a little too hacky. I figured I could use itertools.cycle to compare two cycles for equality, which doesn't really pan out:
In [25]: c1 = itertools.cycle(list(range(4)))
In [26]: c2 = itertools.cycle(list(range(4)))
In [27]: c1==c2
Out[27]: False
Is there a better way to compute this? (I'd be open to a regex, but I have no idea how to apply it there, which is why I didn't include it in my attempts)
EDIT:
I don't necessarily know that the number has a repeating pattern, so I have to return None if there isn't one.
Right now, I'm only concerned with detecting numbers/strings that are made up entirely of a repeating pattern. However, later on, I'll likely also be interested in finding patterns that start after a few characters:
magic_function(78961234123412341234)
would return 1234 as the pattern, 4 as the number of times it is repeated, and 4 as the first index in the input where the pattern first presents itself
(.+?)\1+
Try this. Grab the capture. See demo.
import re
p = re.compile(ur'(.+?)\1+')
test_str = u"1234123412341234"
re.findall(p, test_str)
Add anchors and flag Multiline if you want the regex to fail on 12341234123123, which should return None.
^(.+?)\1+$
See demo.
One way to find a recurring pattern and number of times repeated is to use this pattern:
(.+?)(?=\1+$|$)
w/ g option.
It will return the repeated pattern and number of matches (times repeated)
Non-repeated patterns (fails) will return only "1" match
Repeated patterns will return 2 or more matches (number of times repeated).
Demo
Related
I have a complex case where I can't get any further. The goal is to check a string via RegEx for the following conditions:
Exactly 12 letters
The letters W,S,I,O,B,A,R and H may only appear exactly once in the string
The letters T and E may only occur exactly 2 times in the string.
Important! The order must not matter
Example matches:
WSITTOBAEERH
HREEABOTTISW
WSITOTBAEREH
My first attempt:
results = re.match(r"^W{1}S{1}I{1}T{2}O{1}B{1}A{1}E{2}R{1}H{1}$", word)
The problem with this first attempt is that it only matches if the order of the letters in the RegEx has been followed. That violates condition 4
My second attempt:
results = re.match(r"^[W{1}S{1}I{1}T{2}O{1}B{1}A{1}E{2}R{1}H{1}]{12}$", word)
The problem with trial two: Now the order no longer matters, but the exact number of individual letters is ignored.
I can only do the basics of RegEx so far and can't get any further here. If anyone has an idea what a regular expression looks like that fits the four rules mentioned above, I would be very grateful.
One possibility, although I still think regex is inappropriate for this. Checks that all letters appear the desired amount and that it's 12 letters total (so there's no room left for any more/other letters):
import re
for s in 'WSITTOBAEERH', 'HREEABOTTISW', 'WSITOTBAEREH':
print(re.fullmatch('(?=.*W)(?=.*S)(?=.*I)(?=.*O)'
'(?=.*B)(?=.*A)(?=.*R)(?=.*H)'
'(?=.*T.*T)(?=.*E.*E).{12}', s))
Another, checking that none other than T and E appear twice, that none appear thrice, and that we have only the desired letters, 12 total:
import re
for s in 'WSITTOBAEERH', 'HREEABOTTISW', 'WSITOTBAEREH':
print(re.fullmatch(r'(?!.*([^TE]).*\1)'
r'(?!.*(.).*\1.*\1)'
r'[WSIOBARHTE]{12}', s))
A simpler way:
for s in 'WSITTOBAEERH', 'HREEABOTTISW', 'WSITOTBAEREH':
print(sorted(s) == sorted('WSIOBARHTTEE'))
It seems that I can't find a solution for this perhaps an easy problem: I want to be able to match with a simple regex all possible permutations of 5 specified digits, without repeating, where all digits must be used. So, for this sequence:
12345
the valid permutation is:
54321
but
55555
is not valid.
However, if the provided digits have the same number once or more, only in that case the accepted permutations will have those repeated digits, but each digit must be used only once. For example, if the provided number is:
55432
we see that 5 is provided 2 times, so it must be also present two times in each permutation, and some of the accepted answers would be:
32545
45523
but this is wrong:
55523
(not all original digits are used and 5 is repeated more than twice)
I came very close to solve this using:
(?:([43210])(?!.*\1)){5}
but unfortunately it doesn't work when there are multiple same digits provided(like 43211).
One way to solve this is to make a character class out of the search digits and build a regex to search for as many digits in that class as are in the search string. Then you can filter the regex results based on the sorted match string being the same as the sorted search string. For example:
import re
def find_perms(search, text):
search = sorted(search)
regex = re.compile(rf'\b[{"".join(search)}]{{{len(search)}}}\b')
matches = [m for m in regex.findall(text) if sorted(m) == search]
return matches
print(find_perms('54321', '12345 54321 55432'))
print(find_perms('23455', '12345 54321 55432'))
print(find_perms('24455', '12345 54321 55432'))
Output:
['12345', '54321']
['55432']
[]
Note I've included word boundaries (\b) in the regex so that (for example) 12345 won't match 654321. If you want to match substrings as well, just remove the word boundaries from the regex.
The mathematical term for this is a mutliset. In Python, this is handled by the Counter data type. For example,
from collections import Counter
target = '55432'
candidate = '32545'
Counter(candidate) == Counter(target)
If you want to generate all of the multisets, here's one question dealing with that: How to generate all the permutations of a multiset?
I am just facing a probem when trying to create a regex which should help finding strings including specific combinations of substrings.
For example i am searching for the substring combination:
ab-ab-cd
1) "xxxabxxxxxxabxxxxcdxxx" -> should be a match
2) "xxxabxxxxabxxxxabxxxxcdxxxx -> no match
3) "xxxabxxxxxxxxxxcdxxxx -> no match
to make it even more complicated:
4) "xxxabxxxxxabxxxxcdxxxabxxx -> should also be a match
My substring combinations could also be like this:
ab-cd
or
ab-ab-ab-cd
or
ab-cd-ab-cd
For all these (and more) examples I am looking for a systematic way to build the corresponding regexes in a systematic way so that only strings are found as matches where the substrings occur in the right order and with correct frequency.
I got something like this for the "ab-ab-cd" substring search but it fails in cases like 4) of my examples.
p = re.compile("(?:(?!ab).)*ab.*?ab(?!.*ab).*cd",re.IGNORECASE)
In cases like 4) this one works in but in also matches strings like 2):
p = re.compile("(?:(?!ab).)*ab(?:(?!ab).)*ab((?!ab|cd)*).*cd", re.IGNORECASE)
Could you please point me to my mistake?
Thanks a lot!
EDIT:
Sorry to all, that my question was not clear enough. I tried to break my problem down into a more simple case, which might have been no good idea.
Here comes the detailed explanation of the problem:
I have list of (protein) sequences and to assign a specific type to each sequence on the basis of sequence patterns.
Therefore I created a dictionary with type-name as key and feature template (list of sequence features in a specific order) as value, e.g.:
type_a -> [A,A,B,C]
type_b -> [A,B,C]
type_c -> [A,B,A,B]
In other dict I have (simple) regex patters for each feature, e.g.:
A -> [PHT]AG[QP]LI
B -> RS[TP]EV
C -> ...
D -> ...
Now each template (type_a, type_b,...) I now to systematically build the concatenated regex patters (i.e. for type_a build a regex searching for A,A,B,C).
That would than result into another dict with types as key and and the complete regex as value.
Now I want to go through each sequence in my list of sequences and map all complete regex templates against each sequence. In best case, only one complete regex (type) should match the sequence.
Taking the example from above, having the following regex-templates:
cd
ab-cd
ab-ab-cd
ab-ab-ab-cd
ab-cd-ab-cd
ab-ab-cd-ab
"xxxabxxxxxxabxxxxcdxxx"
->this sequence should match the regex for the template "ab-ab-cd" and not any of the others
With the following regex I could perfectly look for ab-ab-cd.
p = re.compile("(?:(?!ab).)*ab.*?ab(?!.*ab).*cd",re.IGNORECASE)
If my tests were correct it would only match sequence 1) from above and not 2) or 3).
However, if I would like to search for ab-ab-cd-ab the negative look-ahead would not allow to find the last ab. I found something like the following code to break the negative look-ahead after the second "ab" part. In my understand the negative look-ahead should stop with the "cd", so that the last "ab" could be matches again.
p = re.compile("(?:(?!ab).)*ab(?:(?!ab).)*ab((?!ab|cd)*).*cd", re.IGNORECASE)
It solves the problem with the last "ab" from ab-ab-cd-ab.
But somehow it now does not only match the for 2 times "ab" before the "cd" (Sequence 1) - ab-ab-cd) but also the 3 (or more) times "ab" before the "cd" (Sequence 2, ab-ab-ab-cd), which it should not.
I hope my problem is more clear. Thanks a lot for all the answers, I will try the code tomorrow when I am back at work. Any further answers are highly appreciated, explanations of the regex code (I am pretty new to regex) and suggestions with re.functions (match, final...) to use.
Thanks
You could use re.findall and post-process it. Effectively you want to find all instances of ab or cd and see if your pattern(['ab', 'ab', 'cd']) is at the start of the list. The following:
import re
test1 = "xxxabxxxxxxabxxxxcdxxx"
test2 = "xxxabxxxxabxxxxabxxxxcdxxxx"
test3 = "xxxabxxxxxxxxxxcdxxxx"
test4 = "xxxabxxxxxabxxxxcdxxxabxxx"
for x in (test1, test2, test3, test4):
matches = re.findall(r'(ab|cd)', x)
print matches[:3] == ['ab', 'ab', 'cd']
prints
True
False
False
True
As required.
Why do you need the negative look ahead?
Why not use something as simple as that:
*ab.*ab.*cd
Or if you need it to find a match from the beginning of the line, you can use:
^.*ab.*ab.*cd
Edit:
After your comment I understood what you need. Try this one:
^(?:(?!ab).)*ab(?:(?!ab).)*ab(?:(?!ab).)*cd
Consider the python code:
import re
re.findall('[0-9]+', 'XYZ 102 1030')
which returns:
['102', '1030']
Can one write a regex that requires at least one occurance of the digit 3, i.e. I am interested in '[0-9]+' where there is at least one 3? So the result of interest would be:
['1030']
More generally, how about at least n 3's?
And even more generally, how about at least n 3's and at least k 4's, etc?
Just try the regexp '\d*3\d*', which means "0 or more digits, followed by a 3, followed by 0 or more digits".
You can check it here
If you want "at least 'n' 3", use '\d*(3\d*){n}'.
At least one 3 in the string could be
\d*3\d*
https://regex101.com/r/yEbatk/4
If you are looking for (at least) 2 times the number 3 inside, you could use:
\d*3\d*3\d*
https://regex101.com/r/yEbatk/5
If you want it to be (at least) n times you could use the {min,max} repeat option:
\d*(3\d*){n}
https://regex101.com/r/yEbatk/7
For n occurrences of x, m occurrences of y, and so on, build forth on this general expression:
(?=(?:\d*x){n})(?=(?:\d*y){m})\b\d+\b
where the lookahead part (?=(?:\d*x){n}) is repeated for each desired n and x.
I chose to make the lookahead groups non-capturing by surrounding them with (?:..), although it makes it a bit less readable.
The counting part itself is just (\d*x){n}, and it needs a lookahead because with more than one set of numbers to look for, the digits may appear in any order.
The final \b\d+\b ensures you capture just digits, surrounded by 'not-word' characters, so it will skip any sequence containing letters but will work on something like abc-123-456.
Example: 2 3's and 2 4's, in XYZ 1023344a 1403403
(?=(?:\d*3){2})(?=(?:\d*4){2})\b\d+\b
will match 1403403 but not 1023344a.
See
https://regex101.com/r/QgYptp/3
Although you can use a regex for this, regexes get messy and hard to read when you're searching for more than a couple of different digits. Instead, you can use collections.Counter to count the number of occurrences of each character in a string:
from collections import Counter
# Must contain at least two 3s, three 4s, and one 7
mins = { '3': 2, '4': 3, '7': 1 }
input = '3444 33447 334447 foo334447 473443 2317349414'
tokens = input.split()
for token in tokens:
# Skip tokens that aren't numbers
if not token.isdigit():
continue
counter = Counter(token)
for digit, min_count in mins.items():
if counter[digit] < min_count:
break
else:
print(token)
Output:
334447
473443
2317349414
I am new to regular expressions and I am trying to write a pattern of phone numbers, in order to identify them and be able to extract them. My doubt can be summarized to the following simple example:
I try first to identify whether in the string is there something like (+34) which should be optional:
prefixsrch = re.compile(r'(\(?\+34\)?)?')
that I test in the following string in the following way:
line0 = "(+34)"
print prefixsrch.findall(line0)
which yields the result:
['(+34)','']
My first question is: why does it find two occurrences of the pattern? I guess that this is related to the fact that the prefix thing is optional but I do not completely understand it. Anyway, now for my big doubt
If we do a similar thing searching for a pattern of 9 digits we get the same:
numsrch = re.compile(r'\d{9}')
line1 = "971756754"
print numsrch.findall(line1)
yields something like:
['971756754']
which is fine. Now what I want to do is identify a 9 digits number, preceded or not, by (+34). So to my understanding I should do something like:
phonesrch = re.compile(r'(\(?\+34\)?)?\d{9}')
If I test it in the following strings...
line0 = "(+34)971756754"
line1 = "971756754"
print phonesrch.findall(line0)
print phonesrch.findall(line1)
this is, to my surprise, what I get:
['(+34)']
['']
What I was expecting to get is ['(+34)971756754'] and ['971756754']. Does anybody has the insight of this? thank you very much in advance.
Your capturing group is wrong. Make the country code within a non-capturing group and the entire expression in the capturing group
>>> line0 = "(+34)971756754"
>>> line1 = "971756754"
>>> re.findall(r'((?:\(?\+34\)?)?\d{9})',line0)
['(+34)971756754']
>>> re.findall(r'((?:\(?\+34\)?)?\d{9})',line1)
['971756754']
My first question is: why does it find two occurrences of the pattern?
This is because, ? which means it match 0 or 1 repetitions, so an empty string is also a valid match