Regular expression to match any character repeated exactly twice - python

I am trying to identify whether a supplied string has characters repeated exactly twice. The following is the regular expression that I am using:
([a-z])\1(?!\1)
However, when tested against the following strings, both the strings below are matching the pattern (though I have used (?!\1):
>>> re.findall(r'.*([a-z])\1(?!\1)', 'abcdeefg')
['e']
>>> re.findall(r'.*([a-z])\1(?!\1)', 'abcdeeefg')
['e']
What is wrong in the above pattern?

I suspect that a python regular expression alone will not meet your needs. In order to ensure that a character is repeated exactly twice will require a negative look behind assertion, and such assertions cannot contain group references.
The easiest approach is to instead look for all repetitions and simply check their length.
def double_repeats(txt):
import itertools
# find possible repeats
candidates = set(re.findall(r'([a-z])\1', txt))
# now find the actual repeated blocks
repeats = itertools.chain(*[re.findall(r"({0}{0}+)".format(c), txt) for c in candidates])
# return just the blocks of length 2
return [x for x in repeats if len(x) == 2]
Then:
>>> double_repeats("abbbcbbdddeffgggg")
['ff', 'bb']

You could use a regex alternate operator trick.
>>> def guess(s):
out = re.findall(r'([a-z])\1{2,}|([a-z])\2', s)
if out and out[0][1]:
return True
return False
>>> k = ['abcdeefg', 'abcdeeefg']
>>> [guess(i) for i in k]
[True, False]
>>>
([a-z])\1{2,} matches all the repeated characters having a minimum of 3 maximum of n characters.
| OR
([a-z])\2 matches exactly two repeated characters from the remaining string since all the same continuous characters are matched by the first pattern.
or
>>> def guess(s):
out = re.findall(r'([a-z])\1{2,}|([a-z])\2', s)
if out and out[0][1]:
return out[0][1]
return False
>>> k = '23413e4abcee'
>>> k.count(guess(k)) == 2
False
>>> k = '234134abcee'
>>> k.count(guess(k)) == 2
True
>>>
If you want to get output like the other answers, then here you go:
>>> def guess(s):
out = re.findall(r'([a-z])\1{2,}|([a-z])\2', s)
if out:
return [y+y for x,y in out if y]
return []
>>> guess("abbbcbbdddeffgggg")
['bb', 'ff']
>>>

I find it the best way to do that.

Related

Get certain items based on their formatting

I have a list of values, some are numeric only, others made up of words, others a mix of the two.
I would like to select only those items composed by the combination number, single letter, number.
let me explain, this is my list of values
l = ['980X2350', 'DO_UN_HPL_Glas_Links', 'DO_UN_HPL_Glas_Rechts',
'930x2115', 'DO_UN_HPL_Links', 'DO_UN_HPL_Rechts', '830X2115',
'Deuropening', 'BF_32_Tourniquets_dubbeledeur_Aluminium']
I'd like to just get back:
['980X2350', '930x2115', '830X2115']
There is no need of importing re for such trivial matter.
Here is an approach that is more efficient than the regex based one:
allowed = '0123456786x'
def filter_str(lst):
output = []
for s in lst:
c = s.lower().strip()
if all(i in allowed for i in c) and c.count('x') == 1:
output.append(s)
return output
If the strings must contain two numeric fields:
allowed = '0123456786x'
def filter_str(lst):
output = []
for s in lst:
c = s.lower().strip()
n = len(c) - 1
if all(i in allowed for i in c) and c.count('x') == 1 and c.index('x') not in (0, n):
output.append(s)
return output
all function short-circuits (i.e. it stops checking as soon as Falsy value is registered), all Python logical operators also short-circuit, for the and operator, the right-hand operand won't be executed if the left-hand operand is Falsy, so my code does look it's longer than the regex based one, but it actually executes faster because regex checks whole strings and does not short-circuit.
Assuming a list of strings as input, you can use a regex and a list comprehension:
l = ['980X2350', 'DO_UN_HPL_Glas_Links', 'DO_UN_HPL_Glas_Rechts',
'930x2115', 'DO_UN_HPL_Links', 'DO_UN_HPL_Rechts', '830X2115',
'Deuropening', 'BF_32_Tourniquets_dubbeledeur_Aluminium']
import re
regex = re.compile('\d+x\d+', flags=re.I)
out = [s for s in l if regex.match(s.strip())]
output:
['980X2350', '930x2115', '830X2115']
Assuming a list of strings :
you can store in a counter the number of letters encountered, if this number is exactly equal to 1 and you have encountered some numbers then you can store it to your output list :
a = ['980X2350', 'DO_UN_HPL_Glas_Links', 'DO_UN_HPL_Glas_Rechts', '930x2115', 'DO_UN_HPL_Links',
'DO_UN_HPL_Rechts', '830X2115', 'Deuropening' ]
alphabet = 'abcdefghijklmnopqrstuvwxyz'
alphabet+= alphabet.upper()
numeric = '0123456789'
numeric_flag = False
output = []
for item in a:
alphabet_count = 0
for char in item:
if char in alphabet:
alphabet_count += 1
if char in numeric:
numeric_flag = True
if alphabet_count == 1 and numeric_flag:
output.append(item)
print(output)
# ['980X2350', '930x2115', '830X2115']

is there a more efficient way than this method?

The Job is to decompress a string.
For example:
if a string is 'a3b4c2' then decompress it as 'aaabbbbcc'.
the previous code i tried is
list1 = [i for i in a]
listNum = list(map(int,list(filter(lambda x:x.isdigit(),list1))))
listChar = list(filter(lambda x: not x.isdigit(),list1))
b = ''
for i in range(len(listNum)):
b += listChar[i]*listNum[i]
print(b)
I think it is a pretty simple problem, but my code seems clumsy, is there any other method to do it?.
import re
b = ''.join(c * int(n) for c, n in re.findall(r'(\w)(\d+)', a))
The regex will match each letter with the following number (accommodating multi-digit numbers) and return them in groups:
>>> re.findall(r'(\w)(\d+)', a)
[('a', '3'), ('b', '4'), ('c', '2')]
Then you just need to iterate over them…
for c, n in ...
# c = 'a'
# n = '3'
# ...
…and multiply them…
c * int(n)
…and simply do that in a generator expression…
c * int(n) for c, n in re.findall(r'(\w)(\d+)', a)
…and ''.join all the resulting small strings together.
For fun, here's a version that even allows standalone letters without numbers:
a = 'a3bc4d2e'
b = ''.join(c * int(n or 1) for c, n in re.findall(r'(\w)(\d+)?', a))
# aaabccccdde
Just another way, zip + splicing,
>>> value = 'a3b4c2'
>>>
>>> "".join(x*int(y) for x, y in zip(value[0::2], value[1::2]))
'aaabbbbcc'
>>>
You can use list comprehension for a one line solution:
input='a3b4c2'
result=''.join(input[i] * int(input[i+1]) for i in range(0,len(input),2))
Output:
>>> result
aaabbbbcc
The * operator can be used to multiply an integer with a character.
The join method is called to join the list of the substrings to the full string.
You might do it using regular expressions (re module), using grouping and function as 2nd re.sub argument following way
import re
a = 'a3b4c2'
def decompress(x):
return x.group(1)*int(x.group(2))
output = re.sub(r'(\D)(\d+)', decompress, a)
print(output) # aaabbbbcc
Explanation I am looking in string for single non-digit (\D) followed by one or more digits (\d+). For every match first is put into 1st group and latter into 2nd group, hence brackets in pattern. Then every match is replaced by content of 1st group (which is string) times value of content of 2nd group. Note that I used int to get that value as attempt of direct multiplying would fail (you can not multiply string by string).
Iterate the string pairwise using zip, to get the char c and int n as separate elements and then replicate the char c for n times
>>> str = 'a3b4c2'
>>> s = iter(str)
>>> ''.join(c*int(n) for c,n in zip(s, s))
'aaabbbbcc'

Count character repeats in Python

I'm writing a Python program and I need some way to count the number of times an X or a stretch of Xs occurs in a string. So for example if the input is aaaXXXbbbXXXcccXdddXXXXXeXf then the output should be 5, since there are 5 stretches of X in the string.
In Perl I would have done this as follows.
my $count =()= $str =~ m/X+/g;
I'm familiar with the re.search command in Python, but I'm unaware of how to count the number of results, and I'm unsure whether this is the most efficient way to approach my problem in Python.
My highest priority is readability/clarity; efficiency is secondary.
You can use itertools.groupby for this:
>>> s = "aaaXXXbbbXXXcccXdddXXXXXeXf"
>>> import itertools
>>> sum(e == 'X' for e, g in itertools.groupby(s))
5
This groups the elements in the iterable -- if no key-function is given, it just groups equal elements. Then, you just use sum to count the elements where the key is 'X'.
Or how about regular expressions:
>>> import re
>>> len(re.findall("X+", s))
5
This should work:
prev = None
count = 0
for letter in string:
if letter == 'X' and prev != 'X':
count += 1
prev = letter

How to find number or 'xx' pairs in a barcode recursively (python)

I am using python.
Im having trouble with this recursion problem, I am trying to find how many pairs of characters are the same in a string. For example, 'xx' would return 1 and 'xxx' would also return one because the pairs are not allowed to overlap. 'aabbb' would return 2.
I am completely stuck. I thought of breaking the word up into length 2 strings and recursing through the string like that, but then cases like 'aaa' would result in incorrect output.
Thanks.
Not sure why you want to do this recursively. If you wish to avoid regex, you can still just scan the string from left to right. For example, using itertools.groupby
>>> from itertools import groupby
>>> s = 'aabbb'
>>> sum(sum(1 for i in g)//2 for k,g in groupby(s))
2
>>> s = 'yyourr ssstringg'
>>> sum(sum(1 for i in g)//2 for k,g in groupby(s))
4
sum(1 for i in g) is used to find the length of the group. If the groups are not very long you can use len(list(g)) instead
You can use regex for that:
import re
s = 'yyourr ssstringg'
print len(re.findall(r'(\w)\1', s))
[OUTPUT]
4
This also takes care of your "overlaps-not-allowed" problem as you can see in the above example it prints 4 and not 5.
For a recursion approach, you can do it as:
st = 'yyourr ssstringg'
def get_double(s):
if len(s) < 2:
return 0
else:
for i,k in enumerate(s):
if k==s[i+1]:
return 1 + get_double(s[i+2:])
>>> print get_double(st)
4
And without a for loop:
st = 'yyourr sstringg'
def get_double(s):
if len(s) < 2:
return 0
elif s[0]==s[1]:
return 1 + get_double(s[2:])
else:
return 0 + get_double(s[1:])
>>> print get_double(st)
4
I would evaluate it by 2's.
for example "sskkkj" would be looked at as two sets of two char strings:
"ss", "kk", "kj" # from 0 index
"sk", "kk" # offset by 1
look at the two sets at the same time and add only one to the count if either has a pair.

Split string at nth occurrence of a given character

Is there a Python-way to split a string after the nth occurrence of a given delimiter?
Given a string:
'20_231_myString_234'
It should be split into (with the delimiter being '_', after its second occurrence):
['20_231', 'myString_234']
Or is the only way to accomplish this to count, split and join?
>>> n = 2
>>> groups = text.split('_')
>>> '_'.join(groups[:n]), '_'.join(groups[n:])
('20_231', 'myString_234')
Seems like this is the most readable way, the alternative is regex)
Using re to get a regex of the form ^((?:[^_]*_){n-1}[^_]*)_(.*) where n is a variable:
n=2
s='20_231_myString_234'
m=re.match(r'^((?:[^_]*_){%d}[^_]*)_(.*)' % (n-1), s)
if m: print m.groups()
or have a nice function:
import re
def nthofchar(s, c, n):
regex=r'^((?:[^%c]*%c){%d}[^%c]*)%c(.*)' % (c,c,n-1,c,c)
l = ()
m = re.match(regex, s)
if m: l = m.groups()
return l
s='20_231_myString_234'
print nthofchar(s, '_', 2)
Or without regexes, using iterative find:
def nth_split(s, delim, n):
p, c = -1, 0
while c < n:
p = s.index(delim, p + 1)
c += 1
return s[:p], s[p + 1:]
s1, s2 = nth_split('20_231_myString_234', '_', 2)
print s1, ":", s2
I like this solution because it works without any actuall regex and can easiely be adapted to another "nth" or delimiter.
import re
string = "20_231_myString_234"
occur = 2 # on which occourence you want to split
indices = [x.start() for x in re.finditer("_", string)]
part1 = string[0:indices[occur-1]]
part2 = string[indices[occur-1]+1:]
print (part1, ' ', part2)
I thought I would contribute my two cents. The second parameter to split() allows you to limit the split after a certain number of strings:
def split_at(s, delim, n):
r = s.split(delim, n)[n]
return s[:-len(r)-len(delim)], r
On my machine, the two good answers by #perreal, iterative find and regular expressions, actually measure 1.4 and 1.6 times slower (respectively) than this method.
It's worth noting that it can become even quicker if you don't need the initial bit. Then the code becomes:
def remove_head_parts(s, delim, n):
return s.split(delim, n)[n]
Not so sure about the naming, I admit, but it does the job. Somewhat surprisingly, it is 2 times faster than iterative find and 3 times faster than regular expressions.
I put up my testing script online. You are welcome to review and comment.
>>>import re
>>>str= '20_231_myString_234'
>>> occerence = [m.start() for m in re.finditer('_',str)] # this will give you a list of '_' position
>>>occerence
[2, 6, 15]
>>>result = [str[:occerence[1]],str[occerence[1]+1:]] # [str[:6],str[7:]]
>>>result
['20_231', 'myString_234']
It depends what is your pattern for this split. Because if first two elements are always numbers for example, you may build regular expression and use re module. It is able to split your string as well.
I had a larger string to split ever nth character, ended up with the following code:
# Split every 6 spaces
n = 6
sep = ' '
n_split_groups = []
groups = err_str.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
print n_split_groups
Thanks #perreal!
In function form of #AllBlackt's solution
def split_nth(s, sep, n):
n_split_groups = []
groups = s.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
return n_split_groups
s = "aaaaa bbbbb ccccc ddddd eeeeeee ffffffff"
print (split_nth(s, " ", 2))
['aaaaa bbbbb', 'ccccc ddddd', 'eeeeeee ffffffff']
As #Yuval has noted in his answer, and #jamylak commented in his answer, the split and rsplit methods accept a second (optional) parameter maxsplit to avoid making splits beyond what is necessary. Thus, I find the better solution (both for readability and performance) is this:
s = '20_231_myString_234'
first_part = text.rsplit('_', 2)[0] # Gives '20_231'
second_part = text.split('_', 2)[2] # Gives 'myString_234'
This is not only simple, but also avoids performance hits of regex solutions and other solutions using join to undo unnecessary splits.

Categories