is there a more efficient way than this method? - python

The Job is to decompress a string.
For example:
if a string is 'a3b4c2' then decompress it as 'aaabbbbcc'.
the previous code i tried is
list1 = [i for i in a]
listNum = list(map(int,list(filter(lambda x:x.isdigit(),list1))))
listChar = list(filter(lambda x: not x.isdigit(),list1))
b = ''
for i in range(len(listNum)):
b += listChar[i]*listNum[i]
print(b)
I think it is a pretty simple problem, but my code seems clumsy, is there any other method to do it?.

import re
b = ''.join(c * int(n) for c, n in re.findall(r'(\w)(\d+)', a))
The regex will match each letter with the following number (accommodating multi-digit numbers) and return them in groups:
>>> re.findall(r'(\w)(\d+)', a)
[('a', '3'), ('b', '4'), ('c', '2')]
Then you just need to iterate over them…
for c, n in ...
# c = 'a'
# n = '3'
# ...
…and multiply them…
c * int(n)
…and simply do that in a generator expression…
c * int(n) for c, n in re.findall(r'(\w)(\d+)', a)
…and ''.join all the resulting small strings together.
For fun, here's a version that even allows standalone letters without numbers:
a = 'a3bc4d2e'
b = ''.join(c * int(n or 1) for c, n in re.findall(r'(\w)(\d+)?', a))
# aaabccccdde

Just another way, zip + splicing,
>>> value = 'a3b4c2'
>>>
>>> "".join(x*int(y) for x, y in zip(value[0::2], value[1::2]))
'aaabbbbcc'
>>>

You can use list comprehension for a one line solution:
input='a3b4c2'
result=''.join(input[i] * int(input[i+1]) for i in range(0,len(input),2))
Output:
>>> result
aaabbbbcc
The * operator can be used to multiply an integer with a character.
The join method is called to join the list of the substrings to the full string.

You might do it using regular expressions (re module), using grouping and function as 2nd re.sub argument following way
import re
a = 'a3b4c2'
def decompress(x):
return x.group(1)*int(x.group(2))
output = re.sub(r'(\D)(\d+)', decompress, a)
print(output) # aaabbbbcc
Explanation I am looking in string for single non-digit (\D) followed by one or more digits (\d+). For every match first is put into 1st group and latter into 2nd group, hence brackets in pattern. Then every match is replaced by content of 1st group (which is string) times value of content of 2nd group. Note that I used int to get that value as attempt of direct multiplying would fail (you can not multiply string by string).

Iterate the string pairwise using zip, to get the char c and int n as separate elements and then replicate the char c for n times
>>> str = 'a3b4c2'
>>> s = iter(str)
>>> ''.join(c*int(n) for c,n in zip(s, s))
'aaabbbbcc'

Related

Finding regular expression with at least one repetition of each letter

From any *.fasta DNA sequence (only 'ACTG' characters) I must find all sequences which contain at least one repetition of each letter.
For examle from sequence 'AAGTCCTAG' I should be able to find: 'AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG' and 'CTAG' (iteration on each letter).
I have no clue how to do that in pyhton 2.7. I was trying with regular expressions but it was not searching for every variants.
How can I achive that?
You could find all substrings of length 4+, and then down select from those to find only the shortest possible combinations that contain one of each letter:
s = 'AAGTCCTAG'
def get_shortest(s):
l, b = len(s), set('ATCG')
options = [s[i:j+1] for i in range(l) for j in range(i,l) if (j+1)-i > 3]
return [i for i in options if len(set(i) & b) == 4 and (set(i) != set(i[:-1]))]
print(get_shortest(s))
Output:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
This is another way you can do it. Maybe not as fast and nice as chrisz answere. But maybe a little simpler to read and understand for beginners.
DNA='AAGTCCTAG'
toSave=[]
for i in range(len(DNA)):
letters=['A','G','T','C']
j=i
seq=[]
while len(letters)>0 and j<(len(DNA)):
seq.append(DNA[j])
try:
letters.remove(DNA[j])
except:
pass
j+=1
if len(letters)==0:
toSave.append(seq)
print(toSave)
Since the substring you are looking for may be of about any length, a LIFO queue seems to work. Append each letter at a time, check if there are at least one of each letters. If found return it. Then remove letters at the front and keep checking until no longer valid.
def find_agtc_seq(seq_in):
chars = 'AGTC'
cur_str = []
for ch in seq_in:
cur_str.append(ch)
while all(map(cur_str.count,chars)):
yield("".join(cur_str))
cur_str.pop(0)
seq = 'AAGTCCTAG'
for substr in find_agtc_seq(seq):
print(substr)
That seems to result in the substrings you are looking for:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
I really wanted to create a short answer for this, so this is what I came up with!
See code in use here
s = 'AAGTCCTAG'
d = 'ACGT'
c = len(d)
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
print(x)
s,c = s[1:],len(d)
It works as follows:
c is set to the length of the string of characters we are ensuring exist in the string (d = ACGT)
The while loop iterates over each possible substring of s such that c is smaller than the length of s.
This works by increasing c by 1 upon each iteration of the while loop.
If every character in our string d (ACGT) exist in the substring, we print the result, reset c to its default value and slice the string by 1 character from the start.
The loop continues until the string s is shorter than d
Result:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
To get the output in a list instead (see code in use here):
s = 'AAGTCCTAG'
d = 'ACGT'
c,r = len(d),[]
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
r.append(x)
s,c = s[1:],len(d)
print(r)
Result:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
If you can break the sequence into a list, e.g. of 5-letter sequences, you could then use this function to find repeated sequences.
from itertools import groupby
import numpy as np
def find_repeats(input_list, n_repeats):
flagged_items = []
for item in input_list:
# Create itertools.groupby object
groups = groupby(str(item))
# Create list of tuples: (digit, number of repeats)
result = [(label, sum(1 for _ in group)) for label, group in groups]
# Extract just number of repeats
char_lens = np.array([x[1] for x in result])
# Append to flagged items
if any(char_lens >= n_repeats):
flagged_items.append(item)
# Return flagged items
return flagged_items
#--------------------------------------
test_list = ['aatcg', 'ctagg', 'catcg']
find_repeats(test_list, n_repeats=2) # Returns ['aatcg', 'ctagg']

Swapping two characters in a string and store the generated strings in a list in Python

I want to swap every two characters in a string and store the output in a list (to check every string later wether it exists in the dictionary)
I have seen some codes that swap the characters all at once, but that is not what I'am looking for.
For example:
var = 'abcde'
Expected output:
['bacde','acbde','abdce','abced']
How can I do this in Python?
You may use a below list comprehension expression to achieve this:
>>> var = 'abcde'
# v To reverse the substring
>>> [var[:i]+var[i:i+2][::-1]+var[i+2:] for i in range(len(var)-1)]
['bacde', 'acbde', 'abdce', 'abced']
Assuming the final entry from your expected output list is a typo, and that it should be 'abced' to keep the pattern going, then here is one way (unsure yet if it generalizes correctly based on your use case):
In [5]: x
Out[5]: 'abcde'
In [6]: [x[:i] + x[i+1] + x[i] + x[i+2:] for i in range(len(x)-1)]
Out[6]: ['bacde', 'acbde', 'abdce', 'abced']
A generator function will not use too much memory for longer strings:
def swap_pairs(s):
for i in range(len(s) - 1):
yield s[:i] + s[i + 1] + s[i] + s[i + 2:]
>>> swap_pairs('abcde')
<generator object swap_pairs at 0x1034d0f68>
>>> list(swap_pairs('abcde'))
['bacde', 'acbde', 'abdce', 'abced']
Here is a re approach:
x = 'abcde'
[re.sub(f'(.)(.)(?=.{{{i}}}$)', "\\2\\1", x) for i in reversed(range(len(x)-1))]
# ['bacde', 'acbde', 'abdce', 'abced']
And a variant that skips double characters:
x = 'abbde'
[s for s, i in (re.subn(f'(.)(?!\\1)(.)(?=.{{{i}}}$)', "\\2\\1", x) for i in reversed(range(len(x)-1))) if i]
# ['babde', 'abdbe', 'abbed']

Regular expression to match any character repeated exactly twice

I am trying to identify whether a supplied string has characters repeated exactly twice. The following is the regular expression that I am using:
([a-z])\1(?!\1)
However, when tested against the following strings, both the strings below are matching the pattern (though I have used (?!\1):
>>> re.findall(r'.*([a-z])\1(?!\1)', 'abcdeefg')
['e']
>>> re.findall(r'.*([a-z])\1(?!\1)', 'abcdeeefg')
['e']
What is wrong in the above pattern?
I suspect that a python regular expression alone will not meet your needs. In order to ensure that a character is repeated exactly twice will require a negative look behind assertion, and such assertions cannot contain group references.
The easiest approach is to instead look for all repetitions and simply check their length.
def double_repeats(txt):
import itertools
# find possible repeats
candidates = set(re.findall(r'([a-z])\1', txt))
# now find the actual repeated blocks
repeats = itertools.chain(*[re.findall(r"({0}{0}+)".format(c), txt) for c in candidates])
# return just the blocks of length 2
return [x for x in repeats if len(x) == 2]
Then:
>>> double_repeats("abbbcbbdddeffgggg")
['ff', 'bb']
You could use a regex alternate operator trick.
>>> def guess(s):
out = re.findall(r'([a-z])\1{2,}|([a-z])\2', s)
if out and out[0][1]:
return True
return False
>>> k = ['abcdeefg', 'abcdeeefg']
>>> [guess(i) for i in k]
[True, False]
>>>
([a-z])\1{2,} matches all the repeated characters having a minimum of 3 maximum of n characters.
| OR
([a-z])\2 matches exactly two repeated characters from the remaining string since all the same continuous characters are matched by the first pattern.
or
>>> def guess(s):
out = re.findall(r'([a-z])\1{2,}|([a-z])\2', s)
if out and out[0][1]:
return out[0][1]
return False
>>> k = '23413e4abcee'
>>> k.count(guess(k)) == 2
False
>>> k = '234134abcee'
>>> k.count(guess(k)) == 2
True
>>>
If you want to get output like the other answers, then here you go:
>>> def guess(s):
out = re.findall(r'([a-z])\1{2,}|([a-z])\2', s)
if out:
return [y+y for x,y in out if y]
return []
>>> guess("abbbcbbdddeffgggg")
['bb', 'ff']
>>>
I find it the best way to do that.

Matching Strings in Python?

Using Python, how can I check whether 3 consecutive chars within a string (A) are also contained in another string (B)? Is there any built-in function in Python?
EXAMPLE:
A = FatRadio
B = fradio
Assuming that I have defined a threshold of 3, the python script should return true as there are three consecutive characters in B which are also included in A (note that this is the case for 4 and 5 consecutive characters as well).
How about this?
char_count = 3 # Or whatever you want
if len(A) >= char_count and len(B) >= char_count :
for i in range(0, len(A) - char_count + 1):
some_chars = A[i:i+char_count]
if some_chars in B:
# Huray!
You can use the difflib module:
import difflib
def have_common_triplet(a, b):
matcher = difflib.SequenceMatcher(None, a, b)
return max(size for _,_,size in matcher.get_matching_blocks()) >= 3
Result:
>>> have_common_triplet("FatRadio", "fradio")
True
Note however that SequenceMatcher does much more than finding the first common triplet, hence it could take significant more time than a naive approach. A simpler solution could be:
def have_common_group(a, b, size=3):
first_indeces = range(len(a) - len(a) % size)
second_indeces = range(len(b) - len(b) % size)
seqs = {b[i:i+size] for i in second_indeces}
return any(a[i:i+size] in seqs for i in first_indeces)
Which should perform better, especially when the match is at the beginning of the string.
I don't know about any built-in function for this, so I guess the most simple implementation would be something like this:
a = 'abcdefgh'
b = 'foofoofooabcfoo'
for i in range(0,len(a)-3):
if a[i:i+3] in b:
print 'then true!'
Which could be shorten to:
search_results = [i for in range(0,len(a)-3) if a[i:i+3] in b]

Split string at nth occurrence of a given character

Is there a Python-way to split a string after the nth occurrence of a given delimiter?
Given a string:
'20_231_myString_234'
It should be split into (with the delimiter being '_', after its second occurrence):
['20_231', 'myString_234']
Or is the only way to accomplish this to count, split and join?
>>> n = 2
>>> groups = text.split('_')
>>> '_'.join(groups[:n]), '_'.join(groups[n:])
('20_231', 'myString_234')
Seems like this is the most readable way, the alternative is regex)
Using re to get a regex of the form ^((?:[^_]*_){n-1}[^_]*)_(.*) where n is a variable:
n=2
s='20_231_myString_234'
m=re.match(r'^((?:[^_]*_){%d}[^_]*)_(.*)' % (n-1), s)
if m: print m.groups()
or have a nice function:
import re
def nthofchar(s, c, n):
regex=r'^((?:[^%c]*%c){%d}[^%c]*)%c(.*)' % (c,c,n-1,c,c)
l = ()
m = re.match(regex, s)
if m: l = m.groups()
return l
s='20_231_myString_234'
print nthofchar(s, '_', 2)
Or without regexes, using iterative find:
def nth_split(s, delim, n):
p, c = -1, 0
while c < n:
p = s.index(delim, p + 1)
c += 1
return s[:p], s[p + 1:]
s1, s2 = nth_split('20_231_myString_234', '_', 2)
print s1, ":", s2
I like this solution because it works without any actuall regex and can easiely be adapted to another "nth" or delimiter.
import re
string = "20_231_myString_234"
occur = 2 # on which occourence you want to split
indices = [x.start() for x in re.finditer("_", string)]
part1 = string[0:indices[occur-1]]
part2 = string[indices[occur-1]+1:]
print (part1, ' ', part2)
I thought I would contribute my two cents. The second parameter to split() allows you to limit the split after a certain number of strings:
def split_at(s, delim, n):
r = s.split(delim, n)[n]
return s[:-len(r)-len(delim)], r
On my machine, the two good answers by #perreal, iterative find and regular expressions, actually measure 1.4 and 1.6 times slower (respectively) than this method.
It's worth noting that it can become even quicker if you don't need the initial bit. Then the code becomes:
def remove_head_parts(s, delim, n):
return s.split(delim, n)[n]
Not so sure about the naming, I admit, but it does the job. Somewhat surprisingly, it is 2 times faster than iterative find and 3 times faster than regular expressions.
I put up my testing script online. You are welcome to review and comment.
>>>import re
>>>str= '20_231_myString_234'
>>> occerence = [m.start() for m in re.finditer('_',str)] # this will give you a list of '_' position
>>>occerence
[2, 6, 15]
>>>result = [str[:occerence[1]],str[occerence[1]+1:]] # [str[:6],str[7:]]
>>>result
['20_231', 'myString_234']
It depends what is your pattern for this split. Because if first two elements are always numbers for example, you may build regular expression and use re module. It is able to split your string as well.
I had a larger string to split ever nth character, ended up with the following code:
# Split every 6 spaces
n = 6
sep = ' '
n_split_groups = []
groups = err_str.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
print n_split_groups
Thanks #perreal!
In function form of #AllBlackt's solution
def split_nth(s, sep, n):
n_split_groups = []
groups = s.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
return n_split_groups
s = "aaaaa bbbbb ccccc ddddd eeeeeee ffffffff"
print (split_nth(s, " ", 2))
['aaaaa bbbbb', 'ccccc ddddd', 'eeeeeee ffffffff']
As #Yuval has noted in his answer, and #jamylak commented in his answer, the split and rsplit methods accept a second (optional) parameter maxsplit to avoid making splits beyond what is necessary. Thus, I find the better solution (both for readability and performance) is this:
s = '20_231_myString_234'
first_part = text.rsplit('_', 2)[0] # Gives '20_231'
second_part = text.split('_', 2)[2] # Gives 'myString_234'
This is not only simple, but also avoids performance hits of regex solutions and other solutions using join to undo unnecessary splits.

Categories