How to compress by removing duplicates in python? - python

I have strings with blocks of the same character in, eg '1254,,,,,,,,,,,,,,,,982'. What I'm aiming to do is replace that with something along the lines of '1254(,16)982' so that the original string can be reconstructed. If anyone could point me in the right direction that would be greatly appreciated

You're looking for run-length encoding: here is a Python implementation based loosely on this one.
import itertools
def runlength_enc(s):
'''Return a run-length encoded version of the string'''
enc = ((x, sum(1 for _ in gp)) for x, gp in itertools.groupby(s))
removed_1s = [((c, n) if n > 1 else c) for c, n in enc]
joined = [["".join(g)] if n == 1 else list(g)
for n, g in itertools.groupby(removed_1s, key=len)]
return list(itertools.chain(*joined))
def runlength_decode(enc):
return "".join((c[0] * c[1] if len(c) == 2 else c) for c in enc)
For your example:
print runlength_enc("1254,,,,,,,,,,,,,,,,982")
# ['1254', (',', 16), '982']
print runlength_decode(runlength_enc("1254,,,,,,,,,,,,,,,,982"))
# 1254,,,,,,,,,,,,,,,,982
(Note that this will be efficient only if there are very long runs in your string).

If you don't care about the exact compressed form you may want to look at zlib.compress and zlib.decompress. zlibis a standard Python library that can compress a single string and will probably get better compression than a self implemented compression algorithm.

using regular expressions:
s = '1254,,,,,,,,,,,,,,,,982'
import re
c = re.sub(r'(.)\1+', lambda m: '(%s%d)' % (m.group(1), len(m.group(0))), s)
print c # 1254(,16)982
using itertools
import itertools
c = ''
for chr, g in itertools.groupby(s):
k = len(list(g))
c += chr if k == 1 else '(%s%d)' % (chr, k)
print c # 1254(,16)982

Related

print all options of 1s and 0s in a binary sequence with missing places

I have the following sequence with missing 1s and 0s:
"01010xx101xxx001"
And I need to print all possible sequences instead of the missing places marked by "x"
How can I do it if I might also need to change the number of "x"s?
You can do the following, using itertools.product:
from itertools import product
def combs(string):
for p in map(iter, product("01", repeat=string.count("x"))):
yield "".join(c if c in "01" else next(p) for c in string)
# maybe more consistent:
# yield "".join(next(p) if c == "x" else c for c in string)
# or shortest, with some operator trickery
# yield "".join(c!="x" and c or next(p) for c in string)
for c in combs("01010xx101xxx001"):
print(c)
0101000101000001
0101000101001001
0101000101010001
0101000101011001
0101000101100001
0101000101101001
# ...
Some documentation on the utils used here:
iter
map
next
str.join
str.count
itertools.product
generators
There's already a pretty good answer, and I suggest you study it.
Here's another solution that I think walks you a bit more through each step:
l = '1010xx101xxx001'
# count x's
count = len([c for c in l if c == 'x'])
# There will be 2 ^ {count} combinations, so iterate from 0 to 2 ^ {count} - 1
for i in range(2 ** count):
# Get the binary representation of i
b = str(bin(i)).replace('0b', '')
# Pad with zero's so we get {count} digits ('1' becomes '00001')
b_p = '0' * (count - len(b)) + b
# Replace x's one at a time
out = l
for digit in b_p:
out = out.replace('x', digit, 1)
print(out)

Getting all possible combinations of a regular expression string

I have a regular expression: ATG(C|G|A)(C|T)GA
The above regular expression could take any form with only OR (|) special characters at any position in the string and any number of alphabets within the brackets.
I want to match all combinations of this string in a list:
ATGCCGA
ATGCTGA
ATGGCGA
ATGGTGA
ATGACGA
ATGATGA
I am unable to find any python library that could do this.
You could take the cartesian product of the dynamic parts of the string using itertools.product then join with the other static parts of the string.
>>> from itertools import product
>>> [f'ATG{i}{j}GA' for i,j in product('CGA', 'CT')]
['ATGCCGA', 'ATGCTGA', 'ATGGCGA', 'ATGGTGA', 'ATGACGA', 'ATGATGA']
You can use recursion:
import collections
s = 'ATG(C|G|A)(C|T)GA'
def combos(d):
r, k = [], None
while d:
if (c:=d.popleft()) not in '|()':
k = (k if k else '')+c
elif c == '|':
if k:
r.append(k)
k = None
elif c == '(':
r = [v+(k or '')+i for i in combos(d) for v in (r if r else [''])]
k = None
else:
if k:
r.append(k)
k = None
break
yield from ([i+(k or '') for i in r] if r else [k])
print(list(combos(collections.deque(list(s)))))
Output:
['ATGCCGA', 'ATGGCGA', 'ATGACGA', 'ATGCTGA', 'ATGGTGA', 'ATGATGA']

Remove N consecutive repeated characters in a string

I am trying to solve a problem where the user inputs a string say str = "aaabbcc" and an integer n = 2.
So the function is supposed to remove characters that appearing 'n' times from the str and output only "aaa".
I tried couple of approaches and I'm not able to obtain the right output.
Are there any Regular expression functions that I could use or any recursive functions or just plain old iterations.
Thanks in advance.
Using itertools.groupby
Ex:
from itertools import groupby
s = "aaabbcc"
n = 2
result = ""
for k, v in groupby(s):
value = list(v)
if not len(value) == n:
result += "".join(value)
print(result)
Output:
aaa
You can use itertools.groupby:
>>> s = "aaabbccddddddddddeeeee"
>>> from itertools import groupby
>>> n = 3
>>> groups = (list(values) for _, values in groupby(s))
>>> "".join("".join(v) for v in groups if len(v) < n)
'bbcc'
from collections import Counter
counts = Counter(string)
string = "".join(c for c in string if counts[c] != 2)
Edit: Wait, sorry, I missed "consecutive". This will remove characters that occur exactly two times in the whole string (fitting your example, but not the general case).
Consecutive filter is a bit more complex, but doable - just find the consecutive runs first, then filter out the ones which have length two.
runs = [[string[0], 0]]
for c in string:
if c == runs[-1][0]:
runs[-1][1] += 1
else:
runs.append([c, 1])
string = "".join(c*length for c,length in runs if length != 2)
Edit2: As the other answers correctly point out, the first part of this is done natively by groupby
from itertools import groupby
string = "".join(c*length for c,length in groupby(string) if length != 2)
In [15]: some_string = 'aaabbcc'
In [16]: n = 2
In [17]: final_string = ''
In [18]: for k, v in Counter(some_string).items():
...: if v != n:
...: final_string += k * v
...:
In [19]: final_string
Out[19]: 'aaa'
You'll need: from collections import Counter
from collections import defaultdict
def fun(string,n):
dic = defaultdict(int)
for i in string:
dic[i]+=1
check = []
for i in dic:
if dic[i]==n:
check.append(i)
for i in check:
del dic[i]
return dic
string = "aaabbcc"
n = 2
result = fun(string, n)
sol =''
for i in result:
sol+=i*result[i]
print(sol)
output
aaa

Printing alphabets advanced by n in Python

how can i write a python program to intake some alphabets in and print out (alphabets+n) in the output. Example
my_string = 'abc'
expected_output = 'cde' # n=2
One way I've thought is by using str.maketrans, and mapping the original input to (alphabets + n). Is there any other way?
PS: xyz should translate to abc
I've tried to write my own code as well for this, (apart from the infinitely better answers mentioned):
number = 2
prim = """abc! fgdf """
final = prim.lower()
for x in final:
if(x =="y"):
print("a", end="")
elif(x=="z"):
print("b", end="")
else:
conv = ord(x)
x = conv+number
print(chr(x),end="")
Any comments on how to not convert special chars? thanks
If you don't care about wrapping around, you can just do:
def shiftString(string, number):
return "".join(map(lambda x: chr(ord(x)+number),string))
If you do want to wrap around (think Caesar chiffre), you'll need to specify a start and an end of where the alphabet begins and ends:
def shiftString(string, number, start=97, num_of_symbols=26):
return "".join(map(lambda x: chr(((ord(x)+number-start) %
num_of_symbols)+start) if start <= ord(x) <= start+num_of_symbols
else x,string))
That would, e.g., convert abcxyz, when given a shift of 2, into cdezab.
If you actually want to use it for "encryption", make sure to exclude non-alphabetic characters (like spaces etc.) from it.
edit: Shameless plug of my Vignère tool in Python
edit2: Now only converts in its range.
How about something like
>>> my_string = "abc"
>>> n = 2
>>> "".join([ chr(ord(i) + n) for i in my_string])
'cde'
Note As mentioned in comments the question is bit vague about what to do when the edge cases are encoundered like xyz
Edit To take care of edge cases, you can write something like
>>> from string import ascii_lowercase
>>> lower = ascii_lowercase
>>> input = "xyz"
>>> "".join([ lower[(lower.index(i)+2)%26] for i in input ])
'zab'
>>> input = "abc"
>>> "".join([ lower[(lower.index(i)+2)%26] for i in input ])
'cde'
I've made the following change to the code:
number = 2
prim = """Special() ops() chars!!"""
final = prim.lower()
for x in final:
if(x =="y"):
print("a", end="")
elif(x=="z"):
print("b", end="")
elif (ord(x) in range(97, 124)):
conv = ord(x)
x = conv+number
print(chr(x),end="")
else:
print(x, end="")
**Output**: urgekcn() qru() ejctu!!
test_data = (('abz', 2), ('abc', 3), ('aek', 26), ('abcd', 25))
# translate every character
def shiftstr(s, k):
if not (isinstance(s, str) and isinstance(k, int) and k >=0):
return s
a = ord('a')
return ''.join([chr(a+((ord(c)-a+k)%26)) for c in s])
for s, k in test_data:
print(shiftstr(s, k))
print('----')
# translate at most 26 characters, rest look up dictionary at O(1)
def shiftstr(s, k):
if not (isinstance(s, str) and isinstance(k, int) and k >=0):
return s
a = ord('a')
d = {}
l = []
for c in s:
v = d.get(c)
if v is None:
v = chr(a+((ord(c)-a+k)%26))
d[c] = v
l.append(v)
return ''.join(l)
for s, k in test_data:
print(shiftstr(s, k))
Testing shiftstr_test.py (above code):
$ python3 shiftstr_test.py
cdb
def
aek
zabc
----
cdb
def
aek
zabc
It covers wrapping.

Compare strings in list in Python and output character until they are identical

How can I compare all strings in a list e.g:
"A-B-C-D-E-F-H-A",
"A-B-C-F-G-H-M-P",
And output until which character they are identical:
In the example above it would be:
Character 6
And output the most similar strings.
I tried with collections.Counter but that did not work.
You're trying to go character by character in the two strings in lockstep. This is a job for zip:
A = "A-B-C-D-E-F-H-A"
B = "A-B-C-F-G-H-M-P"
count = 0
for a, b in zip(A, B):
if a == b:
count += 1
else:
break
Or, if you prefer "…as long as they are…" is a job for takewhile:
from itertools import takewhile
from operator import eq
def ilen(iterable): return sum(1 for _ in iterable)
count = ilen(takewhile(lambda ab: eq(*ab), zip(A, B)))
If you have a list of these strings, and you want to compare every string to every other string:
First, you turn the above code into a function. I'll do it with the itertools version, but you can do it with the other just as easily:
def shared_prefix(A, B):
return ilen(takewhile(lambda ab: eq(*ab), zip(A, B)))
Now, for every string, you compare it to all the rest of the strings. There's an easy way to do it with combinations:
from itertools import combinations
counts = [shared_prefix(pair) for pair in combinations(list_o_strings, 2)]
But if you don't understand that, you can write it as a nested loop. The only tricky part is what "the rest of the strings" means. You can't loop over all the strings in both the outer and inner loops, or you'll compare each pair of strings twice (once in each order), and compare each string to itself. So it has to mean "all the strings after the current one". Like this:
counts = []
for i, s1 in enumerate(list_o_strings):
for s2 in list_o_strings[i+1:]:
counts.append(prefix(s1, s2))
I think this code will solve your problem.
listA = "A-B-C-D-E-F-H-A"
listB = "A-B-C-F-G-H-M-P"
newListA = listA.replace ("-", "")
newListB = listB.replace ("-", "")
# newListA = "ABCDEFHA"
# newListB = "ABCFGHMP"
i = 0
exit = 0
while ((i < len (newListA)) & (exit == 0)):
if (newListA[i] != newListB[i]):
exit = 1
i = i + 1
print ("Character: " + str(i))

Categories