How to find contiguous substrings from a string in python - python

I have a string abccddde
I need to find substrings like:
a, b, c, cc, d, dd, ddd, e
substrings ab or cd are not valid.
I tried finding all the substrings from a string but its not efficient
def get_all_substrings(input_string):
length = len(input_string)
return [input_string[i:j+1] for i in range(length) for j in range(i,length)]
This is outputting:
['a', 'ab', 'abc', 'abcc', 'abccd', 'abccdd', 'abccddd', 'abccddde', 'b', 'bc', 'bcc', 'bccd', 'bccdd', 'bccddd', 'bccddde', 'c', 'cc', 'ccd', 'ccdd', 'ccddd', 'ccddde', 'c', 'cd', 'cdd', 'cddd', 'cddde', 'd', 'dd', 'ddd', 'ddde', 'd', 'dd', 'dde', 'd', 'de', 'e']
This was the method i followed to find the substrings but it gives all the possiblities but that is what makes it inefficient
Please Help!

You can use itertools.groupby() for this:
from itertools import groupby
s = 'abccdddcce'
l1 = ["".join(g) for k, g in groupby(s)]
l2 = [a[:i+1] for a in l1 for i in range(len(a))]
print l2
Output:
['a', 'b', 'c', 'cc', 'd', 'dd', 'ddd', 'c', 'cc', 'e']
For larger input data, replace the Lists with Generators,
l1=()
l2=()

itertools.groupby can tell you the number of consecutive chars. After that for each group you have the char repeated upto that number.
from itertools import groupby
def substrings(s):
for char, group in groupby(s):
substr = ''
for i in group:
substr += i
yield substr
for result in substrings('abccdddcce'):
print(result)

here is one way using regex:
In [85]: [j for i in re.findall(r'((\w)(\2+)?)', s) for j in set(i) if j]
Out[85]: ['a', 'b', 'c', 'cc', 'ddd', 'dd', 'd', 'e']

from itertools import groupby
def runlength_compress(src):
return ((k, sum(1 for _ in g)) for k,g in groupby(src))
def contiguous_substrings(src):
return [c*(i+1) for c, count in runlength_compress(src) for i in range(count)]
print(contiguous_substrings('abccddde'))

The following will do what you want. I don't know if its efficient compared to other solutions though.
def get_all_substrings(text):
res = []
prev = ''
s = ''
for c in text:
if c == prev:
s += c
else:
s = prev = c
res.append(s)
return res
# Output
>>> get_all_substrings('abccddde')
['a', 'b', 'c', 'cc', 'd', 'dd', 'ddd', 'e']
>>> get_all_substrings('abccdddec')
['a', 'b', 'c', 'cc', 'd', 'dd', 'ddd', 'e', 'c']
Timings
import timeit
import random
size = 100
values = 'abcde'
s = ''.join(random.choice(values) for _ in range(size))
print(s)
print(timeit.timeit("get_all_substrings(s)",
setup = 'from __main__ import s, get_all_substrings',
number = 10000) )
# Example for size 100 input
abbaaebacddbdedbdbbacadcdddabaeabacdcbeebbccaadebdcecadcecceececcacebacecbbccdedddddabaeeceeeccabdcc
0.16761969871275967

Related

Removing duplicate characters from a list in Python where the pattern repeats

I am monitoring a serial port that sends data that looks like this:
['','a','a','a','a','a','a','','b','b','b','b','b','b','b','b',
'','','c','c','c','c','c','c','','','','d','d','d','d','d','d','d','d',
'','','e','e','e','e','e','e','','','a','a','a','a','a','a',
'','','','b','b','b','b','b','b','b','b','b','','','c','c','c','c','c','c',
'','','','d','d','d','d','d','d','','','e','e','e','e','e','e',
'','','a','a','a','a','a','a','','b','b','b','b','b','b','b','b',
'','','c','c','c','c','c','c','','','','d','d','d','d','d','d','d','d',
'','','e','e','e','e','e','e','','','a','a','a','a','a','a',
'','','','b','b','b','b','b','b','b','b','b','','','c','c','c','c','c','c',
'','','','d','d','d','d','d','d','','','e','e','e','e','e','e','','']
I need to be able to convert this into:
['a','b','c','d','a','b','c','d','a','b','c','d','a','b','c','d']
So I'm removing duplicates and empty strings, but also retaining the number of times the pattern repeats itself.
I haven't been able to figure it out. Can someone help?
Here's a solution using a list comprehension and itertools.zip_longest: keep an element only if it's not an empty string, and not equal to the next element. You can use an iterator to skip the first element, to avoid the cost of slicing the list.
from itertools import zip_longest
def remove_consecutive_duplicates(lst):
ahead = iter(lst)
next(ahead)
return [ x for x, y in zip_longest(lst, ahead) if x and x != y ]
Usage:
>>> remove_consecutive_duplicates([1, 1, 2, 2, 3, 1, 3, 3, 3, 2])
[1, 2, 3, 1, 3, 2]
>>> remove_consecutive_duplicates(my_list)
['a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd',
'e', 'a', 'b', 'c', 'd', 'e']
I'm assuming either that there are no duplicates separated by empty strings (e.g. 'a', '', 'a'), or that you don't want to remove such duplicates. If this assumption is wrong, then you should filter out the empty strings first:
>>> example = ['a', '', 'a']
>>> remove_consecutive_duplicates([ x for x in example if x ])
['a']
You can loop over the list and add the appropriate contitions. For the response that you are expecting, you just need to whether previous character is not same as current character
current_sequence = ['','a','a','a','a','a','a','','b','b','b','b','b','b','b','b','','','c','c','c','c','c','c','','','','d','d','d','d','d','d','d','d','','','e','e','e','e','e','e','','','a','a','a','a','a','a','','','','b','b','b','b','b','b','b','b','b','','','c','c','c','c','c','c','','','','d','d','d','d','d','d','','','e','e','e','e','e','e','','','a','a','a','a','a','a','','b','b','b','b','b','b','b','b','','','c','c','c','c','c','c','','','','d','d','d','d','d','d','d','d','','','e','e','e','e','e','e','','','a','a','a','a','a','a','','','','b','b','b','b','b','b','b','b','b','','','c','c','c','c','c','c','','','','d','d','d','d','d','d','','','e','e','e','e','e','e','','']
sequence_list = []
for x in range(len(current_sequence)):
if current_sequence[x]:
if current_sequence[x] != current_sequence[x-1]:
sequence_list.append(current_sequence[x])
print(sequence_list)
You need something like that
li = ['','a','a','a','a','a','a','','b','b','b','b','b','b','b','b','','','c','c','c','c','c','c','','','','d','d','d','d','d','d','d','d','','','e','e','e','e','e','e','','','a','a','a','a','a','a','','','','b','b','b','b','b','b','b','b','b','','','c','c','c','c','c','c','','','','d','d','d','d','d','d','','','e','e','e','e','e','e','','','a','a','a','a','a','a','','b','b','b','b','b','b','b','b','','','c','c','c','c','c','c','','','','d','d','d','d','d','d','d','d','','','e','e','e','e','e','e','','','a','a','a','a','a','a','','','','b','b','b','b','b','b','b','b','b','','','c','c','c','c','c','c','','','','d','d','d','d','d','d','','','e','e','e','e','e','e','','']
new_li = []
e_ = ''
for e in li:
if len(e) > 0 and e_ != e:
new_li.append(e)
e_ = e
print(new_li)
Output
['a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e']
You can use itertools.groupby:
if your list is ll
ll = [i for i in ll if i]
out = []
for k, g in groupby(ll, key=lambda x: ord(x)):
out.append(chr(k))
print(out)
#prints ['a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e', ...
from itertools import groupby
from operator import itemgetter
# data <- your data
a = [k for k, v in groupby(data) if k] # approach 1
b = list(filter(bool, map(itemgetter(0), groupby(data)))) # approach 2
assert a == b
print(a)
Result:
['a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e']
using the set method you can remove the duplicates from the list
data = ['','a','a','a','a','a','a','','b','b','b','b','b','b','b','b',
'','','c','c','c','c','c','c','','','','d','d','d','d','d','d','d','d',
'','','e','e','e','e','e','e','','','a','a','a','a','a','a',
'','','','b','b','b','b','b','b','b','b','b','','','c','c','c','c','c','c',
'','','','d','d','d','d','d','d','','','e','e','e','e','e','e',
'','','a','a','a','a','a','a','','b','b','b','b','b','b','b','b',
'','','c','c','c','c','c','c','','','','d','d','d','d','d','d','d','d',
'','','e','e','e','e','e','e','','','a','a','a','a','a','a',
'','','','b','b','b','b','b','b','b','b','b','','','c','c','c','c','c','c',
'','','','d','d','d','d','d','d','','','e','e','e','e','e','e','','']
print(set(data))

Make a list based on conditions in python to create a unique list

I have two lists:
a= [0,0,0,1,1,1,3,3,3]
b= ['a','b','c','d','e','f','g','h','i']
output = [['a','b','c'],['d','e','f'],['g','h','i']]
a and b are list of same length.
I need an output array by in such a way that whenever the value in list - a changes from 0 to 1 or from 1 to 3, A new list should be made in the output list.
can someone please help.
Use groupby:
from itertools import groupby
from operator import itemgetter
a = [0, 0, 0, 1, 1, 1, 3, 3, 3]
b = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
output = [list(map(itemgetter(1), group)) for _, group in groupby(zip(a, b), key=itemgetter(0))]
print(output)
Output
[['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]
A simpler method without using any imports by utilizing dictionary:
a= [0,0,0,1,1,1,3,3,3]
b= ['a','b','c','d','e','f','g','h','i']
d = {e: [] for e in set(a)} # Create a dictionary for each of a's unique key
[d[e].append(b[i]) for i, e in enumerate(a)] # put stuff into lists by index
lofl = list(d.values())
>>> lofl
[['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]
Using groupby, you could do:
from itertools import groupby
a= [0,0,0,1,1,1,3,3,3]
b= ['a','b','c','d','e','f','g','h','i']
iter_b = iter(b)
output = [[next(iter_b) for _ in group] for key, group in groupby(a)]
print(output)
# [['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]
groupby yields successive groups of identical values of a. For each group, we create a list containing as many of the next elements of b as there are values in the group.
As you added tag algorithm , I believe you want a solution without so many magic.
>>> def merge_lists(A, B):
... output = []
... sub_list = []
... current = A[0]
... for i in range(len(A)):
... if A[i] == current:
... sub_list.append(B[i])
... else:
... output.append(sub_list)
... sub_list = []
... sub_list.append(B[i])
... current = A[i]
... output.append(sub_list)
... return output
...
>>> a= [0,0,0,1,1,1,3,3,3]
>>> b= ['a','b','c','d','e','f','g','h','i']
>>> merge_list(a, b)
[['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]

How to slice lists in Python based on the length of its elements

How to slice a list based on the length of its elements?
For example, how do I turn
['A', 'E', 'LA', 'ELA']
into
['A','E],['LA'],['ELA']
Using itertools.groupby
from itertools import groupby
l = ['A', 'E', 'LA', 'ELA']
[list(g) for _,g in groupby(l,len)]
#Output:
#[['A', 'E'], ['LA'], ['ELA']]
You can try this:
l = ['A', 'E', 'LA', 'ELA', 'B', 'CD']
maximum = max([len(i) for i in l])
minimum = min([len(i) for i in l])
l = list([i for i in l if len(i)==s] for s in range(minimum, maximum+1))
print(l)
Output:
[['A', 'E', 'B'], ['LA', 'CD'], ['ELA']]

Merge lists in Python by placing every nth item from one list and others from another?

I have two lists, list1 and list2.
Here len(list2) << len(list1).
Now I want to merge both of the lists such that every nth element of final list is from list2 and the others from list1.
For example:
list1 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
list2 = ['x', 'y']
n = 3
Now the final list should be:
['a', 'b', 'x', 'c', 'd', 'y', 'e', 'f', 'g', 'h']
What is the most Pythonic way to achieve this?
I want to add all elements of list2 to the final list, final list should include all elements from list1 and list2.
Making the larger list an iterator makes it easy to take multiple elements for each element of the smaller list:
list1 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
list2 = ['x', 'y']
n = 3
iter1 = iter(list1)
res = []
for x in list2:
res.extend([next(iter1) for _ in range(n - 1)])
res.append(x)
res.extend(iter1)
>>> res
['a', 'b', 'x', 'c', 'd', 'y', 'e', 'f', 'g', 'h']
This avoids insert which can be expensive for large lists because each time the whole list needs to be re-created.
To preserve the original list, you could try the following:
result = copy.deepcopy(list1)
index = n - 1
for elem in list2:
result.insert(index, elem)
index += n
result
['a', 'b', 'x', 'c', 'd', 'y', 'e', 'f', 'g', 'h']
Using the itertools module and the supplementary more_itertools package, you can construct an iterable solution a couple different ways. First the imports:
import itertools as it, more_itertools as mt
This first one seems the cleanest, but it relies on more_itertools.chunked().
it.chain(*mt.roundrobin(mt.chunked(list1, n-1), list2))
This one uses only more_itertools.roundrobin(), whose implementation is taken from the itertools documentation, so if you don't have access to more_itertools you can just copy it yourself.
mt.roundrobin(*([iter(list1)]*(n-1) + [list2]))
Alternatively, this does nearly the same thing as the first sample without using any more_itertools-specific functions. Basically, grouper can replace chunked, but it will add Nones at the end in some cases, so I wrap it in it.takewhile to remove those. Naturally, if you are using this on lists which actually do contain None, it will stop once it reaches those elements, so be careful.
it.takewhile(lambda o: o is not None,
it.chain(*mt.roundrobin(mt.grouper(n-1, list1), list2))
)
I tested these on Python 3.4, but I believe these code samples should also work in Python 2.7.
What about the below solution? However I don't have a better one...
>>> list1 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
>>> list2 = ['x', 'y']
>>> n = 2
>>> for i in range(len(list2)):
... list1.insert(n, list2[i])
... n += 3
...
...
>>> list1
['a', 'b', 'x', 'c', 'd', 'y', 'e', 'f', 'g', 'h']
n is 2 because the index of third element in a list is 2, since it starts at 0.
list(list1[i-1-min((i-1)//n, len(list2))] if i % n or (i-1)//n >= len(list2) else list2[(i-1)//n] for i in range(1, len(list1)+len(list2)+1))
Definitely not pythonic, but I thought it might be fun to do it in a one-liner. More readable (really?) version:
list(
list1[i-1-min((i-1)//n, len(list2))]
if i % n or (i-1)//n >= len(list2)
else
list2[(i-1)//n]
for i in range(1, len(list1)+len(list2)+1)
)
Basically, some tinkering around with indexes and determining which list and which index to take next element from.
Yet another way, calculating the slice steps:
list1 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
list2 = ['x', 'y']
n = 3
res = []
m = n - 1
start, end = 0, m
for x in list2:
res.extend(list1[start:end])
res.append(x)
start, end = end, end + m
res.extend(list1[start:])
>>> res
['a', 'b', 'x', 'c', 'd', 'y', 'e', 'f', 'g', 'h']
list1 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
list2 = ['x', 'y']
n = 3
new = list1[:]
for index, item in enumerate(list2):
new[n * (index + 1) - 1: n * (index + 1) - 1] = item
print(new)
I admire #David Z's use of more_itertools. Updates to the tools can simplify the solution:
import more_itertools as mit
n = 3
groups = mit.windowed(list1, n-1, step=n-1)
list(mit.flatten(mit.interleave_longest(groups, list2)))
# ['a', 'b', 'x', 'c', 'd', 'y', 'e', 'f', 'g', 'h']
Summary: list2 is being interleaved into groups from list1 and finally flattened into one list.
Notes
groups: n-1 size sliding windows, e.g. [('a', 'b'), ('c', 'd'), ('e', 'f'), ('g', 'h')]
interleave_longest is presently equivalent to roundrobin
None is the default fillvalue. Optionally remove with filter(None, ...)
Maybe here is another solution, slice the list1 the correct index then add the element of list2 into list1.
>>> list1 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
>>> list2 = ['x', 'y']
>>> n = 3
>>> for i in range(len(list2)):
... list1 = list1[:n*(i+1) - 1] + list(list2[i]) + list1[n*(i+1)-1:]
...
>>> list1
['a', 'b', 'x', 'c', 'd', 'y', 'e', 'f', 'g', 'h']

How to group sentences of identical characters in python ( 'aabcdd' -> ['aa', 'b', 'c', 'dd'] )?

I have a string 'aabbababacccssdd' from which I want to generate ['aa', 'bb', 'a', 'b', 'a', 'b', 'a', 'ccc', 'ss', 'dd']
Here's my present solution:
def get_pats(n):
n = str(n) # to support integers
a = len(n)
p = []
pat_start = 0
prev = 0
for b in range(0, a):
if n[b] != n[prev]:
p.append(n[pat_start:b])
prev = b
pat_start = b
p.append(n[pat_start:b+1])
return p
The solution works good enough, but I was wondering if there is a more elegant/pythonic way to do this?
This is what itertools.groupby does for you:
text = 'aabbababacccssdd'
from itertools import groupby
print [''.join(g) for k, g in groupby(text)]
# # ['aa', 'bb', 'a', 'b', 'a', 'b', 'a', 'ccc', 'ss', 'dd']

Categories