Splitting a Python list into a list of overlapping chunks

Splitting a Python list into a list of overlapping chunks - python

This question is similar to Slicing a list into a list of sub-lists, but in my case I want to include the last element of each previous sub-list as the first element in the next sub-list. And I have to take into account that the last sub-list always has to have at least two elements.
For example:
list_ = ['a','b','c','d','e','f','g','h']
The result for a size 3 sub-list:
resultant_list = [['a','b','c'],['c','d','e'],['e','f','g'],['g','h']]

The list comprehension in the answer you linked is easily adapted to support overlapping chunks by simply shortening the "step" parameter passed to the range:
>>> list_ = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
>>> n = 3 # group size
>>> m = 1 # overlap size
>>> [list_[i:i+n] for i in range(0, len(list_), n-m)]
[['a', 'b', 'c'], ['c', 'd', 'e'], ['e', 'f', 'g'], ['g', 'h']]
Other visitors to this question mightn't have the luxury of working with an input list (slicable, known length, finite). Here is a generator-based solution that can work with arbitrary iterables:
from collections import deque
def chunks(iterable, chunk_size=3, overlap=0):
# we'll use a deque to hold the values because it automatically
# discards any extraneous elements if it grows too large
if chunk_size < 1:
raise Exception("chunk size too small")
if overlap >= chunk_size:
raise Exception("overlap too large")
queue = deque(maxlen=chunk_size)
it = iter(iterable)
i = 0
try:
# start by filling the queue with the first group
for i in range(chunk_size):
queue.append(next(it))
while True:
yield tuple(queue)
# after yielding a chunk, get enough elements for the next chunk
for i in range(chunk_size - overlap):
queue.append(next(it))
except StopIteration:
# if the iterator is exhausted, yield any remaining elements
i += overlap
if i > 0:
yield tuple(queue)[-i:]
Note: I've since released this implementation in wimpy.util.chunks. If you don't mind adding the dependency, you can pip install wimpy and use from wimpy import chunks rather than copy-pasting the code.

more_itertools has a windowing tool for overlapping iterables.
Given
import more_itertools as mit
iterable = list("abcdefgh")
iterable
# ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
Code
windows = list(mit.windowed(iterable, n=3, step=2))
windows
# [('a', 'b', 'c'), ('c', 'd', 'e'), ('e', 'f', 'g'), ('g', 'h', None)]
If required, you can drop the None fillvalue by filtering the windows:
[list(filter(None, w)) for w in windows]
# [['a', 'b', 'c'], ['c', 'd', 'e'], ['e', 'f', 'g'], ['g', 'h']]
See also more_itertools docs for details on more_itertools.windowed

Here's what I came up with:
l = [1, 2, 3, 4, 5, 6]
x = zip (l[:-1], l[1:])
for i in x:
print (i)
(1, 2)
(2, 3)
(3, 4)
(4, 5)
(5, 6)
Zip accepts any number of iterables, there is also zip_longest

Related

How to return a shuffled list considering mutually exclusive items?

Say I have a list of options and I want to pick a certain number randomly.
In my case, say the options are in a list ['a', 'b', 'c', 'd', 'e'] and I want my script to return 3 elements.
However, there is also the case of two options that cannot appear at the same time. That is, if option 'a' is picked randomly, then option 'b' cannot be picked. And the same applies the other way round.
So valid outputs are: ['a', 'c', 'd'] or ['c', 'd', 'b'], while things like ['a', 'b', 'c'] would not because they contain both 'a' and 'b'.
To fulfil these requirements, I am fetching 3 options plus another one to compensate a possible discard. Then, I keep a set() with the mutually exclusive condition and keep removing from it and check if both elements have been picked or not:
import random
mutually_exclusive = set({'a', 'b'})
options = ['a', 'b', 'c', 'd', 'e']
num_options_to_return = 3
shuffled_options = random.sample(options, num_options_to_return + 1)
elements_returned = 0
for item in shuffled_options:
if elements_returned >= num_options_to_return:
break
if item in mutually_exclusive:
mutually_exclusive.remove(item)
if not mutually_exclusive:
# if both elements have appeared, then the set is empty so we cannot return the current value
continue
print(item)
elements_returned += 1
However, I may be overcoding and Python may have better ways to handle these requirements. Going through random's documentation I couldn't find ways to do this out of the box. Is there a better solution than my current one?

One way to do this is use itertools.combinations to create all of the possible results, filter out the invalid ones and make a random.choice from that:
>>> from itertools import combinations
>>> from random import choice
>>> def is_valid(t):
... return 'a' not in t or 'b' not in t
...
>>> choice([
... t
... for t in combinations('abcde', 3)
... if is_valid(t)
... ])
...
('c', 'd', 'e')

Maybe a bit naive, but you could generate samples until your condition is met:
import random
options = ['a', 'b', 'c', 'd', 'e']
num_options_to_return = 3
mutually_exclusive = set({'a', 'b'})
while True:
shuffled_options = random.sample(options, num_options_to_return)
if all (item not in mutually_exclusive for item in shuffled_options):
break
print(shuffled_options)

You can restructure your options.
import random
options = [('a', 'b'), 'c', 'd', 'e']
n_options = 3
selected_option = random.sample(options, n_options)
result = [item if not isinstance(item, tuple) else random.choice(item)
for item in selected_option]
print(result)

I would implement it with sets:
import random
mutually_exclusive = {'a', 'b'}
options = ['a', 'b', 'c', 'd', 'e']
num_options_to_return = 3
while True:
s = random.sample(options, num_options_to_return)
print('Sample is', s)
if not mutually_exclusive.issubset(s):
break
print('Discard!')
print('Final sample:', s)
Prints (for example):
Sample is ['a', 'b', 'd']
Discard!
Sample is ['b', 'a', 'd']
Discard!
Sample is ['e', 'a', 'c']
Final sample: ['e', 'a', 'c']

I created the below function and I think it's worth sharing it too ;-)
def random_picker(options, n, mutually_exclusives=None):
if mutually_exclusives is None:
return random.sample(options, n)
elif any(len(pair) != 2 for pair in mutually_exclusives):
raise ValueError('Lenght of pairs of mutually_exclusives iterable, must be 2')
res = []
while len(res) < n:
item_index = random.randint(0, len(options) - 1)
item = options[item_index]
if any(item == exc and pair[-(i - 1)] in res for pair in mutually_exclusives
for i, exc in enumerate(pair)):
continue
res.append(options.pop(item_index))
return res
Where:
options is the list of available options to pick from.
n is the number of items you want to be picked from options
mutually_exclusives is an iterable containing tuples pairs of mutually exclusive items
You can use it as follows:
>>> random_picker(['a', 'b', 'c', 'd', 'e'], 3)
['c', 'e', 'a']
>>> random_picker(['a', 'b', 'c', 'd', 'e'], 3, [('a', 'b')])
['d', 'b', 'e']
>>> random_picker(['a', 'b', 'c', 'd', 'e'], 3, [('a', 'b'), ('a', 'c')])
['e', 'd', 'a']

import random
l = [['a','b'], ['c'], ['d'], ['e']]
x = [random.choice(i) for i in random.sample(l,3)]
here l is a two-dimensional list, where the fist level reflects an and relation and the second level an or relation.

Python while loop return Nth letter

I have a list of strings
X=['kmo','catlin','mept']
I was trying to write a loop that would return a list that contains lists of every Nth letter of each word:
[['k','c','m'], ['m','a','e'],['o','t','p']]
But all the methods I tried only returned a list of all the letters returned consecutively in one list:
['k','m','o','c','a','t','l','i'.....'t']
Here is one version of my code:
def letters(X):
prefix=[]
for i in X:
j=0
while j < len(i):
while j < len(i):
prefix.append(i[j])
break
j+=1
return prefix
I know I'm looping within each word, but I'm not sure how to correct it.

It seems that the length of the resulting list is dictated by the length of the smallest string in the original list. If that is indeed the case, you could simply do it like this:
X = ['kmo','catlin','mept']
l = len(min(X, key=len))
res = [[x[i] for x in X] for i in range(l)]
which returns:
print(res) # -> [['k', 'c', 'm'], ['m', 'a', 'e'], ['o', 't', 'p']]
or the even simpler (kudos #JonClemens):
res = [list(el) for el in zip(*X)]
with the same result. Note that this works because zip automatically stops iterating as soon as one of its elements is depleted.
If you want to fill the blanks so to speak, itertools has got your back with its zip_longest method. See this for more information. The fillvalue can be anything you chose; here, '-' is used to demonstrate the use. An empty string '' might be a better option for production code.
res = list(zip_longest(*X, fillvalue = '-'))
print(res) # -> [('k', 'c', 'm'), ('m', 'a', 'e'), ('o', 't', 'p'), ('-', 'l', 't'), ('-', 'i', '-'), ('-', 'n', '-')]

You can use zip.
output=list(zip(*X))
print(output)
*X will unpack all the elements present in X.
After unpacking I'm zipping all of them together. The zip() function returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together, and then the second item in each passed iterator are paired together etc. Finally, I wrapped everything in a list using list.
output
[('k', 'c', 'm'), ('m', 'a', 'e'), ('o', 't', 'p')]
If you want output to be a list of lists. Then use map.
output=list(map(list,zip(*X)))
print(output)
output
[['k', 'c', 'm'], ['m', 'a', 'e'], ['o', 't', 'p']]

X=['kmo','catlin','mept']
y = []
j=0
for i in X:
item =''
for element in X :
if (len(element) > j):
item = item + element[j]
y.append(item)
j=j+1
print("X = [",X,"]")
print("Y = [",y,"]")

try this
def letters(X):
prefix=[]
# First lets zip the list element
zip_elemets = zip(*X)
for element in zip_elemets:
prefix.append(list(element))
return prefix

Python array value unexpectedly changes after function call

I have a small function which uses one list to populate another. For some reason, the source list gets modified. I don't have a single line that manipulates the source list arr. I am probably missing the way Python deals with scope of variables, lists. My expected output is for the list arr to remain the same after the function call.
numTestRows = 5
m = 2
def getTestData():
data['test'] = []
size_c = len(arr)
for i in range(numTestRows):
data['test'].append(arr[i%size_c])
for j in range(m):
data['test'][i].append('xyz')
#just a 2x5 str matrix
arr = [['a', 'b', 'c', 'd', 'e'], ['f', 'g', 'h', 'i', 'j']]
print('Array before: ')
print( arr)
data = {}
getTestData()
print('Array after: ')
print( arr)
Output
Array before:
[['a', 'b', 'c', 'd', 'e'], ['f', 'g', 'h', 'i', 'j']]
Array after:
[['a', 'b', 'c', 'd', 'e', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz'], ['f', 'g', 'h', 'i', 'j', 'xyz', 'xyz', 'xyz', 'xyz']]

You've mis-handled the references in your list of lists (not a matrix). Perhaps if we break this down a little more, you can see what's happening. Start your main program with the two char lists as separate variables:
left = ['a', 'b', 'c', 'd', 'e']
right = ['f', 'g', 'h', 'i', 'j']
arr = [left, right]
Now, look at what happens within your function at the critical lines. On this first iteration, size_c is 2, i is 0 ...
data['test'].append(arr[i%size_c])
This will append arr[0] to data[test], which started as an empty list. Now for the critical part: arr[0] is not a new list; rather, it's a reference to the list we now know as left in the main program. There is only one copy of this list.
Now, when we get into the next loop, we hit the statement:
data['test'][i].append('xyz')
data['test'][i] is a reference to the same list as left ... and this explains the appending to the original list.
You can easily copy a list with the suffix [:], making a new slice of the entire list. For instance:
data['test'].append(arr[i%size_c][:])
... and this should solve your reference problem.

Itertools Combinations No Repeats: Where rgb is equivelant to rbg etc

I'm trying to use itertools.combinations to return unique combinations. I've searched through several similar questions but have not been able to find an answer.
An example:
>>> import itertools
>>> e = ['r','g','b','g']
>>> list(itertools.combinations(e,3))
[('r', 'g', 'b'), ('r', 'g', 'g'), ('r', 'b', 'g'), ('g', 'b', 'g')]
For my purposes, (r,g,b) is identical to (r,b,g) and so I would want to return only (rgb),(rgg) and (gbg).
This is just an illustrative example and I would want to ignore all such 'duplicates'. The list e could contain up to 5 elements. Each individual element would be either r, g or b. Always looking for combinations of 3 elements from e.
To be concrete, the following are the only combinations I wish to call 'valid': (rrr), (ggg), (bbb), (rgb).
So perhaps the question boils down to how to treat any variation of (rgb) as equal to (rgb) and therefore ignore it.
Can I use itertools to achieve this or do I need to write my own code to drop the 'dupliates' here? If no itertools solution then I can just easily check if each is a variation of (rgb), but this feels a bit 'un-pythonic'.

You can use a set to discard duplicates.
In your case the number of characters is the way you identify duplicates so you could use collections.Counter. In order to save them in a set you need to convert them to frozensets though (because Counter isn't hashable):
>>> import itertools
>>> from collections import Counter
>>> e = ['r','g','b','g']
>>> result = []
>>> seen = set()
>>> for comb in itertools.combinations(e,3):
... cnts = frozenset(Counter(comb).items())
... if cnts in seen:
... pass
... else:
... seen.add(cnts)
... result.append(comb)
>>> result
[('r', 'g', 'b'), ('r', 'g', 'g'), ('g', 'b', 'g')]
If you want to convert them to strings use:
result.append(''.join(comb)) # instead of result.append(comb)
and it will give:
['rgb', 'rgg', 'gbg']
The approach is a variation of the unique_everseen recipe (itertools module documentation) - so it's probably "quite pythonic".

According to your definition of "valid outputs", you can directly build them like this:
from collections import Counter
# Your distinct values
values = ['r', 'g', 'b']
e = ['r','g','b','g', 'g']
count = Counter(e)
# Counter({'g': 3, 'r': 1, 'b': 1})
# If x appears at least 3 times, 'xxx' is a valid combination
combinations = [x*3 for x in values if count[x] >=3]
# If all values appear at least once, 'rgb' is a valid combination
if all([count[x]>=1 for x in values]):
combinations.append('rgb')
print(combinations)
#['ggg', 'rgb']
This will be more efficient than creating all possible combinations and filtering the valid ones afterwards.

It is not completely clear what you want to return. It depends on what comes first when iterating. For example if gbr is found first, then rgb will be discarded as a duplicate:
import itertools
e = ['r','g','b','g']
s = set(e)
v = [s] * len(s)
solns = []
for c in itertools.product(*v):
in_order = sorted(c)
if in_order not in solns:
solns.append(in_order)
print solns
This would give you:
[['r', 'r', 'r'], ['b', 'r', 'r'], ['g', 'r', 'r'], ['b', 'b', 'r'], ['b', 'g', 'r'], ['g', 'g', 'r'], ['b', 'b', 'b'], ['b', 'b', 'g'], ['b', 'g', 'g'], ['g', 'g', 'g']]

What's the most efficient way of identifying repeated pattern in array of objects using Python

I have two arrays of 5 objects
a = ['a', 'b', 'c', 'd', 'e', 'f', 'e', 'f']
b = ['a', 'b', 'd', 'f', 'e', 'f']
I would like to identify the repeated patterns of more than one object and their occurrences like
['a', 'b']: 2
['e', 'f']: 3
['f', 'e', 'f']: 2
The first sequence ['a', 'b'] appeared once in a and once in b, so total count 2. The 2nd sequence ['e', 'f'] appeared twice in a, once in b, so total 3. The 3rd sequence ['f', 'e', 'f'] appeared once in a, and once in b, so total 2.
Is there a good way to do this in Python?
Also the universe of objects is limited. Was wondering if there's an efficient solution that utilizes hash table?

If the approach is only for two lists, the following approach should work. I am not sure if this is the most efficient solution though.
A nice description of find n-grams is given in this blog post.
This approach provides the min length and determines the max length that a repeating sequence of a list might have (at most half the length of the list).
We then find all the sequences for each of the lists by combining the sequences for individual lists. Then we have a counter of every sequence and its count.
Finally we return a dictionary of all the sequences that occur more than once.
def find_repeating(list_a, list_b):
min_len = 2
def find_ngrams(input_list, n):
return zip(*[input_list[i:] for i in range(n)])
seq_list_a = []
for seq_len in range(min_len, len(list_a) + 1):
seq_list_a += [val for val in find_ngrams(list_a, seq_len)]
seq_list_b = []
for seq_len in range(min_len, len(list_b) + 1):
seq_list_b += [val for val in find_ngrams(list_b, seq_len)]
all_sequences = seq_list_a + seq_list_b
counter = {}
for seq in all_sequences:
counter[seq] = counter.get(seq, 0) + 1
filtered_counter = {k: v for k, v in counter.items() if v > 1}
return filtered_counter
Do let me know if you are unsure about anything.
>>> list_a = ['a', 'b', 'c', 'd', 'e', 'f', 'e', 'f']
>>> list_b = ['a', 'b', 'd', 'f', 'e', 'f']
>>> print find_repeating(list_a, list_b)
{('f', 'e'): 2, ('e', 'f'): 3, ('f', 'e', 'f'): 2, ('a', 'b'): 2}

When you mentioned that you were looking for an efficient solution, my first thought was of the approaches to solving the longest common subsequence problem. But in your case, we actually do need to enumerate all common subsequences so that we can count them, so a dynamic programming solution will not do. Here's my solution. It's certainly shorter than SSSINISTER's solution (mostly because I use the collections.Counter class).
#!/usr/bin/env python3
def find_repeating(sequence_a, sequence_b, min_len=2):
from collections import Counter
# Find all subsequences
subseq_a = [tuple(sequence_a[start:stop]) for start in range(len(sequence_a)-min_len+1)
for stop in range(start+min_len,len(sequence_a)+1)]
subseq_b = [tuple(sequence_b[start:stop]) for start in range(len(sequence_b)-min_len+1)
for stop in range(start+min_len,len(sequence_b)+1)]
# Find common subsequences
common = set(tup for tup in subseq_a if tup in subseq_b)
# Count common subsequences
return Counter(tup for tup in (subseq_a + subseq_b) if tup in common)
Resulting in ...
>>> list_a = ['a', 'b', 'c', 'd', 'e', 'f', 'e', 'f']
>>> list_b = ['a', 'b', 'd', 'f', 'e', 'f']
>>> print(find_repeating(list_a, list_b))
Counter({('e', 'f'): 3, ('f', 'e'): 2, ('a', 'b'): 2, ('f', 'e', 'f'): 2})
The advantage to using collections.Counter is that not only do you not need to produce the actual code to iterate and count, you get access to all of the dict methods as well as a few specialized methods for using those counts.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting a Python list into a list of overlapping chunks - python

Here's what I came up with: l = [1, 2, 3, 4, 5, 6] x = zip (l[:-1], l[1:]) for i in x: print (i) (1, 2) (2, 3) (3, 4) (4, 5) (5, 6) Zip accepts any number of iterables, there is also zip_longest

Related

How to return a shuffled list considering mutually exclusive items?

Python while loop return Nth letter

Python array value unexpectedly changes after function call

Itertools Combinations No Repeats: Where rgb is equivelant to rbg etc

What's the most efficient way of identifying repeated pattern in array of objects using Python

Categories

Resources