Merge N lists by randomly picking elements at each index - python

I have a bajillion paired lists, each pair of equal size. I want to "merge" each by picking a random element from each index, but my current implementation is very slow - even when multiprocessing. (FWIW, my code does need to be threadable).
def rand_merge(l1, l2):
newl = []
for i in range(len(l1)):
q = random.choice([l1, l2])
newl.append(q[i])
return newl
Pretty basic, but running it on 20k lists of sizes ~5-25, it takes crazy long - I assume it's random.choice gumming up the works. But I've also tried other versions of random, like creating a string of 0's and 1's to refer to, no go.
EDIT:
More clarity: It's a Genetic Algorithm designed to write sentences by matching up against a corpus. The lists in question are sentences split by word. The GA is "merging" winning fitness "parents" into children, each of which are a merging of the two parent sentences' "genes."
That means that the "lists" do need to match up, and can't pull from a larger list of lists (I don't think).
Here some code...
from multiprocessing import Pool as ThreadPool
import random
def offspring(parents):
child = []
p1 = parents[0].split(' ')
p2 = parents[1].split(' ')
for i in range(min(len(p1), len(p2))):
q = random.choice([p1, p2])
child.append(q[i])
child = ' '.join([g for g in child]).strip()
return child
def nextgen(l): #l is two lists of previous generation and grammar seed
oldgen = l[0][:pop] # Population's worth of previous generation
gramsent = l[1] # this is the grammar seed
newgen = []
newgen.append(tuple([oldgen[0][0], oldgen[0][0]])) # Keep the winner!
for i in range(len(oldgen) - len(oldgen)//4):
ind1 = oldgen[0][0] # paired off against the winner - for larger pools, this is a random.sample/"tournament"
ind2 = oldgen[i][0]
newgen.append(tuple([ind1, ind2]))
pool = ThreadPool(processes=8)
newgen = pool.map(offspring, newgen)
pool.close()
pool.join()
The populations and generations can get into high numbers together, and each sentence runs through. Since posting the question originally, troubled that it was taking so long for each generation to roll by, I discovered (head-scratcher for me) that the long processing times actually have (almost) nothing to do with the "population" size or number of lists. It was taking ~15 seconds to mutate each generation. I upped the population from 50 to 50000 and the generations went from 15 seconds to 17 or so. So the slowness is apparently hiding elsewhere.

Try merging all 20,000 lists at once, instead of two at a time.
from itertools import zip_longest
from functools import partial
import random
lists = [l1, l2, ...]
idxvals = map(partial(filter, None), itertools.zip_longest(*lists))
newl = [random.choice([*i]) for i in idxvals]
Since you want to pick a random element at each index, it makes sense to chose from all 20k lists at once instead of 2 at a time.
>>> lists = [[1, 2, 3], [10], [20, 30, 40, 5]]
zip_longest will zip to the longest list, filling missing values with None.
>>> list(itertools.zip_longest(*lists))
[(1, 10, 20), (2, None, 30), (3, None, 40), (None, None, 5)]
These Nones will need to be filtered out before the choose step. filter will help with that.
>>> f = partial(filter, None)
>>> list(map(list, map(f, itertools.zip_longest(*lists))))
[[1, 10, 20], [2, 30], [3, 40], [5]]
It should be clear what I'm trying to do. The ith index of the output contains those elements present at l[i], for every l in lists.
Now, iterate over idxvals and choose:
>>> idxvals = map(f, itertools.zip_longest(*lists))
>>> [random.choice([*i]) for i in idxvals]
[10, 30, 3, 5]

Related

Create random sequence of comparison pairs (x, y) so that subsequent x and y values are not repeated

I have the following list:
item_list = [1, 2, 3, 4, 5]
I want to compare each item in the list to the other items to generate comparison pairs, such that the same comparisons (x, y) and (y, x) are not repeated (i.e. I don't want both [1, 5] and [5, 1]). For the 5 items in the list, this would generate a total of 10 comparison pairs (n*(n-1)/2). I also want to randomize the order of the pairs such that both x- and y-values aren't the same as the adjacent x- and y-values.
For example, this is fine:
[1, 5]
[3, 2]
[5, 4]
[4, 2]
...
But not this:
[1, 5]
[1, 4] <-- here the x-value is the same as the previous x-value
[2, 4] <-- here the y-value is the same as the previous y-value
[5, 3]
...
I have only been able to come up with a method in which I manually create the pairs by zipping two lists together (example below), but this is obviously very time-consuming (and would be even more so if I wanted to increase the list of items to 10, which would generate 45 pairs). I also can't randomize the order each time, otherwise I could get repetitions of the same x- or y-values.
x_list = [1, 4, 1, 3, 1, 4, 1, 2, 5, 3]
y_list = [2, 5, 3, 5, 4, 2, 5, 3, 2, 4]
zip_list = zip(x_list, y_list)
paired_list = list(zip_list)
print(paired_list)
I hope that makes sense. I am very new to coding, so any help would be much appreciated!
Edit: For context, my experiment involves displaying two images next to each other on the screen. I have a total of 5 images (labeled 1-5), hence the 5-item list. For each image pair, the participant must select one of the two images, which is why I don't want the same image displayed at the same time (e.g. [1, 1]), and I don't need the same pair repeated (e.g. [1, 5] and [5, 1]). I also want to make sure that each time the screen displays a new pair of images, both images, in their respective positions on the screen, change. So it doesn't matter if an image repeats in the sequence, so as long as it changes position (e.g. [4, 3] followed by [5, 4] is ok).
carrvo's answer is good, but doesn't guarantee the requirement that each iteration-step causes the x-value to change and the y-value to change.
(I'm also not a fan of mutability, shuffling in place, but in some contexts it's more performant)
I haven't thought of an elegant, concise implementation, but I do see a slightly clever trick: Because each pair appears only once, we're already guaranteed to have either x or y change, so if we see a pair for which they don't both change, we can just swap them.
I haven't tested this.
from itertools import combinations
from random import sample # not cryptographic secure.
def human_random_pairs(items):
n = len(items)
random_pairs = sample(combinations(items, 2),
n * (n - 1) / 2)
def generator():
old = random_pairs[0]
yield old
for new in random_pairs[1:]:
collision = old[0] == new[0] or old[1] == new[1] # or you can use any, a comprehension, and zip; your choice.
old = tuple(reversed(new)) if collision else new
yield old
return tuple(generator())
This wraps the output in a tuple; you can use a list if you like, or depending on your usage you can probably unwrap the inner function and just yield directly from human_random_pairs, in which case it will "return" the iterable/generator.
Oh, actually we can use itertools.accumulate:
from itertools import accumulate, combinations, starmap
from operator import eq
from random import sample # not cryptographic secure.
def human_random_pairs(items):
n = len(items)
def maybe_flip_second(fst, snd):
return tuple(reversed(snd)) if any(starmap(eq, zip(fst, snd))) else snd
return tuple( # this outer wrapper is optional
accumulate(sample(combinations(items, 2), n * (n - 1) / 2), # len(combinations) = n! / r! / (n-r)!
maybe_flip_second)
)
I had to look up how to generate combinations and random because I have not used them so often, but you should be looking for something like the following:
from itertools import combinations
from random import shuffle
item_list = range(1, 6) # [1, 2, 3, 4, 5]
paired_list = list(combinations(item_list, 2))
shuffle(paired_list)
print(paired_list)
Thank you for the contributions! I'm posting the solution I ended up using below for anyone who might be interested, which uses carrvo's code for generating random comparisons and the pair reversal idea from ShapeOfMatter. Overall does not look very elegant and can likely be simplified, but at least generates the desired output.
from itertools import combinations
import random
# Create image pair comparisons and randomize order
no_of_images = 5
image_list = range(1, no_of_images+1)
pairs_list = list(combinations(image_list, 2))
random.shuffle(pairs_list)
print(pairs_list)
# Create new comparisons sequence with no x- or y-value repeats, by reversing pairs that clash
trial_list = []
trial_list.append(pairs_list[0]) # append first image pair
binary_list = [0] # check if preceding pairs have been reversed or not (0 = not reversed, 1 = reversed)
# For subsequent pairs, if x- or y-values are repeated, reverse the pair
for i in range(len(pairs_list)-1):
# if previous pair was reversed, check against new pair
if binary_list[i] == 1:
if trial_list[i][0] == pairs_list[i+1][0] or trial_list[i][1] == pairs_list[i+1][1]:
trial_list.append(tuple(list(reversed(pairs_list[i+1])))) # if x- or y-value repeats, reverse pair
binary_list.append(1) # flag reversal
else:
trial_list.append(pairs_list[i+1])
binary_list.append(0)
# if previous pair was not reversed, check against old pair
elif binary_list[i] == 0:
if pairs_list[i][0] == pairs_list[i+1][0] or pairs_list[i][1] == pairs_list[i+1][1]:
trial_list.append(tuple(list(reversed(pairs_list[i+1])))) # if x- or y-value repeats, reverse pair
binary_list.append(1) # flag reversal
else:
trial_list.append(pairs_list[i+1])
binary_list.append(0)
print(trial_list)

How to find sum of different combinations of all products in list (python)

I have the following two python lists of numbers:
list1 = [0, 2, 5, 10, 20, 50, 100]
list2 = [2, 4, 6, 8, 10, 20]
I can get the products of each like below:
combo = list(product(list1, list2))
I then have a class where I have created a method which is just simply the plus or minus of the product (dependent on other class variables) of a given element from the two lists, e.g.
class1 = class.method(x, y)
class1 = class.method(list1[1], list2[5])
class1 = class.method(2, 20)
class1 = 40
I want to get the sum of all instances I have of the class for all possible combinations of the products. These can be repeated (i.e. (2, 20) can be used more than once, and so on) so the amount of combinations will be very large.
My issue is that I am not sure how to loop through the whole combined list when my number of instances grows to a very large amount. The only thing I can think of so far is as follows which just results in blanks.
pos_combos = []
for i in combos:
x, y = i
if class1.method(x, y)
+class2.method(x, y)
+class3.method(x, y)
…
+class98.method(x, y)
+class99.method(x, y)
+class100.method(x, y) > 0:
pos_combos.append(i)
print(pos_cases)
I can get the different combinations I want to use doing the below.
combo2 = list(product(combo, repeat=100))
But this is too long to even print, not to mention the issue of passing each element of this through the equation.
Does anyone have any ideas on how to do this? Am thinking that its maybe too big a task for simple for loop/function and some more sophisticated method may be used (i.e. some form of ML).
Maybe I could keep the two lists separate and loop through each somehow?
If someone could even point me in right direction it would be greatly appreciated
Thanks.
EDIT: a minimum reproducible example would be as follows:
if every second instance was a minus product, given that tuples can be repeated, one example would be if bet1 used (2, 20) and bet2 - bet100 used (2, 5) (so 99 times)
Then one entry into pos_combos list would be the tuple of tuples
[((2, 20), ... (2, 5))]
As the sum of these is > 0 as it equals 40.
I want to get the list of all others which meet this criteria.
If you already have your list combo, then you should be able to do the following:
import itertools
sum([i*j for i,j in list(itertools.combinations(combo, 2))])
If combo = [1, 2, 3], then the above code is doing the following:
(1 * 2) + (1 * 3) + (2 * 3)

Fast sorting of large nested lists

I am looking to find out the likelihood of parameter combinations using Monte Carlo Simulation.
I've got 4 parameters and each can have about 250 values.
I have randomly generated 250,000 scenarios for each of those parameters using some probability distribution function.
I now want to find out which parameter combinations are the most likely to occur.
To achieve this I have started by filtering out any duplicates from my 250,000 randomly generated samples in order to reduce the length of the list.
I then iterated through this reduced list and checked how many times each scenario occurs in the original 250,000 long list.
I have a large list of 250,000 items which contains lists, as such :
a = [[1,2,5,8],[1,2,5,8],[3,4,5,6],[3,4,5,7],....,[3,4,5,7]]# len(a) is equal to 250,000
I want to find a fast and efficient way of having each list within my list only occurring once.
The end goal is to count the occurrences of each list within list a.
so far I've got:
'''Removing duplicates from list a and storing this as a new list temp'''
b_set = set(tuple(x) for x in a)
temp = [ list(x) for x in b_set ]
temp.sort(key = lambda x: a.index(x) )
''' I then iterate through each of my possible lists (i.e. temp) and count how many times they occur in a'''
most_likely_dict = {}
for scenario in temp:
freq = list(scenario_list).count(scenario)
most_likely_dict[str(scenario)] = freq
at the moment it takes a good 15 minutes to perform ... Any suggestion on how to turn that into a few seconds would be greatly appreciated !!
You can take out the sorting part, as the final result is a dictionary which will be unordered in any case, then use a dict comprehension:
>>> a = [[1,2],[1,2],[3,4,5],[3,4,5], [3,4,5]]
>>> a_tupled = [tuple(i) for i in a]
>>> b_set = set(a_tupled)
>>> {repr(i): a_tupled.count(i) for i in b_set}
{'(1, 2)': 2, '(3, 4, 5)': 3}
calling list on your tuples will add more overhead, but you can if you want to
>>> {repr(list(i)): a_tupled.count(i) for i in b_set}
{'[3, 4, 5]': 3, '[1, 2]': 2}
Or just use a Counter:
>>> from collections import Counter
>>> Counter(tuple(i) for i in a)
{str(item):a.count(item) for item in a}
Input :
a = [[1,2,5,8],[1,2,5,8],[3,4,5,6],[3,4,5,7],[3,4,5,7]]
Output :
{'[3, 4, 5, 6]': 1, '[1, 2, 5, 8]': 2, '[3, 4, 5, 7]': 2}

Remove duplicates from one Python list, prune other lists based on it

I have a problem that's easy enough to do in an ugly way, but I'm wondering if there's a more Pythonic way of doing it.
Say I have three lists, A, B and C.
A = [1, 1, 2, 3, 4, 4, 5, 5, 3]
B = [1, 2, 3, 4, 5, 6, 7, 8, 9]
C = [1, 2, 3, 4, 5, 6, 7, 8, 9]
# The actual data isn't important.
I need to remove all duplicates from list A, but when a duplicate entry is deleted, I would like the corresponding indexes removed from B and C:
A = [1, 2, 3, 4, 5]
B = [1, 3, 4, 5, 7]
C = [1, 3, 4, 5, 7]
This is easy enough to do with longer code by moving everything to new lists:
new_A = []
new_B = []
new_C = []
for i in range(len(A)):
if A[i] not in new_A:
new_A.append(A[i])
new_B.append(B[i])
new_C.append(C[i])
But is there a more elegant and efficient (and less repetitive) way of doing this? This could get cumbersome if the number of lists grows, which it might.
Zip the three lists together, uniquify based on the first element, then unzip:
from operator import itemgetter
from more_itertools import unique_everseen
abc = zip(a, b, c)
abc_unique = unique_everseen(abc, key=itemgetter(0))
a, b, c = zip(*abc_unique)
This is a very common pattern. Whenever you want to do anything in lock step over a bunch of lists (or other iterables), you zip them together and loop over the result.
Also, if you go from 3 lists to 42 of them ("This could get cumbersome if the number of lists grows, which it might."), this is trivial to extend:
abc = zip(*list_of_lists)
abc_unique = unique_everseen(abc, key=itemgetter(0))
list_of_lists = zip(*abc_unique)
Once you get the hang of zip, the "uniquify" is the only hard part, so let me explain it.
Your existing code checks whether each element has been seen by searching for each one in new_A. Since new_A is a list, this means that if you have N elements, M of them unique, on average you're going to be doing M/2 comparisons for each of those N elements. Plug in some big numbers, and NM/2 gets pretty big—e.g., 1 million values, a half of them unique, and you're doing 250 billion comparisons.
To avoid that quadratic time, you use a set. A set can test an element for membership in constant, rather than linear, time. So, instead of 250 billion comparisons, that's 1 million hash lookups.
If you don't need to maintain order or decorate-process-undecorate the values, just copy the list to a set and you're done. If you need to decorate, you can use a dict instead of a set (with the key as the dict keys, and everything else hidden in the values). To preserve order, you could use an OrderedDict, but at that point it's easier to just use a list and a set side by side. For example, the smallest change to your code that works is:
new_A_set = set()
new_A = []
new_B = []
new_C = []
for i in range(len(A)):
if A[i] not in new_A_set:
new_A_set.add(A[i])
new_A.append(A[i])
new_B.append(B[i])
new_C.append(C[i])
But this can be generalized—and should be, especially if you're planning to expand from 3 lists to a whole lot of them.
The recipes in the itertools documentation include a function called unique_everseen that generalizes exactly what we want. You can copy and paste it into your code, write a simplified version yourself, or pip install more-itertools and use someone else's implementation (as I did above).
PadraicCunningham asks:
how efficient is zip(*unique_everseen(zip(a, b, c), key=itemgetter(0)))?
If there are N elements, M unique, it's O(N) time and O(M) space.
In fact, it's effectively doing the same work as the 10-line version above. In both cases, the only work that's not obviously trivial inside the loop is key in seen and seen.add(key), and since both operations are amortized constant time for set, that means the whole thing is O(N) time. In practice, for N=1000000, M=100000 the two versions are about 278ms and 297ms (I forget which is which) compared to minutes for the quadratic version. You could probably micro-optimize that down to 250ms or so—but it's hard to imagine a case where you'd need that, but wouldn't benefit from running it in PyPy instead of CPython, or writing it in Cython or C, or numpy-izing it, or getting a faster computer, or parallelizing it.
As for space, the explicit version makes it pretty obvious. Like any conceivable non-mutating algorithm, we've got the three new_Foo lists around at the same time as the original lists, and we've also added new_A_set of the same size. Since all of those are length M, that's 4M space. We could cut that in half by doing one pass to get indices, then doing the same thing mu 無's answer does:
indices = set(zip(*unique_everseen(enumerate(a), key=itemgetter(1))[0])
a = [a[index] for index in indices]
b = [b[index] for index in indices]
c = [c[index] for index in indices]
But there's no way to go lower than that; you have to have at least a set and a list of length M alive to uniquify a list of length N in linear time.
If you really need to save space, you can mutate all three lists in-place. But this is a lot more complicated, and a bit slower (although still linear*).
Also, it's worth noting another advantage of the zip version: it works on any iterables. You can feed it three lazy iterators, and it won't have to instantiate them eagerly. I don't think it's doable in 2M space, but it's not too hard in 3M:
indices, a = zip(*unique_everseen(enumerate(a), key=itemgetter(1))
indices = set(indices)
b = [value for index, value in enumerate(b) if index in indices]
c = [value for index, value in enumerate(c) if index in indices]
* Note that just del c[i] will make it quadratic, because deleting from the middle of a list takes linear time. Fortunately, that linear time is a giant memmove that's orders of magnitude faster than the equivalent number of Python assignments, so if N isn't too big you can get away with it—in fact, at N=100000, M=10000 it's twice as fast as the immutable version… But if N might be too big, you have to instead replace each duplicate element with a sentinel, then loop over the list in a second pass so you can shift each element only once, which is instead 50% slower than the immutable version.
How about this - basically get a set of all unique elements of A, and then get their indices, and create a new list based on these indices.
new_A = list(set(A))
indices_to_copy = [A.index(element) for element in new_A]
new_B = [B[index] for index in indices_to_copy]
new_C = [C[index] for index in indices_to_copy]
You can write a function for the second statement, for reuse:
def get_new_list(original_list, indices):
return [original_list[idx] for idx in indices]

Fix first element, shuffle the rest of a list/array

Is it possible to shuffle only a (continuous) part of a given list (or array in numpy)?
If this is not generally possible, how about the special case where the first element is fixed while the rest of the list/array need to be shuffled? For example, I have a list/array:
to_be_shuffled = [None, 'a', 'b', 'c', 'd', ...]
where the first element should always stay, while the rest are going to be shuffled repeatedly.
One possible way is to shuffle the whole list first, and then check the first element, if it is not the special fixed element (e.g. None), then swap its position with that of the special element (which would then require a lookup).
Is there any better way for doing this?
Why not just
import random
rest = to_be_shuffled[1:]
random.shuffle(rest)
shuffled_lst = [to_be_shuffled[0]] + rest
numpy arrays don't copy data on slicing:
numpy.random.shuffle(a[1:])
I thought it would be interesting and educational to try to implement a slightly more general approach than what you're asking for. Here I shuffle the indices to the original list (rather than the list itself), excluding the locked indices, and use that index-list to cherry pick elements from the original list. This is not an in-place solution, but implemented as a generator so you can lazily pick elements.
Feel free to edit if you can improve it.
import random
def partial_shuf(input_list, fixed_indices):
"""Given an input_list, yield elements from that list in random order
except where elements indices are in fixed_indices."""
fixed_indices = sorted(set(i for i in fixed_indices if i < len(input_list)))
i = 0
for fixed in fixed_indices:
aslice = range(i, fixed)
i = 1 + fixed
random.shuffle(aslice)
for j in aslice:
yield input_list[j]
yield input_list[fixed]
aslice = range(i, len(input_list))
random.shuffle(aslice)
for j in aslice:
yield input_list[j]
print '\n'.join(' '.join((str(i), str(n))) for i, n in enumerate(partial_shuf(range(4, 36), [0, 4, 9, 17, 25, 40])))
assert sorted(partial_shuf(range(4, 36), [0, 4, 9, 17, 25, 40])) == range(4, 36)
I took the shuffle function from the standard library random module (found in Lib\random.py) and modified it slightly so that it shuffles only a portion of the list specified by start and stop. It does this in place. Enjoy!
from random import randint
def shuffle(x, start=0, stop=None):
if stop is None:
stop = len(x)
for i in reversed(range(start + 1, stop)):
# pick an element in x[start: i+1] with which to exchange x[i]
j = randint(start, i)
x[i], x[j] = x[j], x[i]
For your purposes, calling this function with 1 as the start parameter should do the trick.

Categories