Efficient way to look in list of lists? - python

I am continuously creating a randomly generated list, New_X of size 10, based on 500 columns.
Each time I create a new list, it must be unique, and my function NewList only returns New_X once it hasn't already been created and appended to a List_Of_Xs
def NewList(Old_List):
end = True
while end == True:
""" Here is code that generates my new sorted list, it is a combination of elements
from Old_List and the other remaining columns,
but the details aren't necessary for this question. """
end = (List_Of_Xs == np.array([New_X])).all(axis=1).any()
List_Of_Xs.append(New_X)
return New_X
My question is, is the line end = (List_Of_Xs == np.array([New_X])).all(axis=1).any() an efficient way of looking in List_Of_Xs?
My List_Of_Xs can grow to a size of over 100,000 lists long, so I am unsure if this is inefficient or not.
Any help would be appreciated!

As I observed in a comment, the array comparison is potentially quite slow, especially as the list gets large. It has to create arrays each time, which consumes time.
Here's a set implementation
Function to create a 10 element list:
def foo(N=10):
return np.random.randint(0,10,N).tolist()
Function to generate lists, and print the unique ones
def foo1(m=10):
Set_of_Xs = set()
while len(Set_of_Xs)<m:
NewX = foo(10)
tx = tuple(NewX)
if not tx in Set_of_Xs:
print(NewX)
Set_of_Xs.add(tx)
return Set_of_Xs
Sample run. As written it doesn't show if there are duplicates.
In [214]: foo1(5)
[9, 4, 3, 0, 9, 4, 9, 5, 6, 3]
[1, 8, 0, 3, 0, 0, 4, 0, 0, 5]
[6, 7, 2, 0, 6, 9, 0, 7, 0, 8]
[9, 5, 6, 3, 3, 5, 6, 9, 6, 9]
[9, 2, 6, 0, 2, 7, 2, 0, 0, 4]
Out[214]:
{(1, 8, 0, 3, 0, 0, 4, 0, 0, 5),
(6, 7, 2, 0, 6, 9, 0, 7, 0, 8),
(9, 2, 6, 0, 2, 7, 2, 0, 0, 4),
(9, 4, 3, 0, 9, 4, 9, 5, 6, 3),
(9, 5, 6, 3, 3, 5, 6, 9, 6, 9)}

So let me get this straight since the code doesn't appear complete:
1. You have an old list that is constantly growing with each iteration
2. You calculate a list
3. You compare it against each of the lists in the old list to see if you should break the loop?
One option is to store the lists in a set instead of a List of List.
Comparing an element against all the elements of a list would be an O(n) operation each iteration. Using a set it should be O(1) avg... Although you may be getting O(n) every iteration until the last.
Other thoughts would be to calculate the md5 of each element and compare those so you're not comparing the full lists.

Related

Random sampling a large Cartesian product of iterables

I have multiple iterables and I need to create the Cartesian product of those iterables and then randomly sample from the resulting pool of tuples. The problem is that the total number of combinations of these iterables is somewhere around 1e19, so I can't possibly load all of this into memory.
What I thought was using itertools.product in combination with a random number generator to skip random number of items, then once I get to the randomly selected item, I perform my calculations and continue until I run out of the generator. The plan was to do something like:
from itertools import product
from random import randint
iterables = () # tuple of 18 iterables
versions = product(iterables)
def do_stuff():
# do stuff
STEP_SIZE = int(1e6)
# start both counts from 0.
# First value to be taken is start + step
# after that increment start to be equal to count and repeat
start = 0
count = 0
while True:
try:
step = randint(1, 100) * STEP_SIZE
for v in versions:
# if the count is less than required skip values while incrementing count
if count < start + step:
versions.next()
count += 1
else:
do_stuff(*v)
start = count
except StopIteration:
break
Unfortunately, itertools.product objects don't have the next() method, so this doesn't work. What other way is there to go through this large number of tuples and either take a random sample or directly run calculations on the values?
Don't try to generate the Cartesian product. Sample from one iterable at a time to generate your result using random.choice(). The number of elements across all iterables is small, so you can store all the elements in memory directly.
Here's an example using 18 iterables with 10 elements each (as specified in the comment):
import random
iterables = [list(range(i, i + 10)) for i in range(0, 180, 10)]
result = [random.choice(iterable) for iterable in iterables]
print(result)
Which version of Python are you using? Somewhere along the way .next() methods were deprecated in favor a new next() built-in function. That works fine with all iterators. Here, for example, under the current released 3.10.1:
>>> import itertools
>>> itp = itertools.product(range(5), repeat=6)
>>> next(itp)
(0, 0, 0, 0, 0, 0)
>>> next(itp)
(0, 0, 0, 0, 0, 1)
>>> next(itp)
(0, 0, 0, 0, 0, 2)
>>> next(itp)
(0, 0, 0, 0, 0, 3)
>>> for ignore in range(50):
... ignore = next(itp)
>>> next(itp)
(0, 0, 0, 2, 0, 4)
Beyond that, you didn't show us the most important part of your code: how you build your product.
Without seeing that, I can only guess that it would be far more efficient to make a random choice from the first sequence passed to product(), then another from the second, and so on. Build a random element of the product from one component at a time.
Picking a random product tuple efficiently
Perhaps overkill, but this class shows an especially efficient way to do this. The .index() method maps an integer i to the i'th tuple (0-based) in the product. Then picking a random tuple from the product is simply applying .index() to a random integer in range(total number of elements in the product).
from math import prod
from random import randrange
class RanProduct:
def __init__(self, iterables):
self.its = list(map(list, iterables))
self.n = prod(map(len, self.its))
def index(self, i):
if i not in range(self.n):
raise ValueError(f"index {i} not in range({self.n})")
result = []
for it in reversed(self.its):
i, r = divmod(i, len(it))
result.append(it[r])
return tuple(reversed(result))
def pickran(self):
return self.index(randrange(self.n))
and then
>>> r = RanProduct(["abc", range(2)])
>>> for i in range(6):
... print(i, '->', r.index(i))
0 -> ('a', 0)
1 -> ('a', 1)
2 -> ('b', 0)
3 -> ('b', 1)
4 -> ('c', 0)
5 -> ('c', 1)
>>> r = RanProduct([range(10)] * 19)
>>> r.pickran()
(3, 5, 8, 8, 3, 6, 7, 6, 8, 6, 2, 0, 5, 6, 1, 0, 0, 8, 2)
>>> r.pickran()
(4, 5, 0, 5, 7, 1, 7, 2, 7, 4, 8, 4, 2, 0, 2, 9, 3, 6, 2)
>>> r.pickran()
(8, 7, 4, 1, 3, 0, 4, 6, 4, 3, 9, 8, 5, 8, 9, 9, 7, 1, 8)
>>> r.pickran()
(8, 6, 6, 0, 6, 7, 1, 3, 9, 5, 1, 4, 5, 8, 6, 8, 4, 9, 9)
>>> r.pickran()
(4, 9, 4, 7, 1, 5, 5, 1, 6, 7, 1, 8, 9, 0, 7, 9, 1, 7, 0)
>>> r.pickran()
(3, 0, 3, 9, 8, 6, 3, 0, 3, 0, 9, 9, 3, 5, 2, 3, 7, 8, 8)

list of unique elements formed by concatenating permutations of the initial lists

I would like to combine several lists, each lists should be preserved up to a permutation.
Here is an example:
I would like to combine these lists
[[0, 7], [2, 4], [0, 1, 7], [0, 1, 4, 7]]
The output I would like to obtain is e.g. this list
[2, 4, 0, 7, 1]
Or as Sembei Norimaki phrased the task:
the result must be a list of unique elements formed by concatenating permutations of the initial lists.
The solution is not unique, and it could be that there is not always a solution possible
Third time lucky. This is a bit cheesy - it checks every permutation of the source list elements to see which ones are valid:
from itertools import permutations
def check_sublist(sublist, candidate):
# a permutation of sublist must exist within the candidate list
sublist = set(sublist)
# check each len(sublist) portion of candidate
for i in range(1 + len(candidate) - len(sublist)):
if sublist == set(candidate[i : i + len(sublist)]):
return True
return False
def check_list(input_list, candidate):
for sublist in input_list:
if not check_sublist(sublist, candidate):
return False
return True
def find_candidate(input_list):
# flatten input_list and make set of unique values
values = {x for sublist in input_list for x in sublist}
for per in permutations(values):
if check_list(input_list, per):
print(per)
find_candidate([[0, 7], [2, 4], [0, 1, 7], [0, 1, 4, 7]])
# (0, 7, 1, 4, 2)
# (1, 0, 7, 4, 2)
# (1, 7, 0, 4, 2)
# (2, 4, 0, 7, 1)
# (2, 4, 1, 0, 7)
# (2, 4, 1, 7, 0)
# (2, 4, 7, 0, 1)
# (7, 0, 1, 4, 2)
You'd definitely do better applying a knowledge of graph theory and using a graphing library, but that's beyond my wheelhouse at present!

How to return the order statistic of a whole array?

I have searched the web but could not find a solution. If I have an array, let's say:
x=[17, 1, 2, 7, 8, 5, 27, 29]
I am searching for an easy way such that a vector of order statistics, i.e.
y=[6, 1, 2, 4, 5, 3, 7, 8]
is returned. Of course it can also be (typical for python) indexed starting with zero; Additinally, it would be perfect if there are two or more entries of the same value like:
x=[17, 1, 2, 1, 8, 5, 27, 29]
That we have a result like this:
y=[6, 2, 3, 2, 5, 4, 7, 8]
Basically, since I dont have LaTeX, I want as a result:
"#numbers smaller or equal this number"; Therefore either entry, that is one has two numbers which are smaller or equal one and therefore the desired entry would be 2;
Use sorted:
s = sorted(x)
[s.index(i) + 1 for i in x]
Output:
[6, 1, 2, 4, 5, 3, 7, 8]
Note that index originally starts with 0, thus +1 is slightly unconventional, which may raise error later if you were to use it back to find the original value.

Seperate lists based on indices

I have 2 lists:
data = [0, 1, 2, 3, 7, 8, 9, 10]
indices = [1, 1, 0, 0, 0, 2, 1, 0]
I want to append the data to a 2-D array given the indices which correspond to the 2-D array. meaning:
new_list = [[]]*len(set(indices))
Where new_list will results as follows:
new_list = [[2,3,7,10],[0,1,9],[8]]
I am using this code:
for i in range(len(set(indices)):
for j in range(len(indices)):
if indices[j] == i:
new_list[i].append(data[j])
else:
pass
However, I get this:
new_list = [[2, 3, 7, 10, 0, 1, 9, 8], [2, 3, 7, 10, 0, 1, 9, 8], [2, 3, 7, 10, 0, 1, 9, 8]]
I am not sure what mistake I am doing, any help is appreciated!
You can use a dict to map the values to their respective indices, and then use a range to output them in order, so that this will only cost O(n) in time complexity:
d = {}
for i, n in zip(indices, data):
d.setdefault(i, []).append(n)
newlist = [d[i] for i in range(len(d))]
newlist becomes:
[[2, 3, 7, 10], [0, 1, 9], [8]]
You're iterating your indices completely for every value, which is wasteful. You're also multiplying a list of lists, which doesn't do what you expect (it makes a list of many references to the same underlying list). You want to pair up indices and values instead (so you do O(n) work, not O(n**2)), which is what zip was made for, and make your list of empty lists safely (a list of several independent lists):
data = [0, 1, 2, 3, 7, 8, 9, 10]
indices = [1, 1, 0, 0, 0, 2, 1, 0]
# Use max because you really care about the biggest index, not the number of unique indices
# A list comprehension producing [] each time produces a *new* list each time
new_list = [[] for _ in range(max(indices)+1)]
# Iterate each datum and matching index in parallel exactly once
for datum, idx in zip(data, indices):
new_list[idx].append(datum)
To get at this, i zipped the data with its index:
>>>data = [0, 1, 2, 3, 7, 8, 9, 10]
>>>indices = [1, 1, 0, 0, 0, 2, 1, 0]
>>>buff = sorted(list(zip(indices,data)))
>>>print(buff)
[(0, 2), (0, 3), (0, 7), (0, 10), (1, 0), (1, 1), (1, 9), (2, 8)]
Then I used the set of unique indices as a way to determine if the data gets included in a new list. This is done with nested list comprehensions.
>>>new_list = list(list((b[1] for b in buff if b[0]==x)) for x in set(indices))
>>>print(new_list)
[[2, 3, 7, 10], [0, 1, 9], [8]]
I hope this helps.

python: sampling without replacement from a 2D grid

I need a sample, without replacement, from among all possible tuples of numbers from range(n). That is, I have a collection of (0,0), (0,1), ..., (0,n), (1,0), (1,1), ..., (1,n), ..., (n,0), (n,1), (n,n), and I'm trying to get a sample of k of those elements. I am hoping to avoid explicitly building this collection.
I know random.sample(range(n), k) is simple and efficient if I needed a sample from a sequence of numbers rather than tuples of numbers.
Of course, I can explicitly build the list containing all possible (n * n = n^2) tuples, and then call random.sample. But that probably is not efficient if k is much smaller than n^2.
I am not sure if things work the same in Python 2 and 3 in terms of efficiency; I use Python 3.
Depending on how many of these you're selecting, it might be simplest to just keep track of what things you've already picked (via a set) and then re-pick until you get something that you haven't picked already.
The other option is to just use some simple math:
numbers_in_nxn = random.sample(range(n*n), k) # Use xrange in Python 2.x
tuples_in_nxn = [divmod(x,n) for x in numbers_in_nxn]
You say:
Of course, I can explicitly build the
list containing all possible (n * n =
n^2) tuples, and then call
random.sample. But that probably is
not efficient if k is much smaller
than n^2.
Well, how about building the tuple after you have randomly picked one? Ie, if you can build the tuples before you randomly choose which one to pick, you can do the picking first and building later.
I don't understand how your tuples are supposed to look, but here is an example, although I realize your tuples are all of the same length, this shows the principle:
Instead of doing this:
>>> import random
>>> all_sequences = [range(x) for x in range(10)]
>>> all_sequences
[[], [0], [0, 1], [0, 1, 2], [0, 1, 2, 3], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5, 6], [0, 1, 2, 3, 4, 5, 6, 7], [0, 1, 2, 3, 4, 5, 6, 7, 8]]
>>> random.sample(all_sequences, 3)
[[0, 1, 2, 3, 4, 5, 6, 7], [0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5, 6, 7, 8]]
You would do this:
>>> import random
>>> selection = random.sample(range(10), 3)
>>> [range(x) for a in selection]
[[0, 1, 2, 3, 4, 5, 6, 7, 8], [0, 1, 2, 3, 4, 5, 6, 7, 8], [0, 1, 2, 3, 4, 5, 6, 7, 8]]
Without trying (no python at hand):
random.shuffle(range(n))[:k]
see comments. Didn't sleep enough...

Categories