Choosing python data structures to speed up algorithm implementation - python

So I'm given a large collection (roughly 200k) of lists. Each contains a subset of the numbers 0 through 27. I want to return two of the lists where the product of their lengths is greater than the product of the lengths of any other pair of lists. There's another condition, namely that the lists have no numbers in common.
There's an algorithm I found for this (can't remember the source, apologies for non-specificity of props) which exploits the fact that there are fewer total subsets of the numbers 0 through 27 than there are words in the dictionary.
The first thing I've done is looped through all the lists, found the unique subset of integers that comprise it and indexed it as a number between 0 and 1<<28. As follows:
def index_lists(lists):
index_hash = {}
for raw_list in lists:
length = len(raw_list)
if length > index_hash.get(index,{}).get("length"):
index = find_index(raw_list)
index_hash[index] = {"list": raw_list, "length": length}
return index_hash
This gives me the longest list and the length of the that list for each subset that's actually contained in the collection of lists given. Naturally, not all subsets from 0 to (1<<28)-1 are necessarily included, since there's not guarantee the supplied collection has a list containing each unique subset.
What I then want, for each subset 0 through 1<<28 (all of them this time) is the longest list that contains at most that subset. This is the part that is killing me. At a high level, it should, for each subset, first check to see if that subset is contained in the index_hash. It should then compare the length of that entry in the hash (if it exists there) to the lengths stored previously in the current hash for the current subset minus one number (this is an inner loop 27 strong). The greatest of these is stored in this new hash for the current subset of the outer loop. The code right now looks like this:
def at_most_hash(index_hash):
most_hash = {}
for i in xrange(1<<28): # pretty sure this is a bad idea
max_entry = index_hash.get(i)
if max_entry:
max_length = max_entry["length"]
max_word = max_entry["list"]
else:
max_length = 0
max_word = []
for j in xrange(28): # again, probably not great
subset_index = i & ~(1<<j) # gets us a pre-computed subset
at_most_entry = most_hash.get(subset_index, {})
at_most_length = at_most_entry.get("length",0)
if at_most_length > max_length:
max_length = at_most_length
max_list = at_most_entry["list"]
most_hash[i] = {"length": max_length, "list": max_list}
return most_hash
This loop obviously takes several forevers to complete. I feel that I'm new enough to python that my choice of how to iterate and what data structures to use may have been completely disastrous. Not to mention the prospective memory problems from attempting to fill the dictionary. Is there perhaps a better structure or package to use as data structures? Or a better way to set up the iteration? Or maybe I can do this more sparsely?
The next part of the algorithm just cycles through all the lists we were given and takes the product of the subset's max_length and complementary subset's max length by looking them up in at_most_hash, taking the max of those.
Any suggestions here? I appreciate the patience for wading through my long-winded question and less than decent attempt at coding this up.
In theory, this is still a better approach than working with the collection of lists alone since that approach is roughly o(200k^2) and this one is roughly o(28*2^28 + 200k), yet my implementation is holding me back.

Given that your indexes are just ints, you could save some time and space by using lists instead of dicts. I'd go further and bring in NumPy arrays. They offer compact storage representation and efficient operations that let you implicitly perform repetitive work in C, bypassing a ton of interpreter overhead.
Instead of index_hash, we start by building a NumPy array where index_array[i] is the length of the longest list whose set of elements is represented by i, or 0 if there is no such list:
import numpy
index_array = numpy.zeros(1<<28, dtype=int) # We could probably get away with dtype=int8.
for raw_list in lists:
i = find_index(raw_list)
index_array[i] = max(index_array[i], len(raw_list))
We then use NumPy operations to bubble up the lengths in C instead of interpreted Python. Things might get confusing from here:
for bit_index in xrange(28):
index_array = index_array.reshape([1<<(28-bit_index), 1<<bit_index])
numpy.maximum(index_array[::2], index_array[1::2], out=index_array[1::2])
index_array = index_array.reshape([1<<28])
Each reshape call takes a new view of the array where data in even-numbered rows corresponds to sets with the bit at bit_index clear, and data in odd-numbered rows corresponds to sets with the bit at bit_index set. The numpy.maximum call then performs the bubble-up operation for that bit. At the end, each cell index_array[i] of index_array represents the length of the longest list whose elements are a subset of set i.
We then compute the products of lengths at complementary indices:
products = index_array * index_array[::-1] # We'd probably have to adjust this part
# if we picked dtype=int8 earlier.
find where the best product is:
best_product_index = products.argmax()
and the longest lists whose elements are subsets of the set represented by best_product_index and its complement are the lists we want.

This is a bit too long for a comment so I will post it as an answer. One more direct way to index your subsets as integers is to use "bitsets" with each bit in the binary representation corresponding to one of the numbers.
For example, the set {0,2,3} would be represented by 20 + 22 + 23 = 13 and {4,5} would be represented by 24 + 25 = 48
This would allow you to use simple lists instead of dictionaries and Python's generic hashing function.

Related

List of lists (or numpy array): extracting data via SQL-like methods?

The problem: given a (large) Python list-of-lists, or, semi-equivalently, a numpy array, extract information from the array in a SQL-like manner, i.e., as if the array were a database.
For example: given a 4-column by (large) N-row array, extract the rows where the first column is equal to X. In SQL this would be:
SELECT * FROM array WHERE col_1_id = X
In Python, however... ¯\_(ツ)_/¯
An attempt to make the issue clearer:
The array in question holds in each sublist/row [M, a^2, b^2, c^2], where M is the sum of the squares. The list contains millions of entries, and M ranges from ~100 to ~10^6.
The desire is to extract from this data only the rows for which at least 8 different rows have the same sum. Naively we can do this with a loop:
Output = []
for i in [0..10^6]:
newarray = []
for row in array:
if row[0] == i:
newarray.append(row)
if len(newarray) >= 8:
Output.extend(newarray)
save(Output, 'outputfilename')
This output is a much shorter and more workable array. But my understanding is that this is incredibly inefficient (We're looping through a million row array a million times, that's a trillion calls, that seems problematic.)
Were this data in a database, I could grab it with:
SELECT * FROM array WHERE col_1 = i AND COUNT(i) >= 8
(depending on which SQL this might take a different form).
So far as I can tell, neither Python nor numpy has built-in functions that act like this. I don't expect the language to parse a SQL query, but there must be some tool within the language that approximates this function.
Numpy has a select method that doesn't actually select rows in this way, and some other methods that sound like they might make these operations possible but seem to do nothing of the sort. As mentioned below, the documentation is very thin on examples.
I have seen things somewhat like this done using collections.Counter(), but I'm not sure this specific desire can be done with it and am uncertain how to do it. The documentation is... thin on examples.
I'm aware of the fact that this may be an XY question, and have hence attempted to leave out the X except as examples of what I've tried. I am, however, in need of tools using Python (via SageMath/Jupyter). If there's a way of directly storing numpy/Python data in a database-like format and hitting it with SQL-like queries, that would be great too.
This might not be exactly what you are looking for, but I hope it can be helpful either way. :) I wrote a loop implementation that should be more efficient than the one you provided since we only loop through the column twice. We use a dictionary to keep track of the number of times a specific value in the first column occurs.
countDict = {}
#Counting the number of times a sum occurs in the first column of the array
for row in array:
if row[0] in countDict:
#If row sum exists in dictionary we increment the count
countDict[row[0]] +=1
else:
#Else we add the first count (1)
countDict[row[0]] = 1
output = [] #Output to generate
#Loop through first column of array again
for row in array:
#If the sum value occured at least 8 times we add it to the output list
if countDict[row[0]] >= 8:
output.append(row)

Given a set t of tuples containing elements from the set S, what is the most efficient way to build another set whose members are not contained in t?

For example, suppose I had an (n,2) dimensional tensor t whose elements are all from the set S containing random integers. I want to build another tensor d with size (m,2) where individual elements in each tuple are from S, but the whole tuples do not occur in t.
E.g.
S = [0,1,2,3,7]
t = [[0,1],
[7,3],
[3,1]]
d = some_algorithm(S,t)
/*
d =[[2,1],
[3,2],
[7,4]]
*/
What is the most efficient way to do this in python? Preferably with pytorch or numpy, but I can work around general solutions.
In my naive attempt, I just use
d = np.random.choice(S,(m,2))
non_dupes = [i not in t for i in d]
d = d[non_dupes]
But both t and S are incredibly large, and this takes an enormous amount of time (not to mention, rarely results in a (m,2) array). I feel like there has to be some fancy tensor thing I can do to achieve this, or maybe making a large hash map of the values in t so checking for membership in t is O(1), but this produces the same issue just with memory. Is there a more efficient way?
An approximate solution is also okay.
my naive attempt would be a base-transformation function to reduce the problem to an integer set problem:
definitions and assumptions:
let S be a set (unique elements)
let L be the number of elements in S
let t be a set of M-tuples with elements from S
the original order of the elements in t is irrelevant
let I(x) be the index function of the element x in S
let x[n] be the n-th tuple-member of an element of t
let f(x) be our base-transform function (and f^-1 its inverse)
since S is a set we can write each element in t as a M digit number to the base L using elements from S as digits.
for M=2 the transformation looks like
f(x) = I(x[1])*L^1 + I(x[0])*L^0
f^-1(x) is also rather trivial ... x mod L to get back the index of the least significant digit. floor(x/L) and repeat until all indices are extracted. lookup the values in S and construct the tuple.
since now you can represet t as an integer set (read hastable) calculating the inverse set d becomes rather trivial
loop from L^(M-1) to (L^(M+1)-1) and ask your hashtable if the element is in t or d
if the size of S is too big you can also just draw random numbers against the hashtable for a subset of the inverse of t
does this help you?
If |t| + |d| << |S|^2 then the probability of some random tuple to be chosen again (in a single iteration) is relatively small.
To be more exact, if (|t|+|d|) / |S|^2 = C for some constant C<1, then if you redraw an element until it is a "new" one, the expected number of redraws needed is 1/(1-C).
This means, that by doing this, and redrawing elements until this is a new element, you get O((1/(1-C)) * |d|) times to process a new element (on average), which is O(|d|) if C is indeed constant.
Checking is an element is already "seen" can be done in several ways:
Keeping hash sets of t and d. This requires extra space, but each lookup is constant O(1) time. You could also use a bloom filter instead of storing the actual elements you already seen, this will make some errors, saying an element is already "seen" though it was not, but never the other way around - so you will still get all elements in d as unique.
Inplace sorting t, and using binary search. This adds O(|t|log|t|) pre-processing, and O(log|t|) for each lookup, but requires no additional space (other then where you store d).
If in fact, |d| + |t| is very close to |S|^2, then an O(|S|^2) time solution could be to use Fisher Yates shuffle on the available choices, and choosing the first |d| elements that do not appear in t.

Invertable Cartesian Product Elements/Index Translation Function

I have a problem where I need to identify the elements found at an indexed position within
the Cartesian product of a series of lists but also, the inverse, i.e. identify the indexed position from a unique combination of elements from a series of lists.
I've written the following code which performs the task reasonably well:
import numpy as np
def index_from_combination(meta_list_shape, index_combination ):
list_product = np.prod(meta_list_shape)
m_factor = np.cumprod([[l] for e,l in enumerate([1]+meta_list_shape)])[0:len(meta_list_shape)]
return np.sum((index_combination)*m_factor,axis=None)
def combination_at_index(meta_list_shape, index ):
il = len(meta_list_shape)-1
list_product = np.prod(meta_list_shape)
assert index < list_product
m_factor = np.cumprod([[l] for e,l in enumerate([1]+meta_list_shape)])[0:len(meta_list_shape)][::-1]
idxl = []
for e,m in enumerate(m_factor):
if m<=index:
idxl.append((index//m))
index = (index%m)
else:
idxl.append(0)
return idxl[::-1]
e.g.
index_from_combination([3,2],[2,1])
>> 5
combination_at_index([3,2],5)
>> [2,1]
Where [3,2] describes a series of two lists, one containing 3 elements, and the other containing 2 elements. The combination [2,1] denotes a permutation consisting of the 3rd element (zero-indexing) from the 1st list, and the 2nd element (again zero-indexed) from the second list.
...if a little clunkily (and, to save space, one that ignores the actual contents of the lists, and instead works with indexes used elsewhere to fetch the contents from those lists - that's not important here though).
N.B. What is important is that my functions mirror one another such that:
F(a)==b and G(b)==a
i.e. they are the inverse of one another.
From the linked question, it turns out I can replace the second function with the one-liner:
list(itertools.product(['A','B','C'],['P','Q','R'],['X','Y']))[index]
Which will return the unique combination of values for a supplied index integer (though with some question-mark in my mind about how much of that list is instantiated in memory - but again, that's not necessarily important right now).
What I'm asking is, itertools appears to have been built with these types of problems in mind - is there an equally neat one-line inverse to the itertools.product function that, given a combination, e.g. ['A','Q','Y'] will return an integer describing that combination's position within the cartesian product, such that this integer, if fed into the itertools.product function will return the original combination?
Imagine those combinations as two dimensional X-Y coordinates and use subscript to linear-index conversion and vice-verse. Thus, use NumPy's built-ins np.ravel_multi_index for getting the linear index and np.unravel_index for the subscript indices, which becomes your index_from_combination and combination_at_index respectively.
It's a simple translation and doesn't generate any combination whatsoever, so should be a breeze.
Sample run to make things clearer -
In [861]: np.ravel_multi_index((2,1),(3,2))
Out[861]: 5
In [862]: np.unravel_index(5, (3,2))
Out[862]: (2, 1)
The math is simple enough to be implemented if you don't want to NumPy dependency for some reason -
def index_from_combination(a, b):
return b[0]*a[1] + b[1]
def combination_at_index(a, b):
d = b//a[1]
r = b - a[1]*d
return d, r
Sample run -
In [881]: index_from_combination([3,2],[2,1])
Out[881]: 5
In [882]: combination_at_index([3,2],5)
Out[882]: (2, 1)

4-sum algorithm in Python [duplicate]

This question already has answers here:
Quadratic algorithm for 4-SUM
(3 answers)
Closed 9 years ago.
I am trying to find whether a list has 4 elements that sum to 0 (and later find what those elements are). I'm trying to make a solution based off the even k algorithm described at https://cs.stackexchange.com/questions/2973/generalised-3sum-k-sum-problem.
I get this code in Python using combinations from the standard library
def foursum(arr):
seen = {sum(subset) for subset in combinations(arr,2)}
return any(-x in seen for x in seen)
But this fails for input like [-1, 1, 2, 3]. It fails because it matches the sum (-1+1) with itself. I think this problem will get even worse when I want to find the elements because you can separate a set of 4 distinct items into 2 sets of 2 items in 6 ways: {1,4}+{-2,-3}, {1,-2}+{4,-3} etc etc.
How can I make an algorithm that correctly returns all solutions avoiding this problem?
EDIT: I should have added that I want to use as efficient algorithm as possible. O(len(arr)^4) is too slow for my task...
This works.
import itertools
def foursum(arr):
seen = {}
for i in xrange(len(arr)):
for j in xrange(i+1,len(arr)):
if arr[i]+arr[j] in seen: seen[arr[i]+arr[j]].add((i,j))
else: seen[arr[i]+arr[j]] = {(i,j)}
for key in seen:
if -key in seen:
for (i,j) in seen[key]:
for (p,q) in seen[-key]:
if i != p and i != q and j != p and j != q:
return True
return False
EDIT
This can be made more pythonic i think, I don't know enough python.
It is normal for the 4SUM problem to permit input elements to be used multiple times. For instance, given the input (2 3 1 0 -4 -1), valid solutions are (3 1 0 -4) and (0 0 0 0).
The basic algorithm is O(n^2): Use two nested loops, each running over all the items in the input, to form all sums of pairs, storing the sums and their components in some kind of dictionary (hash table, AVL tree). Then scan the pair-sums, reporting any quadruple for which the negative of the pair-sum is also present in the dictionary.
If you insist on not duplicating input elements, you can modify the algorithm slightly. When computing the two nested loops, start the second loop beyond the current index of the first loop, so no input elements are taken twice. Then, when scanning the dictionary, reject any quadruples that include duplicates.
I discuss this problem at my blog, where you will see solutions in multiple languages, including Python.
First note that the problem is O(n^4) in worst case, since the output size might be of O(n^4) (you are looking for finding all solutions, not only the binary problem).
Proof:
Take an example of [-1]*(n/2).extend([1]*(n/2)). you need to "choose" two instances of -1 w/o repeats - (n/2)*(n/2-1)/2 possibilities, and 2 instances of 1 w/o repeats - (n/2)*(n/2-1)/2 possibilities. This totals in (n/2)*(n/2-1)*(n/2)*(n/2-1)/4 which is in Theta(n^4)
Now, that we understood we cannot achieve O(n^2logn) worst case, we can get to the following algorithm (pseudo-code), that should scale closer to O(n^2logn) for "good" cases (few identical sums), and get O(n^4) worst case (as expected).
Pseudo-code:
subsets <- all subsets of size of indices (not values!)
l <- empty list
for each s in subsets:
#appending a triplet of (sum,idx1,idx2):
l.append(( arr[s[0]] + arr[s[1]], s[0],s[1]))
sort l by first element (sum) in each tupple
for each x in l:
binary search l for -x[0] #for the sum
for each element y that satisfies the above:
if x[1] != y[1] and x[2] != y[1] and x[1] != y[2] and x[2] != y[2]:
yield arr[x[1]], arr[x[2]], arr[y[1]], arr[y[2]]
Probably a pythonic way to do the above will be more elegant and readable, but I am not a python expert I am afraid.
EDIT: Ofcourse the algorithm shall be atleast as time complex as per the solution size!
If the number of possible solutions is not 'large' as compared to n, then
A suggested solution in O(N^3):
Find pair-wise sums of all elements and build a NxN matrix of the sums.
For each element in this matrix, build a struct that would have sumValue, row and column as it fields.
Sort all these N^2 struct elements in a 1D array. (in O(N^2 logN) time).
For each element x in this array, conduct a binary search for its partner y such that x + y = 0 (O(logn) per search).
Now if you find a partner y, check if its row or column field matches with the element x. If so, iterate sequentially in both directions until either there is no more such y.
If you find some y's that do not have a common row or column with x, then increment the count (or print the solution).
This iteration can at most take 2N steps because the length of rows and columns is N.
Hence the total order of complexity for this algorithm shall be O(N^2 * N) = O(N^3)

Vectorize iteration over two large numpy arrays in parallel

I have two large arrays of type numpy.core.memmap.memmap, called data and new_data, with > 7 million float32 items.
I need to iterate over them both within the same loop which I'm currently doing like this.
for i in range(0,len(data)):
if new_data[i] == 0: continue
combo = ( data[i], new_data[i] )
if not combo in new_values_map: new_values_map[combo] = available_values.pop()
data[i] = new_values_map[combo]
However this is unreasonably slow, so I gather that using numpy's vectorising functions are the way to go.
Is it possible to vectorize with the index – so that the vectorised array can compare it's items to the corresponding item in the other array?
I thought of zipping the two arrays but I guess this would cause unreasonable overhead to prepare?
Is there some other way to optimise this operation?
For context: the goal is to effectively merge the two arrays such that each unique combination of corresponding values between the two arrays is represented by a different value in the resulting array, except zeros in the new_data array which are ignored. The arrays represent 3D bitmap images.
EDIT: available_values is a set of values that have not yet been used in data and persists across calls to this loop. new_values_map on the other hand is reset to an empty dictionary before each time this loop is used.
EDIT2: the data array only contains whole numbers, that is: it's initialised as zeros then with each usage of this loop with a different new_data it is populated with more values drawn from available_values which is initially a range of integers. new_data could theoretically be anything.
In answer to you question about vectorising, the answer is probably yes, though you need to clarify what available_values contains and how it's used, as that is the core of the vectorisation.
Your solution will probably look something like this...
indices = new_data != 0
data[indices] = available_values
In this case, if available_values can be considered as a set of values in which we allocate the first value to the first value in data in which new_data is not 0, that should work, as long as available_values is a numpy array.
Let's say new_data and data take values 0-255, then you can construct an available_values array with unique entries for every possible pair of values in new_data and data like the following:
available_data = numpy.array(xrange(0, 255*255)).reshape((255, 255))
indices = new_data != 0
data[indices] = available_data[data[indices], new_data[indices]]
Obviously, available_data can be whatever mapping you want. The above should be very quick whatever is in available_data (especially if you only construct available_data once).
Python gives you a powerful tools for handling large arrays of data: generators and iterators
Basically, they will allow to acces your data as they were regular lists, without fetching them at once to memory, but accessing piece by piece.
In case of accessing two large arrays at once, you can
for item_a, item_b in izip(data, new_data):
#... do you stuff here
izip creates an iterator what iterates over the elements of your arrays at once, but it does picks pieces as you need them, not all at once.
It seems that replacing the first two lines of loop to produce:
for i in numpy.where(new_data != 0)[0]:
combo = ( data[i], new_data[i] )
if not combo in new_values_map: new_values_map[combo] = available_values.pop()
data[i] = new_values_map[combo]
has the desired effect.
So most of the time in the loop was spent skipping the entire loop upon encountering a zero in new_data. Don't really understand why these many null iterations were so expensive, maybe one day I will...

Categories