I have a problem where I need to identify the elements found at an indexed position within
the Cartesian product of a series of lists but also, the inverse, i.e. identify the indexed position from a unique combination of elements from a series of lists.
I've written the following code which performs the task reasonably well:
import numpy as np
def index_from_combination(meta_list_shape, index_combination ):
list_product = np.prod(meta_list_shape)
m_factor = np.cumprod([[l] for e,l in enumerate([1]+meta_list_shape)])[0:len(meta_list_shape)]
return np.sum((index_combination)*m_factor,axis=None)
def combination_at_index(meta_list_shape, index ):
il = len(meta_list_shape)-1
list_product = np.prod(meta_list_shape)
assert index < list_product
m_factor = np.cumprod([[l] for e,l in enumerate([1]+meta_list_shape)])[0:len(meta_list_shape)][::-1]
idxl = []
for e,m in enumerate(m_factor):
if m<=index:
idxl.append((index//m))
index = (index%m)
else:
idxl.append(0)
return idxl[::-1]
e.g.
index_from_combination([3,2],[2,1])
>> 5
combination_at_index([3,2],5)
>> [2,1]
Where [3,2] describes a series of two lists, one containing 3 elements, and the other containing 2 elements. The combination [2,1] denotes a permutation consisting of the 3rd element (zero-indexing) from the 1st list, and the 2nd element (again zero-indexed) from the second list.
...if a little clunkily (and, to save space, one that ignores the actual contents of the lists, and instead works with indexes used elsewhere to fetch the contents from those lists - that's not important here though).
N.B. What is important is that my functions mirror one another such that:
F(a)==b and G(b)==a
i.e. they are the inverse of one another.
From the linked question, it turns out I can replace the second function with the one-liner:
list(itertools.product(['A','B','C'],['P','Q','R'],['X','Y']))[index]
Which will return the unique combination of values for a supplied index integer (though with some question-mark in my mind about how much of that list is instantiated in memory - but again, that's not necessarily important right now).
What I'm asking is, itertools appears to have been built with these types of problems in mind - is there an equally neat one-line inverse to the itertools.product function that, given a combination, e.g. ['A','Q','Y'] will return an integer describing that combination's position within the cartesian product, such that this integer, if fed into the itertools.product function will return the original combination?
Imagine those combinations as two dimensional X-Y coordinates and use subscript to linear-index conversion and vice-verse. Thus, use NumPy's built-ins np.ravel_multi_index for getting the linear index and np.unravel_index for the subscript indices, which becomes your index_from_combination and combination_at_index respectively.
It's a simple translation and doesn't generate any combination whatsoever, so should be a breeze.
Sample run to make things clearer -
In [861]: np.ravel_multi_index((2,1),(3,2))
Out[861]: 5
In [862]: np.unravel_index(5, (3,2))
Out[862]: (2, 1)
The math is simple enough to be implemented if you don't want to NumPy dependency for some reason -
def index_from_combination(a, b):
return b[0]*a[1] + b[1]
def combination_at_index(a, b):
d = b//a[1]
r = b - a[1]*d
return d, r
Sample run -
In [881]: index_from_combination([3,2],[2,1])
Out[881]: 5
In [882]: combination_at_index([3,2],5)
Out[882]: (2, 1)
Related
I have 2 arrays of a million elements (created from an image with the brightness of each pixel)
I need to get a number that is the sum of the products of the array elements of the same name. That is, A(1,1) * B(1,1) + A(1,2) * B(1,2)...
In the loop, python takes the value of the last variable from the loop (j1) and starts running through it, then adds 1 to the penultimate variable and runs through the last one again, and so on. How can I make it count elements of the same name?
res1, res2 - arrays (specifically - numpy.ndarray)
Perhaps there is a ready-made function for this, but I need to make it as open as possible, without a ready-made one.
sum = 0
for i in range(len(res1)):
for j in range(len(res2[i])):
for i1 in range(len(res2)):
for j1 in range(len(res1[i1])):
sum += res1[i][j]*res2[i1][j1]
In the first part of my answer I'll explain how to fix your code directly. Your code is almost correct but contains one big mistake in logic. In the second part of my answer I'll explain how to solve your problem using numpy. numpy is the standard python package to deal with arrays of numbers. If you're manipulating big arrays of numbers, there is no excuse not to use numpy.
Fixing your code
Your code uses 4 nested for-loops, with indices i and j to iterate on the first array, and indices i1 and j1 to iterate on the second array.
Thus you're multiplying every element res1[i][j] from the first array, with every element res2[i1][j1] from the second array. This is not what you want. You only want to multiply every element res1[i][j] from the first array with the corresponding element res2[i][j] from the second array: you should use the same indices for the first and the second array. Thus there should only be two nested for-loops.
s = 0
for i in range(len(res1)):
for j in range(len(res1[i])):
s += res1[i][j] * res2[i][j]
Note that I called the variable s instead of sum. This is because sum is the name of a builtin function in python. Shadowing the name of a builtin is heavily discouraged. Here is the list of builtins: https://docs.python.org/3/library/functions.html ; do not name a variable with a name from that list.
Now, in general, in python, we dislike using range(len(...)) in a for-loop. If you read the official tutorial and its section on for loops, it suggests that for-loop can be used to iterate on elements directly, rather than on indices.
For instance, here is how to iterate on one array, to sum the elements on an array, without using range(len(...)) and without using indices:
# sum the elements in an array
s = 0
for row in res1:
for x in row:
s += x
Here row is a whole row, and x is an element. We don't refer to indices at all.
Useful tools for looping are the builtin functions zip and enumerate:
enumerate can be used if you need access both to the elements, and to their indices;
zip can be used to iterate on two arrays simultaneously.
I won't show an example with enumerate, but zip is exactly what you need since you want to iterate on two arrays:
s = 0
for row1, row2 in zip(res1, res2):
for x, y in zip(row1, row2):
s += x * y
You can also use builtin function sum to write this all without += and without the initial = 0:
s = sum(x * y for row1,row2 in zip(res1, res2) for x,y in zip(row1, row2))
Using numpy
As I mentioned in the introduction, numpy is a standard python package to deal with arrays of numbers. In general, operations on arrays using numpy is much, much faster than loops on arrays in core python. Plus, code using numpy is usually easier to read than code using core python only, because there are a lot of useful functions and convenient notations. For instance, here is a simple way to achieve what you want:
import numpy as np
# convert to numpy arrays
res1 = np.array(res1)
res2 = np.array(res2)
# multiply elements with corresponding elements, then sum
s = (res1 * res2).sum()
Relevant documentation:
sum: .sum() or np.sum();
pointwise multiplication: np.multiply() or *;
dot product: np.dot.
Solution 1:
import numpy as np
a,b = np.array(range(100)), np.array(range(100))
print((a * b).sum())
Solution 2 (more open, because of use of pd.DataFrame):
import pandas as pd
import numpy as np
a,b = np.array(range(100)), np.array(range(100))
df = pd.DataFrame(dict({'col1': a, 'col2': b}))
df['vect_product'] = df.col1 * df.col2
print(df['vect_product'].sum())
Two simple and fast options using numpy are: (A*B).sum() and np.dot(A.ravel(),B.ravel()). The first method sums all elements of the element-wise multiplication of A and B. np.sum() defaults to sum(axis=None), so we will get a single number. In the second method, you create a 1D view into the two matrices and then apply the dot-product method to get a single number.
import numpy as np
A = np.random.rand(1000,1000)
B = np.random.rand(1000,1000)
s = (A*B).sum() # method 1
s = np.dot(A.ravel(),B.ravel()) # method 2
The second method should be extremely fast, as it doesn't create new copies of A and B but a view into them, so no extra memory allocations.
For example, suppose I had an (n,2) dimensional tensor t whose elements are all from the set S containing random integers. I want to build another tensor d with size (m,2) where individual elements in each tuple are from S, but the whole tuples do not occur in t.
E.g.
S = [0,1,2,3,7]
t = [[0,1],
[7,3],
[3,1]]
d = some_algorithm(S,t)
/*
d =[[2,1],
[3,2],
[7,4]]
*/
What is the most efficient way to do this in python? Preferably with pytorch or numpy, but I can work around general solutions.
In my naive attempt, I just use
d = np.random.choice(S,(m,2))
non_dupes = [i not in t for i in d]
d = d[non_dupes]
But both t and S are incredibly large, and this takes an enormous amount of time (not to mention, rarely results in a (m,2) array). I feel like there has to be some fancy tensor thing I can do to achieve this, or maybe making a large hash map of the values in t so checking for membership in t is O(1), but this produces the same issue just with memory. Is there a more efficient way?
An approximate solution is also okay.
my naive attempt would be a base-transformation function to reduce the problem to an integer set problem:
definitions and assumptions:
let S be a set (unique elements)
let L be the number of elements in S
let t be a set of M-tuples with elements from S
the original order of the elements in t is irrelevant
let I(x) be the index function of the element x in S
let x[n] be the n-th tuple-member of an element of t
let f(x) be our base-transform function (and f^-1 its inverse)
since S is a set we can write each element in t as a M digit number to the base L using elements from S as digits.
for M=2 the transformation looks like
f(x) = I(x[1])*L^1 + I(x[0])*L^0
f^-1(x) is also rather trivial ... x mod L to get back the index of the least significant digit. floor(x/L) and repeat until all indices are extracted. lookup the values in S and construct the tuple.
since now you can represet t as an integer set (read hastable) calculating the inverse set d becomes rather trivial
loop from L^(M-1) to (L^(M+1)-1) and ask your hashtable if the element is in t or d
if the size of S is too big you can also just draw random numbers against the hashtable for a subset of the inverse of t
does this help you?
If |t| + |d| << |S|^2 then the probability of some random tuple to be chosen again (in a single iteration) is relatively small.
To be more exact, if (|t|+|d|) / |S|^2 = C for some constant C<1, then if you redraw an element until it is a "new" one, the expected number of redraws needed is 1/(1-C).
This means, that by doing this, and redrawing elements until this is a new element, you get O((1/(1-C)) * |d|) times to process a new element (on average), which is O(|d|) if C is indeed constant.
Checking is an element is already "seen" can be done in several ways:
Keeping hash sets of t and d. This requires extra space, but each lookup is constant O(1) time. You could also use a bloom filter instead of storing the actual elements you already seen, this will make some errors, saying an element is already "seen" though it was not, but never the other way around - so you will still get all elements in d as unique.
Inplace sorting t, and using binary search. This adds O(|t|log|t|) pre-processing, and O(log|t|) for each lookup, but requires no additional space (other then where you store d).
If in fact, |d| + |t| is very close to |S|^2, then an O(|S|^2) time solution could be to use Fisher Yates shuffle on the available choices, and choosing the first |d| elements that do not appear in t.
Suppose i have two arrays of equal lengths:
a = [0,0,1,0,0,1,0,0,0,1,0,1,1,0,0,0,1]
b = [0,1,1,0,1,0,0,1,1,0,0,1,1,0,1,0,0]
Now i want to pick up elements from these two arrays , in the sequence given such that they form a new array of same length as a & b by randomly selecting values between a & b, in the ratio of a:b = 4.68 i.e for every 1 value picked from a , there should be 4.68 values picked from b in the resultant array.
So effectively the resultant array could be something like :
res = [0,1,1,0,1, 1(from a) ,0(from a),1,1,0,0,1,1,0, 0(from a),0,0]
res array has : first 5 values are from b ,6th & 7th from a ,8th-14th from b , 15th from a ,16th-17th from b
Overall ratio of values from a:b in the given res array example is a:b 4.67 ( from a = 3 ,from b = 14 )
Thus between the two arrays, values have to be chosen at random however the sequence needs to be maintained i.e cannot take 7th value from one array and 3rd value from other .If the value to be populated in resultant array is 3rd then the choice is between the 3rd element of both input arrays at random.Also, overall ratio needs to be maintained as well.
Can you please help me in developing an efficient Pythonic way of reaching this resultant solution ? The solution need not be consistent with every run w.r.t values
Borrowing the a_count calculation from Barmar's answer (because it seems to work and I can't be bothered to reinvent it), this solution preserves the ordering of the values chosen from a and b:
from future_builtins import zip # Only on Python 2, to avoid temporary list of tuples
import random
# int() unnecessary on Python 3
a_count = int(round(1/(1 + 4.68) * len(a)))
# Use range on Python 3, xrange on Python 2, to avoid making actual list
a_indices = frozenset(random.sample(xrange(len(a)), a_count))
res = [aval if i in a_indices else bval for i, (aval, bval) in enumerate(zip(a, b))]
The basic idea here is that you determine how many a values you need, get a unique sample of the possible indices of that size, then iterate a and b in parallel, keeping the a value for the selected indices, and the b value for all others.
If you don't like the complexity of the list comprehension, you could use a different approach, copying b, then filling in the a values one by one:
res = b[:] # Copy b in its entirety
# Replace selected indices with a values
# No need to convert to frozenset for efficiency here, and it's clean
# enough to just iterate the sample directly without storing it
for i in random.sample(xrange(len(a)), a_count):
res[i] = a[i]
I believe this should work. You specify how many you want from a (you can simply use your ratio to figure out that number), you randomly generate a 'mask' of numbers and choose from a or be based on the cutoff (notice that you only sort to figure out the cutoff, but you use the unsorted mask later)
import numpy as np
a = [0,0,1,0,0,1,0,0,0,1,0,1,1,0,0,0,1]
b = [0,1,1,0,1,0,0,1,1,0,0,1,1,0,1,0,0]
mask = np.random.random(len(a))
from_a = 3
cutoff = np.sort(mask)[from_a]
res = []
for i in range(len(a)):
if (mask[i]>=cutoff):
res.append(a[i])
else:
res.append(b[i])
As the title states, I want to create a function that'll take a multidimensional array A, and a number B, that ultimately returns the number in A that is the closest to B. If the number B is in A, then return it. If there's 2 numbers in A that are equally distant from B, choose the first one by counting from row to row.
This is the code I have so far:
import numpy as np
def g_C(A,B):
A = np.asanyarray(A)
assert A.ndim == 2 # to assert that A is a multidimensional array.
get = (np.abs(A-B)).argmin()
return (A[get])
However from my understanding, I think (np.abs(M-N)).argmin() really only effectively works for sorted arrays? I'm not allowed to sort the array in this problem; I have to work on it for face value, examining row by row, and grabbing the first instance of the closest number to B.
So for example, g_C([[1,3,6,-8],[2,7,1,0],[4,5,2,8],[2,3,7,10]],9) should return 8
Also, I was given the hint that numpy.argmin would help, and I see that it's purpose is to extract the first occurrence something occurs, which makes sense in this problem, but I'm not sure how exactly to fit that into the code I have at the moment.
EDIT
The flat suggestion works perfectly fine. Thank you everyone.
I'm trying RagingRoosevelt's second suggestion, and I'm stuck.
def g_C(A,B):
A = np.asanyarray(A)
D = np.full_like(A, B) # created an array D with same qualities as array A, but just filled with values of B
diffs = abs(D-A) # finding absolute value differences between D and A
close = diffs.argmin(axis=1) # find argmin of 'diffs', row by row
close = np.asanyarray(close) # converted the argmins of 'diff' into an array
closer = close.argmin() # the final argmin ??
return closer
I'm trying out this suggestion because I have another problem related to this where I have to extract the row who's sum is the closest number to B. And I figure this is good practice anyway.
Your existing code is fine except, by default, argmin returns an index to the flattened array. So you could do
return A.flat[abs(A - B).argmin()]
to get the right value from A.
EDIT: For your other problem - finding the row in a 2-dimensional array A whose sum is closest to B - you can do:
return A[abs(A.sum(axis=1) - B).argmin()]
In either case I don't see any need to create an array of B.
Your problem is the same as a find-min problem. The only difference is that you're looking for min(abs(A[i]-B)) instead. So, iterate over your array. As you do so, record the smallest absolute delta and the index at which it occurred. When you find a smaller delta, update the record and then keep searching. When you've made it all the way through, return whatever value was at the recorded index.
Since you're working with numpy arrays, another approach is that you could create an array of identical size as A but filled only with value B. Compute the difference between the arrays and then use argmin on each row. Assemble an array of all minimum values for each row and then do argmin again to pull out the smallest of the values.
This will work for any 2-dimensional array with a nested for-loop, but I am not sure that this is what you want (as in it doesn't use numpy).
def g_C(A, B):
i = A[0][0]
m = abs(B - A[0][0])
for r in A:
for i in r:
if abs(B - i) < m:
m = abs(B - i)
n = i
return n
Nevertheless, it does work:
>>> g_C([[1,3,6,-8],[2,7,1,0],[4,5,2,8],[2,3,7,10]],9)
8
So I'm given a large collection (roughly 200k) of lists. Each contains a subset of the numbers 0 through 27. I want to return two of the lists where the product of their lengths is greater than the product of the lengths of any other pair of lists. There's another condition, namely that the lists have no numbers in common.
There's an algorithm I found for this (can't remember the source, apologies for non-specificity of props) which exploits the fact that there are fewer total subsets of the numbers 0 through 27 than there are words in the dictionary.
The first thing I've done is looped through all the lists, found the unique subset of integers that comprise it and indexed it as a number between 0 and 1<<28. As follows:
def index_lists(lists):
index_hash = {}
for raw_list in lists:
length = len(raw_list)
if length > index_hash.get(index,{}).get("length"):
index = find_index(raw_list)
index_hash[index] = {"list": raw_list, "length": length}
return index_hash
This gives me the longest list and the length of the that list for each subset that's actually contained in the collection of lists given. Naturally, not all subsets from 0 to (1<<28)-1 are necessarily included, since there's not guarantee the supplied collection has a list containing each unique subset.
What I then want, for each subset 0 through 1<<28 (all of them this time) is the longest list that contains at most that subset. This is the part that is killing me. At a high level, it should, for each subset, first check to see if that subset is contained in the index_hash. It should then compare the length of that entry in the hash (if it exists there) to the lengths stored previously in the current hash for the current subset minus one number (this is an inner loop 27 strong). The greatest of these is stored in this new hash for the current subset of the outer loop. The code right now looks like this:
def at_most_hash(index_hash):
most_hash = {}
for i in xrange(1<<28): # pretty sure this is a bad idea
max_entry = index_hash.get(i)
if max_entry:
max_length = max_entry["length"]
max_word = max_entry["list"]
else:
max_length = 0
max_word = []
for j in xrange(28): # again, probably not great
subset_index = i & ~(1<<j) # gets us a pre-computed subset
at_most_entry = most_hash.get(subset_index, {})
at_most_length = at_most_entry.get("length",0)
if at_most_length > max_length:
max_length = at_most_length
max_list = at_most_entry["list"]
most_hash[i] = {"length": max_length, "list": max_list}
return most_hash
This loop obviously takes several forevers to complete. I feel that I'm new enough to python that my choice of how to iterate and what data structures to use may have been completely disastrous. Not to mention the prospective memory problems from attempting to fill the dictionary. Is there perhaps a better structure or package to use as data structures? Or a better way to set up the iteration? Or maybe I can do this more sparsely?
The next part of the algorithm just cycles through all the lists we were given and takes the product of the subset's max_length and complementary subset's max length by looking them up in at_most_hash, taking the max of those.
Any suggestions here? I appreciate the patience for wading through my long-winded question and less than decent attempt at coding this up.
In theory, this is still a better approach than working with the collection of lists alone since that approach is roughly o(200k^2) and this one is roughly o(28*2^28 + 200k), yet my implementation is holding me back.
Given that your indexes are just ints, you could save some time and space by using lists instead of dicts. I'd go further and bring in NumPy arrays. They offer compact storage representation and efficient operations that let you implicitly perform repetitive work in C, bypassing a ton of interpreter overhead.
Instead of index_hash, we start by building a NumPy array where index_array[i] is the length of the longest list whose set of elements is represented by i, or 0 if there is no such list:
import numpy
index_array = numpy.zeros(1<<28, dtype=int) # We could probably get away with dtype=int8.
for raw_list in lists:
i = find_index(raw_list)
index_array[i] = max(index_array[i], len(raw_list))
We then use NumPy operations to bubble up the lengths in C instead of interpreted Python. Things might get confusing from here:
for bit_index in xrange(28):
index_array = index_array.reshape([1<<(28-bit_index), 1<<bit_index])
numpy.maximum(index_array[::2], index_array[1::2], out=index_array[1::2])
index_array = index_array.reshape([1<<28])
Each reshape call takes a new view of the array where data in even-numbered rows corresponds to sets with the bit at bit_index clear, and data in odd-numbered rows corresponds to sets with the bit at bit_index set. The numpy.maximum call then performs the bubble-up operation for that bit. At the end, each cell index_array[i] of index_array represents the length of the longest list whose elements are a subset of set i.
We then compute the products of lengths at complementary indices:
products = index_array * index_array[::-1] # We'd probably have to adjust this part
# if we picked dtype=int8 earlier.
find where the best product is:
best_product_index = products.argmax()
and the longest lists whose elements are subsets of the set represented by best_product_index and its complement are the lists we want.
This is a bit too long for a comment so I will post it as an answer. One more direct way to index your subsets as integers is to use "bitsets" with each bit in the binary representation corresponding to one of the numbers.
For example, the set {0,2,3} would be represented by 20 + 22 + 23 = 13 and {4,5} would be represented by 24 + 25 = 48
This would allow you to use simple lists instead of dictionaries and Python's generic hashing function.