Related
So I'm trying to find how to group similar numbers into different lists. I tried looking at some sources like (Grouping / clustering numbers in Python)
but all of them requires the importation of itertools and use itertools.groupby, which I dont want because I dont want to use built in functions.
Here is my code so far.
def n_length_combo(lst, n):
if n == 0:
return [[]]
l = []
for i in range(0, len(lst)):
m = lst[i]
remLst = lst[i + 1:]
for p in n_length_combo(remLst, n - 1):
l.append([m] + p)
return l
print(n_length_combo(lst=[1,1,76,45,45,4,5,99,105],n=3))
Edit: n: int represents the number of groups permitted from one single list, so if n is 3, the numbers will be grouped in (x,...), (x,....) (x,...) If n = 2, the numbers will be grouped in (x,..),(x,...)
However, my code prints out all possible combinations in a list of n elements. But it doesnt group the numbers together. So what I want is: for instance if the input is
[10,12,45,47,91,98,99]
and if n = 2, the output would be
[10,12,45,47] [91,98,99]
and if n = 3, the output would be
[10,12] [45,47] [91,98,99]
What changes to my code should I make?
Assuming n is the number of groups/partitions you want:
import math
def partition(nums, n):
partitions = [[] for _ in range(n)]
min_, max_ = min(nums), max(nums)
r = max_ - min_ # range of the numbers
s = math.ceil(r / n) # size of each bucket/partition
for num in nums:
p = (num - min_) // s
partitions[p].append(num)
return partitions
nums = [10,12,45,47,91,98,99]
print(partition(nums, 2))
print(partition(nums, 3))
prints:
[[10, 12, 45, 47], [91, 98, 99]]
[[10, 12], [45, 47], [91, 98, 99]]
You are trying to convert a 1d array into a 2d array. Forgive the badly named variables but the general idea is as follows. It is fairly easy to parse, but basically what we are doing is first finding out the size in rows of the 2d matrix given the length of the 1d matrix and desired number of cols. If this does not divide cleanly, we add one to rows. then we create one loop for counting the cols and inside that we create another loop for counting the rows. we map the current position (r,c) of the 2d array to an index into the 1d array. if there is an array index out of bounds, we put 0 (or None or -1 or just do nothing at all), otherwise we copy the value from the 1d array to the 2d array. Well, actually we create a 1d array inside the cols loop which we append to the lst2 array when the loop is finished.
def transform2d(lst, cols):
size = len(lst)
rows = int(size/cols)
if cols * rows < size:
rows+=1
lst2 = []
for c in range(cols):
a2 = []
for r in range(rows):
i = c*cols + r
if i < size:
a2.append(lst[i])
else:
a2.append(0) # default value
lst2.append(a2)
return lst2
i = [10,12,45,47,91,98,99]
r = transform2d(i, 2)
print(r)
r = transform2d(i, 3)
print(r)
the output is as you have specified except for printing 0 for the extra elements in the 2d array. this can be changed by just removing the else that does this.
I wish to find combinations of the elements of n copies of a vector where the vector is simply: np.arange(0,0.1,0.01). The combination is made up by selecting a single element from each vector.
I need the combinations to meet the criteria that each element within a combination is non-zero and the sum of the combinations = 1. I have the below function which works well:
# cols is a vector of length n.
def feasibility_test(row_in, cols):
if np.around(np.sum(row_in), decimals = 2) == 1 and np.count_nonzero(row_in)>= len(cols):
pass_test = True
else:
pass_test = False
return pass_test
However for combinations of n= 6 or above meshgrid (code below) produces an array that overwhelms computer memory:
def generate_all_combos(range_in, cols):
j = range_in
if len(cols) == 4:
new_array = np.array(np.meshgrid(j,j,j,j)).T.reshape(-1,len(cols))
elif len(cols) == 5:
new_array = np.array(np.meshgrid(j,j,j,j,j)).T.reshape(-1,len(cols))
elif len(cols) == 6:
new_array = np.array(np.meshgrid(j,j,j,j,j,j)).T.reshape(-1,len(cols))
elif len(cols) == 7:
new_array = np.array(np.meshgrid(j,j,j,j,j,j,j)).T.reshape(-1,len(cols))
return new_array
The above code can be called with:
# Create range of values in parameter space of interest
underlying_range = [np.around(i, decimals=2) for i in np.arange(0,0.1,0.01)]
# Generate all possible combinations of col values
comb_array = generate_all_combos(underlying_range, cols)
# Check which combinations are feasible
feasible_combos_high_level = [row for row in comb_array if feasibility_test(row)]
Is there a way to get an array of feasible combinations without producing the entire range of combinations (the majority of which do not meet the feasibility test)?
You can use a recursive generator function that will only produce combinations that meet your criteria up front. Also, by using a generator, you are not immediately storing all the values in memory, so you can access them later on-demand:
import numpy as np
def combos(d, max_l, total_s, valid_f, rt = 0, c = []):
if rt == total_s and len(c) == max_l:
yield np.array(c)
elif len(c) < max_l and rt < total_s:
for i in filter(valid_f, d):
for j in range(1, int(total_s/i)+1):
if rt+(i*j) <= total_s and i not in c:
yield from combos(d, max_l, total_s, valid_f, rt=rt+(i*j), c=c+([i]*j))
vals = np.arange(0,0.1,0.01)
result = combos(vals, len(cols), 1, lambda x: x > 0) #passing len(cols)
#printing first 100 valid combinations
for _ in range(100):
print(next(result))
How can I find the best "match" for small matrix in big matrix?
For example:
small=[[1,2,3],
[4,5,6],
[7,8,9]]
big=[[2,4,2,3,5],
[6,0,1,9,0],
[2,8,2,1,0],
[7,7,4,2,1]]
The match is defined as difference of numbers in matrix, so match in position (1,1) is as if number 5 from small would be on number 0 from big matrix (so the central number from small matrix in coordinates (1,1) of big matrix.
The match value in position (1,1) is:
m(1,1)=|2−1|+|4−2|+|2−3|+|6−4|+|0−5|+|1−6|+|2−7|+|8−8|+|2−9|=28
The goal is to find the lowest difference posible in those matrixes.
The small matrix always has odd number of lines and columns, so it's easy to find it's centre.
You can iterate through the viable rows and columns and zip the slices of big with small to calculate the sum of differences, and use min to find the minimum among the differences:
from itertools import islice
min(
(
sum(
sum(abs(x - y) for x, y in zip(a, b))
for a, b in zip(
(
islice(r, col, col + len(small[0]))
for r in islice(big, row, row + len(small))
),
small
)
),
(row, col)
)
for row in range(len(big) - len(small) + 1)
for col in range(len(big[0]) - len(small[0]) + 1)
)
or in one line:
min((sum(sum(abs(x - y) for x, y in zip(a, b)) for a, b in zip((islice(r, col, col + len(small[0])) for r in islice(big, row, row + len(small))), small)), (row, col)) for row in range(len(big) - len(small) + 1) for col in range(len(big[0]) - len(small[0]) + 1))
This returns: (24, (1, 0))
Done by hand:
small=[[1,2,3],
[4,5,6],
[7,8,9]]
big=[[2,4,2,3,5],
[6,0,1,9,0],
[2,8,2,1,0],
[7,7,4,2,1]]
# collect all the sums
summs= []
# k and j are the offset into big
for k in range(len(big)-len(small)+1):
# add inner list for one row
summs.append([])
for j in range(len(big[0])-len(small[0])+1):
s = 0
for row in range(len(small)):
for col in range(len(small[0])):
s += abs(big[k+row][j+col]-small[row][col])
# add to the inner list
summs[-1].append(s)
print(summs)
Output:
[[28, 29, 38], [24, 31, 39]]
If you are just interested in the coords in the bigger one, store tuples of (rowoffset,coloffset,sum) and dont box lists into lists. You can use min() with a key that way:
summs = []
for k in range(len(big)-len(small)+1):
for j in range(len(big[0])-len(small[0])+1):
s = 0
for row in range(len(small)):
for col in range(len(small[0])):
s += abs(big[k+row][j+col]-small[row][col])
summs .append( (k,j,s) ) # row,col, sum
print ("Min value for bigger matrix at ", min(summs , key=lambda x:x[2]) )
Output:
Min value for bigger matrix at (1, 0, 24)
If you had "draws" this would only return the one with minimal row, col offset.
Another possible solution would be this, returning the minimum difference and the coordinates in the big matrix:
small=[[1,2,3],
[4,5,6],
[7,8,9]]
big=[[2,4,2,3,5],
[6,0,1,9,0],
[2,8,2,1,0],
[7,7,4,2,1]]
def difference(small, matrix):
l = len(small)
return sum([abs(small[i][j] - matrix[i][j]) for i in range(l) for j in range(l)])
def getSubmatrices(big, smallLength):
submatrices = []
bigLength = len(big)
step = (bigLength // smallLength) + 1
for i in range(smallLength):
for j in range(step):
tempMatrix = [big[j+k][i:i+smallLength] for k in range(smallLength)]
submatrices.append([i+1,j+1,tempMatrix])
return submatrices
def minDiff(small, big):
submatrices = getSubmatrices(big, len(small))
diffs = [(x,y, difference(small, submatrix)) for x, y, submatrix in submatrices]
minDiff = min(diffs, key=lambda elem: elem[2])
return minDiff
y, x, diff = minDiff(small, big)
print("Minimum difference: ", diff)
print("X = ", x)
print("Y = ", y)
Output:
Minimum difference: 24
X = 1
Y = 2
I would use numpy to help with this.
To start I would convert the arrays to numpy arrays
import numpy as np
small = np.array([[1,2,3], [4,5,6], [7,8,9]])
big = np.array([[2,4,2,3,5], [6,0,1,9,0], [2,8,2,1,0], [7,7,4,2,1]])
then I would initialize an array to store the results of the test (optional: a dictionary as well)
result_shape = np.array(big.shape) - np.array(small.shape) + 1
results = np.zeros((result_shape[0], result_shape[1]))
result_dict = {}
Then iterate over the positions in which the small matrix can be positioned over the large matrix and calculate the difference:
insert = np.zeros(big.shape)
for i in range(results.shape[0]):
for j in range(results.shape):
insert[i:small.shape[0] + i, j:small.shape[1] + j] = small
results[i, j] = np.sum(np.abs(big - insert)[i:3+i, j:3+j])
# Optional dictionary
result_dict['{}{}'.format(i, j)] = np.sum(np.abs(big - insert)[i:3+i, j:3+j])
Then you can print(results) and obtain:
[[ 28. 29. 38.]
[ 24. 31. 39.]]
and/or because the position of the small matrix over the big matrix is stored in the keys of the dictionary, you can get the position of the small matrix over the large matrix where the difference is smallest by key manipulation:
pos_min = [int(i) for i in list(min(result_dict, key=result_dict.get))]
and if you print(pos_min), you obtain:
[1, 0]
then if you need the index for anything you can iterate over it if required. Hope this helps!
I'm implementing the Select Algorithm (a.k.a. Deterministic Select). I've got it working for small arrays/lists but when my array size gets above 26 it breaks with the following error: "RuntimeError: maximum recursion depth exceeded". For arrays size 25 and below there is no problem.
My ultimate goal is to have it run for arrays of size 500 and do many iterations. The iterations are not an issue. I have already researched StackOverflow and have seen article: Python implementation of "median of medians" algorithm among many others. I had a hunch that duplicates in my random generated array may have been causing a problem but that doesn't seem to be it.
Here's my code:
import math
import random
# Insertion Sort Khan Academy video: https://www.youtube.com/watch?v=6pyeMmJTefg&list=PL36E7A2B75028A3D6&index=22
def insertion_sort(A): # Sorting it in place
for index in range(1, len(A)):# range is up to but not including len(A)
value = A[index]
i = index - 1 # index of the item that is directly to the left
while i >= 0:
if value < A[i]:
A[i + 1] = A[i]
A[i] = value
i = i - 1
else:
break
timeslo = 0 # I think that this is a global variable
def partition(A, p):
global timeslo
hi = [] #hold things larger than our pivot
lo = [] # " " smaller " " "
for x in A: # walk through all the elements in the Array A.
if x <p:
lo = lo + [x]
timeslo = timeslo + 1 #keep track no. of comparisons
else:
hi = hi + [x]
return lo,hi,timeslo
def get_chunks(Acopy, n):
# Declare some empty lists to hold our chunks
chunk = []
chunks = []
# Step through the array n element at a time
for x in range(0, len(Acopy), n): # stepping by size n starting at the beginning
# of the array
chunk = Acopy[x:x+n] # Extract 5 elements
# sort chunk and find its median
insertion_sort(chunk) # in place sort of chunk of size 5
# get the median ... (i.e. the middle element)
# Add them to list
mindex = (len(chunk)-1)/2 # pick middle index each time
chunks.append(chunk[mindex])
# chunks.append(chunk) # assuming subarrays are size 5 and we want the middle
# this caused some trouble because not all subarrays were size 5
# index which is 2.
return chunks
def Select(A, k):
if (len(A) == 1): # if the array is size 1 then just return the one and only element
return A[0]
elif (len(A) <= 5): # if length is 5 or less, sort it and return the kth smallest element
insertion_sort(A)
return A[k-1]
else:
M = get_chunks(A, 5) # this will give you the array of medians,,, don't sort it....WHY ???
m = len(M) # m is the size of the array of Medians M.
x = Select(M, m/2)# m/2 is the same as len(A)/10 FYI
lo, hi, timeslo = partition(A, x)
rank = len(lo) + 1
if rank == k: # we're in the middle -- we're done
return x, timeslo # return the value of the kth smallest element
elif k < rank:
return Select(lo, k) # ???????????????
else:
return Select(hi, k-rank)
################### TROUBLESHOOTING ################################
# Works with arrays of size 25 and 5000 iterations
# Doesn't work with " 26 and 5000 "
#
# arrays of size 26 and 20 iterations breaks it ?????????????????
# A = []
Total = 0
n = input('What size of array of random #s do you want?: ')
N = input('number of iterations: ')
# n = 26
# N = 1
for x in range(0, N):
A = random.sample(range(1,1000), n) # make an array or list of size n
result = Select(A, 2) #p is the median of the medians, 2 means the 3rd smallest element
Total = Total + timeslo # the total number of comparisons made
print("the result is"), result
print("timeslo = "), timeslo
print("# of comparisons = "), Total
# A = [7, 1, 3, 5, 9, 2, 83, 8, 4, 13, 17, 21, 16, 11, 77, 33, 55, 44, 66, 88, 111, 222]
# result = Select(A, 2)
# print("Result = "), result
Any help would be appreciated.
Change this line
return x, timeslo # return the value of the kth smallest element
into
return x # return the value of the kth smallest element
You can get timeslo by printing it in the end. Returning x with timeslo is not correct, because it will be used in the partition(A, p) to split array, where the parameter p should be the median number from previous statement x = Select(M, m/2)
I have random 2d arrays which I make using
import numpy as np
from itertools import combinations
n = 50
A = np.random.randint(2, size=(n,n))
I would like to determine if the matrix has two pairs of pairs of rows which sum to the same row vector. I am looking for a fast method to do this. My current method just tries all possibilities.
for pair in combinations(combinations(range(n), 2), 2):
if (np.array_equal(A[pair[0][0]] + A[pair[0][1]], A[pair[1][0]] + A[pair[1][1]] )):
print "Pair found", pair
A method that worked for n = 100 would be really great.
Here is a pure numpy solution; no extensive timings, but I have to push n up to 500 before I can see my cursor blink once before it completes. it is memory intensive though, and will fail due to memory requirements for much larger n. Either way, I get the intuition that the odds of finding such a vector decrease geometrically for larger n anyway.
import numpy as np
n = 100
A = np.random.randint(2, size=(n,n)).astype(np.int8)
def base3(a):
"""
pack the last axis of an array in base 3
40 base 3 numbers per uint64
"""
S = np.array_split(a, a.shape[-1]//40+1, axis=-1)
R = np.zeros(shape=a.shape[:-1]+(len(S),), dtype = np.uint64)
for i in xrange(len(S)):
s = S[i]
r = R[...,i]
for j in xrange(s.shape[-1]):
r *= 3
r += s[...,j]
return R
def unique_count(a):
"""returns counts of unique elements"""
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1)
return unique, count
def voidview(arr):
"""view the last axis of an array as a void object. can be used as a faster form of lexsort"""
return np.ascontiguousarray(arr).view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1]))).reshape(arr.shape[:-1])
def has_pairs_of_pairs(A):
#optional; convert rows to base 3
A = base3(A)
#precompute sums over a lower triangular set of all combinations
rowsums = sum(A[I] for I in np.tril_indices(n,-1))
#count the number of times each row occurs by sorting
#note that this is not quite O(n log n), since the cost of handling each row is also a function of n
unique, count = unique_count(voidview(rowsums))
#print if any pairs of pairs exist;
#computing their indices is left as an excercise for the reader
return np.any(count>1)
from time import clock
t = clock()
for i in xrange(100):
print has_pairs_of_pairs(A)
print clock()-t
Edit: included base-3 packing; now n=2000 is feasible, taking about 2gb of mem, and a few seconds of processing
Edit: added some timings; n=100 takes only 5ms per call on my i7m.
Based on the code in your question, and on the assumption that you're actually looking for pairs of pairs of rows that sum to equal the same row vector, you could do something like this:
def findMatchSets(A):
B = A.transpose()
pairs = tuple(combinations(range(len(A[0])), 2))
matchSets = [[i for i in pairs if B[0][i[0]] + B[0][i[1]] == z] for z in range(3)]
for c in range(1, len(A[0])):
matchSets = [[i for i in block if B[c][i[0]] + B[c][i[1]] == z] for z in range(3) for block in matchSets]
matchSets = [block for block in matchSets if len(block) > 1]
if not matchSets:
return []
return matchSets
This basically stratifies the matrix into equivalence sets that sum to the same value after one column has been taken into account, then two columns, then three, and so on, until it either reaches the last column or there is no equivalence set left with more than one member (i.e. there is no such pair of pairs). This will work fine for 100x100 arrays, largely because the chances of two pairs of rows summing to the same row vector are infinitesimally small when n is large (n*(n+1)/2 combinations compared to 3^n possible vector sums).
UPDATE
Updated code to allow searching for pairs of n-size subsets of all rows, as requested. Default is n=2 as per the original question:
def findMatchSets(A, n=2):
B = A.transpose()
pairs = tuple(combinations(range(len(A[0])), n))
matchSets = [[i for i in pairs if sum([B[0][i[j]] for j in range(n)]) == z] for z in range(n + 1)]
for c in range(1, len(A[0])):
matchSets = [[i for i in block if sum([B[c][i[j]] for j in range(n)]) == z] for z in range(n + 1) for block in matchSets]
matchSets = [block for block in matchSets if len(block) > 1]
if not matchSets:
return []
return matchSets
Here is a 'lazy' approach, that scales up to n=10000, using 'only' 4gb of memory, and completing in 10s per call or so. Worst case complexity is O(n^3), but for random data, expected performance is O(n^2). At first sight, it seems like youd need O(n^3) ops; each row combination needs to be produced and inspected at least once. But we need not look at the entire row. Rather, we can perform an early exit strategy on the comparison of rowpairs, once it is clear they are of no use to us; and for random data, we may draw this conclusion typically long before we have considered all columns in a row.
import numpy as np
n = 10
#also works for non-square A
A = np.random.randint(2, size=(n*2,n)).astype(np.int8)
#force the inclusion of some hits, to keep our algorithm on its toes
##A[0] = A[1]
def base_pack_lazy(a, base, dtype=np.uint64):
"""
pack the last axis of an array as minimal base representation
lazily yields packed columns of the original matrix
"""
a = np.ascontiguousarray( np.rollaxis(a, -1))
init = np.zeros(a.shape[1:], dtype)
packing = int(np.dtype(dtype).itemsize * 8 / (float(base) / 2))
for columns in np.array_split(a, (len(a)-1)//packing+1):
yield reduce(
lambda acc,inc: acc*base+inc,
columns,
init)
def unique_count(a):
"""returns counts of unique elements"""
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1) #note; this scatter operation requires numpy 1.8; use a sparse matrix otherwise!
return unique, count, inverse
def has_identical_row_sums_lazy(A, combinations_index):
"""
compute the existence of combinations of rows summing to the same vector,
given an nxm matrix A and an index matrix specifying all combinations
naively, we need to compute the sum of each row combination at least once, giving n^3 computations
however, this isnt strictly required; we can lazily consider the columns, giving an early exit opportunity
all nicely vectorized of course
"""
multiplicity, combinations = combinations_index.shape
#list of indices into combinations_index, denoting possibly interacting combinations
active_combinations = np.arange(combinations, dtype=np.uint32)
for packed_column in base_pack_lazy(A, base=multiplicity+1): #loop over packed cols
#compute rowsums only for a fixed number of columns at a time.
#this is O(n^2) rather than O(n^3), and after considering the first column,
#we can typically already exclude almost all rowpairs
partial_rowsums = sum(packed_column[I[active_combinations]] for I in combinations_index)
#find duplicates in this column
unique, count, inverse = unique_count(partial_rowsums)
#prune those pairs which we can exclude as having different sums, based on columns inspected thus far
active_combinations = active_combinations[count[inverse] > 1]
#early exit; no pairs
if len(active_combinations)==0:
return False
return True
def has_identical_triple_row_sums(A):
n = len(A)
idx = np.array( [(i,j,k)
for i in xrange(n)
for j in xrange(n)
for k in xrange(n)
if i<j and j<k], dtype=np.uint16)
idx = np.ascontiguousarray( idx.T)
return has_identical_row_sums_lazy(A, idx)
def has_identical_double_row_sums(A):
n = len(A)
idx = np.array(np.tril_indices(n,-1), dtype=np.int32)
return has_identical_row_sums_lazy(A, idx)
from time import clock
t = clock()
for i in xrange(10):
print has_identical_double_row_sums(A)
print has_identical_triple_row_sums(A)
print clock()-t
Extended to include the calculation over sums of triplets of rows, as you asked above. For n=100, this still takes only about 0.2s
Edit: some cleanup; edit2: some more cleanup
Your current code does not test for pairs of rows that sum to the same value.
Assuming that's actually what you want, its best to stick to pure numpy. This generates the indices of all rows that have equal sum.
import numpy as np
n = 100
A = np.random.randint(2, size=(n,n))
rowsum = A.sum(axis=1)
unique, inverse = np.unique(rowsum, return_inverse = True)
count = np.zeros_like(unique)
np.add.at(count, inverse, 1)
for p in unique[count>1]:
print p, np.nonzero(rowsum==p)[0]
If all you need to do is determine whether such a pair exists you can do:
exists_unique = np.unique(A.sum(axis=1)).size != A.shape[0]