Best way to ignore empty arrays in Python? - python

I am concatenating values from the cells range into arrays, and if I have an empty value I need to ignore this array and continue with another.
What is the best approach?
That's what I have so far (converted arrs to str because it didn't let me operate with empty values in cells).
def get_names_and_probs(row, col, col1, col2):
prob_array = []
name_array = str(df.iloc[row, col]).replace(" ","").split(",")
name_array1 = str(df.iloc[row, col1]).replace(" ","").split(",")
name_array3 = str(df.iloc[row, col2]).replace(" ","").split(",")
C = np.array([[i.value for i in j] for j in ws['O1':'Q1']]).ravel()
С = list(C)
print(C)
for k in range(len(name_array)):
prob_array.append(C[0])
for k in range(len(name_array1)):
prob_array.append(C[1])
for k in range(len(name_array2)):
prob_array.append(C[2])
print(prob_array)
get_names_and_probs(0, 14, 15, 16)
In my Excel I have columns with the heads as probabilities and cells containing values and corresponding to these probabilities. It looks like:
0 50 30 1
1 Oval, Round Irregular
2 Circumscribed
3 High density Equal density, Low density
4 Coarse or “popcorn-like”
Such as, [1][50] contains both Oval and Round, [1][30] is empty, and [1][1] is Irregular.
What I'm trying to do is to get an array from each row and then if cell contains more than 1 value append it to array and then associate it with probability above.
In this case, it should look like name_arr = ['Oval', 'Round', 'Irregular'] and corresponding probabilities should be [50, 50, 1] (ignore 30 because of empty cell).
Thanks in advance!

Related

Recursive python function to make two arrays equal?

I'm attempting to write python code to solve a transportation problem using the Least Cost method. I have a 2D numpy array that I am iterating through to find the minimum, perform calculations with that minimum, and then replace it with a 0 so that the loops stops when values matches constantarray, an array of the same shape containing only 0s. The values array contains distances from points in supply to points in demand. I'm currently using a while loop to do so, but the loop isn't running because values.all() != constantarray.all() evaluates to False.
I also need the process to repeat once the arrays have been edited to move onto the next lowest number in values.
constarray = np.zeros((len(supply),len(demand)) #create array of 0s
sandmoved = np.zeros((len(supply),len(demand)) #used to store information needed for later
totalcost = 0
while values.all() != constantarray.all(): #iterate until `values` only contains 0s
m = np.argmin(values,axis = 0)[0] #find coordinates of minimum value
n = np.argmin(values,axis = 1)[0]
if supply[m] > abs(demand[m]): #all demand numbers are negative
supply[m]+=demand[n] #subtract demand from supply
totalcost +=abs(demand[n])*values[m,n]
sandmoved[m,n] = demand[n] #add amount of 'sand' moved to an empty array
values[m,0:-1] = 0 #replace entire m row with 0s since demand has been filled
demand[n]=0 #replace demand value with 0
elif supply[m]< abs(demand[n]):
demand[n]+=supply[m] #combine positive supply with negative demand
sandmoved[m,n]=supply[m]
totalcost +=supply[m]*values[m,n]
values[:-1,n]=0 #replace entire column with 0s since supply has been depleted
supply[m] = 0
There is an additional if statement for when supply[m]==demand[n] but I feel that isn't necessary. I've already tried using nested for loops, and so many different syntax combinations for a while loop but I just can't get it to work the way I want it to. Even when running the code block over over by itself, m and n stay the same and the function removes one value from values but doesn't add it to sandmoved. Any ideas are greatly appreciated!!
Well, here is an example from an old implementation of mine:
import numpy as np
values = np.array([[3, 1, 7, 4],
[2, 6, 5, 9],
[8, 3, 3, 2]])
demand = np.array([250, 350, 400, 200])
supply = np.array([300, 400, 500])
totCost = 0
MAX_VAL = 2 * np.max(values) # choose MAX_VAL higher than all values
while np.any(values.ravel() < MAX_VAL):
# find row and col indices of min
m, n = np.unravel_index(np.argmin(values), values.shape)
if supply[m] < demand[n]:
totCost += supply[m] * values[m,n]
demand[n] -= supply[m]
values[m,:] = MAX_VAL # set all row to MAX_VAL
else:
totCost += demand[n] * values[m,n]
supply[m] -= demand[n]
values[:,n] = MAX_VAL # set all col to MAX_VAL
Solution:
print(totCost)
# 2850
Basically, start by choosing a MAX_VAL higher than all given values and a totCost = 0. Then follow the standard steps of the algorithm. Find row and column indices of the smallest cell, say m, n. Select the m-th supply or the n-th demand whichever is smaller, then add what you selected multiplied by values[m,n] to the totCost, and set all entries of the selected row or column to MAX_VAL to avoid it in the next iterations. Update the greater value by subtracting the selected one and repeat until all values are equal to MAX_VAL.

Rectangular array with holes

I'm trying to create a rectangular grid with numbers in some cells (but not all of them), in a way such that it's easy to select a given row or column.
What I did so far is to create the list of the positions of the numbers in the grid and the list of the numbers contained in the grid, so that I can select the number at position (i,j) with numbers[positions.index([i,j]), but this is not very handy, especially if I need, for example, to find the minimum of the values in a given column.
Is there a way to create the grid so that, for example, I can select elements with grid[i][j] and columns with grid[:][j] or something similar? The programming language is Python.
You can use numpy for this. It lets you create an array, which can index a single value with array[i,j] or a full column with array[:,j].
I'm not entirely sure what you mean by holes, but numpy will require you to have a value in every spot in the array. The best thing I believe you can set it to a preset "empty" value.
Store your grid as a 2D array (a matrix) and use list comprehensions.
first_column = [row[0] for row in grid]
second_column = [row[1] for row in grid]
If you're going to have a large proportion of the "cells" that are unused, you could try using a dictionary with the coordinates as key in a tuple.
matrix = dict()
matrix[1,3] = 13
matrix[1,5] = 15
matrix[2,3] = 23
matrix[2,7] = 27
matrix[3,7] = 37
valuesInRow2 = [v for (r,c),v in matrix.items() if r==2]
# [23,27]
By creating a subclass of dict to manage indexing and overriding operators, you could get it to behave exactly the way you want:
class Sparse(dict):
def __init__(self,rows=0,cols=0):
super().__init__()
self.rows = rows
self.cols = cols
def __indexToRanges(self,rowIndex,colIndex):
scalar = isinstance(rowIndex,int) and isinstance(colIndex,int)
if isinstance(rowIndex,slice):
rowRange = range(*rowIndex.indices(self.rows))
else:
rowRange = range(rowIndex,rowIndex+1)
if isinstance(colIndex,slice):
colRange = range(*colIndex.indices(self.cols))
else:
colRange = range(colIndex,colIndex+1)
return rowRange,colRange,scalar
def __getitem__(self,indexes):
row,col = indexes
rowRange,colRange,scalar = self.__indexToRanges(row,col)
if scalar: return super()._getitem((row,col))
return [v for (r,c),v in self.items() if r in rowRange and c in colRange]
def __setitem__(self,index,value):
row,col=index
rowRange,colRange,scalar = self.__indexToRanges(row,col)
if scalar:
self.rows = max(self.rows,row+1)
self.cols = max(self.cols,col+1)
return super().__setitem__((row,col),value)
usage:
matrix = Sparse()
matrix[1,3] = 13
matrix[1,5] = 15
matrix[2,3] = 23
matrix[2,7] = 27
matrix[3,7] = 37
print("sum of column 3:", sum(matrix[:,3]) ) # 36
print("sum of row 2:", sum(matrix[2,:]) ) # 50
print("top left 4x4 values:", matrix[:4,:4] ) # [13, 23]

reordering cluster numbers for correct correspondence

I have a dataset that I clustered using two different clustering algorithms. The results are about the same, but the cluster numbers are permuted.
Now for displaying the color coded labels, I want the label ids to be same for the same clusters.
How can I get correct permutation between the two label ids?
I can do this using brute force, but perhaps there is a better/faster method. I would greatly appreciate any help or pointers. If possible I am looking for a python function.
The most well-known algorithm for finding the optimum matching is the hungarian method.
Because it cannot be explained in a few sentences, I have to refer you to a book of your choice, or Wikipedia article "Hungarian algorithm".
You can probably get good results (even perfect if the difference is indeed tiny) by simply picking the maximum of the correspondence matrix and then removing that row and column.
I have a function that works for me. But it may fail when the two cluster results are very inconsistent, which leads to duplicated max values in the contingency matrix. If your cluster results are about the same, it should work.
Here is my code:
from sklearn.metrics.cluster import contingency_matrix
def align_cluster_index(ref_cluster, map_cluster):
"""
remap cluster index according the the ref_cluster.
both inputs must be nparray and have same number of unique cluster index values.
Xin Niu Jan-15-2020
"""
ref_values = np.unique(ref_cluster)
map_values = np.unique(map_cluster)
print(ref_values)
print(map_values)
num_values = ref_values.shape[0]
if ref_values.shape[0]!=map_values.shape[0]:
print('error: both inputs must have same number of unique cluster index values.')
return()
switched_col = set()
while True:
cont_mat = contingency_matrix(ref_cluster, map_cluster)
print(cont_mat)
# divide contingency_matrix by its row and col sums to avoid potential duplicated values:
col_sum = np.matmul(np.ones((num_values, 1)), np.sum(cont_mat, axis = 0).reshape(1, num_values))
row_sum = np.matmul(np.sum(cont_mat, axis = 1).reshape(num_values, 1), np.ones((1, num_values)))
print(col_sum)
print(row_sum)
cont_mat = cont_mat/(col_sum+row_sum)
print(cont_mat)
# ignore columns that have been switched:
cont_mat[:, list(switched_col)]=-1
print(cont_mat)
sort_0 = np.argsort(cont_mat, axis = 0)
sort_1 = np.argsort(cont_mat, axis = 1)
print('argsort contmat:')
print(sort_0)
print(sort_1)
if np.array_equal(sort_1[:,-1], np.array(range(num_values))):
break
# switch values according to the max value in the contingency matrix:
# get the position of max value:
idx_max = np.unravel_index(np.argmax(cont_mat, axis=None), cont_mat.shape)
print(cont_mat)
print(idx_max)
if (cont_mat[idx_max]>0) and (idx_max[0] not in switched_col):
cluster_tmp = map_cluster.copy()
print('switch', map_values[idx_max[1]], 'and:', ref_values[idx_max[0]])
map_cluster[cluster_tmp==map_values[idx_max[1]]]=ref_values[idx_max[0]]
map_cluster[cluster_tmp==map_values[idx_max[0]]]=ref_values[idx_max[1]]
switched_col.add(idx_max[0])
print(switched_col)
else:
break
print('final argsort contmat:')
print(sort_0)
print(sort_1)
print('final cont_mat:')
cont_mat = contingency_matrix(ref_cluster, map_cluster)
col_sum = np.matmul(np.ones((num_values, 1)), np.sum(cont_mat, axis = 0).reshape(1, num_values))
row_sum = np.matmul(np.sum(cont_mat, axis = 1).reshape(num_values, 1), np.ones((1, num_values)))
cont_mat = cont_mat/(col_sum+row_sum)
print(cont_mat)
return(map_cluster)
And here is some test code:
ref_cluster = np.array([2,2,3,1,0,0,0,1,2,1,2,2,0,3,3,3,3])
map_cluster = np.array([0,0,0,1,1,3,2,3,2,2,0,0,0,2,0,3,3])
c = align_cluster_index(ref_cluster, map_cluster)
print(ref_cluster)
print(c)
>>>[2 2 3 1 0 0 0 1 2 1 2 2 0 3 3 3 3]
>>>[2 2 2 1 1 3 0 3 0 0 2 2 2 0 2 3 3]

Averaging out sections of a multiple row array in Python

I've got a 2-row array called C like this:
from numpy import *
A = [1,2,3,4,5]
B = [50,40,30,20,10]
C = vstack((A,B))
I want to take all the columns in C where the value in the first row falls between i and i+2, and average them. I can do this with just A no problem:
i = 0
A_avg = []
while(i<6):
selection = A[logical_and(A >= i, A < i+2)]
A_avg.append(mean(selection))
i += 2
then A_avg is:
[1.0,2.5,4.5]
I want to carry out the same process with my two-row array C, but I want to take the average of each row separately, while doing it in a way that's dictated by the first row. For example, for C, I want to end up with a 2 x 3 array that looks like:
[[1.0,2.5,4.5],
[50,35,15]]
Where the first row is A averaged in blocks between i and i+2 as before, and the second row is B averaged in the same blocks as A, regardless of the values it has. So the first entry is unchanged, the next two get averaged together, and the next two get averaged together, for each row separately. Anyone know of a clever way to do this? Many thanks!
I hope this is not too clever. TIL boolean indexing does not broadcast, so I had to manually do the broadcasting. Let me know if anything is unclear.
import numpy as np
A = [1,2,3,4,5]
B = [50,40,30,20,10]
C = np.vstack((A,B)) # float so that I can use np.nan
i = np.arange(0, 6, 2)[:, None]
selections = np.logical_and(A >= i, A < i+2)[None]
D, selections = np.broadcast_arrays(C[:, None], selections)
D = D.astype(float) # allows use of nan, and makes a copy to prevent repeated behavior
D[~selections] = np.nan # exclude these elements from mean
D = np.nanmean(D, axis=-1)
Then,
>>> D
array([[ 1. , 2.5, 4.5],
[ 50. , 35. , 15. ]])
Another way, using np.histogram to bin your data. This may be faster for large arrays, but is only useful for few rows, since a hist must be done with different weights for each row:
bins = np.arange(0, 7, 2) # include the end
n = np.histogram(A, bins)[0] # number of columns in each bin
a_mean = np.histogram(A, bins, weights=A)[0]/n
b_mean = np.histogram(A, bins, weights=B)[0]/n
D = np.vstack([a_mean, b_mean])

Find two disjoint pairs of pairs that sum to the same vector

This is a follow-up to Find two pairs of pairs that sum to the same value .
I have random 2d arrays which I make using
import numpy as np
from itertools import combinations
n = 50
A = np.random.randint(2, size=(m,n))
I would like to determine if the matrix has two disjoint pairs of pairs of columns which sum to the same column vector. I am looking for a fast method to do this. In the previous problem ((0,1), (0,2)) was acceptable as a pair of pairs of column indices but in this case it is not as 0 is in both pairs.
The accepted answer from the previous question is so cleverly optimised I can't see how to make this simple looking change unfortunately. (I am interested in columns rather than rows in this question but I can always just do A.transpose().)
Here is some code to show it testing all 4 by 4 arrays.
n = 4
nxn = np.arange(n*n).reshape(n, -1)
count = 0
for i in xrange(2**(n*n)):
A = (i >> nxn) %2
p = 1
for firstpair in combinations(range(n), 2):
for secondpair in combinations(range(n), 2):
if firstpair < secondpair and not set(firstpair) & set(secondpair):
if (np.array_equal(A[firstpair[0]] + A[firstpair[1]], A[secondpair[0]] + A[secondpair[1]] )):
if (p):
count +=1
p = 0
print count
This should output 3136.
Here is my solution, extended to do what I believe you want. It isn't entirely clear though; one may get an arbitrary number of row-pairs that sum to the same total; there may exist unique subsets of rows within them that sum to the same value. For instance:
Given this set of row-pairs that sum to the same total
[[19 19 30 30]
[11 16 11 16]]
There exists a unique subset of these rows that may still be counted as valid; but should it?
[[19 30]
[16 11]]
Anyway, I hope those details are easy to deal with, given the code below.
import numpy as np
n = 20
#also works for non-square A
A = np.random.randint(2, size=(n*6,n)).astype(np.int8)
##A = np.array( [[0, 0, 0], [1, 1, 1], [1, 1 ,1]], np.uint8)
##A = np.zeros((6,6))
#force the inclusion of some hits, to keep our algorithm on its toes
##A[0] = A[1]
def base_pack_lazy(a, base, dtype=np.uint64):
"""
pack the last axis of an array as minimal base representation
lazily yields packed columns of the original matrix
"""
a = np.ascontiguousarray( np.rollaxis(a, -1))
packing = int(np.dtype(dtype).itemsize * 8 / (float(base) / 2))
for columns in np.array_split(a, (len(a)-1)//packing+1):
R = np.zeros(a.shape[1:], dtype)
for col in columns:
R *= base
R += col
yield R
def unique_count(a):
"""returns counts of unique elements"""
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1) #note; this scatter operation requires numpy 1.8; use a sparse matrix otherwise!
return unique, count, inverse
def voidview(arr):
"""view the last axis of an array as a void object. can be used as a faster form of lexsort"""
return np.ascontiguousarray(arr).view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1]))).reshape(arr.shape[:-1])
def has_identical_row_sums_lazy(A, combinations_index):
"""
compute the existence of combinations of rows summing to the same vector,
given an nxm matrix A and an index matrix specifying all combinations
naively, we need to compute the sum of each row combination at least once, giving n^3 computations
however, this isnt strictly required; we can lazily consider the columns, giving an early exit opportunity
all nicely vectorized of course
"""
multiplicity, combinations = combinations_index.shape
#list of indices into combinations_index, denoting possibly interacting combinations
active_combinations = np.arange(combinations, dtype=np.uint32)
#keep all packed columns; we might need them later
columns = []
for packed_column in base_pack_lazy(A, base=multiplicity+1): #loop over packed cols
columns.append(packed_column)
#compute rowsums only for a fixed number of columns at a time.
#this is O(n^2) rather than O(n^3), and after considering the first column,
#we can typically already exclude almost all combinations
partial_rowsums = sum(packed_column[I[active_combinations]] for I in combinations_index)
#find duplicates in this column
unique, count, inverse = unique_count(partial_rowsums)
#prune those combinations which we can exclude as having different sums, based on columns inspected thus far
active_combinations = active_combinations[count[inverse] > 1]
#early exit; no pairs
if len(active_combinations)==0:
return False
"""
we now have a small set of relevant combinations, but we have lost the details of their particulars
to see which combinations of rows does sum to the same value, we do need to consider rows as a whole
we can simply apply the same mechanism, but for all columns at the same time,
but only for the selected subset of row combinations known to be relevant
"""
#construct full packed matrix
B = np.ascontiguousarray(np.vstack(columns).T)
#perform all relevant sums, over all columns
rowsums = sum(B[I[active_combinations]] for I in combinations_index)
#find the unique rowsums, by viewing rows as a void object
unique, count, inverse = unique_count(voidview(rowsums))
#if not, we did something wrong in deciding on active combinations
assert(np.all(count>1))
#loop over all sets of rows that sum to an identical unique value
for i in xrange(len(unique)):
#set of indexes into combinations_index;
#note that there may be more than two combinations that sum to the same value; we grab them all here
combinations_group = active_combinations[inverse==i]
#associated row-combinations
#array of shape=(mulitplicity,group_size)
row_combinations = combinations_index[:,combinations_group]
#if no duplicate rows involved, we have a match
if len(np.unique(row_combinations[:,[0,-1]])) == multiplicity*2:
print row_combinations
return True
#none of identical rowsums met uniqueness criteria
return False
def has_identical_triple_row_sums(A):
n = len(A)
idx = np.array( [(i,j,k)
for i in xrange(n)
for j in xrange(n)
for k in xrange(n)
if i<j and j<k], dtype=np.uint16)
idx = np.ascontiguousarray( idx.T)
return has_identical_row_sums_lazy(A, idx)
def has_identical_double_row_sums(A):
n = len(A)
idx = np.array(np.tril_indices(n,-1), dtype=np.int32)
return has_identical_row_sums_lazy(A, idx)
from time import clock
t = clock()
for i in xrange(1):
## print has_identical_double_row_sums(A)
print has_identical_triple_row_sums(A)
print clock()-t
Edit: code cleanup

Categories