I'm trying to create a rectangular grid with numbers in some cells (but not all of them), in a way such that it's easy to select a given row or column.
What I did so far is to create the list of the positions of the numbers in the grid and the list of the numbers contained in the grid, so that I can select the number at position (i,j) with numbers[positions.index([i,j]), but this is not very handy, especially if I need, for example, to find the minimum of the values in a given column.
Is there a way to create the grid so that, for example, I can select elements with grid[i][j] and columns with grid[:][j] or something similar? The programming language is Python.
You can use numpy for this. It lets you create an array, which can index a single value with array[i,j] or a full column with array[:,j].
I'm not entirely sure what you mean by holes, but numpy will require you to have a value in every spot in the array. The best thing I believe you can set it to a preset "empty" value.
Store your grid as a 2D array (a matrix) and use list comprehensions.
first_column = [row[0] for row in grid]
second_column = [row[1] for row in grid]
If you're going to have a large proportion of the "cells" that are unused, you could try using a dictionary with the coordinates as key in a tuple.
matrix = dict()
matrix[1,3] = 13
matrix[1,5] = 15
matrix[2,3] = 23
matrix[2,7] = 27
matrix[3,7] = 37
valuesInRow2 = [v for (r,c),v in matrix.items() if r==2]
# [23,27]
By creating a subclass of dict to manage indexing and overriding operators, you could get it to behave exactly the way you want:
class Sparse(dict):
def __init__(self,rows=0,cols=0):
super().__init__()
self.rows = rows
self.cols = cols
def __indexToRanges(self,rowIndex,colIndex):
scalar = isinstance(rowIndex,int) and isinstance(colIndex,int)
if isinstance(rowIndex,slice):
rowRange = range(*rowIndex.indices(self.rows))
else:
rowRange = range(rowIndex,rowIndex+1)
if isinstance(colIndex,slice):
colRange = range(*colIndex.indices(self.cols))
else:
colRange = range(colIndex,colIndex+1)
return rowRange,colRange,scalar
def __getitem__(self,indexes):
row,col = indexes
rowRange,colRange,scalar = self.__indexToRanges(row,col)
if scalar: return super()._getitem((row,col))
return [v for (r,c),v in self.items() if r in rowRange and c in colRange]
def __setitem__(self,index,value):
row,col=index
rowRange,colRange,scalar = self.__indexToRanges(row,col)
if scalar:
self.rows = max(self.rows,row+1)
self.cols = max(self.cols,col+1)
return super().__setitem__((row,col),value)
usage:
matrix = Sparse()
matrix[1,3] = 13
matrix[1,5] = 15
matrix[2,3] = 23
matrix[2,7] = 27
matrix[3,7] = 37
print("sum of column 3:", sum(matrix[:,3]) ) # 36
print("sum of row 2:", sum(matrix[2,:]) ) # 50
print("top left 4x4 values:", matrix[:4,:4] ) # [13, 23]
Related
I can create a normal matrix with numpy using
np.zeros([800, 200])
How can I create a matrix with a negative index - as in a 1600x200 matrix with row index from -800 to 800?
Not sure what you need it for but maybe you could use a dictionary instead.
a={i:0 for i in range(-800,801)}
With this you can call a[-800] to a[800].
For 2-D,
a={(i,j):0 for i in range(-800,801) for j in range(-100,101)}
This can be called with a[(-800,-100)] to a[(800,100)]
Not clear what is being asked. NumPy arrays do already support access via negative indexing, which will reach out to positions relative to the end, e.g.:
import numpy as np
m = np.arange(3 * 4).reshape((3, 4))
print(m)
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
print(m[-1, :])
# [ 8 9 10 11]
print(m[:, -1]
# [ 3 7 11]
If you need an array that is contiguous near the zero of your indices, one option would be to write a function to map each indices i to i + d // 2 (d being the size along the given axis), e.g.:
def idx2neg(indexing, shape):
new_indexing = []
for idx, dim in zip(indexing, shape):
if isinstance(idx, int):
new_indexing.append(idx + dim // 2)
...
# not implemented for slices, boolean masks, etc.
return tuple(new_indexing)
Note that the above function is not as flexible as what NumPy accepts, it is just meant to give some idea on how to proceed.
Probably you refer to the Fortran-like arbitrary indexing of arrays. This is not compatible with Python. Check the comments in this question. Basically it clashes with Python way of treating negative indexes, which is to start counting from the end (or right) of the array.
I don't know why you would need that but if you just need it for indexing try following function:
def getVal(matrix, i, k):
return matrix[i + 800][k]
This function only interpretes the first index so you can type in an index from -800 up to 799 for a 1600x200 matrix.
If you want to index relatively to its number of lines try following function:
def getVal(matrix, i, k):
return matrix[i + len(matrix) // 2][k]
Hope that helps!
Extending the dependency list, this is kind of straightforward using a pandas DataFrame with custom index.
However you will need to change slightly the syntax for how you access rows (and columns), yet there is the possibility to slice multiple rows and columns.
This is specific to 2d numpy arrays:
import numpy as np
import pandas as pd
a = np.arange(1600*200).reshape([1600, 200])
df = pd.DataFrame(a, index=range(-800, 800))
Once you have such dataframe you can access columns and rows (with few syntax inconsistencies):
Access the 1st column: df[0]
Access the 1st row: df.loc[-800]
Access rows from 1st to 100th: df.loc[-800:-700] and df[-800: -700]
Access columns from 1st to 100th: df.loc[:, 0:100]
Access rows and columns: df.loc[-800:-700, 0:100]
Full documentation on pandas slicing and indexing can be found here.
You can use the np.arange function to generate an array of integers from -800 to 800, and then reshape this array into the desired shape using the reshape method.
Here's an example of how you could do this:
import numpy as np
# Create an array of integers from -800 to 800
indices = np.arange(-800, 801)
# Reshape the array into a 1600 x 200 matrix
matrix = indices.reshape(1600, 200)
This will create a 1600 x 200 matrix with row indices ranging from -800 to 800. You can then access elements of the matrix using these negative indices just like you would with positive indices.
For example, to access the element at row -1 and column 0, you could use the following code:
matrix[-1, 0]
You can create a matrix with a negative index using the following code:
import numpy as np
my_matrix = np.zeros((1600, 200))
my_matrix = np.pad(my_matrix, ((800, 800), (0, 0)), mode='constant', constant_values=0)
my_matrix = my_matrix[-800:800, :]
You can create numpy.ndarray subclass. Take a look at the below example, it can create an array with a specific starting index.
import numpy as np
class CustomArray(np.ndarray):
def __new__(cls, input_array, startIndex=None):
obj = np.asarray(input_array)
if startIndex is not None:
if isinstance(startIndex, int):
startIndex = (startIndex, )
else:
startIndex = tuple(startIndex)
assert len(startIndex), len(obj.shape[0])
obj = obj.view(cls)
obj.startIndex = startIndex
return obj
def __array_finalize__(self, obj):
if obj is None: return
self.startIndex = getattr(obj, 'startIndex', None)
#staticmethod
def adj_index(idx, adj):
if isinstance(idx, tuple):
if not isinstance(adj, tuple):
adj = tuple([adj for i in range(len(idx))])
idx = tuple([CustomArray.adj_index(idx_i, adj_i) for idx_i, adj_i in zip(idx, adj)])
elif isinstance(idx, slice):
if isinstance(adj, tuple):
adj = adj[0]
idx = slice(idx.start-adj if idx.start is not None else idx.start,
idx.stop-adj if idx.stop is not None else idx.stop,
idx.step)
else:
if isinstance(adj, tuple):
adj = adj[0]
idx = idx - adj
return idx
def __iter__(self):
return np.asarray(self).__iter__()
def __getitem__(self, idx):
if self.startIndex is not None:
idx = self.adj_index(idx, self.startIndex)
return np.asarray(self).__getitem__(idx)
def __setitem__(self, idx, val):
if self.startIndex is not None:
idx = self.adj_index(idx, self.startIndex)
return np.asarray(self).__setitem__(idx, val)
def __repr__(self):
r = np.asarray(self).__repr__()
if self.startIndex is not None:
r += f'\n StartIndex: {self.startIndex}'
return r
Example
I am concatenating values from the cells range into arrays, and if I have an empty value I need to ignore this array and continue with another.
What is the best approach?
That's what I have so far (converted arrs to str because it didn't let me operate with empty values in cells).
def get_names_and_probs(row, col, col1, col2):
prob_array = []
name_array = str(df.iloc[row, col]).replace(" ","").split(",")
name_array1 = str(df.iloc[row, col1]).replace(" ","").split(",")
name_array3 = str(df.iloc[row, col2]).replace(" ","").split(",")
C = np.array([[i.value for i in j] for j in ws['O1':'Q1']]).ravel()
С = list(C)
print(C)
for k in range(len(name_array)):
prob_array.append(C[0])
for k in range(len(name_array1)):
prob_array.append(C[1])
for k in range(len(name_array2)):
prob_array.append(C[2])
print(prob_array)
get_names_and_probs(0, 14, 15, 16)
In my Excel I have columns with the heads as probabilities and cells containing values and corresponding to these probabilities. It looks like:
0 50 30 1
1 Oval, Round Irregular
2 Circumscribed
3 High density Equal density, Low density
4 Coarse or “popcorn-like”
Such as, [1][50] contains both Oval and Round, [1][30] is empty, and [1][1] is Irregular.
What I'm trying to do is to get an array from each row and then if cell contains more than 1 value append it to array and then associate it with probability above.
In this case, it should look like name_arr = ['Oval', 'Round', 'Irregular'] and corresponding probabilities should be [50, 50, 1] (ignore 30 because of empty cell).
Thanks in advance!
Question: How could I peform the following task more efficiently?
My problem is as follows. I have a (large) 3D data set of points in real physical space (x,y,z). It has been generated by a nested for loop that looks like this:
# Generate given dat with its ordering
x_samples = 2
y_samples = 3
z_samples = 4
given_dat = np.zeros(((x_samples*y_samples*z_samples),3))
row_ind = 0
for z in range(z_samples):
for y in range(y_samples):
for x in range(x_samples):
row = [x+.1,y+.2,z+.3]
given_dat[row_ind,:] = row
row_ind += 1
for row in given_dat:
print(row)`
For the sake of comparing it to another set of data, I want to reorder the given data into my desired order as follows (unorthodox, I know):
# Generate data with desired ordering
x_samples = 2
y_samples = 3
z_samples = 4
desired_dat = np.zeros(((x_samples*y_samples*z_samples),3))
row_ind = 0
for z in range(z_samples):
for x in range(x_samples):
for y in range(y_samples):
row = [x+.1,y+.2,z+.3]
desired_dat[row_ind,:] = row
row_ind += 1
for row in desired_dat:
print(row)
I have written a function that does what I want, but it is horribly slow and inefficient:
def bad_method(x_samp,y_samp,z_samp,data):
zs = np.unique(data[:,2])
xs = np.unique(data[:,0])
rowlist = []
for z in zs:
for x in xs:
for row in data:
if row[0] == x and row[2] == z:
rowlist.append(row)
new_data = np.vstack(rowlist)
return new_data
# Shows that my function does with I want
fix = bad_method(x_samples,y_samples,z_samples,given_dat)
print('Unreversed data')
print(given_dat)
print('Reversed Data')
print(fix)
# If it didn't work this will throw an exception
assert(np.array_equal(desired_dat,fix))
How could I improve my function so it is faster? My data sets usually have roughly 2 million rows. It must be possible to do this with some clever slicing/indexing which I'm sure will be faster but I'm having a hard time figuring out how. Thanks for any help!
You could reshape your array, swap the axes as necessary and reshape back again:
# (No need to copy if you don't want to keep the given_dat ordering)
data = np.copy(given_dat).reshape(( z_samples, y_samples, x_samples, 3))
# swap the "y" and "x" axes
data = np.swapaxes(data, 1,2)
# back to 2-D array
data = data.reshape((x_samples*y_samples*z_samples,3))
assert(np.array_equal(desired_dat,data))
I have a 2D numpy array with 3 columns. Columns 1 and 2 are a list of connections between ID's. Column 3 is a the strength of that connection. I would like to transform this 3 column matrix into a weighted adjacency matrix (an N x N matrix where cells represent the strength of connection between each ID).
I have already done this in my code below. matrix is the 3 column 2D array and t1 is the weighted adjacency matrix. My problem is this code is very slow because I am using nested for loops. I am familiar with the pandas function melt which does this, but I am not able to use pandas. Is there a faster implementation not using pandas?
import numpy as np
a = np.arange(2000)
np.random.shuffle(a)
b = np.arange(2000)
np.random.shuffle(b)
c = np.random.rand(2000,1)
matrix = np.column_stack((a,b,c))
#get unique value list of nm
flds = list(np.unique(matrix[:,0]))
flds.extend(list(np.unique(matrix[:,1])))
flds = np.asarray(flds)
flds = np.unique(flds)
#make lookup dict
lookup = dict(zip(np.arange(0,len(flds)), flds))
lookup_rev = dict(zip(flds, np.arange(0,len(flds))))
#make empty n by n matrix with unique lists
t1 = np.zeros([len(flds) , len(flds)])
#map values into the n by n matrix and make the rest 0
'''this takes a long time to run'''
#iterate through rows
for i in np.arange(0,len(lookup)):
#iterate through columns
for k in np.arange(0,len(lookup)):
val = matrix[(matrix[:,0] == lookup[i]) & (matrix[:,1] == lookup[k])][:,2]
if val:
t1[i,k] = sum(val)
Assuming that I understood the question correctly and that val is a scalar, you could use a vectorized approach that involves initializing with zeros and then indexing, like so -
out = np.zeros((len(flds),len(flds)))
out[matrix[:,0].astype(int),matrix[:,1].astype(int)] = matrix[:,2]
Please note that by my observation it looks like you can avoid using lookup.
You need to iterate your matrix only once:
import numpy as np
size = 2000
a = np.arange(size)
np.random.shuffle(a)
b = np.arange(size)
np.random.shuffle(b)
c = np.random.rand(size,1)
matrix = np.column_stack((a,b,c))
#get unique value list of nm
fields = np.unique(matrix[:,:2])
n = len(fields)
#make reverse lookup dict
lookup = dict(zip(fields, range(n)))
#make empty n by n matrix
t1 = np.zeros([n, n])
for src, dest, val in matrix:
i = lookup[src]
j = lookup[dest]
t1[i, j] += val
The main acceleration you can get is by not iterating through each element of the NxN matrix but instead iterate trough your connection list, which is much smaller.
I tried to simplify your code a bit. It use the list.index method, which can be slow, but it should still be faster that what you had.
import numpy as np
a = np.arange(2000)
np.random.shuffle(a)
b = np.arange(2000)
np.random.shuffle(b)
c = np.random.rand(2000,1)
matrix = np.column_stack((a,b,c))
lookup = np.unique(matrix[:,:2]).tolist() # You can call unique only once
t1 = np.zeros((len(lookup),len(lookup)))
for i,j,val in matrix:
t1[lookup.index(i),lookup.index(j)] = val # Fill the matrix
This is a follow-up to Find two pairs of pairs that sum to the same value .
I have random 2d arrays which I make using
import numpy as np
from itertools import combinations
n = 50
A = np.random.randint(2, size=(m,n))
I would like to determine if the matrix has two disjoint pairs of pairs of columns which sum to the same column vector. I am looking for a fast method to do this. In the previous problem ((0,1), (0,2)) was acceptable as a pair of pairs of column indices but in this case it is not as 0 is in both pairs.
The accepted answer from the previous question is so cleverly optimised I can't see how to make this simple looking change unfortunately. (I am interested in columns rather than rows in this question but I can always just do A.transpose().)
Here is some code to show it testing all 4 by 4 arrays.
n = 4
nxn = np.arange(n*n).reshape(n, -1)
count = 0
for i in xrange(2**(n*n)):
A = (i >> nxn) %2
p = 1
for firstpair in combinations(range(n), 2):
for secondpair in combinations(range(n), 2):
if firstpair < secondpair and not set(firstpair) & set(secondpair):
if (np.array_equal(A[firstpair[0]] + A[firstpair[1]], A[secondpair[0]] + A[secondpair[1]] )):
if (p):
count +=1
p = 0
print count
This should output 3136.
Here is my solution, extended to do what I believe you want. It isn't entirely clear though; one may get an arbitrary number of row-pairs that sum to the same total; there may exist unique subsets of rows within them that sum to the same value. For instance:
Given this set of row-pairs that sum to the same total
[[19 19 30 30]
[11 16 11 16]]
There exists a unique subset of these rows that may still be counted as valid; but should it?
[[19 30]
[16 11]]
Anyway, I hope those details are easy to deal with, given the code below.
import numpy as np
n = 20
#also works for non-square A
A = np.random.randint(2, size=(n*6,n)).astype(np.int8)
##A = np.array( [[0, 0, 0], [1, 1, 1], [1, 1 ,1]], np.uint8)
##A = np.zeros((6,6))
#force the inclusion of some hits, to keep our algorithm on its toes
##A[0] = A[1]
def base_pack_lazy(a, base, dtype=np.uint64):
"""
pack the last axis of an array as minimal base representation
lazily yields packed columns of the original matrix
"""
a = np.ascontiguousarray( np.rollaxis(a, -1))
packing = int(np.dtype(dtype).itemsize * 8 / (float(base) / 2))
for columns in np.array_split(a, (len(a)-1)//packing+1):
R = np.zeros(a.shape[1:], dtype)
for col in columns:
R *= base
R += col
yield R
def unique_count(a):
"""returns counts of unique elements"""
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1) #note; this scatter operation requires numpy 1.8; use a sparse matrix otherwise!
return unique, count, inverse
def voidview(arr):
"""view the last axis of an array as a void object. can be used as a faster form of lexsort"""
return np.ascontiguousarray(arr).view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1]))).reshape(arr.shape[:-1])
def has_identical_row_sums_lazy(A, combinations_index):
"""
compute the existence of combinations of rows summing to the same vector,
given an nxm matrix A and an index matrix specifying all combinations
naively, we need to compute the sum of each row combination at least once, giving n^3 computations
however, this isnt strictly required; we can lazily consider the columns, giving an early exit opportunity
all nicely vectorized of course
"""
multiplicity, combinations = combinations_index.shape
#list of indices into combinations_index, denoting possibly interacting combinations
active_combinations = np.arange(combinations, dtype=np.uint32)
#keep all packed columns; we might need them later
columns = []
for packed_column in base_pack_lazy(A, base=multiplicity+1): #loop over packed cols
columns.append(packed_column)
#compute rowsums only for a fixed number of columns at a time.
#this is O(n^2) rather than O(n^3), and after considering the first column,
#we can typically already exclude almost all combinations
partial_rowsums = sum(packed_column[I[active_combinations]] for I in combinations_index)
#find duplicates in this column
unique, count, inverse = unique_count(partial_rowsums)
#prune those combinations which we can exclude as having different sums, based on columns inspected thus far
active_combinations = active_combinations[count[inverse] > 1]
#early exit; no pairs
if len(active_combinations)==0:
return False
"""
we now have a small set of relevant combinations, but we have lost the details of their particulars
to see which combinations of rows does sum to the same value, we do need to consider rows as a whole
we can simply apply the same mechanism, but for all columns at the same time,
but only for the selected subset of row combinations known to be relevant
"""
#construct full packed matrix
B = np.ascontiguousarray(np.vstack(columns).T)
#perform all relevant sums, over all columns
rowsums = sum(B[I[active_combinations]] for I in combinations_index)
#find the unique rowsums, by viewing rows as a void object
unique, count, inverse = unique_count(voidview(rowsums))
#if not, we did something wrong in deciding on active combinations
assert(np.all(count>1))
#loop over all sets of rows that sum to an identical unique value
for i in xrange(len(unique)):
#set of indexes into combinations_index;
#note that there may be more than two combinations that sum to the same value; we grab them all here
combinations_group = active_combinations[inverse==i]
#associated row-combinations
#array of shape=(mulitplicity,group_size)
row_combinations = combinations_index[:,combinations_group]
#if no duplicate rows involved, we have a match
if len(np.unique(row_combinations[:,[0,-1]])) == multiplicity*2:
print row_combinations
return True
#none of identical rowsums met uniqueness criteria
return False
def has_identical_triple_row_sums(A):
n = len(A)
idx = np.array( [(i,j,k)
for i in xrange(n)
for j in xrange(n)
for k in xrange(n)
if i<j and j<k], dtype=np.uint16)
idx = np.ascontiguousarray( idx.T)
return has_identical_row_sums_lazy(A, idx)
def has_identical_double_row_sums(A):
n = len(A)
idx = np.array(np.tril_indices(n,-1), dtype=np.int32)
return has_identical_row_sums_lazy(A, idx)
from time import clock
t = clock()
for i in xrange(1):
## print has_identical_double_row_sums(A)
print has_identical_triple_row_sums(A)
print clock()-t
Edit: code cleanup