Data structure for a diamond-shaped array in python - python

I have two arrays that are related to each other via a mapping operation. I will call them S(fk,fq) and Z(fi,αj). The arguments are all sampling frequencies. The mapping rule is fairly straightforward:
fi = 0.5 · (fk - fq)
αj = fk + fq
S is the result of several FFTs and complex multiplications and is defined on a rectangular grid. However, Z is defined on a diamond-shaped grid and it is not clear to me how best to store this. The image below is an attempt at visualizing the operation for a simple example of a 4×4 array, but in general the dimensions are not equal and are much larger (maybe 64×16384, but this is user-selectable). Blue points are the resulting values of fi and αj and the text describes how these are related to fk, fq, and the discrete indices.
The diamond-shaped nature of Z means that in one "row" there will be "columns" that fall in between the "columns" of adjacent "rows". Another way to think of this is that fi can take on fractional index values!
Note that using zero's or nan's to fill in elements that don't exist in any given row has two drawbacks 1) it inflates the size of what may already be a very large 2-D array and 2) it does not really represent the true nature of Z (e.g. the array size will not really be correct).
Currently I am using a dictionary indexed on the actual values of αj to store the results:
import numpy as np
from collections import defaultdict
nrows = 64
ncolumns = 16384
fk = np.fft.fftfreq(nrows)
fq = np.fft.fftfreq(ncolumns)
# using random numbers here to simplify the example
# in practice S is the result of several FFTs and complex multiplications
S = np.random.random(size=(nrows,ncolumns)) + 1j*np.random.random(size=(nrows,ncolumns))
ret = defaultdict(lambda: {"fi":[],"Z":[]})
for k in range(-nrows//2,nrows//2):
for q in range(-ncolumns//2,ncolumns//2):
fi = 0.5*fk[k] - fq[q]
alphaj = fk[k] + fq[q]
Z = S[k,q]
ret[alphaj]["fi"].append(fi)
ret[alphaj]["Z"].append(Z)
I still find this a bit cumbersome to work with and wonder if anyone has suggestions for a better approach? "Better" here would be defined as more computationally and memory efficient and/or easier to interact with and visualize using something like matplotlib.
Note: This is related to another question about how to get rid of those nasty for-loops. Since this is about storing the results I thought it would be better to create two separate questions.

You can still view it as a straight two-dimensional array. But you can represent it as an array of rows, each row of which has a different number of items. For example, here's your 4x4 as a 2D array: (each 0 here is a unique data item)
xxx0xxx
xx0x0xx
x0x0x0x
0x0x0x0
x0x0x0x
xx0x0xx
xxx0xxx
Its sparse representation would be:
[
[0],
[0,0],
[0,0,0],
[0,0,0,0],
[0,0,0],
[0,0],
[0]
]
With this representation you eliminate the empty space. There's a little math involved in converting from Color Temperature to row, and from Spectral Frequency to column (and vice-versa), but that's tractable. You know the bounds and that items are evenly spaced out across each row. So it should be easy enough to do the translation.
Unless I'm missing something . . .

It turns out that the answer to a related question on optimization effectively solved my problem of how to better store the data. The new code returns 2-D arrays for fi, %alpha;j, and these can be used to directly index S. So to get all values of S for %alpha;j = 0, for example, one can do
S[alphaj == 0]
I can use this pretty effectively and it seems like the quickest way to create a reasonable data structure.

Related

Python - sum the intersection of rows and columns in a matrix

Let's suppose we have a matrix and a list of indexes:
adj_mat = np.array([[1,2,3],
[4,5,6],
[7,8,9]])
indexes = [0,2]
What I want is to sum the rows and columns corresponding to the sub matrix we get by the intersection of the rows and columns of the indexes list. In this case it would be:
sub_matrix = ([[1,3]
[7,9]])
result_rows = [4,16]
result_columns = [8,12]
However, I do this calculation rather a lot of times with the same original matrix and different indexes lists, so I am looking for an efficent solution without creating the sub matrix each iteration. My solution so far is (and for columns respectively):
def sum_rows(matrix, indexes):
sum_r = [0]*len(indexes)
for i in range(len(indexes)):
for j in indexes:
sum_r[i] += matrix.item(indexes[i], j)
return sum_r
I'm looking for a more efficient algorithm as I remember there is a method which looks like this that sums all rows (or columns?) in the indexes:
matrix.sum(:, indexes)
matrix.sum(indexes, indexes)
I assume what I need is the second line, if it exists. I tried to google it, with or without numpy, but couldn't find the right syntax.
Is there a solution as I described here but I'm just using the wrong syntax? Or any other suggestions for improvement?
IIUC:
import numpy as np
adj_mat = np.array([[1,2,3],
[4,5,6],
[7,8,9]])
indexes = np.array([1, 3]) - 1
sub_matrix = adj_mat[np.ix_(indexes, indexes)]
result_rows, result_columns = sub_matrix.sum(axis=1), sub_matrix.sum(axis=0)
Result:
array([ 4, 16]) # result_rows
array([ 8, 12]) # result_columns
So assuming you made a mistake and you meant indexes = [0,2] and sub_matrix = [[1,3], [7,9]], then this should do what you want
def sum_sub(matrix, indices):
"""
Returns the sum of each row and column (as a tuple)
for each index in indices (as an array)
"""
# note that this sub matrix does not copy any data from matrix,
# it is a "view" which simply holds a reference to matrix
sub_mat = matrix[np.ix_(indices, indices)]
return sub_mat.sum(axis=1), sub_mat.sum(axis=0)
sum_row, sum_col = sum_sub(np.arange(1,10).reshape((3,3)), [0,2])
The results of this are
sum_col # --> [ 8 12]
sum_row # --> [ 4 16]
Since the point of efficiency was brought up in the question, a little further analysis should probably be done.
First and foremost, the code looks like code to find a matrix inverse using the adjoint matrix. Unless that particular method is important to the project, the standard np.linalg.inv() is almost certainly going to be faster than anything we cook up here. Moreover, in many applications you can get away with solving a system of linear equations rather than finding an inverse and multiplying by it, cutting run times in half or more again.
Second, any discussion of efficient numpy code needs to address views as opposed to copies. Memory allocation, writing to memory, and memory deallocation are all extremely expensive operations when compared with standard floating point arithmetic. That's not to say that they're slow, but you can notice an order of magnitude or two of difference in the speed of code memory efficient code vs nearly anything else. That's the entire premise behind the fastest implementation of persistent homology calculations I know of, among other things.
All of the other answers (at the time of writing) create a copy of the data they're working with, explicitly storing that information in a new variable sub_matrix. It isn't possible to create every fancy-indexed matrix with a copy, but oftentimes equivalent operations can be performed.
For example, if this really is a set of computations on adjoint matrices so that your indexes variable consists of all but one of the available indices (in your example, all but the middle index), then instead of explicitly summing over all the intended indices, we can sum over all indices and subtract the one we don't care about. The effect is that all the intermediate matrices are views rather than copies, preventing the expensive memory allocations. On my machine, this is twice as fast for the tiny 3x3 example given and 10x as fast for 500x500 matrices.
bad_row = 1
bad_col = 1
result_rows = (np.sum(adj_mat, axis=1)-adj_mat[:,bad_col])[np.arange(adj_mat.shape[0])!=bad_row]
result_cols = (np.sum(adj_mat, axis=0)-adj_mat[bad_row,:])[np.arange(adj_mat.shape[1])!=bad_col]
Of course, it's even faster if you can use slices to represent whatever you're doing and you don't have to work around the problem with extra operations as I did, but the example you gave doesn't easily permit slices.

Combinations of features using Python NumPy

For an assignment I have to use different combinations of features belonging to some data, to evaluate a classification system. By features I mean measurements, e.g. height, weight, age, income. So for instance I want to see how well a classifier performs when given just the height and weight to work with, and then the height and age say. I not only want to be able to test what two features work best together, but also what 3 features work best together and would like to be able to generalise this to n features.
I've been attempting this using numpy's mgrid, to create n dimensional arrays, flattening them, and then making arrays that use the same elements from each array to create new ones. Tricky to explain so here is some code and psuedo code:
import numpy as np
def test_feature_combos(data, combinations):
dimensions = combinations.shape[0]
grid = np.empty(dimensions)
for i in xrange(dimensions):
grid[i] = combinations[i].flatten()
#The above code throws an error "setting an array element with a sequence" error which I understand, but this shows my approach.
**Pseudo code begin**
For each element of each element of this new array,
create a new array like so:
[[1,1,2,2],[1,2,1,2]] ---> [[1,1],[1,2],[2,1],[2,2]]
Call this new array combo_indices
Then choose the columns (features) from the data in a loop using:
new_data = data[:, combo_indices[j]]
combinations = np.mgrid[1:5,1:5]
test_feature_combos(data, combinations)
I concede that this approach means a lot of unnecessary combinations due to repeats, however I cannot even implement this so beggars can not be choosers.
Please can someone advise me on how I can either a) implement my approach or b) achieve this goal in a much more elegant way.
Thanks in advance, and let me know if any clarification needs to be made, this was tough to explain.
To generate all combinations of k elements drawn without replacement from a set of size n you can use itertools.combinations, e.g.:
idx = np.vstack(itertools.combinations(range(n), k)) # an (n, k) array of indices
For the special case where k=2 it's often faster to use the indices of the upper triangle of an n x n matrix, e.g.:
idx = np.vstack(np.triu_indices(n, 1)).T

Editing every value in a numpy matrix

I have a numpy matrix which I filled with data from a *.csv-file
csv = np.genfromtxt (file,skiprows=22)
matrix = np.matrix(csv)
This is a 64x64 matrix which looks like
print matrix
[[...,...,....]
[...,...,.....]
.....
]]
Now I need to take the logarithm math.log10() of every single value and safe it into another 64x64 matrix.
How can I do this? I tried
matrix_lg = np.matrix(csv)
for i in range (0,len(matrix)):
for j in range (0,len(matrix[0])):
matrix_lg[i,j]=math.log10(matrix[i,j])
but this only edited the first array (meaning the first row) of my initial matrix.
It's my first time working with python and I start getting confused.
You can just do:
matrix_lg = numpy.log10(matrix)
And it will do it for you. It's also much faster to do it this vectorized way instead of looping over every entry in python. It will also handle domain errors more gracefully.
FWIW though, the issue with your posted code is that the len() for matrices don't work exactly the same as they do for nested lists. As suggested in the comments, you can just use matrix.shape to get the proper dims to iterate through:
matrix_lg = np.matrix(csv)
for i in range(0,matrix_lg.shape[0]):
for j in range(0,matrix_lg.shape[1]):
matrix_lg[i,j]=math.log10(matrix_lg[i,j])

Replace loop with broadcasting in numpy -> memory error

I have an 2D-array (array1), which has an arbitrary number of rows and in the first column I have strictly monotonic increasing numbers (but not linearly), which represent a position in my system, while the second one gives me a value, which represents the state of my system for and around the position in the first column.
Now I have a second array (array2); its range should usually be the same as for the first column of the first array, but does not matter to much, as you will see below.
I am now interested for every element in array2:
1. What is the argument in array1[:,0], which has the closest value to the current element in array2?
2. What is the value (array1[:,1]) of those elements.
As usually array2 will be longer than the number of rows in array1 it is perfectly fine, if I get one argument from array1 more than one time. In fact this is what I expect.
The value from 2. is written in the second and third column, as you will see below.
My striped code looks like this:
from numpy import arange, zeros, absolute, argmin, mod, newaxis, ones
ysize1 = 50
array1 = zeros((ysize1+1,2))
array1[:,0] = arange(ysize1+1)**2
# can be any strictly monotonic increasing array
array1[:,1] = mod(arange(ysize1+1),2)
# in my current case, but could also be something else
ysize2 = (ysize1)**2
array2 = zeros((ysize2+1,3))
array2[:,0] = arange(0,ysize2+1)
# is currently uniformly distributed over the whole range, but does not necessarily have to be
a = 0
for i, array2element in enumerate(array2[:,0]):
a = argmin(absolute(array1[:,0]-array2element))
array2[i,1] = array1[a,1]
It works, but takes quite a lot time to process large arrays. I then tried to implement broadcasting, which seems to work with the following code:
indexarray = argmin(absolute(ones(array2[:,0].shape[0])[:,newaxis]*array1[:,0]-array2[:,0][:,newaxis]),1)
array2[:,2]=array1[indexarray,1] # just to compare the results
Unfortunately now I seem to run into a different problem: I get a memory error on the sizes of arrays I am using in the line of code with the broadcasting.
For small sizes it works, but for larger ones where len(array2[:,0]) is something like 2**17 (and could be even larger) and len(array1[:,0]) is about 2**14. I get, that the size of the array is bigger than the available memory. Is there an elegant way around that or to speed up the loop?
I do not need to store the intermediate array(s), I am just interested in the result.
Thanks!
First lets simplify this line:
argmin(absolute(ones(array2[:,0].shape[0])[:,newaxis]*array1[:,0]-array2[:,0][:,newaxis]),1)
it should be:
a = array1[:, 0]
b = array2[:, 0]
argmin(abs(a - b[:, newaxis]), 1)
But even when simplified, you're creating two large temporary arrays. If a and b have sizes M and N, b - a and abs(...) each create a temporary array of size (M, N). Because you've said that a is monotonically increasing, you can avoid the issue all together by using a binary search (sorted search) which is much faster anyways. Take a look at the answer I wrote to this question a while back. Using the function from this answer, try this:
closest = find_closest(array1[:, 0], array2[:, 0])
array2[:, 2] = array1[closest, 1]

Numpy: average over one dimension in "jagged" 3D array

Suppose I have an N*M*X-dimensional array "data", where N and M are fixed, but X is variable for each entry data[n][m].
(Edit: To clarify, I just used np.array() on the 3D python list which I used for reading in the data, so the numpy array is of dimensions N*M and its entries are variable-length lists)
I'd now like to compute the average over the X-dimension, so that I'm left with an N*M-dimensional array. Using np.average/mean with the axis-argument doesn't work, so the way I'm doing it right now is just iterating over N and M and appending the manually computed average to a new list, but that just doesn't feel very "python":
avgData=[]
for n in data:
temp=[]
for m in n:
temp.append(np.average(m))
avgData.append(temp)
Am I missing something obvious here? I'm trying to freshen up my python skills while I'm at it, so interesting/varied responses are more than welcome! :)
Thanks!
What about using np.vectorize:
do_avg = np.vectorize(np.average)
data_2d = do_avg(data)
data = np.array([[1,2,3],[0,3,2,4],[0,2],[1]]).reshape(2,2)
avg=np.zeros(data.shape)
avg.flat=[np.average(x) for x in data.flat]
print avg
#array([[ 2. , 2.25],
# [ 1. , 1. ]])
This still iterates over the elements of data (nothing un-Pythonic about that). But since there's nothing special about the shape or axes of data, I'm just using data.flat. While appending to Python list, with numpy it is better to assign values to the elements of an existing array.
There are fast numeric methods to work with numpy arrays, but most (if not all) work with simple numeric dtypes. Here the array elements are object (either list or array), numpy has to resort to the usual Python iteration and list operations.
For this small example, this solution is a bit faster than Zwicker's vectorize. For larger data the two solutions take about the same time.

Categories