Combinations of features using Python NumPy - python

For an assignment I have to use different combinations of features belonging to some data, to evaluate a classification system. By features I mean measurements, e.g. height, weight, age, income. So for instance I want to see how well a classifier performs when given just the height and weight to work with, and then the height and age say. I not only want to be able to test what two features work best together, but also what 3 features work best together and would like to be able to generalise this to n features.
I've been attempting this using numpy's mgrid, to create n dimensional arrays, flattening them, and then making arrays that use the same elements from each array to create new ones. Tricky to explain so here is some code and psuedo code:
import numpy as np
def test_feature_combos(data, combinations):
dimensions = combinations.shape[0]
grid = np.empty(dimensions)
for i in xrange(dimensions):
grid[i] = combinations[i].flatten()
#The above code throws an error "setting an array element with a sequence" error which I understand, but this shows my approach.
**Pseudo code begin**
For each element of each element of this new array,
create a new array like so:
[[1,1,2,2],[1,2,1,2]] ---> [[1,1],[1,2],[2,1],[2,2]]
Call this new array combo_indices
Then choose the columns (features) from the data in a loop using:
new_data = data[:, combo_indices[j]]
combinations = np.mgrid[1:5,1:5]
test_feature_combos(data, combinations)
I concede that this approach means a lot of unnecessary combinations due to repeats, however I cannot even implement this so beggars can not be choosers.
Please can someone advise me on how I can either a) implement my approach or b) achieve this goal in a much more elegant way.
Thanks in advance, and let me know if any clarification needs to be made, this was tough to explain.

To generate all combinations of k elements drawn without replacement from a set of size n you can use itertools.combinations, e.g.:
idx = np.vstack(itertools.combinations(range(n), k)) # an (n, k) array of indices
For the special case where k=2 it's often faster to use the indices of the upper triangle of an n x n matrix, e.g.:
idx = np.vstack(np.triu_indices(n, 1)).T

Related

Faster way to fill 2d numpy array with these noise parameters? Currently looping over each element

Is there a faster way to populate a 2d numpy array using the same algorithm (pnoise3 with the same input arguments, notably, i/scale j/scale) seen here? self.world is the np array and it is pretty large (2048,1024) to be traversing like this.
for i in range(self.height):
for j in range(self.width):
self.world[i][j] = noise.pnoise3(i/self.noise['scale'],
j/self.noise['scale'],
SEED,
octaves = self.noise['octaves'],
persistence = self.noise['persistence'],
lacunarity = self.noise['lacunarity'],
repeatx= self.width,
repeaty= self.height,
base= 0)
After learning about boolean indexing I was able to get rid of this nested for loop elsewhere in my program and was amazed at how much more efficient it was. Is there any room for improvement above?
I thought about doing something like self.world[self.world is not None] = noise.pnoise3(arg, arg, etc...) but that cannot accommodate for the incrementing i and j values. And would setting it to a function output mean every value is the same anyways? I also thought about make a separate array and then combining them but I still cannot figure out how to reproduce the incrementing i and j values in that scenario.
Also, as an aside, I used self.world[self.world is not None] as an example of a boolean index that would return true for everything but I imagine this is not the best way to do what I want. Is there an obvious alternative I am missing?
If pnoise is perlin noise then there are numpy vectorized implementations.
Here is one.
As it is I do not think you can do it faster. Numpy is fast when it can do the inner loop in C. That is the case for built in numpy functions like np.sin.
Here you have a vector operation where the operation is a python function.
However it could be possible to re-implement the noise function so that it internally uses numpy vectorized functions.

More efficient way to access rows based on a list of indices in 2d numpy array?

So I have 2d numpay array arr. It's a relatively big one: arr.shape = (2400, 60000)
What I'm currently doing is the following:
randomly (with replacement) select arr.shape[0] indices
access (row-wise) chosen indices of arr
calculating column-wise averages and selecting max value
I'm repeating it for k times
It looks sth like:
no_rows = arr.shape[0]
indicies = np.array(range(no_rows))
my_vals = []
for k in range(no_samples):
random_idxs = np.random.choice(indicies, size=no_rows, replace=True)
my_vals.append(
arr[random_idxs].mean(axis=0).max()
)
My problem is that is very slow. With my arr size, it takes ~3s for 1 loop. As I want a sample that is bigger than 1k - my current solution solution pretty bad (1k*~3s -> ~1h). I've profiled it and the bottleneck is accessing row based on indices. "mean" and "max" work fast. np.random.choice is also ok.
Do you see any area for improvement? A more efficient way of accessing indices or maybe better a faster approach that solves the problem without this?
What I tried so far:
numpy.take (slower)
numpy.ravel :
sth similar to:
random_idxs = np.random.choice(sample_idxs, size=sample_size, replace=True)
test = random_idxs.ravel()[arr.ravel()].reshape(arr.shape)
similar approach to current one but without loop. I created 3d arr and accessed rows across additional dimension in one go
Since advanced indexing will generate a copy, the program will allocate huge memory in arr[random_idxs].
So one of the most simple way to improve efficiency is that do things batch wise.
BATCH = 512
max(arr[random_idxs,i:i+BATCH].mean(axis=0).max() for i in range(0,arr.shape[1],BATCH))
This is not a general solution to the problem, but should make your specific problem much faster. Basically, arr.mean(axis=0).max() won't change, so why not take random samples from that array?
Something like:
mean_max = arr.mean(axis=0).max()
my_vals = np.array([np.random.choice(mean_max, size=len(mean_max), replace=True) for i in range(no_samples)])
You may even be able to do: my_vals = np.random.choice(mean_max, size=(no_samples, len(mean_max)), replace=True), but I'm not sure how, if at all, that would change your statistics.

Data structure for a diamond-shaped array in python

I have two arrays that are related to each other via a mapping operation. I will call them S(fk,fq) and Z(fi,αj). The arguments are all sampling frequencies. The mapping rule is fairly straightforward:
fi = 0.5 · (fk - fq)
αj = fk + fq
S is the result of several FFTs and complex multiplications and is defined on a rectangular grid. However, Z is defined on a diamond-shaped grid and it is not clear to me how best to store this. The image below is an attempt at visualizing the operation for a simple example of a 4×4 array, but in general the dimensions are not equal and are much larger (maybe 64×16384, but this is user-selectable). Blue points are the resulting values of fi and αj and the text describes how these are related to fk, fq, and the discrete indices.
The diamond-shaped nature of Z means that in one "row" there will be "columns" that fall in between the "columns" of adjacent "rows". Another way to think of this is that fi can take on fractional index values!
Note that using zero's or nan's to fill in elements that don't exist in any given row has two drawbacks 1) it inflates the size of what may already be a very large 2-D array and 2) it does not really represent the true nature of Z (e.g. the array size will not really be correct).
Currently I am using a dictionary indexed on the actual values of αj to store the results:
import numpy as np
from collections import defaultdict
nrows = 64
ncolumns = 16384
fk = np.fft.fftfreq(nrows)
fq = np.fft.fftfreq(ncolumns)
# using random numbers here to simplify the example
# in practice S is the result of several FFTs and complex multiplications
S = np.random.random(size=(nrows,ncolumns)) + 1j*np.random.random(size=(nrows,ncolumns))
ret = defaultdict(lambda: {"fi":[],"Z":[]})
for k in range(-nrows//2,nrows//2):
for q in range(-ncolumns//2,ncolumns//2):
fi = 0.5*fk[k] - fq[q]
alphaj = fk[k] + fq[q]
Z = S[k,q]
ret[alphaj]["fi"].append(fi)
ret[alphaj]["Z"].append(Z)
I still find this a bit cumbersome to work with and wonder if anyone has suggestions for a better approach? "Better" here would be defined as more computationally and memory efficient and/or easier to interact with and visualize using something like matplotlib.
Note: This is related to another question about how to get rid of those nasty for-loops. Since this is about storing the results I thought it would be better to create two separate questions.
You can still view it as a straight two-dimensional array. But you can represent it as an array of rows, each row of which has a different number of items. For example, here's your 4x4 as a 2D array: (each 0 here is a unique data item)
xxx0xxx
xx0x0xx
x0x0x0x
0x0x0x0
x0x0x0x
xx0x0xx
xxx0xxx
Its sparse representation would be:
[
[0],
[0,0],
[0,0,0],
[0,0,0,0],
[0,0,0],
[0,0],
[0]
]
With this representation you eliminate the empty space. There's a little math involved in converting from Color Temperature to row, and from Spectral Frequency to column (and vice-versa), but that's tractable. You know the bounds and that items are evenly spaced out across each row. So it should be easy enough to do the translation.
Unless I'm missing something . . .
It turns out that the answer to a related question on optimization effectively solved my problem of how to better store the data. The new code returns 2-D arrays for fi, %alpha;j, and these can be used to directly index S. So to get all values of S for %alpha;j = 0, for example, one can do
S[alphaj == 0]
I can use this pretty effectively and it seems like the quickest way to create a reasonable data structure.

defining a for loop for operations on each row of a NumPy array

I have a data set of length (L) which I named "data".
data=raw_data.iloc[:,0]
I randomly generated 2000 sample series from "data" and named it "resamples" to have a NumPy matrix of len =2000 and cols=L of the "data".
resamples=[np.random.choice(data, size=len(data), replace=True) for i in range (2000)]
The code below shows two operations in Scipy.stats using "data" which is a single array. Now I need to perform the same operation on each one of those sample series (2000 rows) by defining a for loop. The challenge is two parameters (loc and scale) are calculated in the first step and they should be used for each row to perform the next one. My knowledge falls short in defining such a for loop. I was wondering if anyone could help me with this.
loc, scale=stats.gumbel_r.fit(data)
return_gumbel=stats.gumbel_r.ppf([0.9999,0.9995,0.999],loc=loc, scale=scale)
The description is a little unclear, but I think you just need:
alist = []
for data in resamples:
loc, scale=stats.gumbel_r.fit(data)
return_gumbel=stats.gumbel_r.ppf([0.9999,0.9995,0.999],loc=loc, scale=scale)
alist.append(return_gumbel)
arr = np.array(alist)
You could also create arr first, and assign return_gumbel to the respective rows, but the list append is about the same speed. The loop could also be written as a list comprehension.
There was talk of vectorizing, but given the complex nature of the calculation I doubt if that is feasible - at least not without digging into the details of those stats functions. In numpy vectorizing means writing a function such that it works with all rows of the array at once, performing the actions in compiled numpy code.

Compute outer product of arrays with arbitrary dimensions

I have two arrays A,B and want to take the outer product on their last dimension,
e.g.
result[:,i,j]=A[:,i]*B[:,j]
when A,B are 2-dimensional.
How can I do this if I don't know whether they will be 2 or 3 dimensional?
In my specific problem A,B are slices out of a bigger 3-dimensional array Z,
Sometimes this may be called with integer indices A=Z[:,1,:], B=Z[:,2,:] and other times
with slices A=Z[:,1:3,:],B=Z[:,4:6,:].
Since scipy "squeezes" singleton dimensions, I won't know what dimensions my inputs
will be.
The array-outer-product I'm trying to define should satisfy
array_outer_product( Y[a,b,:], Z[i,j,:] ) == scipy.outer( Y[a,b,:], Z[i,j,:] )
array_outer_product( Y[a:a+N,b,:], Z[i:i+N,j,:])[n,:,:] == scipy.outer( Y[a+n,b,:], Z[i+n,j,:] )
array_outer_product( Y[a:a+N,b:b+M,:], Z[i:i+N, j:j+M,:] )[n,m,:,:]==scipy.outer( Y[a+n,b+m,:] , Z[i+n,j+m,:] )
for any rank-3 arrays Y,Z and integers a,b,...i,j,k...n,N,...
The kind of problem I'm dealing with involves a 2-D spatial grid, with a vector-valued function at each grid point. I want to be able to calculate the covariance matrix (outer product) of these vectors, over regions defined by slices in the first two axes.
You may have some luck with einsum :
http://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html
After discovering the use of ellipsis in numpy/scipy arrays
I ended up implementing it as a recursive function:
def array_outer_product(A, B, result=None):
''' Compute the outer-product in the final two dimensions of the given arrays.
If the result array is provided, the results are written into it.
'''
assert(A.shape[:-1] == B.shape[:-1])
if result is None:
result=scipy.zeros(A.shape+B.shape[-1:], dtype=A.dtype)
if A.ndim==1:
result[:,:]=scipy.outer(A, B)
else:
for idx in xrange(A.shape[0]):
array_outer_product(A[idx,...], B[idx,...], result[idx,...])
return result
Assuming I've understood you correctly, I encountered a similar issue in my research a couple weeks ago. I realized that the Kronecker product is simply an outer product which preserves dimensionality. Thus, you could do something like this:
import numpy as np
# Generate some data
a = np.random.random((3,2,4))
b = np.random.random((2,5))
# Now compute the Kronecker delta function
c = np.kron(a,b)
# Check the shape
np.prod(c.shape) == np.prod(a.shape)*np.prod(b.shape)
I'm not sure what shape you want at the end, but you could use array slicing in combination with np.rollaxis, np.reshape, np.ravel (etc.) to shuffle things around as you wish. I guess the downside of this is that it does some extra calculations. This may or may not matter, depending on your limitations.

Categories