I have several large numpy array of dimensions 30*30*30, on which I need to traverse the array, get the sum of each index triplet and bin these elements by this sum. For example, consider this simple 2*2 array:
test = np.array([[2,3],[0,1]])
This array has the indices [0,0],[0,1],[1,0] and [1,1]. This routine would return the list: [2,[3,0],1], because 2 in array test has index sum 0, 3 and 0 have index sum 1 and 1 has index sum 2. I know the brute force method of iterating through the NumPy array and checking the sum would work, but it is far too inefficient for my actual case with large N(=30) and several arrays. Any inputs on using NumPy routines to accomplish this grouping would be appreciated. Thank you in advance.
Here is one way that should be reasonably fast, but not super-fast: 30x30x30 takes 20 ms on my machine.
import numpy as np
# make example
dims = 2,3,4
a = np.arange(np.prod(dims),0,-1).reshape(dims)
# create and sort indices
idx = sum(np.ogrid[tuple(map(slice,dims))])
srt = idx.ravel().argsort(kind='stable')
# use order to arrange and split data
asrt = a.ravel()[srt]
spltpts = idx.ravel().searchsorted(np.arange(1,np.sum(dims)-len(dims)+1),sorter=srt)
out = np.split(asrt,spltpts)
# admire
out
# [array([24]), array([23, 20, 12]), array([22, 19, 16, 11, 8]), array([21, 18, 15, 10, 7, 4]), array([17, 14, 9, 6, 3]), array([13, 5, 2]), array([1])]
You could procedural create a list of index tuplets and use that, but may be getting into a code constant that's too large to be efficient.
[(0,0),[(1,0),(0,1)],(1,1)],
So you need a function to generate these indexes on the fly for an n-demensional array.
For one dimension, a trivial count/increment
[(0),(1),(2),...]
The the second, use the one dimension strategy for the fist dimension, the decrement the first and increment the second to fill in.
[(0...)...,(1...)...,(2...)...,...]
[[(0,0)],[(1,0),(0,1)],[(2,0),(1,1),(0,2)],[...],...]
Notice some of these would be outside the example array, Your generator would need to include a bounds check.
Then three dimensions, give the first two demensions the treatment as above, but at the end, decrement the first dimension, increment the third, repeat until done
[[(0,0,0),...],[(1,0,0),(0,1,0),...],[(2,0,0),(1,1,0),(0,2,0),...],[...],...]
[[(0,0,0)],[(1,0,0),(0,1,0),(0,0,1)],[(2,0,0),(1,1,0),(0,2,0),(1,0,1),(0,1,1)(0,0,2)
Again need bounds checks or cleverer starting/end points to avoid trying to access outside the index, but this general algorithm is how you'd go about generating the indexes on the fly rather than having two large arrays compete for cache and i/o.
Generating the python or nympy equivalent is left as an exercise to the user.
If I have a large array and a small array, for example
A = np.array([1,2,3])
B = np.array([3,4,5,6,7,8,2,1])
I can use np.intersect1d to get the same value,
but if I want to get the index (in large array B)of same value, for this example,it should be [0,6,7],is there any command to get it?
You can use np.in1d() to get a Boolean array that represents the places where items of A appears in B, then using np.where() or np.argwhere() function you can get the indices of the True items:
In [8]: np.where(np.in1d(B, A))[0]
Out[8]: array([0, 6, 7])
Or as mentioned in comments np.in1d(B, A).nonzero()[0]. However the way you wanna choose here depends pretty much on the reset of your program and where/how you wanna use the results. In addition you can run benchmarks on all the methods in both short and large arrays to see which one is more appropriate in which situation.
I'm fairly new to Python/Numpy. What I have here is a standard array and I have a function which I have vectorized appropriately.
def f(i):
return np.random.choice(2,1,p=[0.7,0.3])*9
f = np.vectorize(f)
Defining an example array:
array = np.array([[1,1,0],[0,1,0],[0,0,1]])
With the vectorized function, f, I would like to evaluate f on each cell on the array with a value of 0.
I am trying to leave for loops as a last resort. My arrays will eventually be larger than 100 by 100, so running each cell individually to look and evaluate f might take too long.
I have tried:
print f(array[array==0])
Unfortunately, this gives me a row array consisting of 5 elements (the zeroes in my original array).
Alternatively I have tried,
array[array==0] = f(1)
But as expected, this just turns every single zero element of array into 0's or 9's.
What I'm looking for is somehow to give me my original array with the zero elements replaced individually. Ideally, 30% of my original zero elements will become 9 and the array structure is conserved.
Thanks
The reason your first try doesn't work is because the vectorized function handle, let's call it f_v to distinguish it from the original f, is performing the operation for exactly 5 elements: the 5 elements that are returned by the boolean indexing operation array[array==0]. That returns 5 values, it doesn't set those 5 items to the returned values. Your analysis of why the 2nd form fails is spot-on.
If you wanted to solve it you could combine your second approach with adding the size option to np.random.choice:
array = np.array([[1,1,0],[0,1,0],[0,0,1]])
mask = array==0
array[mask] = np.random.choice([18,9], size=mask.sum(), p=[0.7, 0.3])
# example output:
# array([[ 1, 1, 9],
# [18, 1, 9],
# [ 9, 18, 1]])
There was no need for np.vectorize: the size option takes care of that already.
We can create multi-dimensional arrays in python by using nested list, such as:
A = [[1,2,3],
[2,1,3]]
etc.
In this case, it is simple nRows= len(A) and nCols=len(A[0]). However, when I have more than three dimensions it would become complicated.
A = [[[1,1,[1,2,3,4]],2,[3,[2,[3,4]]]],
[2,1,3]]
etc.
These lists are legal in Python. And the number of dimensions is not a priori.
In this case, how to determine the number of dimensions and the number of elements in each dimension.
I'm looking for an algorithm and if possible implementation. I believe it has something similar to DFS. Any suggestions?
P.S.: I'm not looking for any existing packages, though I would like to know about them.
I believe to have solve the problem my self.
It is just a simple DFS.
For the example given above: A = [[[1,1,[1,2,3,4]],2,[3,[2,[3,4]]]],
[2,1,3]]
the answer is as follows:
[[3, 2, 2, 2, 3, 4], [3]]
The total number of dimensions is the 7.
I guess I was overthinking... thanks anyway...!
When I do:
import rpy2.robjects as R
exampleDict = {'colum1':R.IntVector([1,2,3]), 'column2':R.FloatVector([1,2]), 'column3':R.FloatVector([1,2,3,4])}
R.DataFrame(exampleDict)
I get the error that the rows are not of the same lenghts: "arguments imply differing number of rows: 2, 4, 3".
How I solved it before is to loop through the lists before making them vectors and adding NA to all the lists that are smaller than the longest until they are all of the same length.
Is there an easy way of making a dataframe with rpy2 with different column lengths?
edit: I tried
myparams = {'na.rm': True}
R.DataFrame(exampleDict, **myparams)
but R.DataFrame only accepts one argument.
As Igautier said, it was answered on the rpy mailing-list. It can't be done. So I keep to my workaround of adding NA_Reals and NA_Ints to the vectors that are smaller than the largest vector before making a dataframe.