Build numpy array with multiple custom index ranges without explicit loop - python

In Numpy, is there a pythonic way to create array3 with custom ranges from array1 and array2 without a loop? The straightforward solution of iterating over the ranges works but since my arrays run into millions of items, I am looking for a more efficient solution (maybe syntactic sugar too).
For ex.,
array1 = np.array([10, 65, 200])
array2 = np.array([14, 70, 204])
array3 = np.concatenate([np.arange(array1[i], array2[i]) for i in
np.arange(0,len(array1))])
print array3
result: [10,11,12,13,65,66,67,68,69,200,201,202,203].

Assuming the ranges do not overlap, you could build a mask which is nonzero where the index is between the ranges specified by array1 and array2 and then use np.flatnonzero to obtain an array of indices -- the desired array3:
import numpy as np
array1 = np.array([10, 65, 200])
array2 = np.array([14, 70, 204])
first, last = array1.min(), array2.max()
array3 = np.zeros(last-first+1, dtype='i1')
array3[array1-first] = 1
array3[array2-first] = -1
array3 = np.flatnonzero(array3.cumsum())+first
print(array3)
yields
[ 10 11 12 13 65 66 67 68 69 200 201 202 203]
For large len(array1), using_flatnonzero can be significantly faster than using_loop:
def using_flatnonzero(array1, array2):
first, last = array1.min(), array2.max()
array3 = np.zeros(last-first+1, dtype='i1')
array3[array1-first] = 1
array3[array2-first] = -1
return np.flatnonzero(array3.cumsum())+first
def using_loop(array1, array2):
return np.concatenate([np.arange(array1[i], array2[i]) for i in
np.arange(0,len(array1))])
array1, array2 = (np.random.choice(range(1, 11), size=10**4, replace=True)
.cumsum().reshape(2, -1, order='F'))
assert np.allclose(using_flatnonzero(array1, array2), using_loop(array1, array2))
In [260]: %timeit using_loop(array1, array2)
100 loops, best of 3: 9.36 ms per loop
In [261]: %timeit using_flatnonzero(array1, array2)
1000 loops, best of 3: 564 µs per loop
If the ranges overlap, then using_loop will return an array3 which contains duplicates. using_flatnonzero returns an array with no duplicates.
Explanation: Let's look at a small example with
array1 = np.array([10, 65, 200])
array2 = np.array([14, 70, 204])
The objective is to build an array which looks like goal, below. The 1's are located at index values [ 10, 11, 12, 13, 65, 66, 67, 68, 69, 200, 201, 202, 203] (i.e. array3):
In [306]: goal
Out[306]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1], dtype=int8)
Once we have the goal array, array3 can be obtained with a call to np.flatnonzero:
In [307]: np.flatnonzero(goal)
Out[307]: array([ 10, 11, 12, 13, 65, 66, 67, 68, 69, 200, 201, 202, 203])
goal has the same length as array2.max():
In [308]: array2.max()
Out[308]: 204
In [309]: goal.shape
Out[309]: (204,)
So we can begin by allocating
goal = np.zeros(array2.max()+1, dtype='i1')
and then filling in 1's at the index locations given by array1 and -1's at the indices given by array2:
In [311]: goal[array1] = 1
In [312]: goal[array2] = -1
In [313]: goal
Out[313]:
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, -1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
-1], dtype=int8)
Now applying cumsum (the cumulative sum) produces the desired goal array:
In [314]: goal = goal.cumsum(); goal
Out[314]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0])
In [315]: np.flatnonzero(goal)
Out[315]: array([ 10, 11, 12, 13, 65, 66, 67, 68, 69, 200, 201, 202, 203])
That's the main idea behind using_flatnonzero. The subtraction of first was simply to save a bit of memory.

Prospective Approach
I will go backwards on how to approach this problem.
Take the sample listed in the question. We have -
array1 = np.array([10, 65, 200])
array2 = np.array([14, 70, 204])
Now, look at the desired result -
result: [10,11,12,13,65,66,67,68,69,200,201,202,203]
Let's calculate the group lengths, as we would be needing those to explain the solution approach next.
In [58]: lens = array2 - array1
In [59]: lens
Out[59]: array([4, 5, 4])
The idea is to use 1's initialized array, which when cumumlative summed across the entire length would give us the desired result.
This cumumlative summation would be the last step to our solution.
Why 1's initialized? Well, because we have an array that increasing in steps of 1's except at specific places where we have shifts
corresponding to new groups coming in.
Now, since cumsum would be the last step, so the step before it should give us something like -
array([ 10, 1, 1, 1, 52, 1, 1, 1, 1, 131, 1, 1, 1])
As discussed before, it's 1's filled with [10,52,131] at specific places. That 10 seems to be coming in from the first element in array1, but what about the rest?
The second one 52 came in as 65-13 (looking at the result) and in it 13 came in the group that started with 10 and ran because of the length of
the first group 4. So, if we do 65 - 10 - 4, we will get 51 and then add 1 to accomodate for boundary stop, we would have 52, which is the
desired shifting value. Similarly, we would get 131.
Thus, those shifting-values could be computed, like so -
In [62]: np.diff(array1) - lens[:-1]+1
Out[62]: array([ 52, 131])
Next up, to get those shifting-places where such shifts occur, we can simply do cumulative summation on the group lengths -
In [65]: lens[:-1].cumsum()
Out[65]: array([4, 9])
For completeness, we need to pre-append 0 with the array of shifting-places and array1[0] for shifting-values.
So, we are set to present our approach in a step-by-step format!
Putting back the pieces
1] Get lengths of each group :
lens = array2 - array1
2] Get indices at which shifts occur and values to be put in 1's initialized array :
shift_idx = np.hstack((0,lens[:-1].cumsum()))
shift_vals = np.hstack((array1[0],np.diff(array1) - lens[:-1]+1))
3] Setup 1's initialized ID array for inserting those values at those indices listed in the step before :
id_arr = np.ones(lens.sum(),dtype=array1.dtype)
id_arr[shift_idx] = shift_vals
4] Finally do cumulative summation on the ID array :
output = id_arr.cumsum()
Listed in a function format, we would have -
def using_ones_cumsum(array1, array2):
lens = array2 - array1
shift_idx = np.hstack((0,lens[:-1].cumsum()))
shift_vals = np.hstack((array1[0],np.diff(array1) - lens[:-1]+1))
id_arr = np.ones(lens.sum(),dtype=array1.dtype)
id_arr[shift_idx] = shift_vals
return id_arr.cumsum()
And it works on overlapping ranges too!
In [67]: array1 = np.array([10, 11, 200])
...: array2 = np.array([14, 18, 204])
...:
In [68]: using_ones_cumsum(array1, array2)
Out[68]:
array([ 10, 11, 12, 13, 11, 12, 13, 14, 15, 16, 17, 200, 201,
202, 203])
Runtime test
Let's time the proposed approach against the other vectorized approach in #unutbu's flatnonzero based solution, which already proved to be much better than the loopy approach -
In [38]: array1, array2 = (np.random.choice(range(1, 11), size=10**4, replace=True)
...: .cumsum().reshape(2, -1, order='F'))
In [39]: %timeit using_flatnonzero(array1, array2)
1000 loops, best of 3: 889 µs per loop
In [40]: %timeit using_ones_cumsum(array1, array2)
1000 loops, best of 3: 235 µs per loop
Improvement!
Now, codewise NumPy doesn't like appending. So, those np.hstack calls could be avoided for a slightly improved version as listed below -
def get_ranges_arr(starts,ends):
counts = ends - starts
counts_csum = counts.cumsum()
id_arr = np.ones(counts_csum[-1],dtype=int)
id_arr[0] = starts[0]
id_arr[counts_csum[:-1]] = starts[1:] - ends[:-1] + 1
return id_arr.cumsum()
Let's time it against our original approach -
In [151]: array1,array2 = (np.random.choice(range(1, 11),size=10**4, replace=True)\
...: .cumsum().reshape(2, -1, order='F'))
In [152]: %timeit using_ones_cumsum(array1, array2)
1000 loops, best of 3: 276 µs per loop
In [153]: %timeit get_ranges_arr(array1, array2)
10000 loops, best of 3: 193 µs per loop
So, we have a 30% performance boost there!

This is my approach combining vectorize and concatenate:
Implementation:
import numpy as np
array1, array2 = np.array([10, 65, 200]), np.array([14, 70, 204])
ranges = np.vectorize(lambda a, b: np.arange(a, b), otypes=[np.ndarray])
result = np.concatenate(ranges(array1, array2), axis=0)
print result
# [ 10 11 12 13 65 66 67 68 69 200 201 202 203]
Performance:
%timeit np.concatenate(ranges(array1, array2), axis=0)
100000 loops, best of 3: 13.9 µs per loop

Do you mean this?
In [440]: np.r_[10:14,65:70,200:204]
Out[440]: array([ 10, 11, 12, 13, 65, 66, 67, 68, 69, 200, 201, 202, 203])
or generalizing:
In [454]: np.r_[tuple([slice(i,j) for i,j in zip(array1,array2)])]
Out[454]: array([ 10, 11, 12, 13, 65, 66, 67, 68, 69, 200, 201, 202, 203])
Though this does involve a double loop, the explicit one to generate the slices and one inside r_ to convert the slices to arange.
for k in range(len(key)):
scalar = False
if isinstance(key[k], slice):
step = key[k].step
start = key[k].start
...
newobj = _nx.arange(start, stop, step)
I mention this because it shows that numpy developers consider your kind of iteration normal.
I expect that #unutbu's cleaver, if somewhat obtuse (I haven't figured out what it is doing yet), solution is your best chance of speed. cumsum is a good tool when you need to work with ranges than can vary in length. It probably gains most when the working with many small ranges. I don't think it works with overlapping ranges.
================
np.vectorize uses np.frompyfunc. So this iteration can also be expressed with:
In [467]: f=np.frompyfunc(lambda x,y: np.arange(x,y), 2,1)
In [468]: f(array1,array2)
Out[468]:
array([array([10, 11, 12, 13]), array([65, 66, 67, 68, 69]),
array([200, 201, 202, 203])], dtype=object)
In [469]: timeit np.concatenate(f(array1,array2))
100000 loops, best of 3: 17 µs per loop
In [470]: timeit np.r_[tuple([slice(i,j) for i,j in zip(array1,array2)])]
10000 loops, best of 3: 65.7 µs per loop
With #Darius's vectorize solution:
In [474]: timeit result = np.concatenate(ranges(array1, array2), axis=0)
10000 loops, best of 3: 52 µs per loop
vectorize must be doing some extra work to allow more powerful use of broadcasting. Relative speeds may shift if array1 is much larger.
#unutbu's solution isn't special with this small array1.
In [478]: timeit using_flatnonzero(array1,array2)
10000 loops, best of 3: 57.3 µs per loop
The OP solution, iterative without my r_ middle man is good
In [483]: timeit array3 = np.concatenate([np.arange(array1[i], array2[i]) for i in np.arange(0,len(array1))])
10000 loops, best of 3: 24.8 µs per loop
It's often the case that with a small number of loops, a list comprehension is faster than fancier numpy operations.
For #unutbu's larger test case, my timings are consistent with his - with a 17x speed up.
===================
For the small sample arrays, #Divakar's solution is slower, but for the large ones 3x faster than #unutbu's. So it has more of a setup cost, but scales slower.

Related

Converting an array to a list in Python

I have an array A. I want to identify all locations with element 1 and convert it to a list as shown in the expected output. But I am getting an error.
import numpy as np
A=np.array([0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
B=np.where(A==1)
B=B.tolist()
print(B)
The error is
in <module>
B=B.tolist()
AttributeError: 'tuple' object has no attribute 'tolist'
The expected output is
[1, 2, 5, 7, 10, 11]
np.where used with only the condition returns a tuple of arrays containing indices; one array for each dimension of the array. According to the docs, this is much like np.nonzero, which is the recommended approach over np.where. So, since your array is one dimensional, np.where will return a tuple with one element, inside of which is the array containing the indices in your expected output. You can resolve your problem by accessing into the tuple like np.where(A == 1)[0].tolist().
However, I recommend using np.flatnonzero instead, which avoids the hassle entirely:
import numpy as np
A = np.array([0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
B = np.flatnonzero(A).tolist()
B:
[1, 2, 5, 7, 10, 11]
PS: when all other elements are 0, you don't have to explicitly compare to 1 ;).
import numpy as np
A = np.array([0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
indices = np.where(A == 1)[0]
B = indices.tolist()
print(B)
You should access the first element of this tuple with B[0] :
import numpy as np
A=np.array([0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
B=np.where(A==1)
B = B[0].tolist()
print(B) # [1, 2, 5, 7, 10, 11]

Creating a 'normal distribution' like range in numpy

I am trying to 'bin' an array into bins (similar to histogram). I have an input array input_array and a range bins = np.linspace(-200, 200, 200). The overall function looks something like this:
def bin(arr):
bins = np.linspace(-100, 100, 200)
return np.histogram(arr, bins=bins)[0]
So,
bin([64, 19, 120, 55, 56, 108, 16, 84, 120, 44, 104, 79, 116, 31, 44, 12, 35, 68])
would return:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0])
However, I want my bins to be more 'detailed' as I get close to 0... something similar to an indeal normal distribution. As a result, I could have more bins (i.e. short ranges) when I am close to 0 and as I move out towards the range, the bins are bigger. Is it possible?
More specifically, rather than having equally wide bins in a range, can I have an array of range where the bins towards the centre are smaller than towards the extremes?
I have already looked at answers like this and numpy.random.normal, but something is just not clicking right.
Use the inverse error function to generate the bins. You'll need to scale the bins to get the exact range you want
This transform works because the inverse error function is flatter around zero than +/- one.
from scipy.special import erfinv
erfinv(np.linspace(-1,1))
# returns:
array([ -inf, -1.14541135, -0.8853822 , -0.70933273, -0.56893556,
-0.44805114, -0.3390617 , -0.23761485, -0.14085661, -0.0466774 ,
0.0466774 , 0.14085661, 0.23761485, 0.3390617 , 0.44805114,
0.56893556, 0.70933273, 0.8853822 , 1.14541135, inf])

numpy: check for 1 every 6 element every row

I need to have something like this:
arr = array([[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]])
Where each row contains 36 elements, every 6 element in a row represents a hidden row, and that hidden row needs exactly one 1, and 0 everywhere else. In other words, every entry mod 6 needs exactly one 1. This is my requirement for arr.
I have a table that's going to be used to compute a "fitness" value for each row. That is, I have a
table = np.array([10, 5, 4, 6, 5, 1, 6, 4, 9, 7, 3, 2, 1, 8, 3,
6, 4, 6, 5, 3, 7, 2, 1, 4, 3, 2, 5, 6, 8, 7, 7, 6, 4, 1, 3, 2])
table = table.T
and I'm going to multiply each row of arr with table. The result of that multiplication, a 1x1 matrix, will be stored as the "fitness" value of that corresponding row. UNLESS the row does not fit the requirement described above, which should return 0.
an example of what should be returned is
result = array([5,12,13,14,20,34])
I need a way to do this but I'm too new to numpy to know how to.
(I'm Assuming you want what you've asked for in the first half).
I believe better or more elegant solutions exist, but this is what I think can do the job.
np.all(arr[:,6] == 1) and np.all(arr[:, :6] == 0) and np.all(arr[:, 7:])
Alternatively, you can construct the array (with 0's and 1's) and then just compare with it, say using not_equal.
I'm also not 100% sure of your question, but I'll try to answer with the best of my knowledge.
Since you're saying your matrix has "hidden rows", to check whether it is well formed, the easiest way seems to be to just reshape it:
# First check, returns true if all elements are either 0 or 1
np.in1d(arr, [0,1]).all()
# Second check, provided the above was True, returns True if
# each "hidden row" has exactly one 1 and other 0.
(arr.reshape(6,6,6).sum(axis=2) == 1).all()
Both checks return "True" for your arr.
Now, my understanding is that for each "large" row of 36 elements, you want a scalar product with your "table" vector, unless that "large" row has an ill-formed "hidden small" row. In this case, I'd do something like:
# The following computes the result, not checking for integrity
results = arr.dot(table)
# Now remove the results that are not well formed.
# First, compute "large" rows where at least one "small" subrow
# fails the condition.
mask = (arr.reshape(6,6,6).sum(axis=2) != 1).any(axis=1)
# And set the corresponding answer to 0
results[mask] = 0
However, running this code against your data returns as answer
array([38, 31, 24, 24, 32, 20])
which is not what you mention; did I misunderstand your requirement, or was the example based on different data?

Splitting the multi relational graph using lil_matrix

I am storing graph with two types of relationships using sparse lil_matrix format. This is how I am doing:
e=15
k= 2
X = [lil_matrix((e,e)) for i in range(k)]
#storing type 0 relation#
X[0][0,14] =1
X[0][0,8] =1
X[0][0,9] =1
X[0][0,10] =1
X[0][1,14] =1
X[0][1,6] =1
X[0][1,7] =1
X[0][2,8] =1
X[0][2,9] =1
X[0][2,10] =1
X[0][2,12] =1
X[0][3,6] =1
X[0][3,12] =1
X[0][3,11] =1
X[0][3,13] =1
X[0][4,11] =1
X[0][4,13] =1
X[0][5,13] =1
X[0][5,11] =1
X[0][5,10] =1
X[0][5,12] =1
#storing type 1 relation#
X[1][14,7] =1
X[1][14,6] =1
X[1][6,7] =1
X[1][6,8] =1
X[1][6,9] =1
X[1][10,9] =1
X[1][10,8] =1
X[1][10,11] =1
X[1][12,8] =1
X[1][12,10] =1
X[1][12,11] =1
X[1][12,13] =1
X[1][14,12] =1
X[1][11,9] =1
X[1][8,7] =1
X[1][8,9] =1
I would like to prune the network containing 50% of the nodes only. The way I am approaching this by:
nodes_list = range(e)
total_nodes = len(nodes_list)
get_percentage_of_prune_nodes = np.int(total_nodes * 0.5)
new_nodes = sorted(random.sample(nodes_list,get_percentage_of_prune_nodes))
e_new= get_percentage_of_prune_nodes
k_new= 2
#Y is the pruned matrix#
Y = [lil_matrix((e_new,e_new)) for i in range(k_new)]
for i in xrange(e):
for j in xrange(e):
for rel in xrange(k_new):
if i in new_nodes and j in new_nodes:
if X[rel][i,j]==1:
Y[rel][new_nodes.index(i),new_nodes.index(j)] = 1
This is not very efficient way in the case if the original matrix (X) is huge. Is there any fastest or smartest way to prune this ?
Focusing on just on matrix:
In [318]: X=X[0].astype(int)
In [327]: X.A
Out[327]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
In [331]: new_nodes=sorted(random.sample(np.arange(e).tolist(),7))
In [332]: new_nodes
Out[332]: [0, 1, 2, 5, 8, 12, 13]
In [333]: Y=sparse.lil_matrix((7,7),dtype=int)
In [334]: for i in range(15):
...: for j in range(e):
...: if i in new_nodes and j in new_nodes:
...: if X[i,j]:
...: Y[new_nodes.index(i),new_nodes.index(j)]=1
...:
In [335]: Y
Out[335]:
<7x7 sparse matrix of type '<class 'numpy.int32'>'
with 5 stored elements in LInked List format>
In [336]: Y.A
Out[336]:
array([[0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 0],
[0, 0, 0, 0, 0, 1, 1],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
This is the same as selecting rows and columns with new_nodes:
In [337]: X[np.ix_(new_nodes,new_nodes)]
Out[337]:
<7x7 sparse matrix of type '<class 'numpy.int32'>'
with 5 stored elements in LInked List format>
In [338]: _.A
Out[338]:
array([[0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 0],
[0, 0, 0, 0, 0, 1, 1],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
This indexing is faster with dense arrays:
In [341]: timeit X[np.ix_(new_nodes,new_nodes)]
188 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [342]: timeit X[np.ix_(new_nodes,new_nodes)].A
222 µs ± 6.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [343]: timeit X.A[np.ix_(new_nodes,new_nodes)]
62 µs ± 654 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The dense array approach may run into memory errors. But sparse indexing can also have memory problems.
Sparse matrix slicing memory error

Creating a multidimensional array (D>5) from a dictionary..?

I am trying to build a multidimensional array using vectors of different lengths to map out the 'process space' of a problem. I've started by storing values in keys of a dictionary:
d = {'width' : [1,2,3,5,3,5,3],
'height' : [1,2,3,5,5,3],
'length' : [1,3,3,7,8,0,0,7,2,3,6,3,2,3],
'composition' : [1,2,3,5,5,3],
'year' : [7,5,3,2,1,6,4,9,11],
'efficiency' : [1,1,2,3,5,8,13,21,34]}
Is it possible to use these keys to construct a multidimensional (6D) matrix of size
(7,6,14,6,9,9)? (That is, each dictionary key would be represented as a separate dimension of the final array)
EDIT:
I would like to use this matrix as a means of looking at a cross section of the data. For example, I would like to be able to say, "Here are all the efficiency values as a function of 'Length', given:
width = 4
height = 2
composition = 3
year = 7
I think you are naming the columns as dimensions.
Since you have indexes and data, use pandas DataFrames
from pandas import Series, DataFrame
d = {'width' : [1,2,3,5,3,5,3],
'height' : [1,2,3,5,5,3],
'length' : [1,3,3,7,8,0,0,7,2,3,6,3,2,3],
'composition' : [1,2,3,5,5,3],
'year' : [7,5,3,2,1,6,4,9,11],
'efficiency' : [1,1,2,3,5,8,13,21,34]}
Since there is missing data you need a intermediate step until you can turn it into a DataFrame.
intermediate=dict()
for x in d:
intermediate[x]=Series(d[x])
data=DataFrame(intermediate)
then you can query data using normal pandas syntax.
data[data.length>5]
composition efficiency height length width year
3 5 3 5 7 5 2
4 5 5 5 8 3 1
7 NaN 21 NaN 7 NaN 9
10 NaN NaN NaN 6 NaN NaN
Basic principle
The simples and most efficient way would be using NumPy.
d = {'width' : [1,2,3,5,3,5,3],
'height' : [1,2,3,5,5,3],
'length' : [1,3,3,7,8,0,0,7,2,3,6,3,2,3],
'composition' : [1,2,3,5,5,3],
'year' : [7,5,3,2,1,6,4,9,11],
'efficiency' : [1,1,2,3,5,8,13,21,34]}
You need an order of your names:
names = ['width' ,'height', 'length' ,'composition', 'year','efficiency']
Import NumPy:
import numpy as np
Find the shape:
shape = tuple(len(d[name]) for name in names)
shape is:
(7, 6, 14, 6, 9, 9)
Create an array of zeros:
lookup = np.zeros(shape, dtype=np.uint16)
I use very small unsigned integers to save space. You can use larger numbers if needed:
Now lookup can be used like this:
>>> lookup[0, 0, 0, 0, 0, 0]
0
>>> lookup[0, 0, 0, 0, 0, 0] = 12
>>> lookup[0, 0, 0, 0, 0, 0]
12
Lookup all values for efficiency:
>>> lookup[0, 0, 0, 0, 0, :]
array([12, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint16)
All values for year and efficiency:
>>> lookup[0, 0, 0, 0, :, :]
array([[12, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint16)
Useful class
For convenience, wrap into a class:
class Lookup(object):
def __init__(self, dims, dtype=np.uint16):
self.names = [item[0] for item in dims]
self.shape = [item[1] for item in dims]
self.repr = np.zeros(self.shape, dtype=dtype)
def _make_loc(self, coords):
return [coords.get(name, slice(None)) for name in self.names]
def get_value(self, coords):
return self.repr.__getitem__(self._make_loc(coords))
def set_value(self, coords, value):
return self.repr.__setitem__(self._make_loc(coords), value)
Specify the dimensions:
dims = [('width', 7),
('year', 9),
('composition', 6),
('height', 6),
('efficiency', 9),
('length', 14)]
Make an instance:
lookup = Lookup(dims)
Set a value:
coords1 = {'width': 3,
'height': 1,
'composition': 2,
'year': 6,
'length': 3}
lookup.set_value(coords1, 11)
Get a value back:
coords2 = {'width': 3,
'height': 1,
'composition': 2,
'year': 6}
lookup.get_value(coords2)
Gives you :
array([[ 0, 0, 0, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint16)

Categories