numpy.partition() also does sorting the internal of elements of the array.
I have been trying to do simple sorting based on first element of all the elements of array.
import numpy as np
a = np.array([[5.2, 4.3], [200.2, 6.2], [1.4, 112.2]])
np.partition(a, (1,a.shape[1]-1), axis = 1)
Output:
array([[ 4.3, 5.2],
[ 6.2, 200.2],
[ 1.4, 112.2]])
I don't understand the working of np.partition() here. Any resources for detail on numpy.partition()?
Specifically, I want to modify the arguments of the method to generate the following output:
array([[ 1.4, 112.2],
[ 5.2, 4.3],
[ 200.2, 6.2]])
np.partition() ensures that values at particular indices are the same as they would be if the array were to be fully sorted (e.g. with np.sort). (The order of the values at the other indices is not guaranteed to be anything meaningful.)
The axis=1 argument means that this operation will be applied individually to each row.
Here, the indices you've passed are (1, a.shape[1]-1) which is equivalent to (1, 1) in this case. Repeating an index has no special meaning, so on each row, the value in the second column (index 1) will be the same as if each row was in sorted order.
Now, when the operation is applied, you see in the returned array that the higher values in the first and second rows have been moved to this second column. The third row was already in its sorted order and so is unchanged.
This is really all there is to the function: the NumPy documentation covers a few further details. If you're feeling particularly brave, you can find the source code implementing the introselect algorithm used by np.partition() in all its glory here.
If I understand correctly, you just want to sort the rows in your array according to the values in the first column. You can do this using np.argsort:
# get an array of indices that will sort the first column in ascending order
order = np.argsort(a[:, 0])
# index into the row dimension of a
a_sorted = a[order]
print(a_sorted)
# [[ 1.4 112.2]
# [ 5.2 4.3]
# [ 200.2 6.2]]
If you want a partial sort rather than a full sort, you could use np.argpartition in much the same way:
# a slightly larger example array in order to better illustrate what
# argpartition does
b = np.array([[ 5.2, 4.3],
[200.2, 6.2],
[ 3.6, 85.1],
[ 1.4, 112.2],
[ 12.8, 60.0],
[ 7.6, 23.4]])
# get a set of indices to reorder the rows of `b` such that b[2, 0] is in its
# final 'sorted' position, and all elements smaller or larger than it will be
# placed before and after it respectively
partial_order = np.argpartition(b[:, 0], 2)
# the first (2+1) elements in the first column are guaranteed to be smaller than
# the rest, but apart from that the order is arbitrary
print(b[partial_order])
# [[ 1.4 112.2]
# [ 3.6 85.1]
# [ 5.2 4.3]
# [ 200.2 6.2]
# [ 12.8 60. ]
# [ 7.6 23.4]]
Related
I have learned the axis indication of numpy array from how is axis indexed in numpy's array
The article says that, for 2-D array, axis=0 stands for each col in array, and axis=1 for each row in array. It works when I use np.mean that means values by col, but np.delete in axis=0 is different that deletes elements by row.
import numpy as np
arr = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
'''
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
'''
np.mean(arr, 0)
'''
array([5., 6., 7., 8.])
'''
np.delete(arr,1,axis=0)
'''
array([[ 1, 2, 3, 4],
[ 9, 10, 11, 12]])
'''
I confuse whether I'm wrong for understanding that?
Why np.mean and np.delete operate in different axis when axis=0 is declared?
The accepted answer to the question you linked to actually says correctly that
Axis 0 is thus the first dimension (the "rows"), and axis 1 is the second dimension (the "columns")
which is what the code does and is the opposite to what you said.
This ought to be the source of your confusion. As we see from your own example:
np.delete(arr,1,axis=0)
'''
array([[ 1, 2, 3, 4],
[ 9, 10, 11, 12]])
'''
Row at index 1 is deleted, which is exactly what we want to happen.
This is a 2D example where we have rows and columns but it is important to understand how shapes work in general and then they will make sense in higher dimension. Consider the following example:
[
[
[1, 2],
[3, 4]
],
[
[5, 6],
[7, 8],
],
[
[9, 10],
[11, 12],
]
]
Here, we have 3 grids, each itself is 2x2, so we have something of shape 3x2x2. This is why we have 12 elements in total. Now, how do we know that at axis=0 we have 3 elements? Because if you look at this as a simple array and not some fancy numpy object then len(arr) == 3. Then if you take any of the elements along that axis (any of the "grids" that is), we will see that their length is 2 or len(arr[0]) == 2. That is because each of the grids has 2 rows. Finally, to check how many items each row of each of these grids has, we just have to inspect any one of these rows. Let's look at the second row of the first grid for a change. We will see that: len(arr[0][1]) == 2.
Now, what does np.mean(a, axis=0) mean? It means we will go over each of the items along axis=0 and find their mean. If these items are simply numbers (if a=np.array([1,2,3])) that's easy because the average of 1,2,3 is just the sum of these numbers divided by their quantity.
So, what if we have vectors or grids? What is the average of [2,4,6] and [0,0,0]? The convention is that the average of these to lists is a list of the averages at each index. So in other words it's:
[np.mean([2,0]), np.mean([4,0]), np.mean([6,0])]
which is trivially [1,2,3].
So, why does np.delete behave differently? Well, because the purpose of delete is to remove an element along some axis rather than to perform an aggregation over that axis. So in this particular case, we had 3 grids. So removing one of them will simply leave us with 2 grids. We could alternatively remove the second row of every grid (axis=1). That would leave us with 3 grids but each would have only 1 row instead of 2.
Hopefully, this brings some clarity :)
Usually I like to think about the axis in numpy (or pandas) as an indicator of the axis "along which" computations are carried out.
In this sense when you compute the mean along axis 0, this is, along the rows, you do it for each column. But if you delete along axis 0 it means you scroll along the rows to find the index you will delete.
I think your confusion is possibly coming from the fact that in delete, the axis refers to the axis you are indexing along when finding the section to delete, while in mean, the axis refers to which axis you are averaging along.
In both cases, axis tells the function which axis to "move along" when trying to perform it's operation - for delete it moves down the way when searching for what delete, and for mean it moves down the way when calculating averages
I have three numpy arrays. One is a purely numpy array but others are a numpy arrays composed of lists with different lengths. I want to do a calculation in a for loop but do not know how I can do it. At the moment I can do it only for three pure numpy arrays without any for loop.
my input data are:
aa= array([array([[1.37100000e+03, 4.00000000e+00, ..., 2.00000000e+00]]),
array([[6.25439286e+01, 2.68664193e+01, ..., 5.86345693e+01]]),
array([[5.25980126e+01, 1.99945789e+01, ..., 9.25987458e+01]])], dtype=object)
aa has three arrays (but in reality it has much more arrays) which each array is composed of a big list with thousands of rows and 20 columns. Numbers rows in each array changes but number of columns in all of them is 20. I made it because I have jagged lists and can not make a numpy array by them. Structure of my second data set is also exactly the same as first one (three arrays composed of lists), but it has its own values:
bb= array([array([[2000, 15478, ..., 956410]]),
array([[61478, 98572, ..., 7801561]]),
array([[98601, 20198, ..., 6021981]])], dtype=object)
the third data set:
cc= array([[3., 1.66666667, 1.5, ..., 1.66666667],
.....
[2., 98.33333333, 28.5, ..., 98.33333333]])
cc is a pure 2d numpy array. It has thousands of rows and 12 columns. It is merged by three arrays which has four columns. I can say it has three parts based on its columns (I mean cc[:,0:4], cc[:,4:8] and cc[:,8:12]).
In my first iteration, I want to compare the data of first array of aa and also the first split of cc. I want to do some calculation on all rows of some columns of aa's first array (aa[:,17:20]) and cc[:,1:4]. The next one should be again the all the rows and the same columns (17:20) from the second array of aa and cc[:,5:8]. The next one also again columns 17:20 from the third array of aa and cc[:,9:12]. For this step of my code, I do not need first column of each split of cc (columns 0, 4 and 8). In the next step, I use cc[:,0] in my first iteration to calculate something (u). cc[:,4] is for the next iterationa and cc[:,8] for the last one. Then, all data stored as first array of bb should be called. My code at the moment is working for pure numpy aa, bb and cc. Finally my loop should calculate three new_elements rather than one. So, first iteration of should use all rows of some columns of aa's first array (aa[:,17:20]), cc[:,1:4], cc[:,0] and all the data of first array of bb. This is my code used for one single iteration for three pure numpy arrays (without list component):
from scipy.spatial import distance
distances=distance.cdist(aa[:,17:20],cc[:,1:4]) # THIS LINE NEEDS ITERATION
min_d=np.argmin(distances, axis=1).reshape(-1,1)
z=np.array([])
for i in min_d:
u=cc[i,0] # THIS LINE NEEDS ITERATION
z=np.array([np.append(z,u)]).reshape(-1,1)
final=np.concatenate((aa,z), axis =1) # THIS LINE NEEDS ITERATION
new_vol=final[:,-1].reshape(-1,1)
new_elements=np.concatenate((bb,new_vol), axis =1) # THIS LINE NEEDS ITERATION
new_elements=np.delete(new_elements,[-1],1).astype(int)
Here, I add a simple data set which mimics my code. But the reality is what I explained about my data.
aa= array([[72.70518085, 3.65801928, 51.02667157],
[60.20252523, 3.79943938, 40.83221737]])
bb= array([[2571, 4, 2],
[2572, 4, 2]])
cc= array([[ 3. , 25. , 7.5, 25. ],
[ 2. , 25. , 7.5, 75. ]])
The result will also be:
new_elements= array([[2571, 4, 2],
[2572, 4, 2]])
Please note that for this simplified data set, aa[:,17:20] in my code should be replaced with aa. To address it again, In my real data aa and bb have another structure but aa is the same what I addressed in my simplified case study.
In advance, thanks for devoting time.
I am searching for an efficient solution to build a secondary in-memory index in Python using a high-level optimised mathematical package such as numpy and arrow. I am excluding pandas for performance reasons.
Definition
"A secondary index contains an entry for each existing value of the attribute to be indexed. This entry can be seen as a key/value pair with the attribute value as key and as value a list of pointers to all records in the base table that have this value." - JV. D'Silva et al. (2017)
Let's take a simple example, we can scale this later on to produce some benchmarks:
import numpy as np
pk = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='uint32')
val = np.array([15.5, 3.75, 142.88, 142.88, None, None, None, 7.2, 2.1], dtype='float32')
Interestingly pyarrow.Array.dictionary_encode method can transform the value array into a dictionary encoded representation that is close to a secondary index.
val.dictionary_encode()
Out[55]:
<pyarrow.lib.DictionaryArray object at 0x7ff430d8b4d0>
-- dictionary:
[
15.5,
3.75,
142.88,
nan,
7.2,
2.1
]
-- indices:
[
0,
1,
2,
2,
3,
3,
3,
4,
5
]
I have opened an issue here
So, the question is about how fast you can build a secondary index in memory using Python data structures to hold efficiently values and indices. But this is half the story as the index will be useful if it serves well both filtering queries (point, range) and transformations - reconstruction of row, column and association a.k.a hyperedge in TRIADB. And even this quick description here does not cover how easy it will be to update this kind of index.
For many reasons, I have started investigating a possible PyArrow open-source solution. A sorted dictionary-encoded representation should generally meet the requirements of the problem with an excellent combination of smaller memory footprint and faster/flexible zero copy I/O processing.
Solution
I have searched both in the past and in the present for an open-source solution to this problem but I have not found one that satisfies my appetite. This time I decided to start building my own and discuss openly its implementation that also covers the null case, i.e. missing data scenario.
Do notice that secondary index is very close to adjacency list representation, a core element in my TRIADB project and that is the main reason behind searching for a solution.
Let's start with one line code using numpy
idx = np.sort(np.array(list(zip(pk, val)), dtype=struct_type), order='val')
idx['val']
Out[68]:
array([ 2.1 , 3.75, 7.2 , 15.5 , 142.88, 142.88, nan, nan,
nan], dtype=float32)
idx['pk']
Out[69]: array([8, 1, 7, 0, 2, 3, 4, 5, 6], dtype=uint32)
Faster solution (less generic)
this is the special but perfectly valid case where pk has values in range(n)
idx_pk = np.argsort(val)
idx_pk
Out[91]: array([8, 1, 7, 0, 2, 3, 4, 5, 6])
idx_val = val[idx_pk]
idx_val
Out[93]: array([ 2.1 , 3.75, 7.2 , 15.5 , 142.88, 142.88, nan, nan, nan], dtype=float32)
There are a few more steps to get a secondary index representation according to the definition of JV. D'Silva et al.
Get rid of nan
Calculate unique values of secondary index
For each unique value calculate the list of primary key indices to all rows of the table that contain that value
Unique Secondary Index with adjacency lists
def secondary_index_with_adjacency_list(arr):
idx_pk = np.argsort(arr)
idx_val = arr[idx_pk]
cnt = np.count_nonzero(~np.isnan(idx_val))
usec_ndx, split_ndx, cnt_arr = np.unique(idx_val[:cnt], return_index=True, return_counts=True)
adj_list = np.split(idx_pk[:cnt], split_ndx)[1:]
return usec_ndx, cnt_arr, adj_list
ndx, freq, adj = secondary_index_with_adjacency_list(val)
pd.DataFrame({'val': ndx, 'freq': freq, 'adj': adj})
Out[11]:
val freq adj
0 2.10 1 [8]
1 3.75 1 [1]
2 7.20 1 [7]
3 15.50 1 [0]
4 142.88 2 [2, 3]
Discussion
In practice it is faster to use the representation of secondary index with repeated values than the one with lists of pointers to records of a table but the second one has the interesting property of being closer to a hypergraph representation that I am using in TRIADB.
The kind of secondary index described in this solution is more suitable for analysis, filtering of big data sets that don't fit in memory but stored on disk with a column-store format. In that case for a specific set of columns it is possible to reconstruct a subset of records in memory (column-store) format and even present it on a hypergraph (stay tuned for the next release of TRIADB)
I have an array that contains 2D arrays.
For each 2D array i want to sum up the columns and the result must be in column form.
I have a piece of code to do this, but I feel like I am not utilising numpy optimally. What is the fastest to do this?
My current code:
temp = [np.sum(l_i,axis=1).reshape(-1,1) for l_i in self.layer_inputs]
Sample Array:
array([
array([[ 0.48517904, -11.10809746],
[ 13.64104864, 5.77576326]]),
array([[16.74109924, -3.28535518],
[-4.00977275, -3.39593759],
[ 5.9048581 , -1.65258805],
[13.40762143, -1.61158724],
[ 9.8634849 , 8.02993728]]),
array([[-7.61920427, -3.2314264 ],
[-3.79142779, -2.44719713],
[32.42085005, 4.79376209],
[13.97676962, -1.19746096],
[45.60100807, -3.01680368]])
], dtype=object)
Sample Expected Result:
[array([[-10.62291842],
[ 19.41681191]]),
array([[13.45574406],
[-7.40571034],
[ 4.25227005],
[11.7960342 ],
[17.89342218]]),
array([[-10.85063067],
[ -6.23862492],
[ 37.21461214],
[ 12.77930867],
[ 42.58420439]]) ]
New answer
Given your stringent requirement for a list of arrays, there is no more computationally efficient solution.
Original answer
To leverage NumPy, don't work with a list of arrays: dtype=object is the hint you won't be able to use vectorised operations.
Instead, combine into one array, e.g. via np.vstack, and store split indices. If you need a list of arrays, use np.split as a final step. But this constant flipping between lists and a single array is expensive. Really, you should attempt to just store the splits and a single array, i.e. idx and data below.
idx = np.array(list(map(len, A))).cumsum()[:-1] # [2, 7]
data = np.vstack(A).sum(1)
I am using Python, numpy and scikit-learn. I have data of keys and values that are stored in an SQL table. I retrieve this as a list of tuples returned as: [(id, value),...]. Each id appears only once in the list and the tuples appear sorted in order of ascending id. This process is completed a few times so that I have multiple lists of key: value pairs. Such that:
dataset = []
for sample in samples:
listOfTuplePairs = getDataFromSQL(sample) # get a [(id, value),...] list
dataset.append(listOfTuplePairs)
Keys may be duplicated across different samples, and each row may be of a different length. An example dataset might be:
dataset = [[(1, 0.13), (2, 2.05)],
[(2, 0.23), (4, 7.35), (5, 5.60)],
[(2, 0.61), (3, 4.45)]]
It can be seen that each row is a sample, and that some ids (in this case 2) appear in multiple samples.
Problem: I wish to construct a single (possibly sparse) numpy array suitable for processing with scikit-learn. The values relating to a specific key (id) for each sample should be aligned in the same 'column' (if that is the correct terminology) such that the matrix of the above example would look as follows:
ids = 1 2 3 4 5
------------------------------
dataset = [(0.13, 2.05, null, null, null),
(null, 0.23, null, 7.35, 5.60),
(null, 0.61, 4.45, null, null)]
As you can see, I also wish to strip the ids from the matrix (though I will need to retain a list of them so I know what the values in the matrix relate to. Each initial list of key: value pairs may contain several thousand rows and there may be several thousand samples so the resulting matrix may be very large. Please provide answers that consider speed (within the limits of Python), memory efficiency and code clarity.
Many, many thanks in advance for any help.
Here's a NumPy based approach to create a sparse matrix coo_matrix with memory efficiency in focus -
from scipy.sparse import coo_matrix
# Construct row IDs
lens = np.array([len(item) for item in dataset])
shifts_arr = np.zeros(lens.sum(),dtype=int)
shifts_arr[lens[:-1].cumsum()] = 1
row = shifts_arr.cumsum()
# Extract values from dataset into a NumPy array
arr = np.concatenate(dataset)
# Get the unique column IDs to be used for col-indexing into output array
col = np.unique(arr[:,0],return_inverse=True)[1]
# Determine the output shape
out_shp = (row.max()+1,col.max()+1)
# Finally create a sparse marix with the row,col indices and col-2 of arr
sp_out = coo_matrix((arr[:,1],(row,col)), shape=out_shp)
Please note that if the IDs are supposed to be column numbers in the output array, you could replace the use of np.unique that gives us such unique IDs with something like this -
col = (arr[:,0]-1).astype(int)
This should give us a good performance boost!
Sample run -
In [264]: dataset = [[(1, 0.13), (2, 2.05)],
...: [(2, 0.23), (4, 7.35), (5, 5.60)],
...: [(2, 0.61), (3, 4.45)]]
In [265]: sp_out.todense() # Using .todense() to show output
Out[265]:
matrix([[ 0.13, 2.05, 0. , 0. , 0. ],
[ 0. , 0.23, 0. , 7.35, 5.6 ],
[ 0. , 0.61, 4.45, 0. , 0. ]])
You can convert each element in the dataset to a dictionary and then use pandas data frame which will return the result close to the desired output. If 2D numpy array is desired we can use as_matrix() method to convert the data frame to numpy array:
import pandas as pd
pd.DataFrame(dict(x) for x in dataset).as_matrix()
# array([[ 0.13, 2.05, nan, nan, nan],
# [ nan, 0.23, nan, 7.35, 5.6 ],
# [ nan, 0.61, 4.45, nan, nan]])