I am looking for data structures and an algorithm for a Python/numpy/numba/C-extension implementation to improve the performance over my current approach to solve the following reduction problem:
Input
I have a very large structured (Numpy) array of records in the format ``
iarr = numpy.array(
[([entityId, subentityId], subentityValue),
...,
...
], dtype=[('e', '<2u4'), ('r', '<f4')])
There are m entities (order of millions) and n subentities (<20).
There are no duplicate entity/subentity combinations.
It is not known beforehad what m or n is.
The number of subentities varies from entity to entity but is pre-dominantly 8 or 6 per entity.
The array is unordered.
Expected Output
I need to find the maximum or minimum subentityValue per entityId.
I don't need to retain the information from which subentityId the value came from.
The result should be an array of records like this:
oarr = numpy.array(
[(entityId, subentityValue),
...,
...
], dtype=[('e', '<u4'), ('r', '<f4')])
The result array does not need to be ordered.
The array is created either for max-values or for min-values, hence entityIds in the array are unique.
Equally well, the output could be a dictionary with entityIds as keys and max or min subentityValues as values.
Current Implementation (slow!)
My initial approach using Python, Numpy and Numba was (described here for finding the maximum subentityValue per entitId):
Initialise a dictionary (numba.typed.Dict) with keys being unique entityIds and an initial value that is garanteed to be smaller than any subentityValue found in the array (for example -99999.9).
odict = numba.typed.Dict.empty(key_type=nb.int64, value_type=nb.float64) # types for compatability to Python's dict
smallest_r = nb.float64(-99999.9)
for entity_id in np.unique(iarr['e'].astype(np.int64)):
odict[entity_id] = smallest_r
Loop through the records in the input array and compare the value of dictionary[entityId] with the record's entityValue and
a) if dictionary[entityId] is larger than entityValue don't do anything,
b) if dictionary[entityId] is smaller than entityValue overwrite it with the entityValue.
for i in numba.prange(iarr['e'].shape[0]):
if odict[iarr['e'][i]] < iarr['r'][i]:
odict[iarr['e'][i]] = iarr['r'][i]
Return the dictionary odict as a result.
This works fine but is by far the biggest bottleneck in the system.
To improve performance I attempted to parallize this (#numba.jit(..., parallel=True)), only to find out that numba's typed.Dict is not thread-safe and giving me incorrect results in that case.
I am perfectly happy to ditch my solution completely in favour of something better (faster).
Any suggestions?
To group source rows (by the first element of e) and then compute
both min and max, for each group, it is more covenient to use
Pandas instead of Numpy.
Start from necessary imports:
import numpy as np
import pandas as pd
For the test purpose, I created the source array as:
iarr = np.array([
([10, 1], 10.5), ([10, 1], 9.5), ([10, 1], 10.0),
([10, 2], 9.1), ([10, 2], 9.2), ([10, 2], 9.4),
([10, 3], 7.5), ([10, 3], 9.7), ([10, 3], 8.0),
([20, 2], 7.3), ([20, 2], 7.1), ([20, 2], 8.0),
([20, 3], 7.3), ([20, 3], 9.7), ([20, 3], 8.0)],
dtype=[('e', '<u4', (2,)), ('r', '<f4')])
The first step is to create a pandasonic Series with:
values from r column,
the index (actually a MultiIndex) created from e column.
The code to do it is:
s = pd.Series(iarr['r'], index=pd.MultiIndex.from_arrays(iarr['e'].T))
Then, to get the result, with both min and max, as a DataFrame, run:
result = s.groupby(level=0).agg(['min', 'max'])
The index (the leftmost, unnamed column) holds entityId and "actual"
columns contain both min and max.
The result is:
min max
10 7.5 10.5
20 7.1 9.7
If you need, you can convert it to a Numpy array:
oarr = np.core.records.fromarrays(
result.reset_index().values.T,
names='entityId, min, max', formats='u4, f4, f4')
My code should operate substantially faster than plain pythonic
solution.
Related
I have a dataframe where one column consists of tuples, i.e
df['A'].values = array([(1,2), (5,6), (11,12)])
Now I want to split this into two different columns. A working solution is
df['A1'] = df['A'].apply(lambda x: x[0])
But this is extremely slow. On my Dataframe it takes multiple minutes. So I would like to vectorize this, to something like
df['A1'] = df['A'][:,0]
With pandas, or using numpy or anything. But all of them give me an error similar to
"*** KeyError: 'key of type tuple not found and not a MultiIndex'"
Is there any vectorized way? This feels like a super simple question and task but i cannot find any working and properly vectorized function.
n: int = 2
df = pd.DataFrame(df["A"].apply(lambda x: (x[:n], x[n:])).tolist(), index=df.index)
you can have a look into pandarallel also.
I'll do it in numpy and skip over the pandas bits.
You can get a decent speedup using np.fromiter together with either itertools.chain.from_iterable to extract everything in one go or operator.itemgetter for individual columns.
import operator as op
import itertools as it
a = [*zip(range(10000),range(10000,20000))]
A = np.empty(10000,object)
A[...] = a
A
# array([(0, 10000), (1, 10001), (2, 10002), ..., (9997, 19997),
# (9998, 19998), (9999, 19999)], dtype=object)
(*np.fromiter(it.chain.from_iterable(A),int,len(A[0])*A.size).reshape(A.size,-1).T,)
# (array([ 0, 1, 2, ..., 9997, 9998, 9999]), array([10000, 10001,
# 10002, ..., 19997, 19998, 19999]))
np.fromiter(map(op.itemgetter(0),A),int,A.size)
# array([ 0, 1, 2, ..., 9997, 9998, 9999])
I am searching for an efficient solution to build a secondary in-memory index in Python using a high-level optimised mathematical package such as numpy and arrow. I am excluding pandas for performance reasons.
Definition
"A secondary index contains an entry for each existing value of the attribute to be indexed. This entry can be seen as a key/value pair with the attribute value as key and as value a list of pointers to all records in the base table that have this value." - JV. D'Silva et al. (2017)
Let's take a simple example, we can scale this later on to produce some benchmarks:
import numpy as np
pk = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='uint32')
val = np.array([15.5, 3.75, 142.88, 142.88, None, None, None, 7.2, 2.1], dtype='float32')
Interestingly pyarrow.Array.dictionary_encode method can transform the value array into a dictionary encoded representation that is close to a secondary index.
val.dictionary_encode()
Out[55]:
<pyarrow.lib.DictionaryArray object at 0x7ff430d8b4d0>
-- dictionary:
[
15.5,
3.75,
142.88,
nan,
7.2,
2.1
]
-- indices:
[
0,
1,
2,
2,
3,
3,
3,
4,
5
]
I have opened an issue here
So, the question is about how fast you can build a secondary index in memory using Python data structures to hold efficiently values and indices. But this is half the story as the index will be useful if it serves well both filtering queries (point, range) and transformations - reconstruction of row, column and association a.k.a hyperedge in TRIADB. And even this quick description here does not cover how easy it will be to update this kind of index.
For many reasons, I have started investigating a possible PyArrow open-source solution. A sorted dictionary-encoded representation should generally meet the requirements of the problem with an excellent combination of smaller memory footprint and faster/flexible zero copy I/O processing.
Solution
I have searched both in the past and in the present for an open-source solution to this problem but I have not found one that satisfies my appetite. This time I decided to start building my own and discuss openly its implementation that also covers the null case, i.e. missing data scenario.
Do notice that secondary index is very close to adjacency list representation, a core element in my TRIADB project and that is the main reason behind searching for a solution.
Let's start with one line code using numpy
idx = np.sort(np.array(list(zip(pk, val)), dtype=struct_type), order='val')
idx['val']
Out[68]:
array([ 2.1 , 3.75, 7.2 , 15.5 , 142.88, 142.88, nan, nan,
nan], dtype=float32)
idx['pk']
Out[69]: array([8, 1, 7, 0, 2, 3, 4, 5, 6], dtype=uint32)
Faster solution (less generic)
this is the special but perfectly valid case where pk has values in range(n)
idx_pk = np.argsort(val)
idx_pk
Out[91]: array([8, 1, 7, 0, 2, 3, 4, 5, 6])
idx_val = val[idx_pk]
idx_val
Out[93]: array([ 2.1 , 3.75, 7.2 , 15.5 , 142.88, 142.88, nan, nan, nan], dtype=float32)
There are a few more steps to get a secondary index representation according to the definition of JV. D'Silva et al.
Get rid of nan
Calculate unique values of secondary index
For each unique value calculate the list of primary key indices to all rows of the table that contain that value
Unique Secondary Index with adjacency lists
def secondary_index_with_adjacency_list(arr):
idx_pk = np.argsort(arr)
idx_val = arr[idx_pk]
cnt = np.count_nonzero(~np.isnan(idx_val))
usec_ndx, split_ndx, cnt_arr = np.unique(idx_val[:cnt], return_index=True, return_counts=True)
adj_list = np.split(idx_pk[:cnt], split_ndx)[1:]
return usec_ndx, cnt_arr, adj_list
ndx, freq, adj = secondary_index_with_adjacency_list(val)
pd.DataFrame({'val': ndx, 'freq': freq, 'adj': adj})
Out[11]:
val freq adj
0 2.10 1 [8]
1 3.75 1 [1]
2 7.20 1 [7]
3 15.50 1 [0]
4 142.88 2 [2, 3]
Discussion
In practice it is faster to use the representation of secondary index with repeated values than the one with lists of pointers to records of a table but the second one has the interesting property of being closer to a hypergraph representation that I am using in TRIADB.
The kind of secondary index described in this solution is more suitable for analysis, filtering of big data sets that don't fit in memory but stored on disk with a column-store format. In that case for a specific set of columns it is possible to reconstruct a subset of records in memory (column-store) format and even present it on a hypergraph (stay tuned for the next release of TRIADB)
I am trying to create a numpy array with 2 columns and multiple rows. The first column is meant to represent input vector of size 3. The 2nd column is meant to represent output vector of size 2.
arr = np.array([
[np.array([1,2,3]), np.array([1,0])]
[np.array([4,5,6]), np.array([0,1])]
])
I was expecting: arr[:, 0].shape
to return (2, 3), but it returns (2, )
What is the proper way to arrange input and output vectors into a matrix using numpy?
If you are sure the elements in each column have the same size/length, you can select and then stack the result using numpy.row_stack:
np.row_stack(arr[:,0]).shape
# (2, 3)
np.row_stack(arr[:,1]).shape
# (2, 2)
So, the code
arr = np.array([
[np.array([1,2,3]), np.array([1,0])],
[np.array([4,5,6]), np.array([0,1])]
])
Creates an object array, indexing the first column gives you back two rows with one object in each, which accounts for the size. To get what you want you'd need to wrap it in something like
np.vstack(arr[:, 0])
Which creates an array out of the objects in the first column. This isn't very convenient, it would make more sense to me to store these in a dictionary, something like
io = {'in': np.array([[1,2,3],[4,5,6]]),
'out':np.array([[1,0], [0,1]])
}
A structured array gives you a bit of both. Creation is a bit tricky, for the example given,
arr = np.array([
(1,2,3), (1,0)),
((4,5,6), (0,1)) ],
dtype=[('in', '3int64'), ('out', '2float64')])
Creates a structured array with fields in and out, consisting of 3 integers and 2 floats respectively. Rows can be accessed as usual,
In[73]: arr[0]
Out[74]: ([1, 2, 3], [ 1., 0.])
Or by the field name
In [73]: arr['in']
Out[73]:
array([[1, 2, 3],
[4, 5, 6]])
The numpy manual has many more details (https://docs.scipy.org/doc/numpy-1.13.0/user/basics.rec.html). I can't add any details as I've been intending to use them in a project for some time, but haven't.
I have a dictionary of keys where each value should be a sparse vector of a huge size (~ 700000 elements, maybe more). How do I efficiently grow / build this data structure.
Right now my implementation works only for smaller sizes.
myvec = defaultdict(list)
for id in id_data:
for item in item_data:
if item in item_data[id]:
myvec[id].append(item * 0.5)
else:
myvec[id].append(0)
The above code when used with huge files quickly eats up all the available memory. I tried removing the myvec[id].append(0) condition and store only non-zero values because the length of each myvec[id] list is constant. That worked on my huge test file with a decent memory consumption but I'd rather find a better way to do it.
I know that there are different type of sparse arrays/matrices for this purpose but I have no intuition which one is better. I tried to use lil_matrix from numpy package instead of myvec dict but it turned out to be much slower than the above code.
So the problem basically boils down to the following two questions:
Is it possible to create a sparse data structure on the fly in python?
How can one create such sparse data structure with decent speed?
Appending to a list (or lists) will always be faster than appending to a numpy.array or to a sparse matrix (which stores data in several numpy arrays). lil is supposed to be the fastest when you have to grow the matrix incrementally, but it still will slower than working directly with lists.
Numpy arrays have a fixed size. So the np.append function actually creates a new array by concatenating the old with the new data.
You example code would be more useful if you gave us some data, so we cut, paste and run.
For simplicity lets define
data_dict=dict(one=[1,0,2,3,0,0,4,5,0,0,6])
Sparse matrices can be created directly from this with:
sparse.coo_matrix(data_dict['one'])
whose attributes are:
data: array([1, 2, 3, 4, 5, 6])
row: array([0, 0, 0, 0, 0, 0], dtype=int32)
col: array([ 0, 2, 3, 6, 7, 10], dtype=int32)
or
sparse.lil_matrix(id_data['one'])
data: array([[1, 2, 3, 4, 5, 6]], dtype=object)
rows: array([[0, 2, 3, 6, 7, 10]], dtype=object)
The coo version times a lot faster.
The sparse matrix only saves the nonzero data, but it also has to save an index. There is also a dictionary format, which uses a tuple (row,col) as the key.
And example of incremental construction is:
llm = sparse.lil_matrix((1,11),dtype=int)
for i in range(11):
llm[0,i]=data_dict['one'][i]
For this small case this incremental approach is faster.
I get even better speed by only adding the nonzero terms to the sparse matrix:
llm = sparse.lil_matrix((1,11),dtype=int)
for i in range(11):
if data_dict['one'][i]!=0:
llm[0,i]=data_dict['one'][i]
I can imagine adapting this to your default dict example. Instead of myvec[id].append(0), you keep a record of where you appended the item * 0.5 values (whether in a separate list, or via a lil_matrix. It would take some experimenting to adapt this idea to a default dictionary.
So basically the goal is to create 2 lists:
data = [1, 2, 3, 4, 5, 6]
cols = [ 0, 2, 3, 6, 7, 10]
Whether you create a sparse matrix from these or not depends on what else you need to do with the data.
I want to slice a NumPy nxn array. I want to extract an arbitrary selection of m rows and columns of that array (i.e. without any pattern in the numbers of rows/columns), making it a new, mxm array. For this example let us say the array is 4x4 and I want to extract a 2x2 array from it.
Here is our array:
from numpy import *
x = range(16)
x = reshape(x,(4,4))
print x
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
The line and columns to remove are the same. The easiest case is when I want to extract a 2x2 submatrix that is at the beginning or at the end, i.e. :
In [33]: x[0:2,0:2]
Out[33]:
array([[0, 1],
[4, 5]])
In [34]: x[2:,2:]
Out[34]:
array([[10, 11],
[14, 15]])
But what if I need to remove another mixture of rows/columns? What if I need to remove the first and third lines/rows, thus extracting the submatrix [[5,7],[13,15]]? There can be any composition of rows/lines. I read somewhere that I just need to index my array using arrays/lists of indices for both rows and columns, but that doesn't seem to work:
In [35]: x[[1,3],[1,3]]
Out[35]: array([ 5, 15])
I found one way, which is:
In [61]: x[[1,3]][:,[1,3]]
Out[61]:
array([[ 5, 7],
[13, 15]])
First issue with this is that it is hardly readable, although I can live with that. If someone has a better solution, I'd certainly like to hear it.
Other thing is I read on a forum that indexing arrays with arrays forces NumPy to make a copy of the desired array, thus when treating with large arrays this could become a problem. Why is that so / how does this mechanism work?
To answer this question, we have to look at how indexing a multidimensional array works in Numpy. Let's first say you have the array x from your question. The buffer assigned to x will contain 16 ascending integers from 0 to 15. If you access one element, say x[i,j], NumPy has to figure out the memory location of this element relative to the beginning of the buffer. This is done by calculating in effect i*x.shape[1]+j (and multiplying with the size of an int to get an actual memory offset).
If you extract a subarray by basic slicing like y = x[0:2,0:2], the resulting object will share the underlying buffer with x. But what happens if you acces y[i,j]? NumPy can't use i*y.shape[1]+j to calculate the offset into the array, because the data belonging to y is not consecutive in memory.
NumPy solves this problem by introducing strides. When calculating the memory offset for accessing x[i,j], what is actually calculated is i*x.strides[0]+j*x.strides[1] (and this already includes the factor for the size of an int):
x.strides
(16, 4)
When y is extracted like above, NumPy does not create a new buffer, but it does create a new array object referencing the same buffer (otherwise y would just be equal to x.) The new array object will have a different shape then x and maybe a different starting offset into the buffer, but will share the strides with x (in this case at least):
y.shape
(2,2)
y.strides
(16, 4)
This way, computing the memory offset for y[i,j] will yield the correct result.
But what should NumPy do for something like z=x[[1,3]]? The strides mechanism won't allow correct indexing if the original buffer is used for z. NumPy theoretically could add some more sophisticated mechanism than the strides, but this would make element access relatively expensive, somehow defying the whole idea of an array. In addition, a view wouldn't be a really lightweight object anymore.
This is covered in depth in the NumPy documentation on indexing.
Oh, and nearly forgot about your actual question: Here is how to make the indexing with multiple lists work as expected:
x[[[1],[3]],[1,3]]
This is because the index arrays are broadcasted to a common shape.
Of course, for this particular example, you can also make do with basic slicing:
x[1::2, 1::2]
As Sven mentioned, x[[[0],[2]],[1,3]] will give back the 0 and 2 rows that match with the 1 and 3 columns while x[[0,2],[1,3]] will return the values x[0,1] and x[2,3] in an array.
There is a helpful function for doing the first example I gave, numpy.ix_. You can do the same thing as my first example with x[numpy.ix_([0,2],[1,3])]. This can save you from having to enter in all of those extra brackets.
I don't think that x[[1,3]][:,[1,3]] is hardly readable. If you want to be more clear on your intent, you can do:
a[[1,3],:][:,[1,3]]
I am not an expert in slicing but typically, if you try to slice into an array and the values are continuous, you get back a view where the stride value is changed.
e.g. In your inputs 33 and 34, although you get a 2x2 array, the stride is 4. Thus, when you index the next row, the pointer moves to the correct position in memory.
Clearly, this mechanism doesn't carry well into the case of an array of indices. Hence, numpy will have to make the copy. After all, many other matrix math function relies on size, stride and continuous memory allocation.
If you want to skip every other row and every other column, then you can do it with basic slicing:
In [49]: x=np.arange(16).reshape((4,4))
In [50]: x[1:4:2,1:4:2]
Out[50]:
array([[ 5, 7],
[13, 15]])
This returns a view, not a copy of your array.
In [51]: y=x[1:4:2,1:4:2]
In [52]: y[0,0]=100
In [53]: x # <---- Notice x[1,1] has changed
Out[53]:
array([[ 0, 1, 2, 3],
[ 4, 100, 6, 7],
[ 8, 9, 10, 11],
[ 12, 13, 14, 15]])
while z=x[(1,3),:][:,(1,3)] uses advanced indexing and thus returns a copy:
In [58]: x=np.arange(16).reshape((4,4))
In [59]: z=x[(1,3),:][:,(1,3)]
In [60]: z
Out[60]:
array([[ 5, 7],
[13, 15]])
In [61]: z[0,0]=0
Note that x is unchanged:
In [62]: x
Out[62]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
If you wish to select arbitrary rows and columns, then you can't use basic slicing. You'll have to use advanced indexing, using something like x[rows,:][:,columns], where rows and columns are sequences. This of course is going to give you a copy, not a view, of your original array. This is as one should expect, since a numpy array uses contiguous memory (with constant strides), and there would be no way to generate a view with arbitrary rows and columns (since that would require non-constant strides).
With numpy, you can pass a slice for each component of the index - so, your x[0:2,0:2] example above works.
If you just want to evenly skip columns or rows, you can pass slices with three components
(i.e. start, stop, step).
Again, for your example above:
>>> x[1:4:2, 1:4:2]
array([[ 5, 7],
[13, 15]])
Which is basically: slice in the first dimension, with start at index 1, stop when index is equal or greater than 4, and add 2 to the index in each pass. The same for the second dimension. Again: this only works for constant steps.
The syntax you got to do something quite different internally - what x[[1,3]][:,[1,3]] actually does is create a new array including only rows 1 and 3 from the original array (done with the x[[1,3]] part), and then re-slice that - creating a third array - including only columns 1 and 3 of the previous array.
I have a similar question here: Writting in sub-ndarray of a ndarray in the most pythonian way. Python 2
.
Following the solution of previous post for your case the solution looks like:
columns_to_keep = [1,3]
rows_to_keep = [1,3]
An using ix_:
x[np.ix_(rows_to_keep, columns_to_keep)]
Which is:
array([[ 5, 7],
[13, 15]])
I'm not sure how efficient this is but you can use range() to slice in both axis
x=np.arange(16).reshape((4,4))
x[range(1,3), :][:,range(1,3)]