Looking up large sets of keys: dictionary vs. NumPy array - python

I have a very large (200k+) set of key/value pairs, for which I need to retrieve very large (sometimes all) of the values. The obvious way to do this is with a dictionary such that
values = {lookup.get(key) for key in key_set}
This is getting very time consuming in my code, and I'm wondering if there's a faster way to implement this with a NumPy array. I've been experimenting with using an array with two columns and n rows, such that for any individual key:
value = lookup_array[lookup_array[:,0] == key, 1]
But I'm not sure how to scale this to many keys up without costly iteration. I've looked at:
values = lookup_array[np.in1d(lookup_array[:,0], key_set), 1]
but this also seems time consuming.
Is there any other way to do a massive lookup of nonconsecutive values quickly without iterating?

If certain special conditions apply, you can use NumPy indexing as a very fast alternative to dictionary lookups.
The keys must be integers
You have enough memory to create a NumPy array whose size is as big as the
maximum key value you wish to look up (so that all keys correspond to a valid index into the array.)
The idea is to use
lookup_array = np.empty((M,), dtype=values.dtype)
lookup_array[keys] = values
result = lookup_array[key_set]
instead of
result = {lookup_dict.get(key) for key in key_set}
For example,
import numpy as np
import pandas as pd
def using_dict(lookup_dict, key_set):
return {lookup_dict.get(key) for key in key_set}
def using_array(lookup_array, key_set):
return lookup_array[key_set]
def using_pandas(df, key_set):
return df.loc[df['a'].isin(key_set)]
M = 10**6
N = 2*10**5
K = 10**4
keys = np.random.randint(M, size=(N,))
values = np.random.random((N,))
lookup_dict = dict(zip(keys, values))
lookup_array = np.empty((M,), dtype=values.dtype)
lookup_array[keys] = values
df = pd.DataFrame(np.column_stack([keys, values]), columns=list('ab'))
key_set = np.random.choice(keys, size=(K,))
And here is a timeit benchmark (using IPython) for the methods above:
In [25]: %timeit using_array(lookup_array, key_set)
10000 loops, best of 3: 22.4 µs per loop
In [26]: %timeit using_dict(lookup_dict, key_set)
100 loops, best of 3: 3.73 ms per loop
In [24]: %timeit using_pandas(df, key_set)
10 loops, best of 3: 38.9 ms per loop

Here's an approach with np.searchsorted -
row_idx = np.searchsorted(lookup_array[:,0],key_set)[key_set.argsort()]
values = lookup_array[row_idx,1]
This assumes that lookup_array has the keys sorted in its first column. If that's not the case, you can use the optional sorter argument with np.searchsorted.

Loading a dictionary this huge in memory is kinda not good and then the added overhead of lookups. If this is a data structure you are using quite frequently how about using a database engine. There are KEY / VALUE databases if you don't like SQL. They are highly optimized for lookups.

Related

Fast way to find indexes of nonzero entries for every row in a CSC matrix in Python

Here's the current implementation:
def nonzero_indexes_by_row(input):
return [
np.nonzero(row)[1]
for row in csr_matrix(input.T)
]
The matrix is very large(1.5M, 500K), since I'm accessing rows, I have to convert CSC to CSR first. The result would be a 2d list, each list contains a list of indexes that are nonzero corresponding to the row in the original matrix.
The current process takes 20 minutes. Is there a faster way?
Sure. You're pretty close to having an ideal solution, but you're allocating some unnecessary arrays. Here's a faster way:
from scipy import sparse
import numpy as np
def my_impl(csc):
csr = csc.tocsr()
return np.split(csr.indices, csr.indptr[1:-1])
def your_impl(input):
return [
np.nonzero(row)[1]
for row in sparse.csr_matrix(input)
]
## Results
# demo data
csc = sparse.random(15000, 5000, format="csc")
your_result = your_impl(csc)
my_result = my_impl(csc)
## Tests for correctness
# Same result
assert all(np.array_equal(x, y) for x, y in zip(your_result, my_result))
# Right number of rows
assert len(my_result) == csc.shape[0]
## Speed
%timeit my_impl(csc)
# 31 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit your_impl(csc)
# 1.49 s ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Side question, why are you transposing the matrix? Wouldn't you then be getting the non-zero entries of the columns? If that's what you want, you don't even need to convert to csr and can just run:
np.split(csc.indices, csc.indptr[1:-1])
An interesting alternative to your code is to convert your array to
the COOrdinate format and then read its row and col attributes:
def nonzero_indices_by_coo(input):
cx = input.T.tocoo()
res = [ [] for i in range(cx.shape[0]) ]
for i, j in zip(cx.row, cx.col):
res[i].append(j)
return res
It returns a list of plain pythonic lists, instead of Numpy arrays,
but this should not be any important difference.
I noticed that your code uses internally transposition of the source array
(T operator) so I did the same in my code.
To compare execution speed, I created the following sparse array (2000 by 300):
r = 2000; c = 300
x = scipy.sparse.lil_matrix( (r,c) )
for _ in range(r):
x[np.random.randint(0,r-1), np.random.randint(0,c-1)] = np.random.randint(1,100)
and my code ran about 12 times faster than yours.
Yet quicker solution (in other format)
Or maybe it will be better to generate a 2-D (Numpy) array,
with 2 rows:
first row - row indices of consecutive non-zero elements,
second row - column indices.
To generate such result, you can use the following code:
def nonzero_indices_2d(input):
cx = input.T.tocoo()
return np.array([cx.row, cx.col])
which runs 4 times faster than my first solution.
Of course, then other parts of your code should be reworked, to consume
the indices given in another format.
Sparse arrays have also its own nonzero method:
arr.nonzero()
generating a 2-row Numpy array of indices. This function runs yet
a few percent faster.
So, assuming that the 2-D result format is acceptable (instead of
a list of lists), maybe you don't need any own function to get these
indices.
Another detail to consider: Whether (in all versions) there should be
used transposition.
Your choice, but without transposition each version
of code would run a bit faster.

Generating a dictionary of namedtuples from pandas dataframe

Currently, am working with Python 3.6 and Pandas to optimise a process. The process is using Pandas. The final step of the process is to generate a dictionary of namedtuples. I am trying to use the apply process to efficiently convert the pandas groupby into a list of namedtuples.
from collections import namedtuple
import pandas as pd
dfdata= [('cat01','t1', 50), ('cat01','t2',60) ,('cat01','t3',70),('cat02','t1', 60), ('cat02','t2', 80)]
df = pd.DataFrame.from_records(data=dfdata, columns=['cat','term','value'])
df.groupby('cat')['term','value'].apply(lambda x: x.term.tolist()).to_dict()
{'cat01': ['t1', 't2', 't3'], 'cat02': ['t1', 't2']}
I am failing to generate the data as a list of namedtuples as below. Any pointers on how I can tackle it?
mydata= namedtuple('entry', 'term val', verbose=False)
<some code>
{'cat01': [entry(term='t1', val=50), entry(term='t2', val=60), entry(term='t3', val=70)], 'cat02': [entry(term='t1', val=60), entry(term='t2', val=80)]}
Thanks
Kenneth
Option 1
Using a list comprehension. First, define a function.
def f(x):
return [mydata(i, j) for i, j in zip(x.term, x.value)]
Now, call f inside groupby.apply, and convert the result to a dictionary using to_dict.
df.groupby('cat').apply(f).to_dict()
{
'cat01': [
entry(term='t1', val=50),
entry(term='t2', val=60),
entry(term='t3', val=70)
],
'cat02': [
entry(term='t1', val=60),
entry(term='t2', val=80)
]
}
Performance wise, this is pretty decent, even for large data.
df = pd.concat([df] * 100000, ignore_index=True)
%timeit df.groupby('cat').apply(f).to_dict()
1 loop, best of 3: 575 ms per loop
You can, however, improve performance, by zipping the values and not the series (like this; zip(x['term'].tolist(), x['value'].tolist())) at which point, it comes down to 532 ms per loop.
Option 2
Here's a creative use of np.vectorize. First, create a "vectorised" version of mydata -
v = np.vectorize(mydata, otypes=[mydata])
df.groupby('cat').apply(lambda x: v(x.term, x.value).tolist()).to_dict()
Which returns the same result. Using the same setup as above,
%timeit df.groupby('cat').apply(lambda x: v(x.term, x.value).tolist()).to_dict()
1 loop, best of 3: 498 ms per loop
vectorize appears to be marginally faster than the loopy solution outlined above. Bear in mind, however, that timings vastly differ based on the structure and size of data, so I'd encourage you to run your own tests before drawing any conclusions.

Efficiently determining if large sorted numpy array has only unique values

I have a very large numpy array and I want to sort it and test if it is unique.
I'm aware of the function numpy.unique but it sorts the array another time to achieve it.
The reason I need the array sorted a priori is because the returned keys from the argsort function will be used to reorder another array.
I'm looking for a way to do both (argsort and unique test) without the need to sort the array again.
Example code:
import numpy as np
import numpy.random
# generating random arrays with 2 ^ 27 columns (it can grow even bigger!)
slices = np.random.random_integers(2 ** 32, size = 2 ** 27)
values = np.random.random_integers(2 ** 32, size = 2 ** 27)
# get an array of keys to sort slices AND values
# this operation takes a long time
sorted_slices = slices.argsort()
# sort both arrays
# it would be nice to make this operation in place
slices = slices[sorted_slices]
values = values[sorted_slices]
# test 'uniqueness'
# here, the np.unique function sorts the array again
if slices.shape[0] == np.unique(slices).shape[0]:
print('it is unique!')
else:
print('not unique!')
Both the arrays slices and values have 1 row and the same (huge) number of columns.
Thanks in advance.
You can check whether there are two or more equal values next to each other (non-unique values in a sorted array) by comparing their difference to 0
numpy.any(numpy.diff(slices) == 0)
Be aware though that numpy will create two intermediate arrays: one with the difference values, one with boolean values.
Here's an approach making use of slicing and instead of actual differentiation, we can just compare each element against the previous one without actually computing the differentiation value, like so -
~((slices[1:] == slices[:-1]).any())
Runtime test -
In [54]: slices = np.sort(np.random.randint(0,100000000,(10000000)))
# #Nils Werner's soln
In [55]: %timeit ~np.any(np.diff(slices) == 0)
100 loops, best of 3: 18.5 ms per loop
# #Marco's suggestion in comments
In [56]: %timeit np.diff(slices).all()
10 loops, best of 3: 20.6 ms per loop
# Proposed soln in this post
In [57]: %timeit ~((slices[1:] == slices[:-1]).any())
100 loops, best of 3: 6.12 ms per loop

Pandas replace/dictionary slowness

Please help me understand why this "replace from dictionary" operation is slow in Python/Pandas:
# Series has 200 rows and 1 column
# Dictionary has 11269 key-value pairs
series.replace(dictionary, inplace=True)
Dictionary lookups should be O(1). Replacing a value in a column should be O(1). Isn't this a vectorized operation? Even if it's not vectorized, iterating 200 rows is only 200 iterations, so how can it be slow?
Here is a SSCCE demonstrating the issue:
import pandas as pd
import random
# Initialize dummy data
dictionary = {}
orig = []
for x in range(11270):
dictionary[x] = 'Some string ' + str(x)
for x in range(200):
orig.append(random.randint(1, 11269))
series = pd.Series(orig)
# The actual operation we care about
print('Starting...')
series.replace(dictionary, inplace=True)
print('Done.')
Running that command takes more than 1 second on my machine, which is 1000's of times longer than expected to perform <1000 operations.
It looks like replace has a bit of overhead, and explicitly telling the Series what to do via map yields the best performance:
series = series.map(lambda x: dictionary.get(x,x))
If you're sure that all keys are in your dictionary you can get a very slight performance boost by not creating a lambda, and directly supplying the dictionary.get function. Any keys that are not present will return NaN via this method, so beware:
series = series.map(dictionary.get)
You can also supply just the dictionary itself, but this appears to introduce a bit of overhead:
series = series.map(dictionary)
Timings
Some timing comparisons using your example data:
%timeit series.map(dictionary.get)
10000 loops, best of 3: 124 µs per loop
%timeit series.map(lambda x: dictionary.get(x,x))
10000 loops, best of 3: 150 µs per loop
%timeit series.map(dictionary)
100 loops, best of 3: 5.45 ms per loop
%timeit series.replace(dictionary)
1 loop, best of 3: 1.23 s per loop
.replacecan do incomplete substring matches, while .map requires complete values to be supplied in the dictionary (or it returns NaNs). The fast but generic solution (that can handle substring) should first use .replace on a dict of all possible values (obtained e.g. with .value_counts().index) and then go over all rows of the Series with this dict and .map. This combo can handle for instance special national characters replacements (full substrings) on 1m-row columns in a quarter of a second, where .replace alone would take 15.
Thanks to #root: I did a benchmarking again and found different results on pandas v1.1.4
Found series.map(dictionary) fastest it also returns NaN is key not present

Approach to speed up pandas multilevel index selection?

I have a data frame, and would like to work on a small partition each time for particular tuples of values of 'a', 'b','c'.
df = pd.DataFrame({'a':np.random.randint(0,10,10000),
'b':np.random.randint(0,10,10000),
'c':np.random.randint(0,10,10000),
'value':np.random.randint(0,100,10000)})
so I chose to use pandas multiindex:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
However, the performance is not great.
%timeit dfi.ix[(2,1,7)] # 511 us
%timeit df[(df['a'].values == 2) &
(df['b'].values == 1) & (df['c'].values == 7)] # 247 us
I suspect there are some overheads somewhere. My program has ~1k tuples, so it takes 511 * 1000 = 0.5s for one run. How can I improve further?
update:
hmm, I forgot to mention that the number of tuples are less than the total Cartesian product of distinct values in 'a', 'b','c' in df. Wouldn't groupby do excess amount of work on indices that doesn't exist in my tuples?
its not clear what 'work' on means, but I would do this
this can be almost any function
In [33]: %timeit df.groupby(['a','b','c']).apply(lambda x: x.sum())
10 loops, best of 3: 83.6 ms per loop
certain operations are cythonized so very fast
In [34]: %timeit df.groupby(['a','b','c']).sum()
100 loops, best of 3: 2.65 ms per loop
Doing a a selection on a multi-index is not efficient to do index by index.
If you are operating on a very small subset of the total groups, then you might want to directly index into the multi-index; groupby wins if you are operating on a fraction (maybe 20%) of the groups or more. You might also want to investigate filter which you can use to pre-filter the groups based on some criteria.
As noted above, the cartesian product of the groups indexers is irrelevant. Only the actual groups will be iterated by groupby (think of a MultiIndex as a sparse representation of the total possible space).
How about:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
value = dfi["value"].values
value[dfi.index.get_loc((2, 1, 7))]
the result is a ndarray without index.

Categories