I have a data frame, and would like to work on a small partition each time for particular tuples of values of 'a', 'b','c'.
df = pd.DataFrame({'a':np.random.randint(0,10,10000),
'b':np.random.randint(0,10,10000),
'c':np.random.randint(0,10,10000),
'value':np.random.randint(0,100,10000)})
so I chose to use pandas multiindex:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
However, the performance is not great.
%timeit dfi.ix[(2,1,7)] # 511 us
%timeit df[(df['a'].values == 2) &
(df['b'].values == 1) & (df['c'].values == 7)] # 247 us
I suspect there are some overheads somewhere. My program has ~1k tuples, so it takes 511 * 1000 = 0.5s for one run. How can I improve further?
update:
hmm, I forgot to mention that the number of tuples are less than the total Cartesian product of distinct values in 'a', 'b','c' in df. Wouldn't groupby do excess amount of work on indices that doesn't exist in my tuples?
its not clear what 'work' on means, but I would do this
this can be almost any function
In [33]: %timeit df.groupby(['a','b','c']).apply(lambda x: x.sum())
10 loops, best of 3: 83.6 ms per loop
certain operations are cythonized so very fast
In [34]: %timeit df.groupby(['a','b','c']).sum()
100 loops, best of 3: 2.65 ms per loop
Doing a a selection on a multi-index is not efficient to do index by index.
If you are operating on a very small subset of the total groups, then you might want to directly index into the multi-index; groupby wins if you are operating on a fraction (maybe 20%) of the groups or more. You might also want to investigate filter which you can use to pre-filter the groups based on some criteria.
As noted above, the cartesian product of the groups indexers is irrelevant. Only the actual groups will be iterated by groupby (think of a MultiIndex as a sparse representation of the total possible space).
How about:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
value = dfi["value"].values
value[dfi.index.get_loc((2, 1, 7))]
the result is a ndarray without index.
Related
This question already has answers here:
What is the performance impact of non-unique indexes in pandas?
(2 answers)
Closed 4 years ago.
Suppose I have an pandas series that I'd like to function as a multimap (multiple values for each index key):
# intval -> data1
a = pd.Series(data=-np.arange(100000),
index=np.random.randint(0, 50000, 100000))
I'd like to select (as quickly as possible) all the values from a
where a's index matches another index b. (Like an inner join. Or a merge but for series).
a may have duplicates in its index.
b may not have duplicates and it not necessarily a subset of a's index. To give pandas the best possible chance, let's assume b can also be provided as a sorted index object:
b = pd.Index(np.unique(np.random.randint(30000, 100000, 100000))).sortvalues()
So, we would have something like:
target
a b result
3 0 3 3 0
3 1 7 8 3
4 2 8 ...
8 3 ...
9 4
...
I'm also only interested in getting the values of the result (index [3,8,...] not needed).
If a did not have duplicates, we would simply do:
a.reindex(b) # Cannot reindex a duplicate axis
Because & maintains the duplicates of a, we can't do:
d = a[a.index & b.index]
d = a.loc[a.index & b.index] # same
d = a.get(a.index & b.index) # same
print d.shape
So I think we need to do something like:
common = (a.index & b.index).unique()
a.loc[common]
... which is cumbersome, but also is surprising slow. It's not build the list of items to select that's slow:
%timeit (a.index & b).unique()
# 100 loops, best of 3: 3.39 ms per loop
%timeit (a.index & b).unique().sort_values()
# 100 loops, best of 3: 4.19 ms per loop
... so it look like its really retrieving the values that's slow:
common = ((a.index & b).unique()).sort_values()
%timeit a.loc[common]
#10 loops, best of 3: 43.3 ms per loop
%timeit a.get(common)
#10 loops, best of 3: 42.1 ms per loop
... That's around 20 operations per seconds. Not exactly zippy! Why so slow?
Surely there must be a fast way to lookup as set of values from pandas dataframe? I don't want to get an indexed object out -- really all I'm asking for is a merge over sorted indexes, or (slower) hashed int lookups. Either way, this should be an extremely fast operation -- not a 20 per second operation on my 3Ghz machine.
Also:
Profiling a.loc[common] give:
ncalls tottime percall cumtime percall filename:lineno(function)
# All the time spent here.
40 1.01 0.02525 1.018 0.02546 ~:0(<method 'get_indexer_non_unique' indexing.py:1443(_has_valid_type)
...
# seems to be called a lot.
1500 0.000582 3.88e-07 0.000832 5.547e-07 ~:0(<isinstance>)
PS. I posted a similar question previously, about why Series.map is so slow
Why is pandas.series.map so shockingly slow? . The reason was lazy-under-the-hood-indexing. This doesn't seem to be happening here.
Update:
For similarly sizes a and common where a is unique:
% timeit a.loc[common]
1000 loops, best of 3: 760 µs per loop
... as #jpp points out. Multiindex is likely to blame.
Repeated indices are guaranteed to slow down your dataframe indexing operations. You can amend your inputs to prove this to yourself:
a = pd.Series(data=-np.arange(100000), index=np.random.randint(0, 50000, 100000))
%timeit a.loc[common] # 34.1 ms
a = pd.Series(data=-np.arange(100000), index=np.arange(100000))
%timeit a.loc[common] # 6.86 ms
As mentioned in this related question:
When index is unique, pandas use a hashtable to map key to value O(1).
When index is non-unique and sorted, pandas use binary search O(logN),
when index is random ordered pandas need to check all the keys in the
index O(N).
I have a very large numpy array and I want to sort it and test if it is unique.
I'm aware of the function numpy.unique but it sorts the array another time to achieve it.
The reason I need the array sorted a priori is because the returned keys from the argsort function will be used to reorder another array.
I'm looking for a way to do both (argsort and unique test) without the need to sort the array again.
Example code:
import numpy as np
import numpy.random
# generating random arrays with 2 ^ 27 columns (it can grow even bigger!)
slices = np.random.random_integers(2 ** 32, size = 2 ** 27)
values = np.random.random_integers(2 ** 32, size = 2 ** 27)
# get an array of keys to sort slices AND values
# this operation takes a long time
sorted_slices = slices.argsort()
# sort both arrays
# it would be nice to make this operation in place
slices = slices[sorted_slices]
values = values[sorted_slices]
# test 'uniqueness'
# here, the np.unique function sorts the array again
if slices.shape[0] == np.unique(slices).shape[0]:
print('it is unique!')
else:
print('not unique!')
Both the arrays slices and values have 1 row and the same (huge) number of columns.
Thanks in advance.
You can check whether there are two or more equal values next to each other (non-unique values in a sorted array) by comparing their difference to 0
numpy.any(numpy.diff(slices) == 0)
Be aware though that numpy will create two intermediate arrays: one with the difference values, one with boolean values.
Here's an approach making use of slicing and instead of actual differentiation, we can just compare each element against the previous one without actually computing the differentiation value, like so -
~((slices[1:] == slices[:-1]).any())
Runtime test -
In [54]: slices = np.sort(np.random.randint(0,100000000,(10000000)))
# #Nils Werner's soln
In [55]: %timeit ~np.any(np.diff(slices) == 0)
100 loops, best of 3: 18.5 ms per loop
# #Marco's suggestion in comments
In [56]: %timeit np.diff(slices).all()
10 loops, best of 3: 20.6 ms per loop
# Proposed soln in this post
In [57]: %timeit ~((slices[1:] == slices[:-1]).any())
100 loops, best of 3: 6.12 ms per loop
Please help me understand why this "replace from dictionary" operation is slow in Python/Pandas:
# Series has 200 rows and 1 column
# Dictionary has 11269 key-value pairs
series.replace(dictionary, inplace=True)
Dictionary lookups should be O(1). Replacing a value in a column should be O(1). Isn't this a vectorized operation? Even if it's not vectorized, iterating 200 rows is only 200 iterations, so how can it be slow?
Here is a SSCCE demonstrating the issue:
import pandas as pd
import random
# Initialize dummy data
dictionary = {}
orig = []
for x in range(11270):
dictionary[x] = 'Some string ' + str(x)
for x in range(200):
orig.append(random.randint(1, 11269))
series = pd.Series(orig)
# The actual operation we care about
print('Starting...')
series.replace(dictionary, inplace=True)
print('Done.')
Running that command takes more than 1 second on my machine, which is 1000's of times longer than expected to perform <1000 operations.
It looks like replace has a bit of overhead, and explicitly telling the Series what to do via map yields the best performance:
series = series.map(lambda x: dictionary.get(x,x))
If you're sure that all keys are in your dictionary you can get a very slight performance boost by not creating a lambda, and directly supplying the dictionary.get function. Any keys that are not present will return NaN via this method, so beware:
series = series.map(dictionary.get)
You can also supply just the dictionary itself, but this appears to introduce a bit of overhead:
series = series.map(dictionary)
Timings
Some timing comparisons using your example data:
%timeit series.map(dictionary.get)
10000 loops, best of 3: 124 µs per loop
%timeit series.map(lambda x: dictionary.get(x,x))
10000 loops, best of 3: 150 µs per loop
%timeit series.map(dictionary)
100 loops, best of 3: 5.45 ms per loop
%timeit series.replace(dictionary)
1 loop, best of 3: 1.23 s per loop
.replacecan do incomplete substring matches, while .map requires complete values to be supplied in the dictionary (or it returns NaNs). The fast but generic solution (that can handle substring) should first use .replace on a dict of all possible values (obtained e.g. with .value_counts().index) and then go over all rows of the Series with this dict and .map. This combo can handle for instance special national characters replacements (full substrings) on 1m-row columns in a quarter of a second, where .replace alone would take 15.
Thanks to #root: I did a benchmarking again and found different results on pandas v1.1.4
Found series.map(dictionary) fastest it also returns NaN is key not present
I am trying to add two pandas Series together. The first Series is very large and has a MultiIndex. The index of the second series is a small subset of the index of the first.
df1 = pd.DataFrame(np.ones((1000,5000)),dtype=int).stack()
df1 = pd.DataFrame(df1, columns = ['total'])
df2 = pd.concat([df1.iloc[50:55],df1.iloc[2000:2005]]) # df2 is tiny subset of df1
Using the regular Series.add function takes about 9 seconds the first time, and 2 seconds on subsequent tries (maybe because pandas optimizes how the df is stored in memory?).
starttime = time.time()
df1.total.add(df2.total,fill_value=0).sum()
print "Method 1 took %f seconds" % (time.time() - starttime)
Manually iterating over rows takes about 2/3 as long as Series.add the first time, and about 1/100 as long as Series.add on subsequent tries.
starttime = time.time()
result = df1.total.copy()
for row_index, row in df2.iterrows():
result[row_index] += row
print "Method 2 took %f seconds" % (time.time() - starttime)
The speed difference is particularly noticeable when (as here) the Index is a MultiIndex.
Why does Series.add not work well here? Any suggestions for speeding this up? Is there a more efficient alternative to iterating over each element of the Series?
Also, how do I sort or structure the data frame to improve the performance of either method? The second time either of these methods is run is appreciably faster. How do I get this performance on the first time? Sorting using sort_index helps only marginally.
You don't need for loop:
df1.total[df2.index] += df2.total
As HYRY answers, the more efficient thing to do in this situation is to only look at the small subset of df2's index. You can do this, with the slightly more robust add function (which can fill NaNs):
df1.total[df2.index] = (df1.total[df2.index]).add(df2.total, fill_value=0)
Although the syntax here is not very dry...
To compare some timeit information, we can see that add isn't drastically slower, and both are an enormous improvement on your naive for loop:
In [11]: %%timeit
result = df1.total.copy()
for row_index, row in df2.iterrows():
result[row_index] += row
100 loops, best of 3: 17.9 ms per loop
In [12]: %timeit df1.total[df2.index] = (df1.total[df2.index]).add(df2.total, fill_value=0)
1000 loops, best of 3: 325 µs per loop
In [13]: %timeit df1.total[df2.index] += df2.total
1000 loops, best of 3: 283 µs per loop
It's an interesting question (and I may fill this in later) to what relative size this will be faster with, but certainly in this extreme case there is a huge win...
The thing to take away from this:
If you are writing a for loop (in python) to speed something up, you're doing it wrong! :)
I think your second might be faster in this specific case because you're iterating through the smaller dataset (small amount of work) and then accessing only a handful of components of the larger dataset (an efficient operation thanks to pandas developers).
However, with the .add method, pandas has to look look at the entirety of both indices.
If df1 and df2 are the same length, your first method takes 54 ms, but the second method takes >2 minutes (on my machine, obviously, YMMV).
In my code I am currently iterating and creating three lists:
data, row, col
There is a high repetition of (row, col) pairs, and in my final sparse matrix M I would like the value of M[row, col] to be the sum of all the corresponding elements in data. From reading the documentation, the coo_matrix format seems perfect and for small examples it works just fine.
The problem I am having is that when I scale up my problem size, it looks like the intermediate lists data, row, col are using up all of my (8gb) of memory and the swap space and my script gets automatically killed.
So my question is:
Is there an appropriate format or an efficient way of incrementally building my summed matrix so I don't have to store the full intermediate lists / numpy arrays?
My program loops over a grid, creating local_data, local_row, local_col lists at each point, the elements of which are then appended to data, row, col, so being able to update the sparse matrix with lists as per the sparse matrix constructors would be the ideal case.
There are two things that may be killing you: the duplicates or the overhead of a list over an array. In either case, probably the right thing to do is to grow your list only so big before dumping it into a coo_matrix and adding it to your total. I took a couple of timings:
rows = list(np.random.randint(100, size=(10000,)))
cols = list(np.random.randint(100, size=(10000,)))
values = list(np.random.rand(10000))
%timeit sps.coo_matrix((values, (rows, cols)))
100 loops, best of 3: 4.03 ms per loop
%timeit (sps.coo_matrix((values[:5000], (rows[:5000], cols[:5000]))) +
sps.coo_matrix((values[5000:], (rows[5000:], cols[5000:]))))
100 loops, best of 3: 5.24 ms per loop
%timeit sps.coo_matrix((values[:5000], (rows[:5000], cols[:5000])))
100 loops, best of 3: 2.16 ms per loop
So there is about a 25% overhead in splitting the lists in two, converting each to a coo_matrix and then adding them together. And it doesn't seem to be as bad if you do more splits:
%timeit (sps.coo_matrix((values[:2500], (rows[:2500], cols[:2500]))) +
sps.coo_matrix((values[2500:5000], (rows[2500:5000], cols[2500:5000]))) +
sps.coo_matrix((values[5000:7500], (rows[5000:7500], cols[5000:7500]))) +
sps.coo_matrix((values[7500:], (rows[7500:], cols[7500:]))))
100 loops, best of 3: 5.76 ms per loop