Please help me understand why this "replace from dictionary" operation is slow in Python/Pandas:
# Series has 200 rows and 1 column
# Dictionary has 11269 key-value pairs
series.replace(dictionary, inplace=True)
Dictionary lookups should be O(1). Replacing a value in a column should be O(1). Isn't this a vectorized operation? Even if it's not vectorized, iterating 200 rows is only 200 iterations, so how can it be slow?
Here is a SSCCE demonstrating the issue:
import pandas as pd
import random
# Initialize dummy data
dictionary = {}
orig = []
for x in range(11270):
dictionary[x] = 'Some string ' + str(x)
for x in range(200):
orig.append(random.randint(1, 11269))
series = pd.Series(orig)
# The actual operation we care about
print('Starting...')
series.replace(dictionary, inplace=True)
print('Done.')
Running that command takes more than 1 second on my machine, which is 1000's of times longer than expected to perform <1000 operations.
It looks like replace has a bit of overhead, and explicitly telling the Series what to do via map yields the best performance:
series = series.map(lambda x: dictionary.get(x,x))
If you're sure that all keys are in your dictionary you can get a very slight performance boost by not creating a lambda, and directly supplying the dictionary.get function. Any keys that are not present will return NaN via this method, so beware:
series = series.map(dictionary.get)
You can also supply just the dictionary itself, but this appears to introduce a bit of overhead:
series = series.map(dictionary)
Timings
Some timing comparisons using your example data:
%timeit series.map(dictionary.get)
10000 loops, best of 3: 124 µs per loop
%timeit series.map(lambda x: dictionary.get(x,x))
10000 loops, best of 3: 150 µs per loop
%timeit series.map(dictionary)
100 loops, best of 3: 5.45 ms per loop
%timeit series.replace(dictionary)
1 loop, best of 3: 1.23 s per loop
.replacecan do incomplete substring matches, while .map requires complete values to be supplied in the dictionary (or it returns NaNs). The fast but generic solution (that can handle substring) should first use .replace on a dict of all possible values (obtained e.g. with .value_counts().index) and then go over all rows of the Series with this dict and .map. This combo can handle for instance special national characters replacements (full substrings) on 1m-row columns in a quarter of a second, where .replace alone would take 15.
Thanks to #root: I did a benchmarking again and found different results on pandas v1.1.4
Found series.map(dictionary) fastest it also returns NaN is key not present
Related
I have written logic for changing data inside a pivoted table and I am able to achieve for a single condition but I need to place an else condition too. tried .apply() but it doesn't satisfy the dataset I am looking for.
df_s = sorted_data.groupby(["GH","HA","Tipo"]).first().reset_index()
df_s22 = df_s[df_s['Tipo'].eq('1')].assign(Tipo='2').rename(lambda x: x + .5)
I need an else condition above to assign 1 if not.
The column "Tipo" is a string the way you handle it currently. Perhaps you could convert it into integers, usually easier to work with. Anyway, you have something with a column Tipo, that are strings, and they are either '1' or '2' (hard to tell if there are other values allowed, which affects the approaches you can take).
import numpy as np
import pandas as pd
df_s = pd.DataFrame({'Tipo':[str(i) for i in np.random.randint(1,3, size=10)],
'other_data':[chr(i) for i in np.random.randint(65,90, size=10)]})
Method 1
The most direct solution to your problem would be to define a function and apply it row wise, i.e. axis=1 (probably inefficient but does the job):
def fn(row):
row.loc['Tipo'] = '2' if row.Tipo=='1' else '1'
return row
df_s22 = df_s.apply(fn, axis=1)
timings: 2.57 ms ± 153 µs per loop
Method 2
Or apply directly to the column of interest
df_s22 = df_s.copy()
df_s22.loc[:,'Tipo'] = df_s22.loc[:,'Tipo'].apply(lambda x: '2' if x=='1' else '1')
timings: 862 µs ± 30.7 µs per loop
Method 3
You can also use the eval method:
df_s22 = df_s.copy()
df_s22.loc[:,'Tipo'] = df_s22.eval("Tipo=='1'").astype(int)+1
timings: 2.45 ms ± 97.3 µs per loop
Here I use the eval method and checks if the Tipo column is '2'. Thus if it is, it will be True, which in Python can also be interpreted as '1', so adding 1 to the check Tipo=='2' will make the True value be 1+1=2, and the other values which evaluate to False (i.e. = 0), we add +1 to as well, will be 1. The output will have Tipo column as integers, not strings any more.
Method 4
Using the assign method we can use similar checks
df_s22 = df_s.assign(Tipo = str((df_s.Tipo=='1')+1))
timings: 783 µs ± 18.3 µs per loop
Tips and remarks:
In all cases you need to keep track of quotation marks, since Tipo is a string, for expressions within "- quotes the inner quotes need to be single '-quotes.
Also remember that you are creating a second DataFrame in memory (df_s22), if your dataset is large and you want to do complex operations on it they might be slower if memory becomes full. Think about just creating a new column, perhaps named Tipo22 in your original DataFrame df_s.
This question already has answers here:
What is the performance impact of non-unique indexes in pandas?
(2 answers)
Closed 4 years ago.
Suppose I have an pandas series that I'd like to function as a multimap (multiple values for each index key):
# intval -> data1
a = pd.Series(data=-np.arange(100000),
index=np.random.randint(0, 50000, 100000))
I'd like to select (as quickly as possible) all the values from a
where a's index matches another index b. (Like an inner join. Or a merge but for series).
a may have duplicates in its index.
b may not have duplicates and it not necessarily a subset of a's index. To give pandas the best possible chance, let's assume b can also be provided as a sorted index object:
b = pd.Index(np.unique(np.random.randint(30000, 100000, 100000))).sortvalues()
So, we would have something like:
target
a b result
3 0 3 3 0
3 1 7 8 3
4 2 8 ...
8 3 ...
9 4
...
I'm also only interested in getting the values of the result (index [3,8,...] not needed).
If a did not have duplicates, we would simply do:
a.reindex(b) # Cannot reindex a duplicate axis
Because & maintains the duplicates of a, we can't do:
d = a[a.index & b.index]
d = a.loc[a.index & b.index] # same
d = a.get(a.index & b.index) # same
print d.shape
So I think we need to do something like:
common = (a.index & b.index).unique()
a.loc[common]
... which is cumbersome, but also is surprising slow. It's not build the list of items to select that's slow:
%timeit (a.index & b).unique()
# 100 loops, best of 3: 3.39 ms per loop
%timeit (a.index & b).unique().sort_values()
# 100 loops, best of 3: 4.19 ms per loop
... so it look like its really retrieving the values that's slow:
common = ((a.index & b).unique()).sort_values()
%timeit a.loc[common]
#10 loops, best of 3: 43.3 ms per loop
%timeit a.get(common)
#10 loops, best of 3: 42.1 ms per loop
... That's around 20 operations per seconds. Not exactly zippy! Why so slow?
Surely there must be a fast way to lookup as set of values from pandas dataframe? I don't want to get an indexed object out -- really all I'm asking for is a merge over sorted indexes, or (slower) hashed int lookups. Either way, this should be an extremely fast operation -- not a 20 per second operation on my 3Ghz machine.
Also:
Profiling a.loc[common] give:
ncalls tottime percall cumtime percall filename:lineno(function)
# All the time spent here.
40 1.01 0.02525 1.018 0.02546 ~:0(<method 'get_indexer_non_unique' indexing.py:1443(_has_valid_type)
...
# seems to be called a lot.
1500 0.000582 3.88e-07 0.000832 5.547e-07 ~:0(<isinstance>)
PS. I posted a similar question previously, about why Series.map is so slow
Why is pandas.series.map so shockingly slow? . The reason was lazy-under-the-hood-indexing. This doesn't seem to be happening here.
Update:
For similarly sizes a and common where a is unique:
% timeit a.loc[common]
1000 loops, best of 3: 760 µs per loop
... as #jpp points out. Multiindex is likely to blame.
Repeated indices are guaranteed to slow down your dataframe indexing operations. You can amend your inputs to prove this to yourself:
a = pd.Series(data=-np.arange(100000), index=np.random.randint(0, 50000, 100000))
%timeit a.loc[common] # 34.1 ms
a = pd.Series(data=-np.arange(100000), index=np.arange(100000))
%timeit a.loc[common] # 6.86 ms
As mentioned in this related question:
When index is unique, pandas use a hashtable to map key to value O(1).
When index is non-unique and sorted, pandas use binary search O(logN),
when index is random ordered pandas need to check all the keys in the
index O(N).
In writing to a dataframe in pandas, we see we have a couple of ways to do it, as provided by this answer and this answer.
We have the method of
df[r][c].set_value(r,c,some_value) and the method of
df.iloc[r][c] = some_value.
What is the difference? Which is faster? Is either a copy?
The difference is that set_value is returning an object, while the assignment operator assigns the value into the existing DataFrame object.
after calling set_value you will potentially have two DataFrame objects (this does not necessarily mean you'll have two copies of the data, as DataFrame objects can "reference" one another) while the assignment operator will change data in the single DataFrame object.
It appears to be faster to use the set_value, as it is probably optimized for that use-case, while the assignment approach will generate intermediate slices of the data:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df=pd.DataFrame(np.random.rand(100,100))
In [4]: %timeit df[10][10]=7
The slowest run took 6.43 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 89.5 µs per loop
In [5]: %timeit df.set_value(10,10,11)
The slowest run took 10.89 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 3.94 µs per loop
the result of set_value may be a copy, but the documentation is not really clear (to me) on this:
Returns:
frame : DataFrame
If label pair is contained, will be reference to calling DataFrame, otherwise a new object
I have a very large (200k+) set of key/value pairs, for which I need to retrieve very large (sometimes all) of the values. The obvious way to do this is with a dictionary such that
values = {lookup.get(key) for key in key_set}
This is getting very time consuming in my code, and I'm wondering if there's a faster way to implement this with a NumPy array. I've been experimenting with using an array with two columns and n rows, such that for any individual key:
value = lookup_array[lookup_array[:,0] == key, 1]
But I'm not sure how to scale this to many keys up without costly iteration. I've looked at:
values = lookup_array[np.in1d(lookup_array[:,0], key_set), 1]
but this also seems time consuming.
Is there any other way to do a massive lookup of nonconsecutive values quickly without iterating?
If certain special conditions apply, you can use NumPy indexing as a very fast alternative to dictionary lookups.
The keys must be integers
You have enough memory to create a NumPy array whose size is as big as the
maximum key value you wish to look up (so that all keys correspond to a valid index into the array.)
The idea is to use
lookup_array = np.empty((M,), dtype=values.dtype)
lookup_array[keys] = values
result = lookup_array[key_set]
instead of
result = {lookup_dict.get(key) for key in key_set}
For example,
import numpy as np
import pandas as pd
def using_dict(lookup_dict, key_set):
return {lookup_dict.get(key) for key in key_set}
def using_array(lookup_array, key_set):
return lookup_array[key_set]
def using_pandas(df, key_set):
return df.loc[df['a'].isin(key_set)]
M = 10**6
N = 2*10**5
K = 10**4
keys = np.random.randint(M, size=(N,))
values = np.random.random((N,))
lookup_dict = dict(zip(keys, values))
lookup_array = np.empty((M,), dtype=values.dtype)
lookup_array[keys] = values
df = pd.DataFrame(np.column_stack([keys, values]), columns=list('ab'))
key_set = np.random.choice(keys, size=(K,))
And here is a timeit benchmark (using IPython) for the methods above:
In [25]: %timeit using_array(lookup_array, key_set)
10000 loops, best of 3: 22.4 µs per loop
In [26]: %timeit using_dict(lookup_dict, key_set)
100 loops, best of 3: 3.73 ms per loop
In [24]: %timeit using_pandas(df, key_set)
10 loops, best of 3: 38.9 ms per loop
Here's an approach with np.searchsorted -
row_idx = np.searchsorted(lookup_array[:,0],key_set)[key_set.argsort()]
values = lookup_array[row_idx,1]
This assumes that lookup_array has the keys sorted in its first column. If that's not the case, you can use the optional sorter argument with np.searchsorted.
Loading a dictionary this huge in memory is kinda not good and then the added overhead of lookups. If this is a data structure you are using quite frequently how about using a database engine. There are KEY / VALUE databases if you don't like SQL. They are highly optimized for lookups.
I have a data frame, and would like to work on a small partition each time for particular tuples of values of 'a', 'b','c'.
df = pd.DataFrame({'a':np.random.randint(0,10,10000),
'b':np.random.randint(0,10,10000),
'c':np.random.randint(0,10,10000),
'value':np.random.randint(0,100,10000)})
so I chose to use pandas multiindex:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
However, the performance is not great.
%timeit dfi.ix[(2,1,7)] # 511 us
%timeit df[(df['a'].values == 2) &
(df['b'].values == 1) & (df['c'].values == 7)] # 247 us
I suspect there are some overheads somewhere. My program has ~1k tuples, so it takes 511 * 1000 = 0.5s for one run. How can I improve further?
update:
hmm, I forgot to mention that the number of tuples are less than the total Cartesian product of distinct values in 'a', 'b','c' in df. Wouldn't groupby do excess amount of work on indices that doesn't exist in my tuples?
its not clear what 'work' on means, but I would do this
this can be almost any function
In [33]: %timeit df.groupby(['a','b','c']).apply(lambda x: x.sum())
10 loops, best of 3: 83.6 ms per loop
certain operations are cythonized so very fast
In [34]: %timeit df.groupby(['a','b','c']).sum()
100 loops, best of 3: 2.65 ms per loop
Doing a a selection on a multi-index is not efficient to do index by index.
If you are operating on a very small subset of the total groups, then you might want to directly index into the multi-index; groupby wins if you are operating on a fraction (maybe 20%) of the groups or more. You might also want to investigate filter which you can use to pre-filter the groups based on some criteria.
As noted above, the cartesian product of the groups indexers is irrelevant. Only the actual groups will be iterated by groupby (think of a MultiIndex as a sparse representation of the total possible space).
How about:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
value = dfi["value"].values
value[dfi.index.get_loc((2, 1, 7))]
the result is a ndarray without index.