I would like to have a quick index lookup when using a pandas dataframe. As noted here and there, I understand that I need to keep my index unique, otherwise all hope is lost.
I made sure that the index is sorted and unique:
df = df.sort_index()
assert df.index.is_unique
I measured the lookup speed:
%timeit df.at[tuple(q.values), 'column']
1000 loops, best of 3: 185 µs per loop
Then, when I moved the index to a separate python dictionary:
index = {}
for i in df.index:
index[i] = np.random.randint(1000)
assert len(index) == len(df.index)
I got a huge speedup:
%timeit index[tuple(q.values)]
100000 loops, best of 3: 2.7 µs per loop
Why is it so? Am I doing something wrong? Is there a way to replicate python dict's speed (or something in <5x range) in a pandas index?
Related
I have the following example:
import pandas as pd
import numpy as np
import time
def function(value,df):
return len(df[(df['A']<value)])
df= pd.DataFrame(np.random.randint(0,100,size=(30000, 1)), columns=['A'])
start=time.time()
df['B']=pd.Series([len(df[df['A']<value]) for value in df['A']])
end=time.time()
print("list comprehension time:",end-start)
start=time.time()
df['B']=df['A'].apply(function,df=df)
end=time.time()
print("apply time:",end-start)
start=time.time()
series = []
for index, row in df.iterrows():
series.append(len(df[df['A']<row['A']]))
df['B'] = series
end=time.time()
print("loop time:",end-start)
Output:
time: 19.54859232902527
time: 23.598857402801514
time: 26.441001415252686
This example create a new column by counting all the row which value is superior to the current value of the row.
For this type of issue (when I created a new column, after comparing for a row all other row of the dataframe), I have tried the apply function,list comprehension and classic loop but I think they are slow.
Is there a faster way?
Ps: A specialized solution for this example is not the thing which interested me the most. I prefer a general solution for this type of issue.
An another example can be: for a dataframe with a columns of string,create a new column by counting for each row the number of string in the dataframe which begin by the string first letter.
Usually I am using numpy broadcast for this type task
%timeit df['B']=pd.Series([len(df[df['A']<value]) for value in df['A']])
1 loop, best of 3: 25.4 s per loop
%timeit df['B']=(df.A.values<df.A.values[:,None]).sum(1)
1 loop, best of 3: 1.74 s per loop
#df= pd.DataFrame(np.random.randint(0,100,size=(30000, 1)), columns=['A'])
In general, broadcasting as Wen's solution is generally the fastest. In this case, looks like rank does the job.
np.random.seed(1)
df= pd.DataFrame(np.random.randint(0,100,size=(30000, 1)), columns=['A'])
%timeit df.A.rank()-1
2.71 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Please help me understand why this "replace from dictionary" operation is slow in Python/Pandas:
# Series has 200 rows and 1 column
# Dictionary has 11269 key-value pairs
series.replace(dictionary, inplace=True)
Dictionary lookups should be O(1). Replacing a value in a column should be O(1). Isn't this a vectorized operation? Even if it's not vectorized, iterating 200 rows is only 200 iterations, so how can it be slow?
Here is a SSCCE demonstrating the issue:
import pandas as pd
import random
# Initialize dummy data
dictionary = {}
orig = []
for x in range(11270):
dictionary[x] = 'Some string ' + str(x)
for x in range(200):
orig.append(random.randint(1, 11269))
series = pd.Series(orig)
# The actual operation we care about
print('Starting...')
series.replace(dictionary, inplace=True)
print('Done.')
Running that command takes more than 1 second on my machine, which is 1000's of times longer than expected to perform <1000 operations.
It looks like replace has a bit of overhead, and explicitly telling the Series what to do via map yields the best performance:
series = series.map(lambda x: dictionary.get(x,x))
If you're sure that all keys are in your dictionary you can get a very slight performance boost by not creating a lambda, and directly supplying the dictionary.get function. Any keys that are not present will return NaN via this method, so beware:
series = series.map(dictionary.get)
You can also supply just the dictionary itself, but this appears to introduce a bit of overhead:
series = series.map(dictionary)
Timings
Some timing comparisons using your example data:
%timeit series.map(dictionary.get)
10000 loops, best of 3: 124 µs per loop
%timeit series.map(lambda x: dictionary.get(x,x))
10000 loops, best of 3: 150 µs per loop
%timeit series.map(dictionary)
100 loops, best of 3: 5.45 ms per loop
%timeit series.replace(dictionary)
1 loop, best of 3: 1.23 s per loop
.replacecan do incomplete substring matches, while .map requires complete values to be supplied in the dictionary (or it returns NaNs). The fast but generic solution (that can handle substring) should first use .replace on a dict of all possible values (obtained e.g. with .value_counts().index) and then go over all rows of the Series with this dict and .map. This combo can handle for instance special national characters replacements (full substrings) on 1m-row columns in a quarter of a second, where .replace alone would take 15.
Thanks to #root: I did a benchmarking again and found different results on pandas v1.1.4
Found series.map(dictionary) fastest it also returns NaN is key not present
I'm trying to speed up the following code. 'db' is a dictionary of DataFrames. Is there a better/different way to structure things which would speed this up?
for date in dates: # 3,800 days
for instrument in instruments: # 100 instruments
s = instrument.ticker
current_bar = db[s].ix[date]
# (current_bar.xxx then gets used for difference calculations.)
Here are the results:
%timeit speedTest()
1 loops, best of 3: 1min per loop
This is for each individual call:
%timeit current_bar = db[s].ix[date]
10000 loops, best of 3: 154 µs per loop
Any help/suggestions would be appreciated.
Thanks
i don't think a dict of dataframes is a good idea. Try structure all dataframes in one -- stack vertically and use key as index/ a level of multiindex.
I have a data frame, and would like to work on a small partition each time for particular tuples of values of 'a', 'b','c'.
df = pd.DataFrame({'a':np.random.randint(0,10,10000),
'b':np.random.randint(0,10,10000),
'c':np.random.randint(0,10,10000),
'value':np.random.randint(0,100,10000)})
so I chose to use pandas multiindex:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
However, the performance is not great.
%timeit dfi.ix[(2,1,7)] # 511 us
%timeit df[(df['a'].values == 2) &
(df['b'].values == 1) & (df['c'].values == 7)] # 247 us
I suspect there are some overheads somewhere. My program has ~1k tuples, so it takes 511 * 1000 = 0.5s for one run. How can I improve further?
update:
hmm, I forgot to mention that the number of tuples are less than the total Cartesian product of distinct values in 'a', 'b','c' in df. Wouldn't groupby do excess amount of work on indices that doesn't exist in my tuples?
its not clear what 'work' on means, but I would do this
this can be almost any function
In [33]: %timeit df.groupby(['a','b','c']).apply(lambda x: x.sum())
10 loops, best of 3: 83.6 ms per loop
certain operations are cythonized so very fast
In [34]: %timeit df.groupby(['a','b','c']).sum()
100 loops, best of 3: 2.65 ms per loop
Doing a a selection on a multi-index is not efficient to do index by index.
If you are operating on a very small subset of the total groups, then you might want to directly index into the multi-index; groupby wins if you are operating on a fraction (maybe 20%) of the groups or more. You might also want to investigate filter which you can use to pre-filter the groups based on some criteria.
As noted above, the cartesian product of the groups indexers is irrelevant. Only the actual groups will be iterated by groupby (think of a MultiIndex as a sparse representation of the total possible space).
How about:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
value = dfi["value"].values
value[dfi.index.get_loc((2, 1, 7))]
the result is a ndarray without index.
I'm looking for the most efficient way to select multiple columns from a data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(4,8), columns = list('abcdefgh'))
I want to select columns the following columns a,c,e,f,g only, which can be done by using indexing:
df.ix[:,[0,2,4,5,6]]
For a large data frame of many columns, this seems an inefficient method and I would much rather specify consecutive column indexes by range, if at all possible, but attempts such as the following, both throw up syntax errors:
df.ix[:,[0,2,4:6]]
or
df.ix[:,[0,2,[4:6]]]
As soon as you select non adjacent columns, you will pay the load.
If your data is homogeneous, falling back to numpy give you notable improvement.
In [147]: %timeit df[['a','c','e','f','g']]
%timeit df.values[:,[0,2,4,5,6]]
%timeit df.ix[:,[0,2,4,5,6]]
%timeit pd.DataFrame(df.values[:,[0,2,4,5,6]],columns=df.columns[[0,2,4,5,6]])
100 loops, best of 3: 2.67 ms per loop
10000 loops, best of 3: 58.7 µs per loop
1000 loops, best of 3: 1.81 ms per loop
1000 loops, best of 3: 568 µs per loop
I think you can use range:
print [0,2] + range(4,7)
[0, 2, 4, 5, 6]
print df.ix[:, [0,2] + range(4,7)]
a c e f g
0 0.278231 0.192650 0.653491 0.944689 0.663457
1 0.416367 0.477074 0.582187 0.730247 0.946496
2 0.396906 0.877941 0.774960 0.057290 0.556719
3 0.119685 0.211581 0.526096 0.213282 0.492261
Pandas is relatively well thought, the shortest way is the most efficient:
df[['a','c','e','f','g']]
You don't need ix, as it will do a search in your data, but for that you obviously need the names of the columns.