Faster pandas DatetimeIndex membership checking

Faster pandas DatetimeIndex membership checking - python

I have a tight loop which, among other things, checks whether a given date (in the form of a pandas.Timestamp) is contained in a given unique pandas.DatetimeIndex (the application being checking whether a date is a custom business day).
As a minimal example, consider this bit:
import pandas as pd
dates = pd.date_range("2020", "2021")
index = dates.to_series().sample(frac=0.7).sort_index().index
for date in dates:
if date in index:
# Do stuff...
(Note that simply iterating over index is not an option in the full application)
To my surprise, I found that the date in index bit takes up a significant part of the total runtime. Profiling furthermore shows that Pandas' membership check does a lot more than just a hash lookup, which is further confirmed by a small experiment comparing DatetimeIndex vs a plain python set:
%timeit [date in index for date in dates]
# 3.28 ms ± 81.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
vs
index_set = set(index)
%timeit [date in index_set for date in dates]
# 341 µs ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Note that the difference is almost 10x! Why this difference and can I do anything to make it faster?

Related

Compare with another column value

train.loc[:,'nd_mean_2021-04-15':'nd_mean_2021-08-27'] > train['q_5']
I get Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version. Do left, right = left.align(right, axis=1, copy=False)before e.g.left == right` and something strange output with a lot of columns, but I did expect cell values masked with True or False for calculate sum on next step.
Comparing each columns separately works just fine
train['nd_mean_2021-04-15'] > train['q_5']
But works slowly and messy code.

I've tested your original solution, and two additional ways of performing this comparison you want to make.
To cut to the chase, the following option had the smallest execution time:
%%timeit
sliced_df = df.loc[:, 'nd_mean_2021-04-15':'nd_mean_2021-08-27']
comparisson_df = pd.DataFrame({col: df['q_5'] for col in sliced_df.columns})
(sliced_df > comparisson_df)
# 1.46 ms ± 610 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Drawback: it's little bit messy and requires you to create 2 new objects (sliced_df and comparisson_df)
Option 2: Using DataFrame.apply (slower but more readable)
The second option although slower than your original and the above implementations, in my opinion is the cleanest and easiest to read of them all. If you're not trying to process large amounts of data (I assume not, since you're using pandas instead of Dask or Spark that are tools more suitable for processing large volumes of data) then it's worth bringing it to the discussion table:
%%timeit
df.loc[:, 'nd_mean_2021-04-15':'nd_mean_2021-08-27'].apply(lambda col: col > df['q_5'])
# 5.66 ms ± 897 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Original Solution
I've also tested the performance of your original implementation and here's what I got:
%%timeit
df.loc[:, 'nd_mean_2021-04-15':'nd_mean_2021-08-27'] > df['q_5']
# 2.02 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Side-Note: If the FutureWarning message is bothering you, there's always the option to ignore them, adding the following code after your script imports:
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
DataFrame Used for Testing
All of the above implementations used the same dataframe, that I created using the following code:
import pandas as pd
import numpy as np
columns = list(
map(
lambda value: f'nd_mean_{value}',
pd.date_range('2021-04-15', '2021-08-27', freq='W').to_series().dt.strftime('%Y-%m-%d').to_list()
)
)
df = pd.DataFrame(
{col: np.random.randint(0, 100, 10) for col in [*columns, 'q_5']}
)
Screenshots

How to quickly subset many dataframes?

I have 180 DataFrame objects, each one has 3130 rows and it's about 300KB in memory.
The index is a DatetimeIndex, business days from 2000-01-03 to 2011-12-31:
from datetime import datetime
import pandas as pd
freq = pd.tseries.offsets.BDay()
index = pd.date_range(datetime(2000,1,3), datetime(2011,12,31), freq=freq)
df = pd.DataFrame(index=index)
df['A'] = 1000.0
df['B'] = 2000.0
df['C'] = 3000.0
df['D'] = 4000.0
df['E'] = 5000.0
df['F'] = True
df['G'] = 1.0
df['H'] = 100.0
I preprocess all the data taking advantage of numpy/pandas vectorization, then I have to loop through the dataframes day by day. To prevent the possibility of 'look ahead bias' and get data from the future I must be sure each day I only return a subset of my dataframes, up to that datapoint. I explain: if the current datapoint I am processing is datetime(2010,5,15) I need data from datetime(2000,1,3) to datetime(2010,5,15). You should not be able to access data more recent than datetime(2010,5,15). With this subset I'll make other computations I can't vectorize because they are path dependent.
I modified my original loop like this:
def get_data(datapoint):
return df.loc[:datapoint]
calendar = df.index
for datapoint in calendar:
x = get_data(datapoint)
This kind of code is painfully slow. What is my best option to improve its speed?
If I do not try to prevent the look ahead bias my production code takes about 3 minutes to run but it is too risky. With code like this it takes 13 minutes and this is unacceptable.
%%timeit
A slightly faster option is using iloc instead of loc but it is still slow:
def get_data2(datapoint):
idx = df.index.get_loc(datapoint)
return df.iloc[:idx]
for datapoint in calendar:
x = get_data(datapoint)
371 ms ± 23.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
for datapoint in calendar:
x = get_data2(datapoint)
327 ms ± 7.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The original code, which was not trying to prevent the possibility of look ahead bias, simply returned the whole DataFrame when called for each datapoint. In this example is 100 time faster, real code is 4 times faster.
def get_data_no_check():
return df
for datapoint in calendar:
x = get_data_no_check()
2.87 ms ± 89.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

See if this work for you:
datapoint_range = pd.date_range(datetime(2000,1,3), datetime.now(), freq=freq)
datapoint = datapoint_range[-1]
Logic is: replacing the ending date to be today so as to ensure not future date. Then get the last date of the range.
Then use your df.loc[:datapoint] to get the range you want.

I solved it like this: first I preprocess all my data in the DataFrame to take advantage of pandas vectorization then I convert it into a dict of dict and I iterate over it preventing the possibility of 'look ahead bias'. Since data are already preprocessed I can avoid the DataFrame overhead. The increase in processing speed in production code let me speechless: down from more than 30 minutes to 40 seconds!
# Convert the DataFrame into a dict of dict
for s, data in self._data.items():
self._data[s] = data.to_dict(orient='index')

Python: Vectorize list lookup

I have sensor data like this:
{"Time":1541203508.45,"Tc":25.4,"Hp":33}
{"Time":1541203508.45,"Tc":25.2,"Hp":32}
{"Time":1541203508.45,"Tc":25.1,"Hp":31}
{"Time":1541203508.45,"Tc":25.2,"Hp":33}
I'm doing a lot of list lookups in a for loop like this:
output={}
for i,data in enumerate(sensor_data):
output[i]={}
output[i]['H']=['V_Dry','Dry','Normal','Humid','V_Humid','ERR']([sensor_data[i]['Hp'])%20]
#.... And so on for temp etc
Is there some way to vectorize this if I converted it to a numpy/pandas datatype? Like, if I split the sections into temp, humidity etc, is there a python method that would apply this 'mask' kind of thing on it?
Is map my only option to speed it up?

First attempt
I suggest you first convert your data into a numpy array:
import numpy as np
data = [{"Time":1541203508.45,"Tc":25.4,"Hp":33},
{"Time":1541203508.45,"Tc":25.2,"Hp":32},
{"Time":1541203508.45,"Tc":25.1,"Hp":31},
{"Time":1541203508.45,"Tc":25.2,"Hp":33}]
np_data = np.asarray([list(element.values()) for element in data])
Now the third column is humidity in your example. Let's now define a map for that:
def convert_number_to_hr(value):
hr_names = ['V_Dry','Dry','Normal','Humid','V_Humid','ERR']
return hr_names[int(value//20)]
This uses your predefined names in steps of 20%. Now let's apply the map:
hr_humidity = map(convert_number_to_hr, np_data[:,2])
This is a generator. You can iterate through it or convert it to a list via list(hr_humidity).
This reports a speed of
%timeit hr_humidity = map(convert_number_to_hr, np_data[:,2])
515 ns ± 2.25 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
If you apply list(..) this time grows to
%timeit hr_humidity = list(map(convert_number_to_hr, np_data[:,2]))
5.62 µs ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
You can now use the same procedure for everything else in your dataset.
Second attempt
I tried to do this fully vectorized as you asked in your comment. I came up with:
def same_but_pure_numpy(arr):
arr = arr.astype(int)//20
hr_names = np.asarray(['V_Dry','Dry','Normal','Humid','V_Humid','ERR'])
return hr_names[arr]
This reports a speed of
%timeit a = same_but_pure_numpy(np_data[:,2])
11.5 µs ± 151 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
so the map version seems to be faster.
Third attempt
EDIT: Okay I had my first try with pandas:
import pandas as pd
data = [{"Time":1541203508.45,"Tc":25.4,"Hp":33},
{"Time":1541203508.45,"Tc":25.2,"Hp":32},
{"Time":1541203508.45,"Tc":25.1,"Hp":31},
{"Time":1541203508.45,"Tc":25.2,"Hp":33}]
df = pd.DataFrame(data)
def convert_number_to_hr(value):
hr_names = ['V_Dry','Dry','Normal','Humid','V_Humid','ERR']
return hr_names[int(value//20)]
The result is as expected, but it seems to consume much time:
%timeit new = df["Hp"].map(convert_number_to_hr)
110 µs ± 569 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

What's the fastest way to acces a Pandas DataFrame?

I have a DataFrame df with 541 columns, and I need to save all unique pairs of its column names into the rows of a separate DataFrame, repeated 8 times each.
I thought I would create an empty DataFrame fp, double loop through df's column names, insert into every 8th row, and fill in the blanks with the last available value.
When I tried to do this though I was baffled by how long it's taking. With 541 columns I only have to write 146,611 times yet it's taking well over 20 minutes. This seems egregious for just data access. Where is the problem and how can I solve it? It takes less time than that for Pandas to produce a correlation matrix with the columns so I must me doing something wrong.
Here's a reproducible example of what I mean:
fp = np.empty(shape = (146611, 10))
fp.fill(np.nan)
fp = pd.DataFrame(fp)
%timeit for idx in range(0, len(fp)): fp.iloc[idx, 0] = idx
# 1 loop, best of 3: 22.3 s per loop

Don't do iloc/loc/chained-indexing. Using the NumPy interface alone increases speed by ~180x. If you further remove element access, we can bump this to 180,000x.
fp = np.empty(shape = (146611, 10))
fp.fill(np.nan)
fp = pd.DataFrame(fp)
# this confirms how slow data access is on my computer
%timeit for idx in range(0, len(fp)): fp.iloc[idx, 0] = idx
1 loops, best of 3: 3min 9s per loop
# this accesses the underlying NumPy array, so you can directly set the data
%timeit for idx in range(0, len(fp)): fp.values[idx, 0] = idx
1 loops, best of 3: 1.19 s per loop
This is because there's extensive code that goes in the Python layer for this fancing indexing, taking ~10µs per loop. Using Pandas indexing should be done to retrieve entire subsets of data, which you then use to do vectorized operations on the entire dataframe. Individual element access is glacial: using Python dictionaries will give you a > 180 fold increase in performance.
Things get a lot better when you access columns or rows instead of individual elements: 3 orders of magnitude better.
# set all items in 1 go.
%timeit fp[0] = np.arange(146611)
1000 loops, best of 3: 814 µs per loop
Moral
Don't try to access individual elements via chained indexing, loc, or iloc. Generate a NumPy array in a single allocation, from a Python list (or a C-interface if performance is absolutely critical), and then perform operations on entire columns or dataframes.
Using NumPy arrays and performing operations directly on columns rather than individual elements, we got a whopping 180,000+ fold increase in performance. Not too shabby.
Edit
Comments from #kushy suggest Pandas may have optimized indexing in certain cases since I originally wrote this answer. Always profile your own code, and your mileage may vary.

Alexander's answer was the fastest for me as of 2020-01-06 when using .is_numpy() instead of .values. Tested in Jupyter Notebook on Windows 10. Pandas version = 0.24.2
import numpy as np
import pandas as pd
fp = np.empty(shape = (146611, 10))
fp.fill(np.nan)
fp = pd.DataFrame(fp)
pd.__version__ # '0.24.2'
def func1():
# Asker badmax solution
for idx in range(0, len(fp)):
fp.iloc[idx, 0] = idx
def func2():
# Alexander Huszagh solution 1
for idx in range(0, len(fp)):
fp.to_numpy()[idx, 0] = idx
def func3():
# user4322543 answer to
# https://stackoverflow.com/questions/34855859/is-there-a-way-in-pandas-to-use-previous-row-value-in-dataframe-apply-when-previ
new = []
for idx in range(0, len(fp)):
new.append(idx)
fp[0] = new
def func4():
# Alexander Huszagh solution 2
fp[0] = np.arange(146611)
%timeit func1
19.7 ns ± 1.08 ns per loop (mean ± std. dev. of 7 runs, 500000000 loops each)
%timeit func2
19.1 ns ± 0.465 ns per loop (mean ± std. dev. of 7 runs, 500000000 loops each)
%timeit func3
21.1 ns ± 3.26 ns per loop (mean ± std. dev. of 7 runs, 500000000 loops each)
%timeit func4
24.7 ns ± 0.889 ns per loop (mean ± std. dev. of 7 runs, 50000000 loops each)

Pandas DataFrame performance

Pandas is really great, but I am really surprised by how inefficient it is to retrieve values from a Pandas.DataFrame. In the following toy example, even the DataFrame.iloc method is more than 100 times slower than a dictionary.
The question: Is the lesson here just that dictionaries are the better way to look up values? Yes, I get that that is precisely what they were made for. But I just wonder if there is something I am missing about DataFrame lookup performance.
I realize this question is more "musing" than "asking" but I will accept an answer that provides insight or perspective on this. Thanks.
import timeit
setup = '''
import numpy, pandas
df = pandas.DataFrame(numpy.zeros(shape=[10, 10]))
dictionary = df.to_dict()
'''
f = ['value = dictionary[5][5]', 'value = df.loc[5, 5]', 'value = df.iloc[5, 5]']
for func in f:
print func
print min(timeit.Timer(func, setup).repeat(3, 100000))
value = dictionary[5][5]
0.130625009537
value = df.loc[5, 5]
19.4681699276
value = df.iloc[5, 5]
17.2575249672

A dict is to a DataFrame as a bicycle is to a car.
You can pedal 10 feet on a bicycle faster than you can start a car, get it in gear, etc, etc. But if you need to go a mile, the car wins.
For certain small, targeted purposes, a dict may be faster.
And if that is all you need, then use a dict, for sure! But if you need/want the power and luxury of a DataFrame, then a dict is no substitute. It is meaningless to compare speed if the data structure does not first satisfy your needs.
Now for example -- to be more concrete -- a dict is good for accessing columns, but it is not so convenient for accessing rows.
import timeit
setup = '''
import numpy, pandas
df = pandas.DataFrame(numpy.zeros(shape=[10, 1000]))
dictionary = df.to_dict()
'''
# f = ['value = dictionary[5][5]', 'value = df.loc[5, 5]', 'value = df.iloc[5, 5]']
f = ['value = [val[5] for col,val in dictionary.items()]', 'value = df.loc[5]', 'value = df.iloc[5]']
for func in f:
print(func)
print(min(timeit.Timer(func, setup).repeat(3, 100000)))
yields
value = [val[5] for col,val in dictionary.iteritems()]
25.5416321754
value = df.loc[5]
5.68071913719
value = df.iloc[5]
4.56006002426
So the dict of lists is 5 times slower at retrieving rows than df.iloc. The speed deficit becomes greater as the number of columns grows. (The number of columns is like the number of feet in the bicycle analogy. The longer the distance, the more convenient the car becomes...)
This is just one example of when a dict of lists would be less convenient/slower than a DataFrame.
Another example would be when you have a DatetimeIndex for the rows and wish to select all rows between certain dates. With a DataFrame you can use
df.loc['2000-1-1':'2000-3-31']
There is no easy analogue for that if you were to use a dict of lists. And the Python loops you would need to use to select the right rows would again be terribly slow compared to the DataFrame.

It seems the performance difference is much smaller now (0.21.1 -- I forgot what was the version of Pandas in the original example). Not only the performance gap between dictionary access and .loc reduced (from about 335 times to 126 times slower), loc (iloc) is less than two times slower than at (iat) now.
In [1]: import numpy, pandas
...: ...: df = pandas.DataFrame(numpy.zeros(shape=[10, 10]))
...: ...: dictionary = df.to_dict()
...:
In [2]: %timeit value = dictionary[5][5]
85.5 ns ± 0.336 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [3]: %timeit value = df.loc[5, 5]
10.8 µs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [4]: %timeit value = df.at[5, 5]
6.87 µs ± 64.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [5]: %timeit value = df.iloc[5, 5]
14.9 µs ± 114 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [6]: %timeit value = df.iat[5, 5]
9.89 µs ± 54.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [7]: print(pandas.__version__)
0.21.1
---- Original answer below ----
+1 for using at or iat for scalar operations. Example benchmark:
In [1]: import numpy, pandas
...: df = pandas.DataFrame(numpy.zeros(shape=[10, 10]))
...: dictionary = df.to_dict()
In [2]: %timeit value = dictionary[5][5]
The slowest run took 34.06 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 310 ns per loop
In [4]: %timeit value = df.loc[5, 5]
10000 loops, best of 3: 104 µs per loop
In [5]: %timeit value = df.at[5, 5]
The slowest run took 6.59 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 9.26 µs per loop
In [6]: %timeit value = df.iloc[5, 5]
10000 loops, best of 3: 98.8 µs per loop
In [7]: %timeit value = df.iat[5, 5]
The slowest run took 6.67 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 9.58 µs per loop
It seems using at (iat) is about 10 times faster than loc (iloc).

I encountered the same problem. you can use at to improve.
"Since indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to figure out what you’re asking for. If you only want to access a scalar value, the fastest way is to use the at and iat methods, which are implemented on all of the data structures."
see official reference http://pandas.pydata.org/pandas-docs/stable/indexing.html chapter "Fast scalar value getting and setting"

I experienced different phenomenon about accessing the dataframe row.
test this simple example on dataframe about 10,000,000 rows.
dictionary rocks.
def testRow(go):
go_dict = go.to_dict()
times = 100000
ot= time.time()
for i in range(times):
go.iloc[100,:]
nt = time.time()
print('for iloc {}'.format(nt-ot))
ot= time.time()
for i in range(times):
go.loc[100,2]
nt = time.time()
print('for loc {}'.format(nt-ot))
ot= time.time()
for i in range(times):
[val[100] for col,val in go_dict.iteritems()]
nt = time.time()
print('for dict {}'.format(nt-ot))

I think the fastest way of accessing a cell, is
df.get_value(row,column)
df.set_value(row,column,value)
Both are faster than (I think)
df.iat(...)
df.at(...)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Faster pandas DatetimeIndex membership checking - python

Related

Compare with another column value

How to quickly subset many dataframes?

Python: Vectorize list lookup

What's the fastest way to acces a Pandas DataFrame?

Pandas DataFrame performance

Categories

Resources