how to improve searching index in dataframe

how to improve searching index in dataframe - python

Given a pandas dataframe with a timestamp index, sorted.
I have a label and I need to find the closest index to that label.
Also, I need to find a smaller timestamp, so the search should be computed in the minor timestamps.
Here is my code:
import pandas as pd
import datetime
data = [i for i in range(100)]
dates = pd.date_range(start="01-01-2018", freq="min", periods=100)
dataframe = pd.DataFrame(data, dates)
label = "01-01-2018 00:10:01"
method = "pad"
tol = datetime.timedelta(seconds=60)
idx = dataframe.index.get_loc(key=label, method="pad", tolerance=tol)
print("Closest idx:"+str(idx))
print("Closest date:"+str(dataframe.index[idx]))
the searching is too slow. Is there a way to improve it?

To improve performance, I recommend a transformation of what you're searching. Instead of using get_loc, you can convert your DateTimeIndex to Unix Time, and use np.searchsorted on the underlying numpy array (As the name implies, this requires a sorted index).
get_loc:
(Your current approach)
label = "01-01-2018 00:10:01"
tol = datetime.timedelta(seconds=60)
idx = dataframe.index.get_loc(key=label, method="pad", tolerance=tol)
print(dataframe.iloc[idx])
0 10
Name: 2018-01-01 00:10:00, dtype: int64
And it's timings:
%timeit dataframe.index.get_loc(key=label, method="pad", tolerance=tol)
2.03 ms ± 81.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.searchsorted:
arr = df.index.astype(int)//10**9
l = pd.to_datetime(label).timestamp()
idx = np.max(np.searchsorted(arr, l, side='left')-1, 0)
print(dataframe.iloc[idx])
0 10
Name: 2018-01-01 00:10:00, dtype: int64
And the timings:
%timeit np.max(np.searchsorted(arr, l, side='left')-1, 0)
56.6 µs ± 979 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
(I didn't include the setup costs, because the initial array creation should be something you do once, then use for every single query, but even if I did include the setup costs, this method is faster):
%%timeit
arr = df.index.astype(int)//10**9
l = pd.to_datetime(label).timestamp()
np.max(np.searchsorted(arr, l, side='left')-1, 0)
394 µs ± 3.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The above method does not enforce a tolerance of 60s, although this is trivial to check:
>>> np.abs(arr[idx]-l)<60
True

Related

When indexing a DataFrame with a boolean mask, is it faster to apply the masks sequentially?

Given a large DataFrame df, which is faster in general?
# combining the masks first
sub_df = df[(df["column1"] < 5) & (df["column2"] > 10)]
# applying the masks sequentially
sub_df = df[df["column1"] < 5]
sub_df = sub_df[sub_df["column2"] > 10]
The first approach only selects from the DataFrame once which may be faster, however, the second selection in the second example only has to consider a smaller DataFrame.

It depends on your dataset.
First let's generate a DataFrame with almost all values that should be dropped in the first condition:
n = 1_000_000
p = 0.0001
np.random.seed(0)
df = pd.DataFrame({'column1': np.random.choice([0,6], size=n, p=[p, 1-p]),
'column2': np.random.choice([0,20], size=n)})
And as expected:
# simultaneous conditions
5.69 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# successive slicing
2.99 ms ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It is faster to first generate a small intermediate.
Now, let's change the probability to p = 0.9999. This means that the first condition will remove very few rows.
We could expect both solutions to run with a similar speed, but:
# simultaneous conditions
27.5 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# successive slicing
55.7 ms ± 3.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now the overhead of creating the intermediate DataFrame is not negligible.

Python - Remove row if item is above certain value and replace if between other values

I'm working in a pandas dataframe trying to clean up some data and I want to assign multiple rules to a certain column. If the column value is greater than 500 I want to drop the column. If the column value is between 101 and 500 I want to replace the value with 100. When the column is less than 101 return the column value.
I'm able to do it in 2 lines of code, but I was curious if there's a cleaner more efficient way to do this. I tried with an If/Elif/Else, but I couldn't get it to run or a lambda function, but again I couldn't get it to run.
# This drops all rows that are greater than 500
df.drop(df[df.Percent > 500].index, inplace = True)
# This sets the upper limit on all values at 100
df['Percent'] = df['Percent'].clip(upper = 100)

You can use .loc with boolean mask instead of .drop() with index and use fast numpy function numpy.where() to achieve more efficient / better performance, as follows:
import numpy as np
df2 = df.loc[df['Percent'] <= 500]
df2['Percent'] = np.where(df2['Percent'] >= 101, 100, df2['Percent'])
Performance Comparison:
Part 1: Original size dataframe
Old Codes:
%%timeit
df.drop(df[df.Percent > 500].index, inplace = True)
# This sets the upper limit on all values at 100
df['Percent'] = df['Percent'].clip(upper = 100)
1.58 ms ± 56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
New Codes:
%%timeit
df2 = df.loc[df['Percent'] <= 500]
df2['Percent'] = np.where(df2['Percent'] >= 101, 100, df2['Percent'])
784 µs ± 8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Benchmarking result:
The new codes take 784 µs while the old codes take 1.58 ms:
Around 2x times faster
Part 2: Large size dataframe
Let's use a dataframe 10000 times the original size:
df9 = pd.concat([df] * 10000, ignore_index=True)
Old Codes:
%%timeit
df9.drop(df9[df9.Percent > 500].index, inplace = True)
# This sets the upper limit on all values at 100
df9['Percent'] = df9['Percent'].clip(upper = 100)
3.87 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
New Codes:
%%timeit
df2 = df9.loc[df9['Percent'] <= 500]
df2['Percent'] = np.where(df2['Percent'] >= 101, 100, df2['Percent'])
1.96 ms ± 70.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Benchmarking result:
The new codes take 1.96 ms while the old codes take 3.87 ms :
Also around 2x times faster

Small Pandas DataFrame = bad performance

Usecase and conclusions
Create a 2*2 table filled with integers, change the values of one specific row, and access several rows. For this I was planning using pandas DataFrame but got very disappointed with the performances. The conclusion shows an incompressible cost for pandas DataFrame when the data is a small tables:
- pandas.loc = pandas.iloc = 115 µs
- pandas.iat = 5 µs (20 times faster, but only access one cell)
- numpy access = 0.5 (200 times faster, acceptable performance)
Am I doing a incorrect usage of pandas DataFrame ? Is it supposed to be used only for massive tables of data ? Given that my goal is to use a very simple MultiIndexation (type 1, type 2 and date), is there an existing data-structure that would support similar performance that numpy arrays and is as easy as pandas DataFrame ?
Or the other option I consider is to create my own class of MultiIndexed numpy arrays.
Code of the usecase
import pandas as pd
import numpy as np
from datetime import datetime
import time
data = [[1,2],[3,4]]
array = np.array(data)
index= pd.DatetimeIndex([datetime(2000,2,2), datetime(2000,2,4)])
df = pd.DataFrame(data = data, index = index)
It takes around 0.1 ms using pandas using the loc function
%timeit df.loc[datetime(2000,2,2)] = [12, 42]
115 µs ± 2.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
iloc has the same bad performances, although I would have intuitively think it would be faster
%timeit df.iloc[0] = [12, 42]
114 µs ± 3.49 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Testing iat, we are down to 5 µs, which is more acceptable
%timeit df.iat[0,0] = 42
5.03 µs ± 37.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Finally testing the same behavior with a numpy.array, we have excellent performances :
%timeit array[0,:] = [12, 42]
705 ns ± 4.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Efficiently aggregate results into a Python Data Structure

I often find myself looping over some long INPUT list (or dataframe, or dictionary). Per iteration I do some calculations on the input data, I then push the results into some OUTPUT data structure. Often the final output is a dataframe (since it is convenient to deal with).
Below are two methods that loop over a long list, and aggregate some dummy results into a dataframe. Approach 1 is very slow (~3 seconds per run), whereas Approach 2 is very fast (~18 ms per run). Approach 1 is not good, because it is slow. Approach 2 is faster, but it is not ideal either, because it effectively "caches" data in a local file (and then relies on pandas to read that file back in very quickly). Ideally, we do everything in memory.
What approaches can people suggest to efficiently aggregate results? Bonus: And what if we don't know the exact size/length of our output structure (e.g. the actual output size may exceed the initial size estimate)? Any ideas appreciated.
import time
import pandas as pd
def run1(long_list):
my_df = pd.DataFrame(columns=['A','B','C'])
for el in long_list:
my_df.loc[(len)] = [el, el+1, 1/el] # Dummy calculations
return my_df
def run2(long_list):
with open('my_file.csv', 'w') as f:
f.write('A,B,C\n')
for el in long_list:
f.write(f'{el},{el+1},{1/el}\n') # Dummy calculations
return pd.read_csv('my_file.csv')
long_list = range(1, 2000)
%timeit df1 = run1(long_list) # 3 s ± 349 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2 = run2(long_list) # 18 ms ± 697 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

You can do this by creating and then dropping a dummy input column and doing all of the calculations directly in pandas:
def func(long_list):
my_df = pd.DataFrame(long_list, columns=['input'])
my_df = my_df.assign(
A=my_df.input,
B=my_df.input+1,
C=1/my_df.input)
return my_df.drop('input', axis=1)
Comparing the times:
%timeit df1 = run1(long_list)
%timeit df2 = run2(long_list)
%timeit df3 = func(long_list)
3.81 s ± 6.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.54 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.19 ms ± 3.95 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pros:
All in memory
Really fast
Easy to read
Cons:
Probably not as fast as vectorized Numpy operations

You can directly build a DataFrame from a list of lists:
def run3(long_list):
return pd.DataFrame([[el, el+1, 1/el] for el in long_list],
columns=['A','B','C'])
It should be much faster than first one, and still faster that second one, because it does not use disk io.

obtaining last value of dataframe column without index

Suppose I have a DataFrame such as:
df = pd.DataFrame(np.random.randn(10,5), columns = ['a','b','c','d','e'])
and I would like to retrieve the last value in column e. I could do:
df['e'].tail(1)
but this would return a series which has index 9 with it. Ideally, I just want to obtain the value as a number that I can work with directly. I could also do:
np.array(df['e'].tail(1))
but this would then require me to access/call the 0'th element of it before I can really work with it. Is there a more direct/easy way to do this?

You could try iloc method of dataframe:
In [26]: df
Out[26]:
a b c d e
0 -1.079547 -0.722903 0.457495 -0.687271 -0.787058
1 1.326133 1.359255 -0.964076 -1.280502 1.460792
2 0.479599 -1.465210 -0.058247 -0.984733 -0.348068
3 -0.608238 -1.238068 -0.126889 0.572662 -1.489641
4 -1.533707 -0.218298 -0.877619 0.679370 0.485987
5 -0.864651 -0.180165 -0.528939 0.270885 1.313946
6 0.747612 -1.206509 0.616815 -1.758354 -0.158203
7 -2.309582 -0.739730 -0.004303 0.125640 -0.973230
8 1.735822 -0.750698 1.225104 0.431583 -1.483274
9 -0.374557 -1.132354 0.875028 0.032615 -1.131971
In [27]: df['e'].iloc[-1]
Out[27]: -1.1319705662711321
Or if you want just scalar you could use iat which is faster. From docs:
If you only want to access a scalar value, the fastest way is to use the at and iat methods, which are implemented on all of the data structures
In [28]: df.e.iat[-1]
Out[28]: -1.1319705662711321
Benchmarking:
In [31]: %timeit df.e.iat[-1]
100000 loops, best of 3: 18 µs per loop
In [32]: %timeit df.e.iloc[-1]
10000 loops, best of 3: 24 µs per loop

Try
df['e'].iloc[[-1]]
Sometimes,
df['e'].iloc[-1]
doesn't work.

We can also access it by indexing df.index and at:
df.at[df.index[-1], 'e']
It's faster than iloc but slower than without indexing.
If we decide to assign a value to the last element in column "e", the above method is much faster than the other two options (9-11 times faster):
>>> %timeit df.at[df.index[-1], 'e'] = 1
11.5 µs ± 355 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit df['e'].iat[-1] = 1
107 µs ± 4.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit df['e'].iloc[-1] = 1
127 µs ± 7.13 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)```

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to improve searching index in dataframe - python

Related

When indexing a DataFrame with a boolean mask, is it faster to apply the masks sequentially?

Python - Remove row if item is above certain value and replace if between other values

Small Pandas DataFrame = bad performance

Efficiently aggregate results into a Python Data Structure

obtaining last value of dataframe column without index

Categories

Resources