Fast way to get rolling percentile ranks

Fast way to get rolling percentile ranks - python

Let's say we have a pandas df like this:
A B C
day1 2.4 2.1 3.0
day2 4.0 3.0 2.0
day3 3.0 3.5 2.5
day4 1.0 3.1 3.0
.....
I want to get for all columns rolling percentile ranks, with a window of 10 observations.
The following code works but it's slow:
scores = pd.DataFrame().reindex_like(df).replace(np.nan, '', regex=True)
scores = df.rolling(10).apply(lambda x: stats.percentileofscore(x, x[-1]))
I also tried this, but it's even slower:
def pctrank(x):
n = len(x)
temp = x.argsort()
ranks = np.empty(n)
ranks[temp] = (np.arange(n) + 1) / n
return ranks[-1]
scores = df.rolling(window=10,center=False).apply(pctrank)
Is there a faster solution? Thanks

As you want the rank of a single element in the rolling window, you don't need to sort at every step. You could just compare the last value to all others in the window:
def pctrank_comp(x):
x = x.to_numpy()
smaller_eq = (x <= x[-1]).sum()
return smaller_eq / len(x)
To remove the apply overhead, you could rewrite the same in NumPy, using the slide_tricks from NumPy v1.20:
from numpy.lib.stride_tricks import sliding_window_view
data = df.to_numpy()
sw = sliding_window_view(data, 10, axis=0)
scores_np = (sw <= sw[..., -1:]).sum(axis=2) / sw.shape[-1]
scores_np_df = pd.DataFrame(scores_np, columns=df.columns)
This doesn't contain the first 9 NaN values per column, as your solution, I'll leave it up to you to fix that, if needed.
Switching the sliding window axis from the last axis to the first gives another performance improvement:
sw = sliding_window_view(data, 10, axis=0).T
scores_np = (sw <= sw[-1:, ...]).sum(axis=0).T / sw.shape[0]
To benchmark, some testing data with 1000 rows:
df = pd.DataFrame(np.random.uniform(0, 10, size=(1000, 3)), columns=list("ABC"))
Original solution from question comes in at 381ms:
%timeit scores = df.rolling(window=10,center=False).apply(pctrank)
381 ms ± 2.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Diff implementation using apply, ~5x faster on my machine:
%timeit scores_comp = df.rolling(window=10,center=False).apply(pctrank_comp)
71.9 ms ± 318 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The groupby solution from Cimbali's answer, ~45x faster on my machine:
%timeit grouped = pd.concat({n: df.shift(n) for n in range(10)}).groupby(level=1); scores_grouped = grouped.rank(pct=True).loc[0].where(grouped.count().eq(10))
8.49 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pandas sliding window from #Cimbali, ~105x faster:
%timeit scores_concat = pd.concat({n: df.shift(n).le(df) for n in range(10)}).groupby(level=1).sum() / 10
3.63 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Sum shift version from #Cimbali, ~141x faster:
%timeit scores_sum = sum(df.shift(n).le(df) for n in range(10)).div(10)
2.71 ms ± 70.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The Numpy sliding window solution from above. For 1000 elements, it's faster than the Pandas versions, at ~930x (and possibly uses less memory?), but more complicated. For larger datasets, it becomes slower than the Pandas version.
%timeit data = df.to_numpy(); sw = sliding_window_view(data, 10, axis=0); scores_np = (sw <= sw[..., -1:]).sum(axis=2) / sw.shape[-1]; scores_np_df = pd.DataFrame(scores_np, columns=df.columns)
409 µs ± 4.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The fastest solution is moving the axes around, 2800x faster than the original version for 1000 rows, and about 2x faster than the Pandas sum version for 1M rows:
%timeit data = df.to_numpy(); sw = sliding_window_view(data, 10, axis=0).T; scores_np = (sw <= sw[-1:, ...]).sum(axis=0).T / sw.shape[0]; scores_np_df = pd.DataFrame(scores_np, columns=df.columns)
132 µs ± 750 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Here’s the way to write this with pandas-only tools, where pd.DataFrame.rank() comes in handy:
df.rolling(10).apply(lambda x: x.rank(pct=True).iloc[-1])
If this remains very slow and your window is reasonable, you could concatenate across an axis to generate all the values to compare, and then use groupby.rank() to compare within each set of values:
>>> pd.concat({n: df.shift(10 - n) for n in range(10)})
A B
0 0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ... ...
9 95 17.0 9.0
96 12.0 11.0
97 11.0 19.0
98 4.0 15.0
99 8.0 17.0
[1000 rows x 2 columns]
>>> grouped = pd.concat({n: df.shift(n) for n in range(10)}).groupby(level=1)
>>> grouped.rank(pct=True).loc[0].where(grouped.count().eq(10))
A B
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
.. ... ...
95 0.75 0.50
96 0.60 1.00
97 0.20 0.60
98 0.50 0.70
99 0.75 0.35
[100 rows x 2 columns]
We can compare this with #w-m’s great answer which computes ranks using sums, this gives slightly different results, probably in the case of tie-breaks between grades. The sliding window view computation using pandas could look like:
>>> sum(df.shift(n).le(df) for n in range(10)).div(10)
A B
0 0.1 0.1
1 0.1 0.2
2 0.1 0.1
3 0.2 0.1
4 0.1 0.4
.. ... ...
95 0.8 0.5
96 0.6 1.0
97 0.2 0.6
98 0.5 0.7
99 0.8 0.4
[100 rows x 2 columns]
Note that you can always add .where(df.index.to_series().ge(10)) to the resulting dataframes to remove the 10 first rows.
Here’s what happens when I compare these solutions and both from #w-m’s post:
You can see the sliding window remains faster. If you’re using pandas you might as well use the rank(), it’s not that much slower and gives you more flexibility. .apply() techniques are always slow.
Results obtained with:
import numpy as np, pandas as pd, timeit
glob = {'df': pd.DataFrame(np.random.uniform(0, 10, size=(1000, 3)), columns=list("ABC")), 'pctrank': pctrank, 'pctrank_comp': pctrank_comp, 'sliding_window_view': np.lib.stride_tricks.sliding_window_view, 'pd': pd}
timeit.timeit('df.rolling(window=10,center=False).apply(pctrank)', globals=glob, number=10) / 10
timeit.timeit('df.rolling(window=10,center=False).apply(pctrank_comp)', globals=glob, number=100) / 100
timeit.timeit('data = df.to_numpy(); sw = sliding_window_view(data, 10, axis=0); pd.DataFrame((sw <= sw[..., -1:]).sum(axis=2) / sw.shape[-1], columns=df.columns)', globals=glob, number=10_000) / 10_000
timeit.timeit('pd.concat({n: df.shift(n).le(n) for n in range(10)}).groupby(level=1).sum()', globals=glob, number=10_000) / 10_000
timeit.timeit('sum(df.shift(n).le(df) for n in range(10)).div(10)', globals=glob, number=10_000) / 10_000
timeit.timeit('pd.concat({n: df.shift(n) for n in range(10)}).groupby(level=1).rank(pct=True).loc[0]', globals=glob, number=1000) / 1000

You can use the swifter package to apply your percentiles much faster.
https://github.com/jmcarpenter2/swifter

Related

Pandas: How to calculate the percentage of one column against another?

I am just trying to calculate the percentage of one column against another's total, but I am unsure how to do this in Pandas so the calculation gets added into a new column.
Let's say, for argument's sake, my data frame has two attributes:
Number of Green Marbles
Total Number of Marbles
Now, how would I calculate the percentage of the Number of Green Marbles out of the Total Number of Marbles in Pandas?
Obviously, I know that the calculation will be something like this:
(Number of Green Marbles / Total Number of Marbles) * 100
Thanks - any help is much appreciated!

By default, arithmetic operations on pandas dataframes are element-wise, so this is as simple as it can be:
import pandas as pd
>>> d = pd.DataFrame()
>>> d['green'] = [3,5,10,12]
>>> d['total'] = [8,8,20,20]
>>> d
green total
0 3 8
1 5 8
2 10 20
3 12 20
>>> d['percent_green'] = d['green'] / d['total'] * 100
>>> d
green total percent_green
0 3 8 37.5
1 5 8 62.5
2 10 20 50.0
3 12 20 60.0
References:
pandas.DataFrame.div documentation;
Adding new column to existing dataframe in python pandas?

df['percentage columns'] = (df['Number of Green Marbles']) / (df['Total Number of Marbles'] ) * 100

Here is my comparison of regular vs vectorized approach:
%timeit us_consum['Commercial_%ofUS'] = us_consum['Commercial_MWhrs']*100/us_consum['Total US consumption (MWhr)']
351 µs ± 22.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit us_consum['Commercial_%ofUS'] = (us_consum['Commercial_MWhrs'].div(us_consum['Total US consumption (MWhr)']))*100
337 µs ± 60.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

How to put NaN in Pandas Dataframe efficiently?

I have a df which contains of categorical and numerical data
df = {'Name':['Tom', 'nick', 'krish', 'jack'],
'Address':['Oxford', 'Cambridge', 'Xianjiang', 'Wuhan'],
'Age':[20, 21, 19, 18],
'Weight':[50, 61, 69, 78]}
df = pd.DataFrame(df)
I need to replace 50 % in each column to NaN randomly, so the result might look like this picture
how to do that with the most efficient techique because I have large number of rows and columns, and I'll do many repetitions.

Use apply with sample
df_final = df.apply(lambda x: x.sample(frac=0.5)).reindex(df.index)
Out[175]:
Name Address Age Weight
0 Tom NaN NaN 50.0
1 NaN NaN NaN 61.0
2 krish Xianjiang 19.0 NaN
3 NaN Wuhan 18.0 NaN

Improving three times the performance of previous answers, mostly inspired on #jezrael , I suggest using argpartition instead of argsort, since the sorting performed is rather useless:
df1 = df.mask(np.random.rand(*df.shape).argpartition(0, axis=0) >= df.shape[0] // 2)
print(df1)
Name Address Age Weight
0 NaN Oxford NaN 50.0
1 nick Cambridge 21.0 61.0
2 NaN NaN NaN NaN
3 jack NaN 18.0 NaN
Performance comparison
# Reusing the same comparison dataset
df = pd.concat([df] * 50000, ignore_index=True)
df = pd.concat([df] * 50, ignore_index=True, axis=1)
# #Andy's answer, using apply and sample
%timeit df.apply(lambda x: x.sample(frac=0.5)).reindex(df.index)
9.72 s ± 532 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# #jezrael's answer, based on mask, np random and argsort
%timeit df.mask(np.random.rand(*df.shape).argsort(axis=0) >= df.shape[0] // 2)
8.23 s ± 732 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# This answer, based on mask, np random and argpartition
%timeit df.mask(np.random.rand(*df.shape).argpartition(0, axis=0) >= df.shape[0] // 2)
2.54 s ± 98.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It can be done by taking random numbers in the range of your tuples and run a loop over them and consider that as index to replace with NaaN
example:
if you have 10 tuples
from random number generator set range to 0 to 9 and
and take result of above operation as index to replace with NaN

Obtain `min` and `idxmin` (or `max` and `idxmax`) at the same time ("simultaneously")?

I was wondering if there is a possibility of calling idxmin and min at the same time (in the same call/loop).
Assuming the following dataframe:
id option_1 option_2 option_3 option_4
0 0 10.0 NaN NaN 110.0
1 1 NaN 20.0 200.0 NaN
2 2 NaN 300.0 30.0 NaN
3 3 400.0 NaN NaN 40.0
4 4 600.0 700.0 50.0 50.0
I would like to calculate the minimum value (min) and the column that contains it (idxmin) of the option_ series:
id option_1 option_2 option_3 option_4 min_column min_value
0 0 10.0 NaN NaN 110.0 option_1 10.0
1 1 NaN 20.0 200.0 NaN option_2 20.0
2 2 NaN 300.0 30.0 NaN option_3 30.0
3 3 400.0 NaN NaN 40.0 option_4 40.0
4 4 600.0 700.0 50.0 50.0 option_3 50.0
Obviously, I can call idxmin and min separatedly (one after the other, see example below), but is there a way of making this more efficient without searching the matrix twice (one for the value and another for the index)?
An example calling min and idxmin
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': [0,1,2,3,4],
'option_1': [10, np.nan, np.nan, 400, 600],
'option_2': [np.nan, 20, 300, np.nan, 700],
'option_3': [np.nan, 200, 30, np.nan, 50],
'option_4': [110, np.nan, np.nan, 40, 50],
})
df['min_column'] = df.filter(like='option').idxmin(1)
df['min_value'] = df.filter(like='option').min(1)
(I expected this would be suboptimal as the search is performed twice.)

Google Colab
GitHub
transpose then agg
df.set_index('id').T.agg(['min', 'idxmin']).T
min idxmin
0 10 option_1
1 20 option_2
2 30 option_3
3 40 option_4
4 50 option_3
Numpy v1
d_ = df.set_index('id')
v = d_.values
pd.DataFrame(dict(
Min=np.nanmin(v, axis=1),
Idxmin=d_.columns[np.nanargmin(v, axis=1)]
), d_.index)
Idxmin Min
id
0 option_1 10.0
1 option_2 20.0
2 option_3 30.0
3 option_4 40.0
4 option_3 50.0
Numpy v2
col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
pd.DataFrame(dict(
Min=np.nanmin(v, axis=1),
IdxMin=options[np.nanargmin(v, axis=1)]
))
Full Simulation
Conclusion
The Numpy solutions are fastest.
Results
10 columns
pir_agg_1 pir_agg_2 pir_agg_3 wen_agg_1 tot_agg_1 tot_agg_2
10 12.465358 1.272584 1.0 5.978435 2.168994 2.164858
30 26.538924 1.305721 1.0 5.331755 2.121342 2.193279
100 80.304708 1.277684 1.0 7.221127 2.215901 2.365835
300 230.009000 1.338177 1.0 5.869560 2.505447 2.576457
1000 661.432965 1.249847 1.0 8.931438 2.940030 3.002684
3000 1757.339186 1.349861 1.0 12.541915 4.656864 4.961188
10000 3342.701758 1.724972 1.0 15.287138 6.589233 6.782102
100 columns
pir_agg_1 pir_agg_2 pir_agg_3 wen_agg_1 tot_agg_1 tot_agg_2
10 8.008895 1.000000 1.977989 5.612195 1.727308 1.769866
30 18.798077 1.000000 1.855291 4.350982 1.618649 1.699162
100 56.725786 1.000000 1.877474 6.749006 1.780816 1.850991
300 132.306699 1.000000 1.535976 7.779359 1.707254 1.721859
1000 253.771648 1.000000 1.232238 12.224478 1.855549 1.639081
3000 346.999495 2.246106 1.000000 21.114310 1.893144 1.626650
10000 431.135940 2.095874 1.000000 32.588886 2.203617 1.793076
Functions
def pir_agg_1(df):
return df.set_index('id').T.agg(['min', 'idxmin']).T
def pir_agg_2(df):
d_ = df.set_index('id')
v = d_.values
return pd.DataFrame(dict(
Min=np.nanmin(v, axis=1),
IdxMin=d_.columns[np.nanargmin(v, axis=1)]
))
def pir_agg_3(df):
col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
return pd.DataFrame(dict(
Min=np.nanmin(v, axis=1),
IdxMin=options[np.nanargmin(v, axis=1)]
))
def wen_agg_1(df):
v = df.filter(like='option')
d = v.stack().sort_values().groupby(level=0).head(1).reset_index(level=1)
d.columns = ['IdxMin', 'Min']
return d
def tot_agg_1(df):
"""I combined toto_tico's 2 filter calls into one"""
d = df.filter(like='option')
return df.assign(
IdxMin=d.idxmin(1),
Min=d.min(1)
)
def tot_agg_2(df):
d = df.filter(like='option')
idxmin = d.idxmin(1)
return df.assign(
IdxMin=idxmin,
Min=d.lookup(d.index, idxmin)
)
Sim setup
def sim_df(n, m):
return pd.DataFrame(
np.random.randint(m, size=(n, m))
).rename_axis('id').add_prefix('option').reset_index()
fs = 'pir_agg_1 pir_agg_2 pir_agg_3 wen_agg_1 tot_agg_1 tot_agg_2'.split()
ix = [10, 30, 100, 300, 1000, 3000, 10000]
res_small_col = pd.DataFrame(index=ix, columns=fs, dtype=float)
res_large_col = pd.DataFrame(index=ix, columns=fs, dtype=float)
for i in ix:
df = sim_df(i, 10)
for j in fs:
stmt = f"{j}(df)"
setp = f"from __main__ import {j}, df"
res_small_col.at[i, j] = timeit(stmt, setp, number=10)
for i in ix:
df = sim_df(i, 100)
for j in fs:
stmt = f"{j}(df)"
setp = f"from __main__ import {j}, df"
res_large_col.at[i, j] = timeit(stmt, setp, number=10)

Maybe using stack with groupby
v=df.filter(like='option')
v.stack().sort_values().groupby(level=[0]).head(1).reset_index(level=1)
Out[313]:
level_1 0
0 option_1 10.0
1 option_2 20.0
2 option_3 30.0
3 option_4 40.0
4 option_3 50.0

UPDATE 2:
The numpy solution of #piRSquared is the winner for what I would consider the most common cases. Here is his answers with a minimum modification to assign the columns to the original dataframe (which I did in all my tests, in order to be consistent with the example of the original question)
col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
df.assign(min_value = np.nanmin(v, axis=1),
min_column = options[np.nanargmin(v, axis=1)])
You should be careful if you have a lot of columns (more than 10000), since in these extreme cases results could start changing significatively.
UPDATE 1:
According to my tests calling min and idxmin separatedly is the fastest you can do based on all the proposed answers.
Although it is not at the same time(see direct answer below), you should be better of using DataFrame.lookup on the column indexes (min_column colum), in order to avoid the search for values (min_values).
So, instead of traversing the entire matrix - which is O(n*m), you would only traverse the resulting min_column series - which is O(n):
df = pd.DataFrame({
'id': [0,1,2,3,4],
'option_1': [10, np.nan, np.nan, 400, 600],
'option_2': [np.nan, 20, 300, np.nan, 700],
'option_3': [np.nan, 200, 30, np.nan, 50],
'option_4': [110, np.nan, np.nan, 40, 50],
})
df['min_column'] = df.filter(like='option').idxmin(1)
df['min_value'] = df.lookup(df.index, df['min_column'])
Direct answer (not as efficient)
Since you asked about how to calculate the values "in the same call" (let's say because you simplified your example for the question), you can try a lambda expression:
def min_idxmin(x):
_idx = x.idxmin()
return _idx, x[_idx]
df['min_column'], df['min_value'] = zip(*df.filter(like='option').apply(
lambda x: min_idxmin(x), axis=1))
To be clear, although here the 2nd search is removed (replaced by a direct acccess in x[_idx]), this will highly likely take much longer because you are not exploitng the vectorizing properties of pandas/numpy.
Bottom line is pandas/numpy vectorized operations are very fast.
Summary of the summary:
There doesn't seem to be any advantage in using df.lookup, calling min and idxmin separatedly is better, than using the lookup which is mind blowing and deserves a question in itself.
Summary of the timings:
I tested a dataframe with 10000 rows and 10 columns (option_ sequence in the initial example). Since, I got a couple of unexpected result, I then also tested with 1000x1000, and 100x10000. According to the results:
Using numpy as #piRSquared (test8) suggested is the clear winner, only start perfoming worse when there is a lot of columns (100, 10000, but does not justify the general use of it). The test9 modifies trying to using index in numpy, but it generally speaking performs worse.
Calling min and idxmin separatedly was the best for the 10000x10 case, even better than the Dataframe.lookup (although, the Dataframe.lookup result performed better in the 100x10000 case). Although the shape of the data influence the results, I would argue that having 10000 columns is a bit unrealistic.
The solution provided by #Wen followed in performance, though it was not better than calling idxmin and min separatedly, or using Dataframe.lookup. I did an extra test (see test7()) because I felt that the the addition of operation (reset_index and zip might be disturbing the result. It was still worse than test1 and test2, even though it does not do the assigment (I couldn't figure out how to make the assigment using the head(1)). #Wen, would you mind giving me a hand?
#Wen solution underperfoms when there are more columns (1000x1000, or 100x10000), which makes sense because sorting is slower than searching. In this case, the lambda expression that I suggested performs better.
Any other solution with lambda expression, or that uses the transpose (T) falls behind. The lambda expression that I suggested took around 1 second, better than the ~11 secs using the transpose T suggested by #piRSquared and #RafaelC.
TimeIt results with 10000 rows x 10 columns (pandas 0.23.4):
Using the following dataframe of 10000 rows and 10 columns:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10000, 10)), columns=[f'option_{x}' for x in range(1,11)]).reset_index()
Calling the two columns twice separatedly:
def test1():
df['min_column'] = df.filter(like='option').idxmin(1)
df['min_value'] = df.filter(like='option').min(1)
%timeit -n 100 test1()
13 ms ± 580 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Calling the lookup (it is slower for this case!):
def test2():
df['min_column'] = df.filter(like='option').idxmin(1)
df['min_value'] = df.lookup(df.index, df['min_column'])
%timeit -n 100 test2()
# 15.7 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using apply and min_idxmin(x):
def min_idxmin(x):
_idx = x.idxmin()
return _idx, x[_idx]
def test3():
df['min_column'], df['min_value'] = zip(*df.filter(like='option').apply(
lambda x: min_idxmin(x), axis=1))
%timeit -n 10 test3()
# 968 ms ± 32.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using agg['min', 'idxmin'] by #piRSquared:
def test4():
df['min_column'], df['min_value'] = zip(*df.set_index('index').filter(like='option').T.agg(['min', 'idxmin']).T.values)
%timeit -n 1 test4()
# 11.2 s ± 850 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using agg['min', 'idxmin'] by #RafaelC:
def test5():
df['min_column'], df['min_value'] = zip(*df.filter(like='option').agg(lambda x: x.agg(['min', 'idxmin']), axis=1).values)
%timeit -n 1 test5()
# 11.7 s ± 597 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sorting values by #Wen:
def test6():
df['min_column'], df['min_value'] = zip(*df.filter(like='option').stack().sort_values().groupby(level=[0]).head(1).reset_index(level=1).values)
%timeit -n 100 test6()
# 33.6 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Sorting values by #Wen modified by me to make the comparison fairer due to overload of assigment operation (I explained why in the summary at the beginning):
def test7():
df.filter(like='option').stack().sort_values().groupby(level=[0]).head(1)
%timeit -n 100 test7()
# 25 ms ± 937 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using numpy:
def test8():
col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
df.assign(min_value = np.nanmin(v, axis=1),
min_column = options[np.nanargmin(v, axis=1)])
%timeit -n 100 test8()
# 2.76 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using numpy but avoid the search (indexing instead):
def test9():
col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
idxmin = np.nanargmin(v, axis=1)
# instead of looking for the answer, indexes are used
df.assign(min_value = v[range(v.shape[0]), idxmin],
min_column = options[idxmin])
%timeit -n 100 test9()
# 3.96 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
TimeIt results with 1000 rows x 1000 columns:
I perform more test with a 1000x1000 shape:
df = pd.DataFrame(np.random.randint(0,100,size=(1000, 1000)), columns=[f'option_{x}' for x in range(1,1001)]).reset_index()
Although the results change:
test1 ~27.6ms
test2 ~29.4ms
test3 ~135ms
test4 ~1.18s
test5 ~1.29s
test6 ~287ms
test7 ~290ms
test8 ~25.7
test9 ~26.1
TimeIt results with 100 rows x 10000 columns:
I perform more test with a 100x10000 shape:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 10000)), columns=[f'option_{x}' for x in range(1,10001)]).reset_index()
Although the results change:
test1 ~46.8ms
test2 ~25.6ms
test3 ~101ms
test4 ~289ms
test5 ~276ms
test6 ~349ms
test7 ~301ms
test8 ~121ms
test9 ~122ms

fastest way to change multiple loc in a dataframe

I have a pandas dataframe with 1 million rows. I want to replace values in 900,000 rows in a column by another set of values. Is there fast way to do this without a for loop (which takes me two days to complete)?
For example, look at this sample dataframe where I have condensed 1 million rows to 8 rows
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['a'] = [-1,-3,-4,-4,-3, 4,5,6]
df['b'] = [23,45,67,89,0,-1, 2, 3]
L2 = [-1,-3,-4]
L5 = [9,10,11]
I want to replace values where a is -1, -3, -4 in a single shot if possible or as fast as possible without a for loop.
The crucial part is that values in L5 have to be repeated as needed.
I have tried
df.loc[df.a < 0, 'a'] = L5
but this works only when len(df.a.values) == len(L5)

Use map by dictionary created from both lists by zip, last replace to original non matched values by fillna:
d = dict(zip(L2, L5))
print (d)
{-1: 9, -3: 10, -4: 11}
df['a'] = df['a'].map(d).fillna(df['a'])
print (df)
a b
0 9.0 23
1 10.0 45
2 11.0 67
3 11.0 89
4 10.0 0
5 4.0 -1
6 5.0 2
7 6.0 3
Performance:
It depends of number of values for replace anf of lenght of lists:
Length of lists is 100:
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'a':np.random.randint(1000, size=N)})
L2 = np.arange(100)
L5 = np.arange(100) + 10
In [336]: %timeit df['d'] = np.select([df['a'] == i for i in L2], L5, df['a'])
180 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [337]: %timeit df['a'].map(dict(zip(L2, L5))).fillna(df['a'])
56.9 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If length of lists is small (e.g. 3):
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'a':np.random.randint(100, size=N)})
L2 = np.arange(3)
L5 = np.arange(3) + 10
In [339]: %timeit df['d'] = np.select([df['a'] == i for i in L2], L5, df['a'])
11.9 ms ± 40.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [340]: %timeit df['a'].map(dict(zip(L2, L5))).fillna(df['a'])
54 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

you can use np.select such as:
import numpy as np
condition = [df['a'] == i for i in L2]
df['a'] = np.select(condition, L5, df['a'])
and you get:
a b
0 9 23
1 10 45
2 11 67
3 11 89
4 10 0
5 4 -1
6 5 2
7 6 3
Timing: let's create a bigger dataframe such as with your df:
df_l = pd.concat([df]*10000)
print (df_l.shape)
(80000, 2)
Now some timeit:
# with map, #jezrael
d = dict(zip(L2, L5))
%timeit df_l['a'].map(d).fillna(df_l['a'])
100 loops, best of 3: 7.71 ms per loop
# with np.select
condition = [df_l['a'] == i for i in L2]
%timeit np.select(condition, L5, df_l['a'])
1000 loops, best of 3: 350 µs per loop

Pandas Convert Dataframe Values to Labels

I am trying to convert values within the current dataframe as the "Index" and the dataframe's Index as the "Labels". For Example:
Value1 Value2
0 0 1
1 2 4
2 NaN 3
This would result in
Labels
0 0
1 0
2 1
3 2
4 1
Currently I managed to do this using a loop to check and apply the necessary labels/values but with millions of labels to mark this process becomes extremely time consuming. Is there a way to do this in a smarter and quicker way? Thanks in advance.

Use stack with DataFrame constructor:
s = df.stack()
df = pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
print (df)
Labels
0 0
1 0
2 1
3 2
4 1
Detail:
print (df.stack())
0 Value1 0.0
Value2 1.0
1 Value1 2.0
Value2 4.0
2 Value2 3.0
dtype: float64

Came up with a really good one (thanks to the collective effort of the pandas community). This one should be fast.
It uses the power a flexibility of repeat and ravel to flatten your data.
s = pd.Series(df.index.repeat(2), index=df.values.ravel())
s[s.index.notnull()].sort_index()
0.0 0
1.0 0
2.0 1
3.0 2
4.0 1
dtype: int64
A subsequent conversion results in an integer index:
df.index = df.index.astype(int)
A similar (slightly faster depending on your data) solution which also results in an integer index is performing the filtering before converting to Series -
v = df.index.repeat(df.shape[1])
i = df.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
s
0 0
1 0
2 1
3 2
4 1
dtype: int64
Performance
df2 = pd.concat([df] * 10000, ignore_index=True)
# jezrael's solution
%%timeit
s = df2.stack()
pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
4.57 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
s = pd.Series(df2.index.repeat(2), index=df2.values.ravel())
s[s.index.notnull()].sort_index()
3.12 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
v = df2.index.repeat(df.shape[1])
i = df2.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
3.1 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fast way to get rolling percentile ranks - python

You can use the swifter package to apply your percentiles much faster. https://github.com/jmcarpenter2/swifter

Related

Pandas: How to calculate the percentage of one column against another?

How to put NaN in Pandas Dataframe efficiently?

Obtain `min` and `idxmin` (or `max` and `idxmax`) at the same time ("simultaneously")?

fastest way to change multiple loc in a dataframe

Pandas Convert Dataframe Values to Labels

Categories

Resources