I am just trying to calculate the percentage of one column against another's total, but I am unsure how to do this in Pandas so the calculation gets added into a new column.
Let's say, for argument's sake, my data frame has two attributes:
Number of Green Marbles
Total Number of Marbles
Now, how would I calculate the percentage of the Number of Green Marbles out of the Total Number of Marbles in Pandas?
Obviously, I know that the calculation will be something like this:
(Number of Green Marbles / Total Number of Marbles) * 100
Thanks - any help is much appreciated!
By default, arithmetic operations on pandas dataframes are element-wise, so this is as simple as it can be:
import pandas as pd
>>> d = pd.DataFrame()
>>> d['green'] = [3,5,10,12]
>>> d['total'] = [8,8,20,20]
>>> d
green total
0 3 8
1 5 8
2 10 20
3 12 20
>>> d['percent_green'] = d['green'] / d['total'] * 100
>>> d
green total percent_green
0 3 8 37.5
1 5 8 62.5
2 10 20 50.0
3 12 20 60.0
References:
pandas.DataFrame.div documentation;
Adding new column to existing dataframe in python pandas?
df['percentage columns'] = (df['Number of Green Marbles']) / (df['Total Number of Marbles'] ) * 100
Here is my comparison of regular vs vectorized approach:
%timeit us_consum['Commercial_%ofUS'] = us_consum['Commercial_MWhrs']*100/us_consum['Total US consumption (MWhr)']
351 µs ± 22.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit us_consum['Commercial_%ofUS'] = (us_consum['Commercial_MWhrs'].div(us_consum['Total US consumption (MWhr)']))*100
337 µs ± 60.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I have a df which contains of categorical and numerical data
df = {'Name':['Tom', 'nick', 'krish', 'jack'],
'Address':['Oxford', 'Cambridge', 'Xianjiang', 'Wuhan'],
'Age':[20, 21, 19, 18],
'Weight':[50, 61, 69, 78]}
df = pd.DataFrame(df)
I need to replace 50 % in each column to NaN randomly, so the result might look like this picture
how to do that with the most efficient techique because I have large number of rows and columns, and I'll do many repetitions.
Use apply with sample
df_final = df.apply(lambda x: x.sample(frac=0.5)).reindex(df.index)
Out[175]:
Name Address Age Weight
0 Tom NaN NaN 50.0
1 NaN NaN NaN 61.0
2 krish Xianjiang 19.0 NaN
3 NaN Wuhan 18.0 NaN
Improving three times the performance of previous answers, mostly inspired on #jezrael , I suggest using argpartition instead of argsort, since the sorting performed is rather useless:
df1 = df.mask(np.random.rand(*df.shape).argpartition(0, axis=0) >= df.shape[0] // 2)
print(df1)
Name Address Age Weight
0 NaN Oxford NaN 50.0
1 nick Cambridge 21.0 61.0
2 NaN NaN NaN NaN
3 jack NaN 18.0 NaN
Performance comparison
# Reusing the same comparison dataset
df = pd.concat([df] * 50000, ignore_index=True)
df = pd.concat([df] * 50, ignore_index=True, axis=1)
# #Andy's answer, using apply and sample
%timeit df.apply(lambda x: x.sample(frac=0.5)).reindex(df.index)
9.72 s ± 532 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# #jezrael's answer, based on mask, np random and argsort
%timeit df.mask(np.random.rand(*df.shape).argsort(axis=0) >= df.shape[0] // 2)
8.23 s ± 732 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# This answer, based on mask, np random and argpartition
%timeit df.mask(np.random.rand(*df.shape).argpartition(0, axis=0) >= df.shape[0] // 2)
2.54 s ± 98.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It can be done by taking random numbers in the range of your tuples and run a loop over them and consider that as index to replace with NaaN
example:
if you have 10 tuples
from random number generator set range to 0 to 9 and
and take result of above operation as index to replace with NaN
I was wondering if there is a possibility of calling idxmin and min at the same time (in the same call/loop).
Assuming the following dataframe:
id option_1 option_2 option_3 option_4
0 0 10.0 NaN NaN 110.0
1 1 NaN 20.0 200.0 NaN
2 2 NaN 300.0 30.0 NaN
3 3 400.0 NaN NaN 40.0
4 4 600.0 700.0 50.0 50.0
I would like to calculate the minimum value (min) and the column that contains it (idxmin) of the option_ series:
id option_1 option_2 option_3 option_4 min_column min_value
0 0 10.0 NaN NaN 110.0 option_1 10.0
1 1 NaN 20.0 200.0 NaN option_2 20.0
2 2 NaN 300.0 30.0 NaN option_3 30.0
3 3 400.0 NaN NaN 40.0 option_4 40.0
4 4 600.0 700.0 50.0 50.0 option_3 50.0
Obviously, I can call idxmin and min separatedly (one after the other, see example below), but is there a way of making this more efficient without searching the matrix twice (one for the value and another for the index)?
An example calling min and idxmin
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': [0,1,2,3,4],
'option_1': [10, np.nan, np.nan, 400, 600],
'option_2': [np.nan, 20, 300, np.nan, 700],
'option_3': [np.nan, 200, 30, np.nan, 50],
'option_4': [110, np.nan, np.nan, 40, 50],
})
df['min_column'] = df.filter(like='option').idxmin(1)
df['min_value'] = df.filter(like='option').min(1)
(I expected this would be suboptimal as the search is performed twice.)
Google Colab
GitHub
transpose then agg
df.set_index('id').T.agg(['min', 'idxmin']).T
min idxmin
0 10 option_1
1 20 option_2
2 30 option_3
3 40 option_4
4 50 option_3
Numpy v1
d_ = df.set_index('id')
v = d_.values
pd.DataFrame(dict(
Min=np.nanmin(v, axis=1),
Idxmin=d_.columns[np.nanargmin(v, axis=1)]
), d_.index)
Idxmin Min
id
0 option_1 10.0
1 option_2 20.0
2 option_3 30.0
3 option_4 40.0
4 option_3 50.0
Numpy v2
col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
pd.DataFrame(dict(
Min=np.nanmin(v, axis=1),
IdxMin=options[np.nanargmin(v, axis=1)]
))
Full Simulation
Conclusion
The Numpy solutions are fastest.
Results
10 columns
pir_agg_1 pir_agg_2 pir_agg_3 wen_agg_1 tot_agg_1 tot_agg_2
10 12.465358 1.272584 1.0 5.978435 2.168994 2.164858
30 26.538924 1.305721 1.0 5.331755 2.121342 2.193279
100 80.304708 1.277684 1.0 7.221127 2.215901 2.365835
300 230.009000 1.338177 1.0 5.869560 2.505447 2.576457
1000 661.432965 1.249847 1.0 8.931438 2.940030 3.002684
3000 1757.339186 1.349861 1.0 12.541915 4.656864 4.961188
10000 3342.701758 1.724972 1.0 15.287138 6.589233 6.782102
100 columns
pir_agg_1 pir_agg_2 pir_agg_3 wen_agg_1 tot_agg_1 tot_agg_2
10 8.008895 1.000000 1.977989 5.612195 1.727308 1.769866
30 18.798077 1.000000 1.855291 4.350982 1.618649 1.699162
100 56.725786 1.000000 1.877474 6.749006 1.780816 1.850991
300 132.306699 1.000000 1.535976 7.779359 1.707254 1.721859
1000 253.771648 1.000000 1.232238 12.224478 1.855549 1.639081
3000 346.999495 2.246106 1.000000 21.114310 1.893144 1.626650
10000 431.135940 2.095874 1.000000 32.588886 2.203617 1.793076
Functions
def pir_agg_1(df):
return df.set_index('id').T.agg(['min', 'idxmin']).T
def pir_agg_2(df):
d_ = df.set_index('id')
v = d_.values
return pd.DataFrame(dict(
Min=np.nanmin(v, axis=1),
IdxMin=d_.columns[np.nanargmin(v, axis=1)]
))
def pir_agg_3(df):
col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
return pd.DataFrame(dict(
Min=np.nanmin(v, axis=1),
IdxMin=options[np.nanargmin(v, axis=1)]
))
def wen_agg_1(df):
v = df.filter(like='option')
d = v.stack().sort_values().groupby(level=0).head(1).reset_index(level=1)
d.columns = ['IdxMin', 'Min']
return d
def tot_agg_1(df):
"""I combined toto_tico's 2 filter calls into one"""
d = df.filter(like='option')
return df.assign(
IdxMin=d.idxmin(1),
Min=d.min(1)
)
def tot_agg_2(df):
d = df.filter(like='option')
idxmin = d.idxmin(1)
return df.assign(
IdxMin=idxmin,
Min=d.lookup(d.index, idxmin)
)
Sim setup
def sim_df(n, m):
return pd.DataFrame(
np.random.randint(m, size=(n, m))
).rename_axis('id').add_prefix('option').reset_index()
fs = 'pir_agg_1 pir_agg_2 pir_agg_3 wen_agg_1 tot_agg_1 tot_agg_2'.split()
ix = [10, 30, 100, 300, 1000, 3000, 10000]
res_small_col = pd.DataFrame(index=ix, columns=fs, dtype=float)
res_large_col = pd.DataFrame(index=ix, columns=fs, dtype=float)
for i in ix:
df = sim_df(i, 10)
for j in fs:
stmt = f"{j}(df)"
setp = f"from __main__ import {j}, df"
res_small_col.at[i, j] = timeit(stmt, setp, number=10)
for i in ix:
df = sim_df(i, 100)
for j in fs:
stmt = f"{j}(df)"
setp = f"from __main__ import {j}, df"
res_large_col.at[i, j] = timeit(stmt, setp, number=10)
Maybe using stack with groupby
v=df.filter(like='option')
v.stack().sort_values().groupby(level=[0]).head(1).reset_index(level=1)
Out[313]:
level_1 0
0 option_1 10.0
1 option_2 20.0
2 option_3 30.0
3 option_4 40.0
4 option_3 50.0
UPDATE 2:
The numpy solution of #piRSquared is the winner for what I would consider the most common cases. Here is his answers with a minimum modification to assign the columns to the original dataframe (which I did in all my tests, in order to be consistent with the example of the original question)
col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
df.assign(min_value = np.nanmin(v, axis=1),
min_column = options[np.nanargmin(v, axis=1)])
You should be careful if you have a lot of columns (more than 10000), since in these extreme cases results could start changing significatively.
UPDATE 1:
According to my tests calling min and idxmin separatedly is the fastest you can do based on all the proposed answers.
Although it is not at the same time(see direct answer below), you should be better of using DataFrame.lookup on the column indexes (min_column colum), in order to avoid the search for values (min_values).
So, instead of traversing the entire matrix - which is O(n*m), you would only traverse the resulting min_column series - which is O(n):
df = pd.DataFrame({
'id': [0,1,2,3,4],
'option_1': [10, np.nan, np.nan, 400, 600],
'option_2': [np.nan, 20, 300, np.nan, 700],
'option_3': [np.nan, 200, 30, np.nan, 50],
'option_4': [110, np.nan, np.nan, 40, 50],
})
df['min_column'] = df.filter(like='option').idxmin(1)
df['min_value'] = df.lookup(df.index, df['min_column'])
Direct answer (not as efficient)
Since you asked about how to calculate the values "in the same call" (let's say because you simplified your example for the question), you can try a lambda expression:
def min_idxmin(x):
_idx = x.idxmin()
return _idx, x[_idx]
df['min_column'], df['min_value'] = zip(*df.filter(like='option').apply(
lambda x: min_idxmin(x), axis=1))
To be clear, although here the 2nd search is removed (replaced by a direct acccess in x[_idx]), this will highly likely take much longer because you are not exploitng the vectorizing properties of pandas/numpy.
Bottom line is pandas/numpy vectorized operations are very fast.
Summary of the summary:
There doesn't seem to be any advantage in using df.lookup, calling min and idxmin separatedly is better, than using the lookup which is mind blowing and deserves a question in itself.
Summary of the timings:
I tested a dataframe with 10000 rows and 10 columns (option_ sequence in the initial example). Since, I got a couple of unexpected result, I then also tested with 1000x1000, and 100x10000. According to the results:
Using numpy as #piRSquared (test8) suggested is the clear winner, only start perfoming worse when there is a lot of columns (100, 10000, but does not justify the general use of it). The test9 modifies trying to using index in numpy, but it generally speaking performs worse.
Calling min and idxmin separatedly was the best for the 10000x10 case, even better than the Dataframe.lookup (although, the Dataframe.lookup result performed better in the 100x10000 case). Although the shape of the data influence the results, I would argue that having 10000 columns is a bit unrealistic.
The solution provided by #Wen followed in performance, though it was not better than calling idxmin and min separatedly, or using Dataframe.lookup. I did an extra test (see test7()) because I felt that the the addition of operation (reset_index and zip might be disturbing the result. It was still worse than test1 and test2, even though it does not do the assigment (I couldn't figure out how to make the assigment using the head(1)). #Wen, would you mind giving me a hand?
#Wen solution underperfoms when there are more columns (1000x1000, or 100x10000), which makes sense because sorting is slower than searching. In this case, the lambda expression that I suggested performs better.
Any other solution with lambda expression, or that uses the transpose (T) falls behind. The lambda expression that I suggested took around 1 second, better than the ~11 secs using the transpose T suggested by #piRSquared and #RafaelC.
TimeIt results with 10000 rows x 10 columns (pandas 0.23.4):
Using the following dataframe of 10000 rows and 10 columns:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10000, 10)), columns=[f'option_{x}' for x in range(1,11)]).reset_index()
Calling the two columns twice separatedly:
def test1():
df['min_column'] = df.filter(like='option').idxmin(1)
df['min_value'] = df.filter(like='option').min(1)
%timeit -n 100 test1()
13 ms ± 580 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Calling the lookup (it is slower for this case!):
def test2():
df['min_column'] = df.filter(like='option').idxmin(1)
df['min_value'] = df.lookup(df.index, df['min_column'])
%timeit -n 100 test2()
# 15.7 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using apply and min_idxmin(x):
def min_idxmin(x):
_idx = x.idxmin()
return _idx, x[_idx]
def test3():
df['min_column'], df['min_value'] = zip(*df.filter(like='option').apply(
lambda x: min_idxmin(x), axis=1))
%timeit -n 10 test3()
# 968 ms ± 32.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using agg['min', 'idxmin'] by #piRSquared:
def test4():
df['min_column'], df['min_value'] = zip(*df.set_index('index').filter(like='option').T.agg(['min', 'idxmin']).T.values)
%timeit -n 1 test4()
# 11.2 s ± 850 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using agg['min', 'idxmin'] by #RafaelC:
def test5():
df['min_column'], df['min_value'] = zip(*df.filter(like='option').agg(lambda x: x.agg(['min', 'idxmin']), axis=1).values)
%timeit -n 1 test5()
# 11.7 s ± 597 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sorting values by #Wen:
def test6():
df['min_column'], df['min_value'] = zip(*df.filter(like='option').stack().sort_values().groupby(level=[0]).head(1).reset_index(level=1).values)
%timeit -n 100 test6()
# 33.6 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Sorting values by #Wen modified by me to make the comparison fairer due to overload of assigment operation (I explained why in the summary at the beginning):
def test7():
df.filter(like='option').stack().sort_values().groupby(level=[0]).head(1)
%timeit -n 100 test7()
# 25 ms ± 937 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using numpy:
def test8():
col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
df.assign(min_value = np.nanmin(v, axis=1),
min_column = options[np.nanargmin(v, axis=1)])
%timeit -n 100 test8()
# 2.76 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using numpy but avoid the search (indexing instead):
def test9():
col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
idxmin = np.nanargmin(v, axis=1)
# instead of looking for the answer, indexes are used
df.assign(min_value = v[range(v.shape[0]), idxmin],
min_column = options[idxmin])
%timeit -n 100 test9()
# 3.96 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
TimeIt results with 1000 rows x 1000 columns:
I perform more test with a 1000x1000 shape:
df = pd.DataFrame(np.random.randint(0,100,size=(1000, 1000)), columns=[f'option_{x}' for x in range(1,1001)]).reset_index()
Although the results change:
test1 ~27.6ms
test2 ~29.4ms
test3 ~135ms
test4 ~1.18s
test5 ~1.29s
test6 ~287ms
test7 ~290ms
test8 ~25.7
test9 ~26.1
TimeIt results with 100 rows x 10000 columns:
I perform more test with a 100x10000 shape:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 10000)), columns=[f'option_{x}' for x in range(1,10001)]).reset_index()
Although the results change:
test1 ~46.8ms
test2 ~25.6ms
test3 ~101ms
test4 ~289ms
test5 ~276ms
test6 ~349ms
test7 ~301ms
test8 ~121ms
test9 ~122ms
I have a pandas dataframe with 1 million rows. I want to replace values in 900,000 rows in a column by another set of values. Is there fast way to do this without a for loop (which takes me two days to complete)?
For example, look at this sample dataframe where I have condensed 1 million rows to 8 rows
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['a'] = [-1,-3,-4,-4,-3, 4,5,6]
df['b'] = [23,45,67,89,0,-1, 2, 3]
L2 = [-1,-3,-4]
L5 = [9,10,11]
I want to replace values where a is -1, -3, -4 in a single shot if possible or as fast as possible without a for loop.
The crucial part is that values in L5 have to be repeated as needed.
I have tried
df.loc[df.a < 0, 'a'] = L5
but this works only when len(df.a.values) == len(L5)
Use map by dictionary created from both lists by zip, last replace to original non matched values by fillna:
d = dict(zip(L2, L5))
print (d)
{-1: 9, -3: 10, -4: 11}
df['a'] = df['a'].map(d).fillna(df['a'])
print (df)
a b
0 9.0 23
1 10.0 45
2 11.0 67
3 11.0 89
4 10.0 0
5 4.0 -1
6 5.0 2
7 6.0 3
Performance:
It depends of number of values for replace anf of lenght of lists:
Length of lists is 100:
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'a':np.random.randint(1000, size=N)})
L2 = np.arange(100)
L5 = np.arange(100) + 10
In [336]: %timeit df['d'] = np.select([df['a'] == i for i in L2], L5, df['a'])
180 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [337]: %timeit df['a'].map(dict(zip(L2, L5))).fillna(df['a'])
56.9 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If length of lists is small (e.g. 3):
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'a':np.random.randint(100, size=N)})
L2 = np.arange(3)
L5 = np.arange(3) + 10
In [339]: %timeit df['d'] = np.select([df['a'] == i for i in L2], L5, df['a'])
11.9 ms ± 40.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [340]: %timeit df['a'].map(dict(zip(L2, L5))).fillna(df['a'])
54 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
you can use np.select such as:
import numpy as np
condition = [df['a'] == i for i in L2]
df['a'] = np.select(condition, L5, df['a'])
and you get:
a b
0 9 23
1 10 45
2 11 67
3 11 89
4 10 0
5 4 -1
6 5 2
7 6 3
Timing: let's create a bigger dataframe such as with your df:
df_l = pd.concat([df]*10000)
print (df_l.shape)
(80000, 2)
Now some timeit:
# with map, #jezrael
d = dict(zip(L2, L5))
%timeit df_l['a'].map(d).fillna(df_l['a'])
100 loops, best of 3: 7.71 ms per loop
# with np.select
condition = [df_l['a'] == i for i in L2]
%timeit np.select(condition, L5, df_l['a'])
1000 loops, best of 3: 350 µs per loop
I am trying to convert values within the current dataframe as the "Index" and the dataframe's Index as the "Labels". For Example:
Value1 Value2
0 0 1
1 2 4
2 NaN 3
This would result in
Labels
0 0
1 0
2 1
3 2
4 1
Currently I managed to do this using a loop to check and apply the necessary labels/values but with millions of labels to mark this process becomes extremely time consuming. Is there a way to do this in a smarter and quicker way? Thanks in advance.
Use stack with DataFrame constructor:
s = df.stack()
df = pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
print (df)
Labels
0 0
1 0
2 1
3 2
4 1
Detail:
print (df.stack())
0 Value1 0.0
Value2 1.0
1 Value1 2.0
Value2 4.0
2 Value2 3.0
dtype: float64
Came up with a really good one (thanks to the collective effort of the pandas community). This one should be fast.
It uses the power a flexibility of repeat and ravel to flatten your data.
s = pd.Series(df.index.repeat(2), index=df.values.ravel())
s[s.index.notnull()].sort_index()
0.0 0
1.0 0
2.0 1
3.0 2
4.0 1
dtype: int64
A subsequent conversion results in an integer index:
df.index = df.index.astype(int)
A similar (slightly faster depending on your data) solution which also results in an integer index is performing the filtering before converting to Series -
v = df.index.repeat(df.shape[1])
i = df.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
s
0 0
1 0
2 1
3 2
4 1
dtype: int64
Performance
df2 = pd.concat([df] * 10000, ignore_index=True)
# jezrael's solution
%%timeit
s = df2.stack()
pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
4.57 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
s = pd.Series(df2.index.repeat(2), index=df2.values.ravel())
s[s.index.notnull()].sort_index()
3.12 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
v = df2.index.repeat(df.shape[1])
i = df2.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
3.1 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)