Python Pandas : Find nth largest and count in case of tie - python

In a pandas DataFrame, I want to find the nth largest value (row wise, not column wise) and then also find, if there are ties. I am interested in finding top 3 largest and ties. Note, there are more than 3 columns in actual problem,
e.g. if the data frame looks like this:
A B C
0 1 2 2
1 3 4 3
2 1 2 3
I want to know the nth largest values and if there were ties, so:
row 0 - 1st largest is 2 with a tie, 1 2nd largest, no third largest,
row 1 - 1st largest is 4, 3 2nd largest with a tie, no third largest,
row 2 - 1st largest 3, 2nd largest 2, 3rd largest 1.
Expected output as requested:
A B C max1 max2 max3 tie1 tie2 tie3
0 1 2 2 2 1 NaN 1 0 NaN
1 3 4 3 4 3 NaN 0 1 NaN
2 1 2 3 3 2 1.0 0 0 0.0

Use:
#top N values
N = 3
#reshape to MultiIndex Series and counts values per index, sorting
df1 = (df.stack()
.groupby(level=0).value_counts()
.sort_index(ascending=[True, False])
.reset_index(level=1))
#counter level to MultiIndex for new columns names
s = df1.groupby(level=0).cumcount().add(1)
df1 = df1.set_index(s, append=True)
#filter topN rows
df1 = df1[s.le(N).values]
df1.columns=['max','tie']
#subtract 1 for correct tie
df1['tie'] -= 1
#reshape to df with MultiIndex
df1 = df1.unstack()
#flatten columns names
df1.columns = df1.columns.map(lambda x: f'{x[0]}{x[1]}')
#add to original
df2 = df.join(df1)
print (df2)
A B C max1 max2 max3 tie1 tie2 tie3
0 1 2 2 2.0 1.0 NaN 1.0 0.0 NaN
1 3 4 3 4.0 3.0 NaN 0.0 1.0 NaN
2 1 2 3 3.0 2.0 1.0 0.0 0.0 0.0

for i in range(len(df)):
l = df.loc[0].tolist()
d = {key: l.count(key) for key in l}
The dictionary now contains the values and how many times it has been repeated in the row. You can then print the top-n values of the dictionary easily.

Related

Sum one dataframe based on value of other dataframe in same index/row

I would like to sum values of a dataframe conditionally, based on the values of a different dataframe. Say for example I have two dataframes:
df1 = pd.DataFrame(data = [[1,-1,5],[2,1,1],[3,0,0]],index=[0,1,2],columns = [0,1,2])
index 0 1 2
-----------------
0 1 -1 5
1 2 1 1
2 3 0 0
df2 = pd.DataFrame(data = [[1,1,3],[1,1,2],[0,2,1]],index=[0,1,2],columns = [0,1,2])
index 0 1 2
-----------------
0 1 1 3
1 1 1 2
2 0 2 1
Now what I would like is that for example, if the row/index value of df1 equals 1, to sum the location of those values in df2.
In this example, if the condition is 1, then the sum of df2 would be 4. If the condition was 0, the result would be 3.
Another option with Pandas' query:
df2.query("#df1==1").sum().sum()
# 4
You can use a mask with where:
df2.where(df1.eq(1)).to_numpy().sum()
# or
# df2.where(df1.eq(1)).sum().sum()
output: 4.0
intermediate:
df2.where(df1.eq(1))
0 1 2
0 1.0 NaN NaN
1 NaN 1.0 2.0
2 NaN NaN NaN
Assuming that one wants to store the value in the variable value, there are various options to achieve that. Will leave below two of them.
Option 1
One can simply do the following
value = df2[df1 == 1].sum().sum()
[Out]: 4.0 # numpy.float64
# or
value = sum(df2[df1 == 1].sum())
[Out]: 4.0 # float
Option 2
Using pandas.DataFrame.where
value = df2.where(df1 == 1, 0).sum().sum()
[Out]: 4.0 # numpy.int64
# or
value = sum(df2.where(df1 == 1, 0).sum())
[Out]: 4 # int
Notes:
Both df2[df1 == 1] and df2.where(df1 == 1, 0) give the following output
0 1 2
0 1.0 NaN NaN
1 NaN 1.0 2.0
2 NaN NaN NaN
Depending on the desired output (float, int, numpy.float64,...) one method might be better than the other.

Pandas groupby multiple columns to compare values

My df looks like this: (There are dozens of other columns in the df but these are the three I am focused on)
Param Value Limit
A 1.50 1
B 2.50 1
C 2.00 2
D 2.00 2.5
E 1.50 2
I am trying to use pandas to calculate how many [Value] that are less than [Limit] per [Param], Hoping to get a list like this:
Param Count
A 1
B 1
C 1
D 0
E 0
I've tried with a few methods, the first being
value_count = df.loc[df['Value'] < df['Limit']].count()
but this just gives the full count per column in the df.
I've also tried groupby function which I think could be the correct idea, by creating a subset of the df with the chosen columns
df_below_limit = df[df['Value'] < df['Limit']]
df_below_limit.groupby('Param')['Value'].count()
This is nearly what I want but it excludes values below which I also need. Not sure how to go about getting the list as I need it.
Assuming you want the count per Param, you can use:
out = df['Value'].ge(df['Limit']).groupby(df['Param']).sum()
output:
Param
A 1
B 2
C 1
D 0
E 0
dtype: int64
used input (with a duplicated row "B" for the example):
Param Value Limit
0 A 1.5 1.0
1 B 2.5 1.0
2 B 2.5 1.0
3 C 2.0 2.0
4 D 2.0 2.5
5 E 1.5 2.0
as DataFrame
df['Value'].ge(df['Limit']).groupby(df['Param']).sum().reset_index(name='Count')
# or
df['Value'].ge(df['Limit']).groupby(df['Param']).agg(Count='sum').reset_index()
output:
Param Count
0 A 1
1 B 2
2 C 1
3 D 0
4 E 0

pandas groupby with condition depending on same column

I have a dataframe with ID, date and number columns and would like to create a new column that takes the mean of all numbers for this specific ID BUT only includes the numbers in the mean where date is smaller than the date of this row. How would I do this?
df = (pd.DataFrame({'ID':['1','1','1','1','2','2'],'number':['1','4','1','4','2','5'],
'date':['2021-10-19','2021-10-16','2021-10-16','2021-10-15','2021-10-19','2021-10-10']})
.assign(date = lambda x: pd.to_datetime(x.date))
.assign(mean_no_from_previous_dts = lambda x: x[x.date<??].groupby('ID').number.transform('mean'))
)
this is what i would like to get as output
ID number date mean_no_from_previous_dts
0 1 1 2021-10-19 3.0 = mean(4+1+4)
1 1 4 2021-10-16 2.5 = mean(4+1)
2 1 1 2021-10-16 4.0 = mean(1)
3 1 4 2021-10-15 0.0 = 0 (as it's the first entry for this date and ID - this number doesnt matter, can e something else)
4 2 2 2021-10-19 5.0 = mean(5)
5 2 5 2021-10-10 0.0 = 0 (as it's the first entry for this date and ID)
so for example the first entry of the column mean_no_from_previous_dts is the mean of (4+1+4): the first 4 comes from the column number and the 2nd row because 2021-10-16 (date in the 2nd row) is smaller than 2021-10-19 (date in the 1st row). The 1 comes from the 3rd row because 2021-10-16 is smaller than 2021-10-19. The second 4 comes from the 4th row because 2021-10-15 is smaller than 2021-10-19. This is for ID = 1 the the same for ID = 2
Here is solution with numpy broadcasting per groups:
df = (pd.DataFrame({'ID':['1','1','1','1','2','2'],'number':['1','4','1','4','2','5'],
'date':['2021-10-19','2021-10-16','2021-10-16','2021-10-15','2021-10-19','2021-10-10']})
.assign(date = lambda x: pd.to_datetime(x.date), number = lambda x: x['number'].astype(int))
)
def f(x):
arr = x['date'].to_numpy()
m = arr <= arr[:, None]
#remove rows with same values - set mask to False
np.fill_diagonal(m, False)
#set greater values to `NaN` and get mean without NaNs
m = np.nanmean(np.where(m, x['number'].to_numpy(), np.nan).astype(float), axis=1)
#assign to new column
x['no_of_previous_dts'] = m
return x
#last value is set to 0 per groups
df = df.groupby('ID').apply(f).fillna({'no_of_previous_dts':0})
print (df)
ID number date no_of_previous_dts
0 1 1 2021-10-19 3.0
1 1 4 2021-10-16 2.5
2 1 1 2021-10-16 4.0
3 1 4 2021-10-15 0.0
4 2 2 2021-10-19 5.0
5 2 5 2021-10-10 0.0

Get nth row of groups and fill with 'None' if row is missing

I have a df:
a b c
1 2 3 6
2 2 5 7
3 4 6 8
I want every nth row of groupby a:
w=df.groupby('a').nth(0) #first row
x=df.groupby('a').nth(1) #second row
The second group of the df has no second row, in this case I want to have 'None' values.
[In:] df.groupby('a').nth(1)
[Out:]
a b c
1 2 5 7
2 None None None
Or maybe simplier:
The df has 1-4 rows within groups. If a group has less than 4 rows, I want to extend the group, so that it has 4 rows and fill the missing rows with 'None'. Afterwards if I pick the nth row of groups, I have the desired output.
If you are just interested in a specific nth but not have enough rows in some groups, you can consider to use reindex with unique value from the column a like:
print (df.groupby('a').nth(1).reindex(df['a'].unique()).reset_index())
a b c
0 2 5.0 7.0
1 4 NaN NaN
One way is to assign a count/rank column and reindex/stack:
n=2
(df.assign(rank=df.groupby('a').cumcount())
.query(f'rank < #n')
.set_index(['a','rank'])
.unstack('rank')
.stack('rank', dropna=False)
.reset_index()
.drop('rank', axis=1)
)
Output:
a b c
0 2 3.0 6.0
1 2 5.0 7.0
2 4 6.0 8.0
3 4 NaN NaN

Compare corresponding columns with each other and store the result in a new column

I had a data which I pivoted using pivot table method , now the data looks like this:
rule_id a b c
50211 8 0 0
50249 16 0 3
50378 0 2 0
50402 12 9 6
I have set 'rule_id' as index. Now I compared one column to it's corresponding column and created another column with it's result. The idea is if the first column has a value other than 0 and the second column , to which the first column is compared to ,has 0 , then 100 should be updated in the newly created column, but if the situation is vice-versa then 'Null' should be updated. If both column have 0 , then also 'Null' should be updated. If the last column has value 0 , then 'Null' should be updated and other than 0 , then 100 should be updated. But if both the columns have values other than 0(like in the last row of my data) , then the comparison should be like this for column a and b:
value_of_b/value_of_a *50 + 50
and for column b and c:
value_of_c/value_of_b *25 + 25
and similarly if there are more columns ,then the multiplication and addition value should be 12.5 and so on.
I was able to achieve all the above things apart from the last result which is the division and multiplication stuff. I used this code:
m = df.eq(df.shift(-1, axis=1))
arr = np.select([df ==0, m], [np.nan, df], 1*100)
df2 = pd.DataFrame(arr, index=df.index).rename(columns=lambda x: f'comp{x+1}')
df3 = df.join(df2)
df is the dataframe which stores my pivoted table data which I mentioned at the start. After using this code my data looks like this:
rule_id a b c comp1 comp2 comp3
50211 8 0 0 100 NaN NaN
50249 16 0 3 100 NaN 100
50378 0 2 0 NaN 100 NaN
50402 12 9 6 100 100 100
But I want the data to look like this:
rule_id a b c comp1 comp2 comp3
50211 8 0 0 100 NaN NaN
50249 16 0 3 100 NaN 100
50378 0 2 0 NaN 100 NaN
50402 12 9 6 87.5 41.67 100
If you guys can help me get the desired data , I would greatly appreciate it.
Edit:
This is how my data looks:
The problem is that the coefficient to use when building the new compx column does not depend only on the columns position. In fact in each row it is reset to its maximum of 50 after each 0 value and is half of previous one after a non 0 value. Those resetable series are hard to vectorize in pandas, especially in rows. Here I would build a companion dataframe holding only those coefficients, and use directly the numpy underlying arrays to compute them as efficiently as possible. Code could be:
# transpose the dataframe to process columns instead of rows
coeff = df.T
# compute the coefficients
for name, s in coeff.items():
top = 100 # start at 100
r = []
for i, v in enumerate(s):
if v == 0: # reset to 100 on a 0 value
top=100
else:
top = top/2 # else half the previous value
r.append(top)
coeff.loc[:, name] = r # set the whole column in one operation
# transpose back to have a companion dataframe for df
coeff = coeff.T
# build a new column from 2 consecutive ones, using the coeff dataframe
def build_comp(col1, col2, i):
df['comp{}'.format(i)] = np.where(df[col1] == 0, np.nan,
np.where(df[col2] == 0, 100,
df[col2]/df[col1]*coeff[col1]
+coeff[col1]))
old = df.columns[0] # store name of first column
# Ok, enumerate all the columns (except first one)
for i, col in enumerate(df.columns[1:], 1):
build_comp(old, col, i)
old = col # keep current column name for next iteration
# special processing for last comp column
df['comp{}'.format(i+1)] = np.where(df[col] == 0, np.nan, 100)
With this initial dataframe:
date 2019-04-25 15:08:23 2019-04-25 16:14:14 2019-04-25 16:29:05 2019-04-25 16:36:32
rule_id
50402 0 0 9 0
51121 0 1 0 0
51147 0 1 0 0
51183 2 0 0 0
51283 0 12 9 6
51684 0 1 0 0
52035 0 4 3 2
it gives as expected:
date 2019-04-25 15:08:23 2019-04-25 16:14:14 2019-04-25 16:29:05 2019-04-25 16:36:32 comp1 comp2 comp3 comp4
rule_id
50402 0 0 9 0 NaN NaN 100.000000 NaN
51121 0 1 0 0 NaN 100.0 NaN NaN
51147 0 1 0 0 NaN 100.0 NaN NaN
51183 2 0 0 0 100.0 NaN NaN NaN
51283 0 12 9 6 NaN 87.5 41.666667 100.0
51684 0 1 0 0 NaN 100.0 NaN NaN
52035 0 4 3 2 NaN 87.5 41.666667 100.0
Ok, I think you can iterate over your dataframe df and use some if-else to get the desired output.
for i in range(len(df.index)):
if df.iloc[i,1]!=0 and df.iloc[i,2]==0: # column start from index 0
df.loc[i,'colname'] = 'whatever you want' # so rule_id is column 0
elif:
.
.
.

Categories