Difference in score to next rank - python

I have a dataframe
Group Score Rank
1 0 3
1 4 1
1 2 2
2 3 2
2 1 3
2 7 1
I have to take the difference of the score in next rank within each group. For example, in group 1 rank(1) - rank(2) = 4 - 2
Expected output:
Group Score Rank Difference
1 0 3 0
1 4 1 2
1 2 2 2
2 3 2 2
2 1 3 0
2 7 1 4

you can try:
df = df.sort_values(['Group', 'Rank'],ascending = [True,False])
df['Difference'] =df.groupby('Group', as_index=False)['Score'].transform('diff').fillna(0).astype(int)
OUTPUT:
Group Score Rank Difference
0 1 0 3 0
2 1 2 2 2
1 1 4 1 2
4 2 1 3 0
3 2 3 2 2
5 2 7 1 4
NOTE: The result is sorted based on the rank column.

I think you can create a new column for the values in the next rank by using the shift() and then calculate the difference. You can see the following codes:
# Sort the dataframe
df = df.sort_values(['Group','Rank']).reset_index(drop=True)
# Shift up values by one row within a group
df['Score_next'] = df.groupby('Group')['Score'].shift(-1).fillna(0)
# Calculate the difference
df['Difference'] = df['Score'] - df['Score_next']
Here is the result:
print(df)
Group Score Rank Score_next Difference
0 1 4 1 2.0 2.0
1 1 2 2 0.0 2.0
2 1 0 3 0.0 0.0
3 2 7 1 3.0 4.0
4 2 3 2 1.0 2.0
5 2 1 3 0.0 1.0

Related

Python Pandas DataFrames compare with next rows

I have dataframe like this.
col1
0 1
1 3
2 3
3 1
4 2
5 3
6 2
7 2
I want to create column out by compare each row. If row 0 less than row 1 then out is 1. If row 1 more than row 2 then out is 0. like this sample.
col1 out
0 1 1 # 1<3 = 1
1 3 0 # 3<3 = 0
2 3 0 # 3<1 = 0
3 1 1 # 1<2 = 1
4 2 1 # 2<3 = 1
5 3 0 # 3<2 = 0
6 2 0 # 2<2 = 0
7 2 -
I try with this code.
def comp_out(a):
return np.concatenate(([1],a[1:] > a[2:]))
df['out'] = comp_out(df.col1.values)
It show error like this.
ValueError: operands could not be broadcast together with shapes (11,) (10,)
Let's use shift instead to "shift" the column up so that rows are aligned with the previous, then use lt to compare less than and astype convert the booleans to 1/0:
df['out'] = df['col1'].lt(df['col1'].shift(-1)).astype(int)
col1 out
0 1 1
1 3 0
2 3 0
3 1 1
4 2 1
5 3 0
6 2 0
7 2 0
We can strip the last value with iloc if needed:
df['out'] = df['col1'].lt(df['col1'].shift(-1)).iloc[:-1].astype(int)
df:
col1 out
0 1 1.0
1 3 0.0
2 3 0.0
3 1 1.0
4 2 1.0
5 3 0.0
6 2 0.0
7 2 NaN
If we want to use the function we should make sure both are the same length, by slicing off the last value:
def comp_out(a):
return np.concatenate([a[0:-1] < a[1:], [np.NAN]])
df['out'] = comp_out(df['col1'].to_numpy())
df:
col1 out
0 1 1.0
1 3 0.0
2 3 0.0
3 1 1.0
4 2 1.0
5 3 0.0
6 2 0.0
7 2 NaN

Pandas Insert a new row after every nth row

I have a dataframe that looks like below:
**L_Type L_ID C_Type E_Code**
0 1 1 9
0 1 2 9
0 1 3 9
0 1 4 9
0 2 1 2
0 2 2 2
0 2 3 2
0 2 4 2
0 3 1 3
0 3 2 3
0 3 3 3
0 3 4 3
I need to insert a new row after every 4 row and increment the value in third column (C_Type) by 01 like below table while keeping the values same as first two columns and does not want any value in last column:
L_Type L_ID C_Type E_Code
0 1 1 9
0 1 2 9
0 1 3 9
0 1 4 9
0 1 5
0 2 1 2
0 2 2 2
0 2 3 2
0 2 4 2
0 2 5
0 3 1 3
0 3 2 3
0 3 3 3
0 3 4 3
0 3 5
I have searched other threads but could not figure out the exact solution:
How to insert n DataFrame to another every nth row in Pandas?
Insert new rows in pandas dataframe
You can seelct rows by slicing, add 1 to column C_Type and 0.5 to index, for 100% sorrect slicing, because default method of sorting in DataFrame.sort_index is quicksort. Last join together, sort index and create default by concat with DataFrame.reset_index and drop=True:
df['C_Type'] = df['C_Type'].astype(int)
df2 = (df.iloc[3::4]
.assign(C_Type = lambda x: x['C_Type'] + 1, E_Code = np.nan)
.rename(lambda x: x + .5))
df1 = pd.concat([df, df2], sort=False).sort_index().reset_index(drop=True)
print (df1)
L_Type L_ID C_Type E_Code
0 0 1 1 9.0
1 0 1 2 9.0
2 0 1 3 9.0
3 0 1 4 9.0
4 0 1 5 NaN
5 0 2 1 2.0
6 0 2 2 2.0
7 0 2 3 2.0
8 0 2 4 2.0
9 0 2 5 NaN
10 0 3 1 3.0
11 0 3 2 3.0
12 0 3 3 3.0
13 0 3 4 3.0
14 0 3 5 NaN

Identify first non-zero element within group composed of multiple columns in pandas

I have a dataframe that looks like the following. The rightmost column is my desired column:
Group1 Group2 Value Target_Column
1 3 0 0
1 3 1 1
1 4 1 1
1 4 1 0
2 5 5 5
2 5 1 0
2 6 0 0
2 6 1 1
2 6 9 0
How do I identify the first non-zero value in a group that is made up of two columns(Group1 & Group2) and then create a column that shows the first non-zero value and shows all else as zeroes?
This question is very similar to one posed earlier here:
Identify first non-zero element within a group in pandas
but that solution gives an error on groups based on multiple columns.
I have tried:
import pandas as pd
dt = pd.DataFrame({'Group1': [1,1,1,1,2,2,2,2,2], 'Group2': [3,3,4,4,5,5,6,6,6], 'Value': [0,1,1,1,5,1,0,1,9]})
dt['Newcol']=0
dt.loc[dt.Value.ne(0).groupby(dt['Group1','Group2']).idxmax(),'Newcol']=dt.Value
Setup
df['flag'] = df.Value.ne(0)
Using numpy.where and assign:
df.assign(
target=np.where(df.index.isin(df.groupby(['Group1', 'Group2']).flag.idxmax()),
df.Value, 0)
).drop('flag', 1)
Using loc and assign
df.assign(
target=df.loc[df.groupby(['Group1', 'Group2']).flag.idxmax(), 'Value']
).fillna(0).astype(int).drop('flag', 1)
Both produce:
Group1 Group2 Value target
0 1 3 0 0
1 1 3 1 1
2 1 4 1 1
3 1 4 1 0
4 2 5 5 5
5 2 5 1 0
6 2 6 0 0
7 2 6 1 1
8 2 6 9 0
The number may off, since when there are only have two same values, I do not know you need the which one.
Using user3483203 's setting up
df['flag'] = df.Value.ne(0)
df['Target']=df.sort_values(['flag'],ascending=False).drop_duplicates(['Group1','Group2']).Value
df['Target'].fillna(0,inplace=True)
df
Out[20]:
Group1 Group2 Value Target_Column Target
0 1 3 0 0 0.0
1 1 3 1 1 1.0
2 1 4 1 1 1.0
3 1 4 1 0 0.0
4 2 5 5 5 5.0
5 2 5 1 0 0.0
6 2 6 0 0 0.0
7 2 6 1 1 1.0

Pandas assign the groupby sum value to the last row in the original table

For example, I have a table
A
id price sum
1 2 0
1 6 0
1 4 0
2 2 0
2 10 0
2 1 0
2 5 0
3 1 0
3 5 0
What I want is like (the last row of sum should be the sum of price of a group)
id price sum
1 2 0
1 6 0
1 4 12
2 2 0
2 10 0
2 1 0
2 5 18
3 1 0
3 5 6
What I can do is find out the sum using
A['price'].groupby(A['id']).transform('sum')
However I don't know how to assign this to the sum column (last row).
Thanks
Use last_valid_index to locate rows to fill
g = df.groupby('id')
l = pd.DataFrame.last_valid_index
df.loc[g.apply(l), 'sum'] = g.price.sum().values
df
id price sum
0 1 2 0
1 1 6 0
2 1 4 12
3 2 2 0
4 2 10 0
5 2 1 0
6 2 5 18
7 3 1 0
8 3 5 6
You could do this:
df.assign(sum=df.groupby('id')['price'].transform('sum').drop_duplicates(keep='last')).fillna(0)
OR
df['sum'] = (df.groupby('id')['price']
.transform('sum')
.mask(df.id.duplicated(keep='last'), 0))
Output:
id price sum
0 1 2 0.0
1 1 6 0.0
2 1 4 12.0
3 2 2 0.0
4 2 10 0.0
5 2 1 0.0
6 2 5 18.0
7 3 1 0.0
8 3 5 6.0

Pandas group operation on columns

I have a grouped pandas groupby object.
dis type id date qty
1 1 10 2017-01-01 1
1 1 10 2017-01-01 0
1 1 10 2017-01-02 4.5
1 2 11 2017-04-03 1
1 2 11 2017-04-03 2
1 2 11 2017-04-03 0
1 2 11 2017-04-05 0
I want to apply some operations on this groupby object.
I want to add a new column total_order that calculates the number of orders on a particular date for a particular material
A column zero_qty that calculates the number of zero orders for a particular date for a particular material
change the date column to make it calculate the number of days between each subsequent order for a particular material. The first order becomes 0.
The final dataframe should like something like this:
dis type id date qty total_order zero_qty
1 1 10 0 1 2 1
1 1 10 0 0 2 1
1 1 10 1 4.5 1 1
1 2 11 0 1 3 2
1 2 11 0 2 3 2
1 2 11 0 0 3 2
1 2 11 2 0 1 1
I think you need transform for count size of groups to total_order, then count number of zeros in qty and last get difference by diff with fillna and days:
Notice - for difference need sorted columns, sort_values do it if necessary:
df = df.sort_values(['dis','type','id','date'])
g = df.groupby(['dis','type','id','date'])
df['total_order'] = g['id'].transform('size')
df['zero_qty'] = g['qty'].transform(lambda x: (x == 0).sum()).astype(int)
df['date'] = df.groupby(['dis','type','id'])['date'].diff().fillna(0).dt.days
print (df)
dis type id date qty total_order zero_qty
0 1 1 10 0 1.0 2 1
1 1 1 10 0 0.0 2 1
2 1 1 10 1 4.5 1 0
3 1 2 11 0 1.0 3 1
4 1 2 11 0 2.0 3 1
5 1 2 11 0 0.0 3 1
6 1 2 11 2 0.0 1 1
Another solution instead multiple transform use apply with custom function:
df = df.sort_values(['dis','type','id','date'])
def f(x):
x['total_order'] = len(x)
x['zero_qty'] = x['qty'].eq(0).sum().astype(int)
return x
df = df.groupby(['dis','type','id','date']).apply(f)
df['date'] = df.groupby(['dis','type','id'])['date'].diff().fillna(0).dt.days
print (df)
dis type id date qty total_order zero_qty
0 1 1 10 0 1.0 2 1
1 1 1 10 0 0.0 2 1
2 1 1 10 1 4.5 1 0
3 1 2 11 0 1.0 3 1
4 1 2 11 0 2.0 3 1
5 1 2 11 0 0.0 3 1
6 1 2 11 2 0.0 1 1
EDIT:
Last row can be rewrite too if need process more columns:
def f2(x):
#add another code
x['date'] = x['date'].diff().fillna(0).dt.days
return x
df = df.groupby(['dis','type','id']).apply(f2)

Categories