Pandas group operation on columns - python

I have a grouped pandas groupby object.
dis type id date qty
1 1 10 2017-01-01 1
1 1 10 2017-01-01 0
1 1 10 2017-01-02 4.5
1 2 11 2017-04-03 1
1 2 11 2017-04-03 2
1 2 11 2017-04-03 0
1 2 11 2017-04-05 0
I want to apply some operations on this groupby object.
I want to add a new column total_order that calculates the number of orders on a particular date for a particular material
A column zero_qty that calculates the number of zero orders for a particular date for a particular material
change the date column to make it calculate the number of days between each subsequent order for a particular material. The first order becomes 0.
The final dataframe should like something like this:
dis type id date qty total_order zero_qty
1 1 10 0 1 2 1
1 1 10 0 0 2 1
1 1 10 1 4.5 1 1
1 2 11 0 1 3 2
1 2 11 0 2 3 2
1 2 11 0 0 3 2
1 2 11 2 0 1 1

I think you need transform for count size of groups to total_order, then count number of zeros in qty and last get difference by diff with fillna and days:
Notice - for difference need sorted columns, sort_values do it if necessary:
df = df.sort_values(['dis','type','id','date'])
g = df.groupby(['dis','type','id','date'])
df['total_order'] = g['id'].transform('size')
df['zero_qty'] = g['qty'].transform(lambda x: (x == 0).sum()).astype(int)
df['date'] = df.groupby(['dis','type','id'])['date'].diff().fillna(0).dt.days
print (df)
dis type id date qty total_order zero_qty
0 1 1 10 0 1.0 2 1
1 1 1 10 0 0.0 2 1
2 1 1 10 1 4.5 1 0
3 1 2 11 0 1.0 3 1
4 1 2 11 0 2.0 3 1
5 1 2 11 0 0.0 3 1
6 1 2 11 2 0.0 1 1
Another solution instead multiple transform use apply with custom function:
df = df.sort_values(['dis','type','id','date'])
def f(x):
x['total_order'] = len(x)
x['zero_qty'] = x['qty'].eq(0).sum().astype(int)
return x
df = df.groupby(['dis','type','id','date']).apply(f)
df['date'] = df.groupby(['dis','type','id'])['date'].diff().fillna(0).dt.days
print (df)
dis type id date qty total_order zero_qty
0 1 1 10 0 1.0 2 1
1 1 1 10 0 0.0 2 1
2 1 1 10 1 4.5 1 0
3 1 2 11 0 1.0 3 1
4 1 2 11 0 2.0 3 1
5 1 2 11 0 0.0 3 1
6 1 2 11 2 0.0 1 1
EDIT:
Last row can be rewrite too if need process more columns:
def f2(x):
#add another code
x['date'] = x['date'].diff().fillna(0).dt.days
return x
df = df.groupby(['dis','type','id']).apply(f2)

Related

How can I calculate the sum of 3 values from each number in a pandas dataframe including the first number?

I have a dataframe as below:
Datetime Data Fn
0 18747.385417 11275.0 0
1 18747.388889 8872.0 1
2 18747.392361 7050.0 0
3 18747.395833 8240.0 1
4 18747.399306 5158.0 1
5 18747.402778 3926.0 0
6 18747.406250 4043.0 0
7 18747.409722 2752.0 1
8 18747.420139 3502.0 1
9 18747.423611 4026.0 1
I want to calculate the sum of 3 values from each number in Fn and put the value in Sum.
My expected result is this:
Datetime Data Fn Sum
0 18747.385417 11275.0 0 0
1 18747.388889 8872.0 1 0
2 18747.392361 7050.0 0 1
3 18747.395833 8240.0 1 2
4 18747.399306 5158.0 1 2
5 18747.402778 3926.0 0 2
6 18747.406250 4043.0 0 1
7 18747.409722 2752.0 1 1
8 18747.420139 3502.0 1 2
9 18747.423611 4026.0 1 3
df['Sum'] = df.Fn.rolling(3).sum().fillna(0)
Output:
Datetime Data Fn Sum
0 18747.385417 11275.0 0 0.0
1 18747.388889 8872.0 1 0.0
2 18747.392361 7050.0 0 1.0
3 18747.395833 8240.0 1 2.0
4 18747.399306 5158.0 1 2.0
5 18747.402778 3926.0 0 2.0
6 18747.406250 4043.0 0 1.0
7 18747.409722 2752.0 1 1.0
8 18747.420139 3502.0 1 2.0
9 18747.423611 4026.0 1 3.0

Difference in score to next rank

I have a dataframe
Group Score Rank
1 0 3
1 4 1
1 2 2
2 3 2
2 1 3
2 7 1
I have to take the difference of the score in next rank within each group. For example, in group 1 rank(1) - rank(2) = 4 - 2
Expected output:
Group Score Rank Difference
1 0 3 0
1 4 1 2
1 2 2 2
2 3 2 2
2 1 3 0
2 7 1 4
you can try:
df = df.sort_values(['Group', 'Rank'],ascending = [True,False])
df['Difference'] =df.groupby('Group', as_index=False)['Score'].transform('diff').fillna(0).astype(int)
OUTPUT:
Group Score Rank Difference
0 1 0 3 0
2 1 2 2 2
1 1 4 1 2
4 2 1 3 0
3 2 3 2 2
5 2 7 1 4
NOTE: The result is sorted based on the rank column.
I think you can create a new column for the values in the next rank by using the shift() and then calculate the difference. You can see the following codes:
# Sort the dataframe
df = df.sort_values(['Group','Rank']).reset_index(drop=True)
# Shift up values by one row within a group
df['Score_next'] = df.groupby('Group')['Score'].shift(-1).fillna(0)
# Calculate the difference
df['Difference'] = df['Score'] - df['Score_next']
Here is the result:
print(df)
Group Score Rank Score_next Difference
0 1 4 1 2.0 2.0
1 1 2 2 0.0 2.0
2 1 0 3 0.0 0.0
3 2 7 1 3.0 4.0
4 2 3 2 1.0 2.0
5 2 1 3 0.0 1.0

Transform cell values as column headers and fill it with 1 if matching in python

I have a dataframe:
df
ID 0 1 2 3 4 ....
1 10 20 5 1 2 ....
2 3 4 NaN 10 1 ....
And I need to transpose the cell values of the column 0,1,2,3,4... to the column headers, and fill it for the Id's with 1 if the cell value is present for the respective ID.
Desired Output:
ID 1 2 3 4 5 ... 10 20 ..
1 1 1 0 0 1 ... 1 1 ..
2 1 0 1 1 0 ... 1 0 ..
Note that some entries can be NaN.
How can I get the desired output?
Use DataFrame.set_index with DataFrame.stack for remove missing values, then create indicators by get_dummies and return 1/0 by max by first level, last convert columns to integers:
df1 = (pd.get_dummies(df.set_index('ID').stack())
.max(level=0)
.rename(columns=int)
.reset_index())
print (df1)
ID 1 2 3 4 5 10 20
0 1 1 1 0 0 1 1 1
1 2 1 0 1 1 0 1 0
EDIT:
print (df)
ID 0 1 2 3 4 5
0 1 10 20 5.0 1 2 5
1 2 3 4 NaN 10 1 2
If use max then always in output are 0/1 values (check 5 column):
df1 = (pd.get_dummies(df.set_index('ID').stack())
.max(level=0)
.rename(columns=int)
.reset_index())
print (df1)
ID 1 2 3 4 5 10 20
0 1 1 1 0 0 1 1 1
1 2 1 1 1 1 0 1 0
But if use sum it count values (check 5 column):
df2 = (pd.get_dummies(df.set_index('ID').stack())
.sum(level=0)
.rename(columns=int)
.reset_index())
print (df2)
ID 1 2 3 4 5 10 20
0 1 1 1 0 0 2 1 1
1 2 1 1 1 1 0 1 0
Another way using melt and pd.crosstab
df1 = df.melt('ID')
df_final = pd.crosstab(index=df1.ID, columns=df1.value).reset_index()
Out[673]:
value ID 1.0 2.0 3.0 4.0 5.0 10.0 20.0
0 1 1 1 0 0 1 1 1
1 2 1 0 1 1 0 1 0
Note: default counting of pd.crosstab uses frequency. Therefore, duplicate values will count as their frequencies. If you want only 1/0 indicator, just chain ge(1) and astype as follows
pd.crosstab(index=df1.ID, columns=df1.value).ge(1).astype(int).reset_index()

Pandas Insert a new row after every nth row

I have a dataframe that looks like below:
**L_Type L_ID C_Type E_Code**
0 1 1 9
0 1 2 9
0 1 3 9
0 1 4 9
0 2 1 2
0 2 2 2
0 2 3 2
0 2 4 2
0 3 1 3
0 3 2 3
0 3 3 3
0 3 4 3
I need to insert a new row after every 4 row and increment the value in third column (C_Type) by 01 like below table while keeping the values same as first two columns and does not want any value in last column:
L_Type L_ID C_Type E_Code
0 1 1 9
0 1 2 9
0 1 3 9
0 1 4 9
0 1 5
0 2 1 2
0 2 2 2
0 2 3 2
0 2 4 2
0 2 5
0 3 1 3
0 3 2 3
0 3 3 3
0 3 4 3
0 3 5
I have searched other threads but could not figure out the exact solution:
How to insert n DataFrame to another every nth row in Pandas?
Insert new rows in pandas dataframe
You can seelct rows by slicing, add 1 to column C_Type and 0.5 to index, for 100% sorrect slicing, because default method of sorting in DataFrame.sort_index is quicksort. Last join together, sort index and create default by concat with DataFrame.reset_index and drop=True:
df['C_Type'] = df['C_Type'].astype(int)
df2 = (df.iloc[3::4]
.assign(C_Type = lambda x: x['C_Type'] + 1, E_Code = np.nan)
.rename(lambda x: x + .5))
df1 = pd.concat([df, df2], sort=False).sort_index().reset_index(drop=True)
print (df1)
L_Type L_ID C_Type E_Code
0 0 1 1 9.0
1 0 1 2 9.0
2 0 1 3 9.0
3 0 1 4 9.0
4 0 1 5 NaN
5 0 2 1 2.0
6 0 2 2 2.0
7 0 2 3 2.0
8 0 2 4 2.0
9 0 2 5 NaN
10 0 3 1 3.0
11 0 3 2 3.0
12 0 3 3 3.0
13 0 3 4 3.0
14 0 3 5 NaN

Pandas assign the groupby sum value to the last row in the original table

For example, I have a table
A
id price sum
1 2 0
1 6 0
1 4 0
2 2 0
2 10 0
2 1 0
2 5 0
3 1 0
3 5 0
What I want is like (the last row of sum should be the sum of price of a group)
id price sum
1 2 0
1 6 0
1 4 12
2 2 0
2 10 0
2 1 0
2 5 18
3 1 0
3 5 6
What I can do is find out the sum using
A['price'].groupby(A['id']).transform('sum')
However I don't know how to assign this to the sum column (last row).
Thanks
Use last_valid_index to locate rows to fill
g = df.groupby('id')
l = pd.DataFrame.last_valid_index
df.loc[g.apply(l), 'sum'] = g.price.sum().values
df
id price sum
0 1 2 0
1 1 6 0
2 1 4 12
3 2 2 0
4 2 10 0
5 2 1 0
6 2 5 18
7 3 1 0
8 3 5 6
You could do this:
df.assign(sum=df.groupby('id')['price'].transform('sum').drop_duplicates(keep='last')).fillna(0)
OR
df['sum'] = (df.groupby('id')['price']
.transform('sum')
.mask(df.id.duplicated(keep='last'), 0))
Output:
id price sum
0 1 2 0.0
1 1 6 0.0
2 1 4 12.0
3 2 2 0.0
4 2 10 0.0
5 2 1 0.0
6 2 5 18.0
7 3 1 0.0
8 3 5 6.0

Categories