Cumulative count in a pandas df - python

I am trying to export a cumulative count based off two columns in a pandas df.
An example is the df below. I'm trying to export a count based off Value and Count. So when the count increase I want attribute that to the adjacent value
import pandas as pd
d = ({
'Value' : ['A','A','B','C','D','A','B','A'],
'Count' : [0,1,1,2,3,3,4,5],
})
df = pd.DataFrame(d)
I have used this:
for val in ['A','B','C','D']:
cond = df.Value.eq(val) & df.Count.eq(int)
df.loc[cond, 'Count_' + val] = cond[cond].cumsum()
If I alter int to a specific number it will return the count. But I need this to read any number as the Count column keeps increasing.
My intended output is:
Value Count A_Count B_Count C_Count D_Count
0 A 0 0 0 0 0
1 A 1 1 0 0 0
2 B 1 1 0 0 0
3 C 2 1 0 1 0
4 D 3 1 0 1 1
5 A 3 1 0 1 1
6 B 4 1 1 1 1
7 A 5 2 1 1 1
So the count increase on the second row so 1 to Value A. Count increases again on row 4 and it's the first time for Value C so 1. Same again for rows 5 and 7. The count increases on row 8 so A becomes 2.

You could use str.get_dummies and diff and cumsum
In [262]: df['Value'].str.get_dummies().multiply(df['Count'].diff().gt(0), axis=0).cumsum()
Out[262]:
A B C D
0 0 0 0 0
1 1 0 0 0
2 1 0 0 0
3 1 0 1 0
4 1 0 1 1
5 1 0 1 1
6 1 1 1 1
7 2 1 1 1
Which is
In [266]: df.join(df['Value'].str.get_dummies()
.multiply(df['Count'].diff().gt(0), axis=0)
.cumsum().add_suffix('_Count'))
Out[266]:
Value Count A_Count B_Count C_Count D_Count
0 A 0 0 0 0 0
1 A 1 1 0 0 0
2 B 1 1 0 0 0
3 C 2 1 0 1 0
4 D 3 1 0 1 1
5 A 3 1 0 1 1
6 B 4 1 1 1 1
7 A 5 2 1 1 1

Related

Splitting a non delimited column and create an additional column to count which number value

I have a problem in which I want to take Table 1 and turn it into Table 2 using Python.
Does anybody have any ideas? I've tried to split the Value column from table 1 but run into issues in that each value is a different length, hence I can't always define how much to split it.
Equally I have not been able to think through how to create a new column that counts the position that value was in the string.
Table 1, before:
ID
Value
1
000000S
2
000FY
Table 2, after:
ID
Position
Value
1
1
0
1
2
0
1
3
0
1
4
0
1
5
0
1
6
0
1
7
S
2
1
0
2
2
0
2
3
0
2
4
F
2
5
Y
You can split the string to individual characters and explode:
out = (df
.assign(Value=df['Value'].apply(list))
.explode('Value')
)
output:
ID Value
0 1 0
0 1 0
0 1 0
0 1 0
0 1 0
0 1 0
0 1 S
1 2 0
1 2 0
1 2 0
1 2 F
1 2 Y
Given:
ID Value
0 1 000000S
1 2 000FY
Doing:
df.Value = df.Value.apply(list)
df = df.explode('Value')
df['Position'] = df.groupby('ID').cumcount() + 1
Output:
ID Value Position
0 1 0 1
0 1 0 2
0 1 0 3
0 1 0 4
0 1 0 5
0 1 0 6
0 1 S 7
1 2 0 1
1 2 0 2
1 2 0 3
1 2 F 4
1 2 Y 5

How do I create a column such that its values is count of the number of,1, in that row, which are appearing for the first time in their own column?

How do I do this operation using pandas?
Initial Df:
A B C D
0 0 1 0 0
1 0 1 0 0
2 0 0 1 1
3 0 1 0 1
4 1 1 0 0
5 1 1 1 0
Final Df:
A B C D Param
0 0 1 0 0 1
1 0 1 0 0 0
2 0 0 1 1 2
3 0 1 0 1 0
4 1 1 0 0 1
5 1 1 1 0 0
Basically Param is the number of the 1 in that row which is appearing for the first time in its own column.
Example:
index 0 : 1 in the column B is appearing for the first time hence Param1 = 1
index 1 : none of the 1 is appearing for the first time in its own column hence Param1 = 0
index 2 : 1 in the column C and D is appearing for the first time in their columns hence Paramm1 = 2
index 3 : none of the 1 is appearing for the first time in its own column hence Param1 = 0
index 4 : 1 in the column A is appearing for the first time in the column hence Paramm1 = 1
index 5 : none of the 1 is appearing for the first time in its own column hence Param1 = 0
I will do idxmax and value_counts
df['Param']=df.idxmax().value_counts().reindex(df.index,fill_value=0)
df
A B C D Param
0 0 1 0 0 1
1 0 1 0 0 0
2 0 0 1 1 2
3 0 1 0 1 0
4 1 1 0 0 1
5 1 1 1 0 0
You can check for duplicated values, multiply with df and sum:
df['Param'] = df.apply(lambda x: ~x.duplicated()).mul(df).sum(1)
Output:
A B C D Param
0 0 1 0 0 1
1 0 1 0 0 0
2 0 0 1 1 2
3 0 1 0 1 0
4 1 1 0 0 1
5 1 1 1 0 0
Assuming these are integers, you can use cumsum() twice to isolate the first occurrence of 1.
df2 = (df.cumsum() > 0).cumsum() == 1
df['Param'] = df2.sum(axis = 1)
print(df)
If df elements are strings, you should first convert them to integers.
df = df.astype(int)

Using previous row value while creating a new column

I have a df in python that looks something like this:
'A'
0
1
0
0
1
1
1
1
0
I want to create another column that adds cumulative 1's from column A, and starts over if the value in column A becomes 0 again. So desired output:
'A' 'B'
0 0
1 1
0 0
0 0
1 1
1 2
1 3
1 4
0 0
This is what I am trying, but it's just replicating column A:
df.B[df.A ==0] = 0
df.B[df.A !=0] = df.A + df.B.shift(1)
Let us do cumsum with groupby cumcount
df['B']=(df.groupby(df.A.eq(0).cumsum()).cumcount()).where(df.A==1,0)
Out[81]:
0 0
1 1
2 0
3 0
4 1
5 2
6 3
7 4
8 0
dtype: int64
Use shift with ne and groupby.cumsum:
df['B'] = df.groupby(df['A'].shift().ne(df['A']).cumsum())['A'].cumsum()
print(df)
A B
0 0 0
1 1 1
2 0 0
3 0 0
4 1 1
5 1 2
6 1 3
7 1 4
8 0 0

pandas calculate difference based on indicators grouped by a column with duplicated grouped pair

Here is an example.
a b k c
0 0 0 0
0 1 1 0
0 2 0 0
0 3 0 0
0 4 1 0
0 5 0 0
0 0 0 1
0 1 1 1
0 2 0 1
0 3 0 1
0 4 1 1
0 5 0 1
1 0 0 0
1 1 1 0
1 2 0 0
1 3 1 0
1 4 0 0
1 0 0 1
1 1 1 1
1 2 0 1
1 3 1 1
1 4 0 1
Here, "a" is user id, "b" is time, 'c' is product and "k" is a binary indicator flag. For each c, "b" is consecutive for sure and binary flag 'k' of a unique pair (a,b) is same, which means it is independent with 'c'. What I want to get is this:
a b k c diff_b
0 0 0 0 nan
0 1 1 0 nan
0 2 0 0 1
0 3 0 0 2
0 4 1 0 3
0 5 0 0 1
0 0 0 1 nan
0 1 1 1 nan
0 2 0 1 1
0 3 0 1 2
0 4 1 1 3
0 5 0 1 1
1 0 0 0 nan
1 1 1 0 nan
1 2 0 0 1
1 3 1 0 2
1 4 0 0 1
1 0 0 1 nan
1 1 1 1 nan
1 2 0 1 1
1 3 1 1 2
1 4 0 1 1
So, diff_b is a time difference variable. It shows the duration between the current time point and the last time point with an action. If there is never an action before, it returns nan. This diff_b is grouped by a. For each user, this diff_b is calculated independently and for a same user but different product, it should be independent with product also.
Thank you.
You just need to adding the c into the group indicator at second step
df['New']=df.b.loc[df.k==1]# get all value b when k equal to 1
df.New=df.groupby(['a','c']).New.apply(lambda x : x.ffill().shift()) # fillna by froward method , then we need shift.
df.b-df['New']

Find first row with condition after each row satisfying another condition

in pandas I have the following data frame:
a b
0 0
1 1
2 1
0 0
1 0
2 1
Now I want to do the following:
Create a new column c, and for each row where a = 0 fill c with 1. Then c should be filled with 1s until the first row after each column fulfilling that, where b = 1 (and here im hanging), so the output should look like this:
a b c
0 0 1
1 1 1
2 1 0
0 0 1
1 0 1
2 1 1
Thanks!
It seems you need:
df['c'] = df.groupby(df.a.eq(0).cumsum())['b'].cumsum().le(1).astype(int)
print (df)
a b c
0 0 0 1
1 1 1 1
2 2 1 0
3 0 0 1
4 1 0 1
5 2 1 1
Detail:
print (df.a.eq(0).cumsum())
0 1
1 1
2 1
3 2
4 2
5 2
Name: a, dtype: int32

Categories