Cumulative count of values with grouping using Pandas - python

I have the following DataFrame:
>>>> df = pd.DataFrame(data={
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'value': [0, 2, 3, 4, 0, 3, 2, 3, 0]})
>>> df
type value
0 A 0
1 A 2
2 A 3
3 B 4
4 B 0
5 B 3
6 C 2
7 C 3
8 C 0
What I need to accomplish is the following: for each type, trace the cumulative count of non-zero values but starting from zero each time a 0-value is encountered.
type value cumcount
0 A 0 NaN
1 A 2 1
2 A 3 2
3 B 4 1
4 B 0 NaN
5 B 3 1
6 C 2 1
7 C 3 2
8 C 0 NaN

Idea is create consecutive groups and filter out non 0 values, last assign to new column with filter:
m = df['value'].eq(0)
g = m.ne(m.shift()).cumsum()[~m]
df.loc[~m, 'new'] = df.groupby(['type',g]).cumcount().add(1)
print (df)
type value new
0 A 0 NaN
1 A 2 1.0
2 A 3 2.0
3 B 4 1.0
4 B 0 NaN
5 B 3 1.0
6 C 2 1.0
7 C 3 2.0
8 C 0 NaN
For pandas 0.24+ is possible use Nullable integer data type:
df['new'] = df['new'].astype('Int64')
print (df)
type value new
0 A 0 NaN
1 A 2 1
2 A 3 2
3 B 4 1
4 B 0 NaN
5 B 3 1
6 C 2 1
7 C 3 2
8 C 0 NaN

Related

How to check if a value in a dataframe's column is in another dataframe's column with group by

What I want to do is:
1- Group the dataframe by two columns
2- From each group, check if the values of a column are not in another column of the group.
x = pd.DataFrame({'x': [1,1,1,1,1,1,2], 'y': [4,4,4,5,5,5,4], 'z':['a', 'b', 'c', 'a', 'b', 'c', 'a'], 's':['a', 'a', 'b', 'a', 'a', 'a', 'b']})
x:
x y z s
0 1 4 a a
1 1 4 b a
2 1 4 c b
3 1 5 a a
4 1 5 b a
5 1 5 c a
6 2 4 a b
What I would like to check is if the values of column z are not in column s being the dataframe grouped by x and y.
For example, in the following group (x=1 and y=4):
x y z s
0 1 4 a a
1 1 4 b a
2 1 4 c b
The result will be the third row:
x y z s
0 1 4 c b
I have tried something like this but it gets stuck:
x= x.groupby(['x', 'y'])[(~x.z.isin(x.s)).index]
Any suggestions?
Thanks in advance!
Left merge:
m = x.merge(x, left_on=['x','y','z'],
right_on=['x','y','s'],
how='left', suffixes=['','_']
)
You would see:
x y z s z_ s_
0 1 4 a a a a
1 1 4 a a b a
2 1 4 b a c b
3 1 4 c b NaN NaN
4 1 5 a a a a
5 1 5 a a b a
6 1 5 a a c a
7 1 5 b a NaN NaN
8 1 5 c a NaN NaN
9 2 4 a b NaN NaN
Then your data is where s_ is NaN, so
m.loc[m['s_'].isna(), x.columns]
Output:
x y z s
3 1 4 c b
7 1 5 b a
8 1 5 c a
9 2 4 a b
Option 2: do an apply with isin on groupby:
(x.groupby(['x','y'])
.apply(lambda d: d[~d['z'].isin(d['s'])])
.reset_index(level=['x','y'], drop=True)
)
Output:
x y z s
2 1 4 c b
4 1 5 b a
5 1 5 c a
6 2 4 a b

Python Pandas shift by given value in cell within groupby

Given the following dataframe
df = pd.DataFrame(data={'name': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c'],
'lag': [1, 1, 1, 2, 2, 2, 2, 2, 2, 2],
'value': range(10)})
print(df)
lag name value
0 1 a 0
1 1 a 1
2 1 a 2
3 2 b 3
4 2 b 4
5 2 b 5
6 2 b 6
7 2 c 7
8 2 c 8
9 2 c 9
I am trying to shift values contained in column value to obtain the column expected_value, which is the shifted values grouped by column name and shifted by lag rows. I was thinking of using something like df['expected_value'] = df.groupby(['name', 'lag']).shift(), but I am not sure how to pass lag to the shift() function.
print(df)
lag name value expected_value
0 1 a 0 nan
1 1 a 1 0.0000
2 1 a 2 1.0000
3 2 b 3 nan
4 2 b 4 nan
5 2 b 5 3.0000
6 2 b 6 4.0000
7 2 c 7 nan
8 2 c 8 nan
9 2 c 9 7.0000
You can use GroupBy.transform here.
df.assign(expected_value = df.groupby(['name', 'lag'])['value'].
transform(lambda x: x.shift(x.name[1])))
name lag value expected_value
0 a 1 0 NaN
1 a 1 1 0.0
2 a 1 2 1.0
3 b 2 3 NaN
4 b 2 4 NaN
5 b 2 5 3.0
6 b 2 6 4.0
7 c 2 7 NaN
8 c 2 8 NaN
9 c 2 9 7.0
You can do with an apply:
df['new_val'] = (df.groupby('name')
.apply(lambda x: x['value'].shift(x['lag'].iloc[0]))
.reset_index('name',drop=True)
)
Output:
name lag value new_val
0 a 1 0 NaN
1 a 1 1 0.0
2 a 1 2 1.0
3 b 2 3 NaN
4 b 2 4 NaN
5 b 2 5 3.0
6 b 2 6 4.0
7 c 2 7 NaN
8 c 2 8 NaN
9 c 2 9 7.0

Pandas cumulative count on new value

I have a data frame like the below one.
df = pd.DataFrame()
df['col_1'] = [1, 1, 1, 2, 2, 2, 3, 3, 3]
df['col_2'] = ['A', 'B', 'B', 'A', 'B', 'C', 'A', 'A', 'B']
df
col_1 col_2
0 1 A
1 1 B
2 1 B
3 2 A
4 2 B
5 2 C
6 3 A
7 3 A
8 3 B
I need to group by on col_1 and within each group, I need to update cumulative count whenever there is a new value in col_2. Something like below data frame.
col_1 col_2 col_3
0 1 A 1
1 1 B 2
2 1 B 2
3 2 A 1
4 2 B 2
5 2 C 3
6 3 A 1
7 3 A 1
8 3 B 2
I could do this using lists and dictionary. But couldn't find a way using pandas in built functions.
Use factorize with lambda function in GroupBy.transform:
df['col_3'] = df.groupby('col_1')['col_2'].transform(lambda x: pd.factorize(x)[0]+1)
print (df)
col_1 col_2 col_3
0 1 A 1
1 1 B 2
2 1 B 2
3 2 A 1
4 2 B 2
5 2 C 3
6 3 A 1
7 3 A 1
8 3 B 2

How to reassign the value of a column that has repeated values if it exist for some value?

I have the following DataFrame:
import pandas as pd
df = pd.DataFrame({'codes': [1, 2, 3, 4, 1, 2, 1, 2, 1, 2], 'results': ['a', 'b', 'c', 'd', None, None, None, None, None, None]})
I need to produce the following:
codes results
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b
It is guaranteed that if the value of results is not None for a value in codes it will be unique. I mean there won't be two rows with different values for code and results.
You can do with merge
df[['codes']].reset_index().merge(df.dropna()).set_index('index').sort_index()
Out[571]:
codes results
index
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b
Or map
df['results']=df.codes.map(df.set_index('codes').dropna()['results'])
df
Out[574]:
codes results
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b
Or groupby + ffill
df['results']=df.groupby('codes').results.ffill()
df
Out[577]:
codes results
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b
Or reindex | .loc
df.set_index('codes').dropna().reindex(df.codes).reset_index()
Out[589]:
codes results
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b

Using groupby with expanding and a custom function

I have a dataframe that consists of truthIds and trackIds:
truthId = ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'C', 'B', 'A', 'A', 'C', 'C']
trackId = [1, 1, 2, 2, 3, 4, 5, 3, 2, 1, 5, 4, 6]
df1 = pd.DataFrame({'truthId': truthId, 'trackId': trackId})
trackId truthId
0 1 A
1 1 A
2 2 B
3 2 B
4 3 C
5 4 C
6 5 A
7 3 C
8 2 B
9 1 A
10 5 A
11 4 C
12 6 C
I wish to add a column that calculates, for each unique truthId, the length of the set of unique tracksIds that have previously (i.e. from the top of the data to that row) been associated with it:
truthId trackId unique_Ids
0 A 1 1
1 A 1 1
2 B 2 1
3 B 2 1
4 C 3 1
5 C 4 2
6 A 5 2
7 C 3 2
8 B 2 1
9 A 1 2
10 A 5 2
11 C 4 2
12 C 6 3
I am very close to accomplishing this. I can use:
df.groupby('truthId').expanding().agg({'trackId': lambda x: len(set(x))})
Which produces the following output:
trackId
truthId
A 0 1.0
1 1.0
6 2.0
9 2.0
10 2.0
B 2 1.0
3 1.0
8 1.0
C 4 1.0
5 2.0
7 2.0
11 2.0
12 3.0
This is consistent with the documentation
However, it throws an error when I attempt to assign this output to a new column:
df['unique_Ids'] = df.groupby('truthId').expanding().agg({'trackId': lambda x: len(set(x))})
I have used this workflow before and ideally the new column is put back into the original DateFrame with no issues (i.e. Split-Apply-Combine). How can I get it to work?
You need reset_index
df['Your']=(df.groupby('truthId').expanding().agg({'trackId': lambda x: len(set(x))})).reset_index(level=0,drop=True)
df
Out[1162]:
trackId truthId Your
0 1 A 1.0
1 1 A 1.0
2 2 B 1.0
3 2 B 1.0
4 3 C 1.0
5 4 C 2.0
6 5 A 2.0
7 3 C 2.0
8 2 B 1.0
9 1 A 2.0
10 5 A 2.0
11 4 C 2.0
12 6 C 3.0

Categories