Counting distinct, until a certain condition based on another row is met - python

I have the following df
Original df
Step | CampaignSource | UserId
1 Banana Jeff
1 Banana John
2 Banana Jefferson
3 Website Nunes
4 Banana Jeff
5 Attendance Nunes
6 Attendance Antonio
7 Banana Antonio
8 Website Joseph
9 Attendance Joseph
9 Attendance Joseph
Desired output
Steps | CampaignSource | CountedDistinctUserid
1 Website 2 (Because of different userids)
2 Banana 1
3 Banana 1
4 Website 1
5 Banana 1
6 Attendance 1
7 Attendance 1
8 Attendance 1
9 Attendance 1 (but i want to have 2 here even tho they have similar user ids and because is the 9th step)
What i want to do is impose a condition where if the step column which is made by strings equals '9', i want to count the userids as non distinct, any ideas on how i could do that? I tried applying a function but i just couldnt make it work.
What i am currently doing:
df[['Steps','UserId','CampaignSource']].groupby(['Steps','CampaignSource'],as_index=False,dropna=False).nunique()

You can group by "Step" and use a condition on the group name:
df.groupby('Step')['UserId'].apply(lambda g: g.nunique() if g.name<9 else g.count())
output:
Step
1 2
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 2
Name: UserId, dtype: int64
As DataFrame:
(df.groupby('Step', as_index=False)
.agg(CampaignSource=('CampaignSource', 'first'),
CountedDistinctUserid=('CampaignSource', lambda g: g.nunique() if g.name<9 else g.count())
)
)
output:
Step CampaignSource CountedDistinctUserid
0 1 Banana 2
1 2 Banana 1
2 3 Website 1
3 4 Banana 1
4 5 Attendance 1
5 6 Attendance 1
6 7 Banana 1
7 8 Website 1
8 9 Banana 2

You can apply different functions to different groups depending if condition match.
out = (df[['Steps','UserId','CampaignSource']]
.groupby(['Steps','CampaignSource'],as_index=False,dropna=False)
.apply(lambda g: g.assign(CountedDistinctUserid=( [len(g)]*len(g)
if g['Steps'].eq(9).all()
else [g['UserId'].nunique()]*len(g) ))))
print(out)
Steps UserId CampaignSource CountedDistinctUserid
0 1 Jeff Banana 2
1 1 John Banana 2
2 2 Jefferson Banana 1
3 3 Nunes Website 1
4 4 Jeff Banana 1
5 5 Nunes Attendance 1
6 6 Antonio Attendance 1
7 7 Antonio Banana 1
8 8 Joseph Website 1
9 9 Joseph Attendance 2
10 9 Joseph Attendance 2

Related

How to pivot a DataFrame creating new columns, considering the max item repeated

I have the next pd.DataFrame:
Index ID Name Date Days
1 1 Josh 5-1-20 10
2 1 Josh 9-1-20 10
3 1 Josh 19-1-20 6
4 2 Mike 1-1-20 10
5 3 George 1-4-20 10
6 4 Rose 1-2-20 10
7 4 Rose 11-5-20 5
8 5 Mark 1-9-20 10
9 6 Joe 1-4-21 10
10 7 Jill 1-1-21 10
I'm needing to make a DataFrame where the ID is not repeated, for that, I want to creat new columns (Date y Days), considering the case with most repeatitions (3 in this case).
The desired output is the next DataFrame:
Index ID Name Date 1 Date 2 Date 3 Days1 Days2 Days3
1 1 Josh 5-1-20 9-1-20 19-1-20 10 10 6
2 2 Mike 1-1-20 10
3 3 George 1-4-20 10
4 4 Rose 1-2-20 11-5-20 10 5
5 5 Mark 1-9-20 10
6 6 Joe 1-4-21 10
7 7 Jill 1-1-21 10
Try:
df_out = df.set_index(['ID','Name',df.groupby('ID').cumcount()+1]).unstack()
df_out.columns = [f'{i} {j}' for i, j in df_out.columns]
df_out.fillna('').reset_index()
Output:
ID Name Index 1 Index 2 Index 3 Date 1 Date 2 Date 3 Days 1 Days 2 Days 3
0 1 Josh 1.0 2.0 3.0 5-1-20 9-1-20 19-1-20 10.0 10.0 6.0
1 2 Mike 4.0 1-1-20 10.0
2 3 George 5.0 1-4-20 10.0
3 4 Rose 6.0 7.0 1-2-20 11-5-20 10.0 5.0
4 5 Mark 8.0 1-9-20 10.0
5 6 Joe 9.0 1-4-21 10.0
6 7 Jill 10.0 1-1-21 10.0
Here is a solution using pivot with a helper column:
df2 = (df
.assign(col=df.groupby('ID').cumcount().add(1).astype(str))
.pivot(index=['ID','Name'], columns='col', values=['Date', 'Days'])
.fillna('')
)
df2.columns = df2.columns.map('_'.join)
df2.reset_index()
Output:
ID Name Date_1 Date_2 Date_3 Days_1 Days_2 Days_3
0 1 Josh 5-1-20 9-1-20 19-1-20 10 10 6
1 2 Mike 1-1-20 10
2 3 George 1-4-20 10
3 4 Rose 1-2-20 11-5-20 10 5
4 5 Mark 1-9-20 10
5 6 Joe 1-4-21 10
6 7 Jill 1-1-21 10

Pandas: Create column with rolling sum of previous n rows of another column for within the same id/group

Sample dataset:
id fruit
0 7 NaN
1 7 apple
2 7 NaN
3 7 mango
4 7 apple
5 7 potato
6 3 berry
7 3 olive
8 3 olive
9 3 grape
10 3 NaN
11 3 mango
12 3 potato
In fruit column value of NaN and potato is 0. All other strings value is 1. I want to generate a new column sum_last_3 where each row calculates the sum of previous 3 rows (inclusive) of fruit column. When a new id appears, it should calculate from the beginning.
Output I want:
id fruit sum_last3
0 7 NaN 0
1 7 apple 1
2 7 NaN 1
3 7 mango 2
4 7 apple 2
5 7 potato 2
6 3 berry 1
7 3 olive 2
8 3 olive 3
9 3 grape 3
10 3 NaN 2
11 3 mango 2
12 3 potato 1
My Code:
df['sum_last5'] = (df['fruit'].ne('potato') & df['fruit'].notna())
.groupby('id',sort=False, as_index=False)['fruit']
.rolling(min_periods=1, window=3).sum().astype(int).values
You can modify your codes slightly, as follows:
df['sum_last3'] = ((df['fruit'].ne('potato') & df['fruit'].notna())
.groupby(df['id'],sort=False)
.rolling(min_periods=1, window=3).sum().astype(int)
.droplevel(0)
)
or use .values as in your codes:
df['sum_last3'] = ((df['fruit'].ne('potato') & df['fruit'].notna())
.groupby(df['id'],sort=False)
.rolling(min_periods=1, window=3).sum().astype(int)
.values
)
Your codes are close, just need to change id to df['id'] in the .groupby() call (since the main subject for calling .groupby() is now a boolean series rather than df itself, so .groupby() cannot recognize the id column by the column label 'id' alone and need also the dataframe name to fully qualify/identify the column).
Also remove as_index=False since this parameter is for dataframe rather than (boolean) series here.
Result:
print(df)
id fruit sum_last3
0 7 NaN 0
1 7 apple 1
2 7 NaN 1
3 7 mango 2
4 7 apple 2
5 7 potato 2
6 3 berry 1
7 3 olive 2
8 3 olive 3
9 3 grape 3
10 3 NaN 2
11 3 mango 2
12 3 potato 1

Redefining a pandas dataframe based on its group

Iam using this dataframe
source fruit 2019 2020 2021
0 a apple 3 1 1
1 a banana 4 3 5
2 a orange 2 2 2
3 b apple 3 4 5
4 b banana 4 5 2
5 b orange 1 6 4
i want to refine it like this
source fruit 2019 2020 2021
0 a total 9 6 8
1 a seeds 5 3 3
2 a banana 4 3 5
3 b total 8 15 11
4 b seeds 4 10 9
5 b banana 4 5 2
total is sum of all fruits in that year for each source.
seeds is the sum of fruits containing seeds for each year for each source.
I tried
Appending new empty rows : Insert a new row after every nth row & Insert row at any position
But wasn't getting the expected result.
What would be the best way to get the desired output?
TRY:
df1 = df.groupby('source', as_index=False).sum().assign(fruit = 'total')
seeds = ['orange','apple']
df2 = df.loc[df['fruit'].isin(seeds)].groupby('source', as_index=False).sum().assign(fruit = 'seeds')
final_df = pd.concat([df.loc[~df['fruit'].isin(seeds)], df1,df2])

groupby and withhold information of one column based on value of another column

I have the following DataFrame consisting out of columns id, brand and count
Id brand count
1 Audi 3
2 BWM 5
2 FORD 3
3 AUDI 7
4 BMW 2
5 Audi 4
5 FORD 3
I would like to groupby id and only remain each id with the brand that has the highest count.
So in the end I would like to have the following:
id brand
1 AUDI
2 BMW
3 AUDI
4 BMW
5 AUDI
I have something like this but that obviously is not working. So what would be the correct function or syntax to accomplish that? Thanks!
data.groupby('id')['brand'].where(max('count'))
IIUC use groupby.idxmax and loc:
df.loc[df.groupby('Id')['count'].idxmax()]
[out]
Id brand count
0 1 Audi 3
1 2 BWM 5
3 3 AUDI 7
4 4 BMW 2
5 5 Audi 4
IIUC
df=df.sort_values(['Id','count']).drop_duplicates('Id',keep='last')
Out[249]:
Id brand count
0 1 Audi 3
1 2 BWM 5
3 3 AUDI 7
4 4 BMW 2
5 5 Audi 4

Converting Repeated Names in Dataframe to Single Values

Looking for this:
Anthony now equals 1
John now equals 2
Smith now equals 3
and this goes on and on even if the name is repeated.. Looking for this
1
1
2
2
3
3
The code is fairly long but here is the spot I need to convert the names to numbers
LM = frame[['Name','COMMENT']] -> Name is currently characters in a movie and I want to change it over to Numbers to be able to run a SVM Model through the Response Variable 'Name'
IIUC, you need to look at pd.factorize or convert name to pd.Categorical and use categorgy_codes.
np.random.seed(123)
df = pd.DataFrame({'Name':np.random.choice(['John','Smith','Anthony'],10)})
df['Name_Code'] = pd.factorize(df.Name)[0] + 1
df
Output:
Name Name_Code
0 Anthony 1
1 Smith 2
2 Anthony 1
3 Anthony 1
4 John 3
5 Anthony 1
6 Anthony 1
7 Smith 2
8 Anthony 1
9 Smith 2
OR
df['Name_Cat_Code'] = pd.Categorical(df.Name).codes + 1
Output:
Name Name_Code Name_Cat_Code
0 Anthony 1 1
1 Smith 2 3
2 Anthony 1 1
3 Anthony 1 1
4 John 3 2
5 Anthony 1 1
6 Anthony 1 1
7 Smith 2 3
8 Anthony 1 1
9 Smith 2 3

Categories