how to group by multiple columns - python

I want to group by my dataframe by different columns based on UserId,Date,category (frequency of use per day ) ,max duration per category ,and the part of the day when it is most used and finally store the result in a .csv file.
name duration UserId category part_of_day Date
Settings 3.436 1 System tool evening 2020-09-10
Calendar 2.167 1 Calendar night 2020-09-11
Calendar 5.705 1 Calendar night 2020-09-11
Messages 7.907 1 Phone_and_SMS night 2020-09-11
Instagram 50.285 9 Social night 2020-09-28
Drive 30.260 9 Productivity night 2020-09-28
df.groupby(["UserId", "Date","category"])["category"].count()
my code result is :
UserId Date category
1 2020-09-10 System tool 1
2020-09-11 Calendar 8
Clock 2
Communication 86
Health & Fitness 5
But i want this result
UserId Date category count(category) max-duration
1 2020-09-10 System tool 1 3
2020-09-11 Calendar 2 5
2 2020-09-28 Social 1 50
Productivity 1 30
How can I do that? I can not find the wanted result for any solution

From your question, it looks like you'd like to make a table with each combination and the count. For this, you might consider using the as_index parameter in groupby:
df.category.groupby(["UserId", "Date"], as_index=False).count()

It looks like you might be wanting to calculate statistics for each group.
grouped = df.groupby(["UserId", "Date","category"])
result = grouped.agg({'category': 'count', 'duration': 'max'})
result.columns = ['group_count','duration_max']
result = result.reset_index()
result
UserId Date category group_count duration_max
0 1 2020-09-10 System tool 1 3.436
1 1 2020-09-11 Calendar 2 5.705
2 1 2020-09-11 Phone_and_SMS 1 7.907
3 9 2020-09-28 Productivity 1 30.260
4 9 2020-09-28 Social 1 50.285

You take advantage of pandas.DataFrame.groupby , pandas.DataFrame.aggregate and pandas.DataFrame.rename in following format to generate your desired output in one line:
code:
import pandas as pd
df = pd.DataFrame({'name': ['Settings','Calendar','Calendar', 'Messages', 'Instagram', 'Drive'],
'duration': [3.436, 2.167, 5.7050, 7.907, 50.285, 30.260],
'UserId': [1, 1, 1, 1, 2, 2],
'category' : ['System_tool', 'Calendar', 'Calendar', 'Phone_and_SMS', 'Social', 'Productivity'],
'part_of_day' : ['evening', 'night','night','night','night','night' ],
'Date' : ['2020-09-10', '2020-09-11', '2020-09-11', '2020-09-11', '2020-09-28', '2020-09-28'] })
df.groupby(['UserId', 'Date', 'category']).aggregate( count_cat = ('category', 'count'), max_duration = ('duration', 'max'))
out:

Related

Pandas: Groupby and sum customer profit, for every 6 months, starting from users first transaction

I have a dataset like this:
Customer ID
Date
Profit
1
4/13/2018
10.00
1
4/26/2018
13.27
1
10/23/2018
15.00
2
1/1/2017
7.39
2
7/5/2017
9.99
2
7/7/2017
10.01
3
5/4/2019
30.30
I'd like to groupby and sum profit, for every 6 months, starting at each users first transaction.
The output ideally should look like this:
Customer ID
Date
Profit
1
4/13/2018
23.27
1
10/13/2018
15.00
2
1/1/2017
7.39
2
7/1/2017
20.00
3
5/4/2019
30.30
The closest I've seem to get on this problem is by using:
df.groupby(['Customer ID',pd.Grouper(key='Date', freq='6M', closed='left')])['Profit'].sum().reset_index()
But, that doesn't seem to sum starting on a users first transaction day.
If the changing of dates is not possible (ex. customer 2 date is 7/1/2017 and not 7/5/2017), then at least summing the profit so that its based on each users own 6 month purchase journey would be extremely helpful. Thank you!
I can get you the first of the month until you find a more perfect solution.
df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y")
df = (
df
.set_index("Date")
.groupby(["Customer ID"])
.Profit
.resample("6MS")
.sum()
.reset_index(name="Profit")
)
print(df)
Customer ID Date Profit
0 1 2018-04-01 23.27
1 1 2018-10-01 15.00
2 2 2017-01-01 7.39
3 2 2017-07-01 20.00
4 3 2019-05-01 30.30

Pandas Multilevel Dataframe stack columns next to each other

I have a Dataframe in the following format:
id employee date week result
1234565 Max 2022-07-04 27 Project 1
27.1 Customer 1
27.2 100%
27.3 Work
1245513 Susanne 2022-07-04 27 Project 2
27.1 Customer 2
27.2 100%
27.3 In progress
What I want to achieve is the following format:
id employee date week result customer availability status
1234565 Max 2022-07-04 27 Project 1 Customer 1 100% Work
1245513 Susanne 2022-07-04 27 Project 2 Customer 2 100% In progress
The id, employee, date and week column are index, so I have a multilevel index.
I have tried several things but nothing really brings the expected result...
So basically I want to unpivot the result.
You can do this ( you need pandas version >= 1.3.0 ):
cols = ['result', 'customer', 'availability', 'status']
new_cols = df.index.droplevel('week').names + cols + ['week']
df = df.groupby(df.index.names).agg(list)
weeks = df.reset_index('week').groupby(df.index.droplevel('week').names)['week'].first()
df = df.unstack().droplevel('week',axis=1).assign(week=weeks).reset_index()
df.columns = new_cols
df = df.explode(cols)
print(df):
id employee date result customer availability \
0 1234565 Max 2022-07-04 Project 1 Customer 1 100%
1 1245513 Susanne 2022-07-04 Project 2 Customer 2 100%
status week
0 Work 27.0
1 In progress 27.0

How to represent each user by a unique row (Python)?

I have data like this:
UserId Date Part_of_day Apps Category Frequency Duration_ToT
1 2020-09-10 evening Settings System tool 1 3.436
1 2020-09-11 afternoon Calendar Calendar 5 9.965
1 2020-09-11 afternoon Contacts Phone_and_SMS 7 2.606
2 2020-09-11 afternoon Facebook Social 15 50.799
2 2020-09-11 afternoon clock System tool 2 5.223
3 2020-11-18 morning Contacts Phone_and_SMS 3 1.726
3 2020-11-18 morning Google Productivity 1 4.147
3 2020-11-18 morning Instagram Social 1 0.501
.......................................
67 2020-11-18 morning Truecaller Communication 1 1.246
67 2020-11-18 night Instagram Social 3 58.02
I'am trying to reduce the diemnsionnality of my dataframe to set the entries for k-means.
I'd like to ask it's possible to represent each user by one row ? what do you think to Embedding ?
How can i do please . I can't find any solution
This depends on how you want to aggregate the values. Here is a small example how to do it with groupby and agg.
First I create some sample data.
import pandas as pd
import random
df = pd.DataFrame({
"id": [int(i/3) for i in range(20)],
"val1": [random.random() for _ in range(20)],
"val2": [str(int(random.random()*100)) for _ in range(20)]
})
>>> df.head()
id val1 val2
0 0 0.174553 49
1 0 0.724547 95
2 0 0.369883 3
3 1 0.243191 64
4 1 0.575982 16
>>> df.dtypes
id int64
val1 float64
val2 object
dtype: object
Then we group by the id and aggregate the values according to the functions you specify in the dictionary you pass to agg. In this example I sum up the float values and join the strings with an underscore separator. You could e.g. also pass the list function to store the values in a list.
>>> df.groupby("id").agg({"val1": sum, "val2": "__".join})
val1 val2
id
0 1.268984 49__95__3
1 0.856992 64__16__54
2 2.186370 30__59__21
3 1.486925 29__47__77
4 1.523898 19__78__99
5 0.855413 59__74__73
6 0.201787 63__33
EDIT regarding the comment "But how can we make val2 contain the top 5 applications according to the duration of the application?":
The agg method is restricted in the sense that you cannot access other attributes while aggregating. To do that you should use the apply method. You pass it a function, that processes the whole group and returns a row as Series object.
In this example I still use the sum for val1, but for val2 I return the val2 of the row with the highest val1. This should make clear how to make the aggregation depend on other attributes.
def apply_func(group):
return pd.Series({
"id": group["id"].iat[0],
"val1": group["val1"].sum(),
"val2": group["val2"].iat[group["val1"].argmax()]
})
>>> df.groupby("id").apply(apply_func)
id val1 val2
id
0 0 1.749955 95
1 1 0.344372 65
2 2 2.019035 70
3 3 2.444691 36
4 4 2.573576 92
5 5 1.453769 72
6 6 1.811516 94

How to suppress a pandas dataframe?

I have this data frame:
age Income Job yrs
Churn Own Home
0 0 39.497576 42.540247 7.293301
1 42.667392 58.975215 8.346974
1 0 44.499774 45.054619 7.806146
1 47.615546 60.187945 8.525210
Born from this line of code:
gb = df3.groupby(['Churn', 'Own Home'])['age', 'Income', 'Job yrs'].mean()
I want to "suppress" or unstack this data frame so that it looks like this:
Churn Own Home age Income Job yrs
0 0 0 39.49 42.54 7.29
1 0 1 42.66 58.97 8.34
2 1 0 44.49 45.05 7.80
3 1 1 47.87 60.18 8.52
I have tried using both .stack() and .unstack() with no luck, also I was not able to find anything online talking about this. Any help is greatly appreciated.
Your dataFrame looks like a MultiIndex that you can revert to a single index using the command :
gb.reset_index(level=[0,1])

How to get weekly averages for column values and week number for the corresponding year based on daily data records with pandas

I'm still learning python and would like to ask your help with the following problem:
I have a csv file with daily data and I'm looking for a solution to sum it per calendar weeks. So for the mockup data below I have rows stretched over 2 weeks (week 14 (current week) and week 13 (past week)). Now I need to find a way to group rows per calendar week, recognize what year they belong to and calculate week sum and week average. In the file input example there are only two different IDs. However, in the actual data file I expect many more.
input.csv
id date activeMembers
1 2020-03-30 10
2 2020-03-30 1
1 2020-03-29 5
2 2020-03-29 6
1 2020-03-28 0
2 2020-03-28 15
1 2020-03-27 32
2 2020-03-27 10
1 2020-03-26 9
2 2020-03-26 3
1 2020-03-25 0
2 2020-03-25 0
1 2020-03-24 0
2 2020-03-24 65
1 2020-03-23 22
2 2020-03-23 12
...
desired output.csv
id week WeeklyActiveMembersSum WeeklyAverageActiveMembers
1 202014 10 1.4
2 202014 1 0.1
1 202013 68 9.7
2 202013 111 15.9
my goal is to:
import pandas as pd
df = pd.read_csv('path/to/my/input.csv')
Here I'd need to group by 'id' + 'date' column (per calendar week - not sure if this is possible) and create a 'week' column with the week number, then sum 'activeMembers' values for the particular week, save as 'WeeklyActiveMembersSum' column in my output file and finally calculate 'weeklyAverageActiveMembers' for the particular week. I was experimenting with groupby and isin parameters but no luck so far... would I have to go with something similar to this:
df.groupby('id', as_index=False).agg({'date':'max',
'activeMembers':'sum'}
and finally save all as output.csv:
df.to_csv('path/to/my/output.csv', index=False)
Thanks in advance!
It seems I'm getting a different week setting than you do:
# should convert datetime column to datetime type
df['date'] = pd.to_datetime(df['date'])
(df.groupby(['id',df.date.dt.strftime('%Y%W')], sort=False)
.activeMembers.agg([('Sum','sum'),('Average','mean')])
.add_prefix('activeMembers')
.reset_index()
)
Output:
id date activeMembersSum activeMembersAverage
0 1 202013 10 10.000000
1 2 202013 1 1.000000
2 1 202012 68 9.714286
3 2 202012 111 15.857143

Categories