Pandas Multilevel Dataframe stack columns next to each other - python

I have a Dataframe in the following format:
id employee date week result
1234565 Max 2022-07-04 27 Project 1
27.1 Customer 1
27.2 100%
27.3 Work
1245513 Susanne 2022-07-04 27 Project 2
27.1 Customer 2
27.2 100%
27.3 In progress
What I want to achieve is the following format:
id employee date week result customer availability status
1234565 Max 2022-07-04 27 Project 1 Customer 1 100% Work
1245513 Susanne 2022-07-04 27 Project 2 Customer 2 100% In progress
The id, employee, date and week column are index, so I have a multilevel index.
I have tried several things but nothing really brings the expected result...
So basically I want to unpivot the result.

You can do this ( you need pandas version >= 1.3.0 ):
cols = ['result', 'customer', 'availability', 'status']
new_cols = df.index.droplevel('week').names + cols + ['week']
df = df.groupby(df.index.names).agg(list)
weeks = df.reset_index('week').groupby(df.index.droplevel('week').names)['week'].first()
df = df.unstack().droplevel('week',axis=1).assign(week=weeks).reset_index()
df.columns = new_cols
df = df.explode(cols)
print(df):
id employee date result customer availability \
0 1234565 Max 2022-07-04 Project 1 Customer 1 100%
1 1245513 Susanne 2022-07-04 Project 2 Customer 2 100%
status week
0 Work 27.0
1 In progress 27.0

Related

Pandas: Groupby and sum customer profit, for every 6 months, starting from users first transaction

I have a dataset like this:
Customer ID
Date
Profit
1
4/13/2018
10.00
1
4/26/2018
13.27
1
10/23/2018
15.00
2
1/1/2017
7.39
2
7/5/2017
9.99
2
7/7/2017
10.01
3
5/4/2019
30.30
I'd like to groupby and sum profit, for every 6 months, starting at each users first transaction.
The output ideally should look like this:
Customer ID
Date
Profit
1
4/13/2018
23.27
1
10/13/2018
15.00
2
1/1/2017
7.39
2
7/1/2017
20.00
3
5/4/2019
30.30
The closest I've seem to get on this problem is by using:
df.groupby(['Customer ID',pd.Grouper(key='Date', freq='6M', closed='left')])['Profit'].sum().reset_index()
But, that doesn't seem to sum starting on a users first transaction day.
If the changing of dates is not possible (ex. customer 2 date is 7/1/2017 and not 7/5/2017), then at least summing the profit so that its based on each users own 6 month purchase journey would be extremely helpful. Thank you!
I can get you the first of the month until you find a more perfect solution.
df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y")
df = (
df
.set_index("Date")
.groupby(["Customer ID"])
.Profit
.resample("6MS")
.sum()
.reset_index(name="Profit")
)
print(df)
Customer ID Date Profit
0 1 2018-04-01 23.27
1 1 2018-10-01 15.00
2 2 2017-01-01 7.39
3 2 2017-07-01 20.00
4 3 2019-05-01 30.30

how to group by multiple columns

I want to group by my dataframe by different columns based on UserId,Date,category (frequency of use per day ) ,max duration per category ,and the part of the day when it is most used and finally store the result in a .csv file.
name duration UserId category part_of_day Date
Settings 3.436 1 System tool evening 2020-09-10
Calendar 2.167 1 Calendar night 2020-09-11
Calendar 5.705 1 Calendar night 2020-09-11
Messages 7.907 1 Phone_and_SMS night 2020-09-11
Instagram 50.285 9 Social night 2020-09-28
Drive 30.260 9 Productivity night 2020-09-28
df.groupby(["UserId", "Date","category"])["category"].count()
my code result is :
UserId Date category
1 2020-09-10 System tool 1
2020-09-11 Calendar 8
Clock 2
Communication 86
Health & Fitness 5
But i want this result
UserId Date category count(category) max-duration
1 2020-09-10 System tool 1 3
2020-09-11 Calendar 2 5
2 2020-09-28 Social 1 50
Productivity 1 30
How can I do that? I can not find the wanted result for any solution
From your question, it looks like you'd like to make a table with each combination and the count. For this, you might consider using the as_index parameter in groupby:
df.category.groupby(["UserId", "Date"], as_index=False).count()
It looks like you might be wanting to calculate statistics for each group.
grouped = df.groupby(["UserId", "Date","category"])
result = grouped.agg({'category': 'count', 'duration': 'max'})
result.columns = ['group_count','duration_max']
result = result.reset_index()
result
UserId Date category group_count duration_max
0 1 2020-09-10 System tool 1 3.436
1 1 2020-09-11 Calendar 2 5.705
2 1 2020-09-11 Phone_and_SMS 1 7.907
3 9 2020-09-28 Productivity 1 30.260
4 9 2020-09-28 Social 1 50.285
You take advantage of pandas.DataFrame.groupby , pandas.DataFrame.aggregate and pandas.DataFrame.rename in following format to generate your desired output in one line:
code:
import pandas as pd
df = pd.DataFrame({'name': ['Settings','Calendar','Calendar', 'Messages', 'Instagram', 'Drive'],
'duration': [3.436, 2.167, 5.7050, 7.907, 50.285, 30.260],
'UserId': [1, 1, 1, 1, 2, 2],
'category' : ['System_tool', 'Calendar', 'Calendar', 'Phone_and_SMS', 'Social', 'Productivity'],
'part_of_day' : ['evening', 'night','night','night','night','night' ],
'Date' : ['2020-09-10', '2020-09-11', '2020-09-11', '2020-09-11', '2020-09-28', '2020-09-28'] })
df.groupby(['UserId', 'Date', 'category']).aggregate( count_cat = ('category', 'count'), max_duration = ('duration', 'max'))
out:

How to suppress a pandas dataframe?

I have this data frame:
age Income Job yrs
Churn Own Home
0 0 39.497576 42.540247 7.293301
1 42.667392 58.975215 8.346974
1 0 44.499774 45.054619 7.806146
1 47.615546 60.187945 8.525210
Born from this line of code:
gb = df3.groupby(['Churn', 'Own Home'])['age', 'Income', 'Job yrs'].mean()
I want to "suppress" or unstack this data frame so that it looks like this:
Churn Own Home age Income Job yrs
0 0 0 39.49 42.54 7.29
1 0 1 42.66 58.97 8.34
2 1 0 44.49 45.05 7.80
3 1 1 47.87 60.18 8.52
I have tried using both .stack() and .unstack() with no luck, also I was not able to find anything online talking about this. Any help is greatly appreciated.
Your dataFrame looks like a MultiIndex that you can revert to a single index using the command :
gb.reset_index(level=[0,1])

Pandas pivot table: Aggregate function by count of a particular string

I am trying to analyse a DataFrame which contains the Date as the index, and Name and Message as columns.
df.head() returns:
Name Message
Date
2020-01-01 Tom ‎ image omitted
2020-01-01 Michael ‎image omitted
2020-01-02 James ‎image Happy new year you wonderfully awfully people...
2020-01-02 James I was waiting for you ‎image
2020-01-02 James QB whisperer ‎image
This is the pivot table I was trying to call off the initial df, which the aggfunc being the count of the existence of a word (eg. image)
df_s = df.pivot_table(values='Message',index='Date',columns='Name',aggfunc=(lambda x: x.value_counts()['image']))
Which ideally would show, as an example:
Name Tom Michael James
Date
2020-01-01 1 1 0
2020-01-02 0 0 3
For instance, I've done another df.pivot_table using
df_m = df.pivot_table(values='Message',index='Date',columns='Name',aggfunc=lambda x: len(x.unique()))
Which aggregates based off the number of messages in a day and this returns the table fine.
Thanks in advance
Use Series.str.count for number of matched values to new column added to DataFrame by DataFrame.assign and then pivoting with sum:
df_m = (df.reset_index()
.assign(count= df['Message'].str.count('image'))
.pivot_table(index='Date',
columns='Name',
values='count' ,
aggfunc='sum',
fill_value=0))
print (df_m)
Name James Michael Tom
Date
2020-01-01 0 1 1
2020-01-02 3 0 0
This is for the fun of it, and an alternative to the same answer. It is just a play on the different options Pandas provides :
#or df1.groupby(['Date','Name']) if the index has a name
res = (df1.groupby([df1.index,df1.Name])
.Message.agg(','.join)
.str.count('image')
.unstack(fill_value=0)
)
res
Name James Michael Tom ‎
Date
2020-01-01 0 1 1
2020-01-02 3 0 0

Python selecting row from second dataframe based on complex criteria

I have two dataframes, one with some purchasing data, and one with a weekly calendar, e.g.
df1:
purchased_at product_id cost
01-01-2017 1 £10
01-01-2017 2 £8
09-01-2017 1 £10
18-01-2017 3 £12
df2:
week_no week_start week_end
1 31-12-2016 06-01-2017
2 07-01-2017 13-01-2017
3 14-01-2017 20-01-2017
I want to use data from the two to add a 'week_no' column to df1, which is selected from df2 based on where the 'purchased_at' date in df1 falls between the 'week_start' and 'week_end' dates in df2, i.e.
df1:
purchased_at product_id cost week_no
01-01-2017 1 £10 1
01-01-2017 2 £8 1
09-01-2017 1 £10 2
18-01-2017 3 £12 3
I've searched but I've not been able to find an example where the data is being pulled from a second dataframe using comparisons between the two, and I've been unable to correctly apply any examples I've found, e.g.
df1.loc[(df1['purchased_at'] < df2['week_end']) &
(df1['purchased_at'] > df2['week_start']), df2['week_no']
was unsuccessful, with the ValueError 'can only compare identically-labeled Series objects'
Could anyone help with this problem, or I'm open to suggestions if there is a better way to achieve the same outcome.
edit to add further detail of df1
df1 full dataframe headers
purchased_at purchase_id product_id product_name transaction_id account_number cost
01-01-2017 1 1 A 1 AA001 £10
01-01-2017 2 2 B 1 AA001 £8
02-01-2017 3 1 A 2 AA008 £10
03-01-2017 4 3 C 3 AB040 £12
...
09-01-2017 12 1 A 10 AB102 £10
09-01-2017 13 2 B 11 AB102 £8
...
18-01-2017 20 3 C 15 AA001 £12
So the purchase_id increases incrementally with each row, the product_id and product_name have a 1:1 relationship, the transaction_id also increases incrementally, but there can be multiple purchases within a transaction.
If your dataframes are to big you can use this trick.
Do a full cartisian product join of all records to all records:
df_out = pd.merge(df1.assign(key=1),df2.assign(key=1),on='key')
Next filter out those records that do not match criteria in this case, where purchased_at is not between week_start and week_end
(df_out.query('week_start < purchased_at < week_end')
.drop(['key','week_start','week_end'], axis=1))
Output:
purchased_at product_id cost week_no
0 2017-01-01 1 £10 1
3 2017-01-01 2 £8 1
7 2017-01-09 1 £10 2
11 2017-01-18 3 £12 3
If you do have large dataframes then you can use this numpy method as proposed by PiRSquared.
a = df1.purchased_at.values
bh = df2.week_end.values
bl = df2.week_start.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.DataFrame(
np.column_stack([df1.values[i], df2.values[j]]),
columns=df1.columns.append(df2.columns)
).drop(['week_start','week_end'],axis=1)
Output:
purchased_at product_id cost week_no
0 2017-01-01 00:00:00 1 £10 1
1 2017-01-01 00:00:00 2 £8 1
2 2017-01-09 00:00:00 1 £10 2
3 2017-01-18 00:00:00 3 £12 3
You could just use time.strftime() to extract the week number from the date. If you want to keep counting the weeks upwards, you need to define a "zero year" as the start of your time-series and offset the week_no accordingly:
import pandas as pd
data = {'purchased_at': ['01-01-2017', '01-01-2017', '09-01-2017', '18-01-2017'], 'product_id': [1,2,1,3], 'cost':['£10', '£8', '£10', '£12']}
df = pd.DataFrame(data, columns=['purchased_at', 'product_id', 'cost'])
def getWeekNo(date, year0):
datetime = pd.to_datetime(date, dayfirst=True)
year = int(datetime.strftime('%Y'))
weekNo = int(datetime.strftime('%U'))
return weekNo + 52*(year-year0)
df['week_no'] = df.purchased_at.apply(lambda x: getWeekNo(x, 2017))
Here, I use pd.to_dateime() to convert the datestring from df into a datetime-object. strftime('%Y') returns the year and strftime('%U') the week (with the first week of a year starting on it's first Sunday. If weeks should start on Monday, use '%W' instead).
This way, you don't need to maintain a seperate DataFrame only for week numbers.

Categories