Pandas pivot table: Aggregate function by count of a particular string - python

I am trying to analyse a DataFrame which contains the Date as the index, and Name and Message as columns.
df.head() returns:
Name Message
Date
2020-01-01 Tom ‎ image omitted
2020-01-01 Michael ‎image omitted
2020-01-02 James ‎image Happy new year you wonderfully awfully people...
2020-01-02 James I was waiting for you ‎image
2020-01-02 James QB whisperer ‎image
This is the pivot table I was trying to call off the initial df, which the aggfunc being the count of the existence of a word (eg. image)
df_s = df.pivot_table(values='Message',index='Date',columns='Name',aggfunc=(lambda x: x.value_counts()['image']))
Which ideally would show, as an example:
Name Tom Michael James
Date
2020-01-01 1 1 0
2020-01-02 0 0 3
For instance, I've done another df.pivot_table using
df_m = df.pivot_table(values='Message',index='Date',columns='Name',aggfunc=lambda x: len(x.unique()))
Which aggregates based off the number of messages in a day and this returns the table fine.
Thanks in advance

Use Series.str.count for number of matched values to new column added to DataFrame by DataFrame.assign and then pivoting with sum:
df_m = (df.reset_index()
.assign(count= df['Message'].str.count('image'))
.pivot_table(index='Date',
columns='Name',
values='count' ,
aggfunc='sum',
fill_value=0))
print (df_m)
Name James Michael Tom
Date
2020-01-01 0 1 1
2020-01-02 3 0 0

This is for the fun of it, and an alternative to the same answer. It is just a play on the different options Pandas provides :
#or df1.groupby(['Date','Name']) if the index has a name
res = (df1.groupby([df1.index,df1.Name])
.Message.agg(','.join)
.str.count('image')
.unstack(fill_value=0)
)
res
Name James Michael Tom ‎
Date
2020-01-01 0 1 1
2020-01-02 3 0 0

Related

Pandas Multilevel Dataframe stack columns next to each other

I have a Dataframe in the following format:
id employee date week result
1234565 Max 2022-07-04 27 Project 1
27.1 Customer 1
27.2 100%
27.3 Work
1245513 Susanne 2022-07-04 27 Project 2
27.1 Customer 2
27.2 100%
27.3 In progress
What I want to achieve is the following format:
id employee date week result customer availability status
1234565 Max 2022-07-04 27 Project 1 Customer 1 100% Work
1245513 Susanne 2022-07-04 27 Project 2 Customer 2 100% In progress
The id, employee, date and week column are index, so I have a multilevel index.
I have tried several things but nothing really brings the expected result...
So basically I want to unpivot the result.
You can do this ( you need pandas version >= 1.3.0 ):
cols = ['result', 'customer', 'availability', 'status']
new_cols = df.index.droplevel('week').names + cols + ['week']
df = df.groupby(df.index.names).agg(list)
weeks = df.reset_index('week').groupby(df.index.droplevel('week').names)['week'].first()
df = df.unstack().droplevel('week',axis=1).assign(week=weeks).reset_index()
df.columns = new_cols
df = df.explode(cols)
print(df):
id employee date result customer availability \
0 1234565 Max 2022-07-04 Project 1 Customer 1 100%
1 1245513 Susanne 2022-07-04 Project 2 Customer 2 100%
status week
0 Work 27.0
1 In progress 27.0

how to expand a string into multiple rows in dataframe?

i want to split a string into multiple rows.
df.assign(MODEL_ABC = df['MODEL_ABC'].str.split('_').explode('MODEL_ABC'))
my output:
YEAR PERIOD MODEL_ABC Price Qty
0 2018 First A 75.0 25.0
if i run individually for column i'm getting like below but not entire dataframe
A
B
this is my dataframe df
YEAR PERIOD MODEL_ABC Price Qty
0 2018 First A_B 75.0 25.0
expected output:
YEAR PERIOD MODEL_ABC Price Qty
0 2018 First A 75.0 25.0
1 2018 First B 75.0 25.0
You can do the following, start by transforming the column into a list, so then you can explode it to create multiple rows:
df['MODEL_ABC'] = df['MODEL_ABC'].str.split('_')
df = df.explode('MODEL_ABC')

For a given key, how do I test for overlapping date ranges in Pandas?

I'm working with a data frame in which people can appear with multiple roles and I need to devise a test to see for a given person, do they have any dates that overlap:
import pandas as pd
records = pd.DataFrame({'name': ['Tom','Harry','Jack','Matt','Harry','Matt'],
'job code': [101,101,301,101,401,102], 'start date': ['1/1/20','1/1/20','1/1/20','1/1/20','5/1/20','6/15/20'], 'end date':['12/31/20','4/30/20','12/31/20','11/30/20','12/31/20','12/31/20']})
From this dataset you can see at a glance that everyone is fine except for Matt - he has job dates that overlap which is not allowed. How can I test for this in pandas, checking that each unique name does not have any overlap and flagging the entries that do?
Thanks!
The criteria for overlapping would be
max(start_dates) < min(end_date)
So you could merge and query:
(records.merge(records, on='name')
.loc[lambda x: x['job code_x'] != x['job code_y']]
.loc[lambda x: x.filter(like='start date').max(1) <= x.filter(like='end date').min(1)]
['name'].unique()
)
Output:
['Matt']
The first thing I would do is convert to datetime objects:
records['start date']=pd.to_datetime(records['start date'])
records['end date']=pd.to_datetime(records['end date'])
Then I can work with these rather than strings:
import datetime as dt
# I sort based on name and start date:
records2=records.sort_values(['name', 'start date'])
I then create a new column, which compares the start date to the end date, and returns True if a job overlaps with the subsequent job (False otherwise). This is more specific than what you asked, as it gets to the job level, but you could change this to be True if any jobs overlap for a person.
records2['overlap']=(records2['end date']-records2['start date'].shift(-1).where(records2['name'].eq(records2['name'].shift(-1))))>dt.timedelta(0)
records2
Which returns:
name job code start date end date overlap
1 Harry 101 2020-01-01 2020-04-30 False
4 Harry 401 2020-05-01 2020-12-31 False
2 Jack 301 2020-01-01 2020-12-31 False
3 Matt 101 2020-01-01 2020-11-30 True
5 Matt 102 2020-06-15 2020-12-31 False
0 Tom 101 2020-01-01 2020-12-31 False
This is a helpful question for using shift in conjunction with groups, and there are some nice and different ways to do this. I pulled from the second answer.
If you're interested in how many times each person has an overlap, you can use the following code to create a dataframe with that information:
df=records2.groupby('name').sum('overlap')
df
Which returns:
job code overlap
name
Harry 502 0
Jack 301 0
Matt 203 1
Tom 101 0

Pandas Expanding Mean with Group By and before current row date

I have a Pandas dataframe as follows
df = pd.DataFrame([['John', '1/1/2017','10'],
['John', '2/2/2017','15'],
['John', '2/2/2017','20'],
['John', '3/3/2017','30'],
['Sue', '1/1/2017','10'],
['Sue', '2/2/2017','15'],
['Sue', '3/2/2017','20'],
['Sue', '3/3/2017','7'],
['Sue', '4/4/2017','20']
],
columns=['Customer', 'Deposit_Date','DPD'])
. What is the best way to calculate the PreviousMean column in the screen shot below?
The column is the year to date average of DPD for that customer. I.e. Includes all DPDs up to but not including rows that match the current deposit date. If no previous records existed then it's null or 0.
Screenshot:
Notes:
the data is grouped by Customer Name and expanding over Deposit Dates
within each group, the expanding mean is calculated using only values from the previous rows.
at the start of each new customer the mean is 0 or alternatively null as there are no previous records on which to form the mean
the data frame is ordered by Customer Name and Deposit_Date
instead of grouping & expanding the mean, filter the dataframe on the conditions, and calculate the mean of DPD:
Customer == current row's Customer
Deposit_Date < current row's Deposit_Date
Use df.apply to perform this operation for all row in the dataframe:
df['PreviousMean'] = df.apply(
lambda x: df[(df.Customer == x.Customer) & (df.Deposit_Date < x.Deposit_Date)].DPD.mean(),
axis=1)
outputs:
Customer Deposit_Date DPD PreviousMean
0 John 2017-01-01 10 NaN
1 John 2017-02-02 15 10.0
2 John 2017-02-02 20 10.0
3 John 2017-03-03 30 15.0
4 Sue 2017-01-01 10 NaN
5 Sue 2017-02-02 15 10.0
6 Sue 2017-03-02 20 12.5
7 Sue 2017-03-03 7 15.0
8 Sue 2017-04-04 20 13.0
Here's one way to exclude repeated days from mean calculation:
# create helper series which is NaN for repeated days, DPD otherwise
s = df.groupby(['Customer Name', 'Deposit_Date']).cumcount() == 1
df['DPD2'] = np.where(s, np.nan, df['DPD'])
# apply pd.expanding_mean
df['CumMean'] = df.groupby(['Customer Name'])['DPD2'].apply(lambda x: pd.expanding_mean(x))
# drop helper series
df = df.drop('DPD2', 1)
print(df)
Customer Name Deposit_Date DPD CumMean
0 John 01/01/2017 10 10.0
1 John 01/01/2017 10 10.0
2 John 02/02/2017 20 15.0
3 John 03/03/2017 30 20.0
4 Sue 01/01/2017 10 10.0
5 Sue 01/01/2017 10 10.0
6 Sue 02/02/2017 20 15.0
7 Sue 03/03/2017 30 20.0
Ok here is the best solution I've come up with thus far.
The trick is to first create an aggregated table at the customer & deposit date level containing a shifted mean. To calculate this mean you have to calculate the sum and the count first.
s=df.groupby(['Customer Name','Deposit_Date'],as_index=False)[['DPD']].agg(['count','sum'])
s.columns = [' '.join(col) for col in s.columns]
s.reset_index(inplace=True)
s['DPD_CumSum']=s.groupby(['Customer Name'])[['DPD sum']].cumsum()
s['DPD_CumCount']=s.groupby(['Customer Name'])[['DPD count']].cumsum()
s['DPD_CumMean']=s['DPD_CumSum']/ s['DPD_CumCount']
s['DPD_PrevMean']=s.groupby(['Customer Name'])['DPD_CumMean'].shift(1)
df=df.merge(s[['Customer Name','Deposit_Date','DPD_PrevMean']],how='left',on=['Customer Name','Deposit_Date'])

How to change value of second duplicate in row

I have not been able to find an answer to the following online. Your valuable help would be much appreciated.
I have a DataFrame like this with 20k rows:
ID Date Country
2008-0001 2008-01-02 India
2008-0001 2008-01-02 France
2008-0002 2008-01-03 USA
I want to take all the duplicates in ID such as in rows 1 and 2 and then increment the second ID to the highest number after the dash.
So for instance because there is already 2008-0002 (assume that 0002 is the highest number after the dash in that column for that year) then I want to increment to one above that so one of the duplicate id values 2008-0001 would become 2008-0003.
I can identify and drop duplicates using the following code
drop_duplicate_df = train_df.drop_duplicates(['ID'])
but this is not what I need.
I believe this will get it done:
isdup = df.duplicated(subset=['ID1', 'ID2'])
dups, uniques = df[isdup], df[~isdup]
ids = ['ID1', 'ID2']
for i, row in dups.iterrows():
while (row[ids] == uniques[ids]).all(axis=1).any():
row.loc['ID2'] += 1
uniques = uniques.append(row)
id1 = uniques.ID1.astype(str)
id2 = uniques.ID2.astype(str).str.zfill(4)
uniques.loc[:, 'ID'] = id1 + '-' + id1
uniques.drop(['ID1', 'ID2'], axis=1, inplace=True)
print uniques.sort_index()
ID Date Country
0 2008-0001 2008-01-02 India
1 2008-0003 2008-01-02 France
2 2008-0002 2008-01-03 USA
The below works with the sample data, and assumes you have data for several years that you all want to relabel according to the same logic:
df.Date = pd.to_datetime(df.Date) # to datetime to extract years
years = df.groupby(df.Date.dt.year) # analysis per year
new_df = pd.DataFrame()
for year, data in years:
data.loc[data.duplicated(subset='ID'), 'ID'] = '{0}-{1}'.format(year, str(int(df.ID.max().split('-')[1]) + 1).zfill(4))
new_df = pd.concat([new_df, data])
to get:
ID Date Country
0 2008-0001 2008-01-02 India
1 2008-0003 2008-01-02 France
2 2008-0002 2008-01-03 USA

Categories