I have a dataframe. Rows are unique persons and columns are various action types taken. I need the data restructured to show the individual events by row. Here is my current and desired format, as well as the approach I've been trying to implement.
current = pd.DataFrame({'name': {0: 'ross', 1: 'allen', 2: 'jon'},'action a': {0:'2017-10-04', 1:'2017-10-04', 2:'2017-10-04'},'action b': {0:'2017-10-05', 1:'2017-10-05', 2:'2017-10-05'},'action c': {0:'2017-10-06', 1:'2017-10-06', 2:'2017-10-06'}})
desired = pd.DataFrame({'name':['ross','ross','ross','allen','allen','allen','jon','jon','jon'],'action':['action a','action b','action c','action a','action b','action c','action a','action b','action c'],'date':['2017-10-04','2017-10-05','2017-10-05','2017-10-04','2017-10-05','2017-10-05','2017-10-04','2017-10-05','2017-10-05']})
Use df.melt (v0.20+):
df
action a action b action c name
0 2017-10-04 2017-10-05 2017-10-06 ross
1 2017-10-04 2017-10-05 2017-10-06 allen
2 2017-10-04 2017-10-05 2017-10-06 jon
df = df.melt('name').sort_values('name')
df.columns = ['name', 'action', 'date']
df
name action date
1 allen action a 2017-10-04
4 allen action b 2017-10-05
7 allen action c 2017-10-06
2 jon action a 2017-10-04
5 jon action b 2017-10-05
8 jon action c 2017-10-06
0 ross action a 2017-10-04
3 ross action b 2017-10-05
6 ross action c 2017-10-06
r = df.roles
c = df.roles.str.count(',') + 1
i = df.index
df.loc[i.repeat(c)].assign(roles=','.join(r).split(','))
company employer_id roles
0 a 1 engineer
0 a 1 data_scientist
0 a 1 architect
1 b 2 engineer
1 b 2 front_end_developer
Related
I have two df's, One with students details and another df with students attendance records.
details_df
name roll start_day last_day
0 anthony 9 2020-09-08 2020-09-28
1 paul 6 2020-09-01 2020-09-15
2 marcus 10 2020-08-08 2020-09-08
attendance_df
name roll status day
0 anthony 9 absent 2020-07-25
1 anthony 9 present 2020-09-15
2 anthony 9 absent 2020-09-25
3 paul 6 present 2020-09-02
4 marcus 10 present 2020-07-01
5 marcus 10 present 2020-08-17
I trying to get status=absent True/False for each user between start_day and last_day.
Ex: user - anthony has two records in attendance_df between start_day and last_day out of total 3 records.
From those two records if status=absent then mark that user as True
Expected Output
name roll absent
0 anthony 9 True
1 paul 6 False
2 marcus 10 False
I have tried making details_df into a list then looping into attendance_df. But is there any other efficient way?
You need to do merge (i.e. a join operation) and filter the days for which the column day is between start_day and last_day. Then, a group-by + apply (i.e. grouped aggregation operation):
merged_df = attendance_df.merge(details_df, on=['name', 'roll'])
df = (merged_df[merged_df.day.between(merged_df.start_day, merged_df.last_day)]
.groupby(['name', 'roll'])
.apply(lambda x: (x.status == 'absent').any())
.reset_index())
df.columns = ['name', 'roll', 'absent']
To get:
df
name roll absent
0 anthony 9 True
1 marcus 10 False
2 paul 6 False
Merge, groupby() and find any days that are between start and last using a lambda function
df2=pd.merge(attendance_df,details_df, how='left', on=['name','roll'])
df2.groupby(['name','roll']).apply(lambda x: (x['day'].\
between(x['start_day'],x['last_day'])).any(0)).to_frame('absent')
absent
name roll
anthony 9 True
marcus 10 True
paul 6 True
What is the fastest way to iterate through the rest of a dataframe given rows matching some specific values ?
For example let's say I have a dataframe with 'Date', 'Name' and 'Movie'. There could be many users and movies. I want all the person named John that have seen the same movie as someone named Alicia has seen before.
Input dataframe could be :
date name movie
0 2018-01-16 10:33:59 Alicia Titanic
1 2018-01-17 08:49:13 Chandler Avatar
2 2018-01-18 09:29:09 Luigi Glass
3 2018-01-19 09:45:27 Alicia Die Hard
4 2018-01-20 10:08:05 Bouchra Pulp Fiction
5 2018-01-26 10:21:47 Bariza Glass
6 2018-01-27 10:15:32 Peggy Bumbleblee
7 2018-01-20 10:08:05 John Titanic
8 2018-01-26 10:21:47 Bariza Glass
9 2018-01-27 10:15:32 John Titanic
The result should be :
date name movie
0 2018-01-16 10:33:59 Alicia Titanic
7 2018-01-20 10:08:05 John Titanic
9 2018-01-27 10:15:32 John Titanic
For the moment I am doing the following:
alicias = df[df['Name'] == 'Alicia']
df_res = pd.DataFrame(columns=df.columns)
for i in alicias.index:
df_res = df_res.append(alicias.loc[i], sort=False)
df_johns = df[(df['Date'] > alicias['Date'][i])
&(df['Name'] == 'John')
&(df['Movie'] == alicias['Movie'][i)]
df_res = df_res.append(df_johns, sort=False)
It works but this is very slow. I could also use a groupby which is much faster but I want the result to keep the initial row (the row with 'Alicia' in the example), and I can't find a way to do this with a groupby.
Any help ?
Here's a way to do it. Say you have the following dataframe:
date user movie
0 2018-01-02 Alicia Titanic
1 2018-01-13 John Titanic
2 2018-01-22 John Titanic
3 2018-04-02 John Avatar
4 2018-04-05 Alicia Avatar
5 2018-05-19 John Avatar
IIUC the correct solution should not contain row 3, as Alicia had not seen Avatar yet. So you could do:
df[df.user.eq('Alicia').groupby(df.movie).cumsum()]
date user movie
0 2018-01-02 Alicia Titanic
1 2018-01-13 John Titanic
2 2018-01-22 John Titanic
4 2018-04-05 Alicia Avatar
5 2018-05-19 John Avatar
Explanation:
The following returns True where the user is Alicia:
df.user.eq('Alicia')
0 True
1 False
2 False
3 False
4 True
5 False
Name: user, dtype: bool
What you could do now is to GroupBy the movies, and apply a cumsum on the groups, so only the rows after the first True will also become True:
0 True
1 True
2 True
3 False
4 True
5 True
Name: user, dtype: bool
Finally use boolean indexation on the original dataframe in order to select the rows of interest.
I have 2 DataFrames:
PROJECT1
key name deadline delivered
0 AA1 Tom 01/05/2018 02/05/2018
1 AA2 Sue 01/05/2018 30/04/2018
2 AA4 Jack 01/05/2018 04/05/2018
PROJECT2
key name deadline delivered
0 AA1 Tom 01/05/2018 30/04/2018
1 AA2 Sue 01/05/2018 30/04/2018
2 AA3 Jim 01/05/2018 03/05/2018
is is possible to create a column in PROJECT2 named 'In PROJECT1' and apply condition as such:
psuedo code
for row in PROJECT2:
if in the same row based on key column PROJECT1['delivered'] >= PROJECT2['deadline']:
PROJECT2['In PROJECT1'] = 'project delivered before deadline'
else:
'Project delayed'
expected result
key name deadline delivered In PROJECT1
0 AA1 Tom 01/05/2018 30/04/2018 Project delayed
1 AA2 Sue 01/05/2018 30/04/2018 project delivered before deadline
2 AA3 Jim 01/05/2018 03/05/2018 NaN
not sure how to approach it (iterrows(), for loop, df.loc[conditions], np.where(), or perhaps I need to define some kind of function to use in df.apply()), any help highly appreciated.
You can use numpy.select to add a series with a list of conditions and values.
Note I believe you have your desired criteria reversed, i.e. delivered before deadline should give "project delivered before deadline" rather than vice versa.
import numpy as np
# convert series to datetime if necessary
for col in ['deadline', 'delivered']:
df1[col] = pd.to_datetime(df1[col], dayfirst=True)
for col in ['deadline', 'delivered']:
df2[col] = pd.to_datetime(df2[col], dayfirst=True)
# create series mapping key to delivered date in df1
s = df1.set_index('key')['delivered']
# define conditions and values
conditions = [~df2['key'].isin(s.index), df2['key'].map(s) <= df2['deadline']]
values = [np.nan, 'project delivered before deadline']
# apply conditions and values, with fallback value
df2['In Project1'] = np.select(conditions, values, 'Project delayed')
print(df2)
key name deadline delivered In Project1
0 AA1 Tom 2018-05-01 2018-04-30 Project delayed
1 AA2 Sue 2018-05-01 2018-04-30 project delivered before deadline
2 AA3 Jim 2018-05-01 2018-05-03 nan
Here is an alternate way you can follow by joining both the data sets. This will help you avoid any necessity for loop and will be faster.
## join the two data sets
# p1 = Project 1
# p2 = Project 2
p3 = p2.merge(p1.loc[:,['key','delivered']], on='key',how='left', suffixes=['_p2','_p1'])
p3['In PROJECT1'] = np.where((p3['delivered_p1'] >= p3['delivered_p2']),'project delivered before deadline','Project delayed')
# handle cases with NA
set_to_na = p3[['delivered_p1','delivered_p2']].isnull().any(axis=1).values.tolist()
p3['In PROJECT1'].iloc[set_to_na] = np.nan
## remove unwanted columns and rename
p3.drop('delivered_p1', axis=1, inplace=True)
p3.rename(columns={'delivered_p2':'delivered'}, inplace=True)
print(p3)
key name deadline delivered In PROJECT1
0 AA1 Tom 01/05/2018 30/04/2018 Project delayed
1 AA2 Sue 01/05/2018 30/04/2018 project delivered before deadline
2 AA3 Jim 01/05/2018 03/05/2018 NaN
I have a Pandas dataframe as follows
df = pd.DataFrame([['John', '1/1/2017','10'],
['John', '2/2/2017','15'],
['John', '2/2/2017','20'],
['John', '3/3/2017','30'],
['Sue', '1/1/2017','10'],
['Sue', '2/2/2017','15'],
['Sue', '3/2/2017','20'],
['Sue', '3/3/2017','7'],
['Sue', '4/4/2017','20']
],
columns=['Customer', 'Deposit_Date','DPD'])
. What is the best way to calculate the PreviousMean column in the screen shot below?
The column is the year to date average of DPD for that customer. I.e. Includes all DPDs up to but not including rows that match the current deposit date. If no previous records existed then it's null or 0.
Screenshot:
Notes:
the data is grouped by Customer Name and expanding over Deposit Dates
within each group, the expanding mean is calculated using only values from the previous rows.
at the start of each new customer the mean is 0 or alternatively null as there are no previous records on which to form the mean
the data frame is ordered by Customer Name and Deposit_Date
instead of grouping & expanding the mean, filter the dataframe on the conditions, and calculate the mean of DPD:
Customer == current row's Customer
Deposit_Date < current row's Deposit_Date
Use df.apply to perform this operation for all row in the dataframe:
df['PreviousMean'] = df.apply(
lambda x: df[(df.Customer == x.Customer) & (df.Deposit_Date < x.Deposit_Date)].DPD.mean(),
axis=1)
outputs:
Customer Deposit_Date DPD PreviousMean
0 John 2017-01-01 10 NaN
1 John 2017-02-02 15 10.0
2 John 2017-02-02 20 10.0
3 John 2017-03-03 30 15.0
4 Sue 2017-01-01 10 NaN
5 Sue 2017-02-02 15 10.0
6 Sue 2017-03-02 20 12.5
7 Sue 2017-03-03 7 15.0
8 Sue 2017-04-04 20 13.0
Here's one way to exclude repeated days from mean calculation:
# create helper series which is NaN for repeated days, DPD otherwise
s = df.groupby(['Customer Name', 'Deposit_Date']).cumcount() == 1
df['DPD2'] = np.where(s, np.nan, df['DPD'])
# apply pd.expanding_mean
df['CumMean'] = df.groupby(['Customer Name'])['DPD2'].apply(lambda x: pd.expanding_mean(x))
# drop helper series
df = df.drop('DPD2', 1)
print(df)
Customer Name Deposit_Date DPD CumMean
0 John 01/01/2017 10 10.0
1 John 01/01/2017 10 10.0
2 John 02/02/2017 20 15.0
3 John 03/03/2017 30 20.0
4 Sue 01/01/2017 10 10.0
5 Sue 01/01/2017 10 10.0
6 Sue 02/02/2017 20 15.0
7 Sue 03/03/2017 30 20.0
Ok here is the best solution I've come up with thus far.
The trick is to first create an aggregated table at the customer & deposit date level containing a shifted mean. To calculate this mean you have to calculate the sum and the count first.
s=df.groupby(['Customer Name','Deposit_Date'],as_index=False)[['DPD']].agg(['count','sum'])
s.columns = [' '.join(col) for col in s.columns]
s.reset_index(inplace=True)
s['DPD_CumSum']=s.groupby(['Customer Name'])[['DPD sum']].cumsum()
s['DPD_CumCount']=s.groupby(['Customer Name'])[['DPD count']].cumsum()
s['DPD_CumMean']=s['DPD_CumSum']/ s['DPD_CumCount']
s['DPD_PrevMean']=s.groupby(['Customer Name'])['DPD_CumMean'].shift(1)
df=df.merge(s[['Customer Name','Deposit_Date','DPD_PrevMean']],how='left',on=['Customer Name','Deposit_Date'])
I had posted this question and need to expand on the application. I now need to get the N max date for each Vendor:
#import pandas as pd
#df = pd.read_clipboard()
#df['Insert_Date'] = pd.to_datetime(df['Insert_Date'])
# used in example below
#df2 = df.sort_values(['Vendor','InsertDate']).drop_duplicates(['Vendor'],keep='last')
Vendor Insert_Date Total
Steph 2017-10-25 2
Matt 2017-10-31 13
Chris 2017-11-03 3
Steve 2017-10-23 11
Chris 2017-10-27 3
Steve 2017-11-01 11
If I needed to get the 2nd max date expected output would be:
Vendor Insert_Date Total
Steph 2017-10-25 2
Steve 2017-10-23 11
Matt 2017-10-31 13
Chris 2017-10-27 3
I can easily get the 2nd max date by using df2 in the example df.loc[~df.index.isin(df2.index)] but if i need to get the 50th max value, that is a lot of dataframe building to use isin()...
I have also tried df.groupby('Vendor')['Insert_Date'].nlargest(N_HERE) which gets me close, but i then need to get the N value for each Vendor.
I have also tried filtering out the df by Vendor:
df.loc[df['Vendor']=='Chris', 'Insert_Date'].nlargest(2)
but if I try to get the second record with df.loc[df['Vendor']=='Chris', 'Insert_Date'].nlargest(2)[2] it returns: Timestamp('2017-11-03 00:00:00'). Instead i need to use df.loc[df['Vendor']=='Chris', 'Insert_Date'].nlargest(2)[1:2]. Why must I use list slicing here and not simply[2]?
In summary? how do I return the N largest date by Vendor?
I might've misunderstood your initial problem. You can sort on Insert_Date, and then use groupby + apply in this manner:
n = 9
df.sort_values('Insert_Date')\
.groupby('Vendor', as_index=False).apply(lambda x: x.iloc[-n])
For your example data, it seems n = 0 does the trick.
df.sort_values('Insert_Date')\
.groupby('Vendor', as_index=False).apply(lambda x: x.iloc[0])
Vendor Insert_Date Total
0 Chris 2017-10-27 3
1 Matt 2017-10-31 13
2 Steph 2017-10-25 2
3 Steve 2017-10-23 11
Beware, this code will throw errors if your Vendor groups are smaller in size than n.
I will using head (You can pick the top n here I am using 2) and always drop_duplicates by the last.
df.sort_values('Insert_Date',ascending=False).groupby('Vendor').\
head(2).drop_duplicates('Vendor',keep='last').sort_index()
Out[609]:
Vendor Insert_Date Total
0 Steph 2017-10-25 2
1 Matt 2017-10-31 13
3 Steve 2017-10-23 11
4 Chris 2017-10-27 3
I like #COLDSPEED's answer as its more direct. Here is one using nlargest which involves an intermediate step of creating nthlargest column
n = 2
df1['nth_largest'] = df1.groupby('Vendor').Insert_Date.transform(lambda x: x.nlargest(n).min())
df1.drop_duplicates(subset = ['Vendor', 'nth_largest']).drop('Insert_Date', axis = 1)
Vendor Total nth_largest
0 Steph 2 2017-10-25
1 Matt 13 2017-10-31
2 Chris 3 2017-10-27
3 Steve 11 2017-10-23