I'm trying to add a calculated column in a dataframe based on a condition that includes other dataframe.
Example:
I have a dataframe Users that contains:
Out[4]:
UserID Active BaseSalaryCOP BaseSalaryUSD FromDate ToDate
0 557058:36103848-2606-4d87-9af8-b0498f1c6713 True 9405749 2475.20 05/11/2020 05/11/2021
1 557058:36103848-2606-4d87-9af8-b0498f1c6713 True 3831329 1008.24 05/11/2020 04/11/2021
2 557058:7df66ef4-b04d-4ce9-9cdc-55751909a61e True 3775657 993.59 05/11/2020 05/11/2021
3 557058:b0a4e46c-9bfe-439e-ae6e-500e3c2a87e2 True 9542508 2511.19 05/11/2020 05/11/2021
4 557058:b25dbdb2-aa23-4706-9e50-90b2f66b60a5 True 8994035 2366.85 05/11/2020 05/11/2021
And I have another called Rate that contains the UserId.
I want to add A calculate column to add the BaseSalaryUSD Divide between 18 where the USer ID match and ToDate match as well.
Something like (If date Match with toDate and USerID Match then add a new colum that contains User["BaseSalaryUSD"] / 18):
Out[5]:
AccountID Date rate
0 557058:36103848-2606-4d87-9af8-b0498f1c6713 04/21/2021 137.51
2 557058:7df66ef4-b04d-4ce9-9cdc-55751909a61e 05/11/2021 55.19
3 557058:b0a4e46c-9bfe-439e-ae6e-500e3c2a87e2 05/11/2021 139.51
4 557058:b25dbdb2-aa23-4706-9e50-90b2f66b60a5 05/11/2021 131.49
Any idea?
Thanks
Use outer join by both Dataframes, then filter by Series.between and divide column by Series.div:
Rate['Date'] = pd.to_datetime(Rate['Date'])
Users['FromDate'] = pd.to_datetime(Users['FromDate'])
Users['ToDate'] = pd.to_datetime(Users['ToDate'])
df = Users.merge(Rate.rename(columns={'AccountID':'UserID'}), on='UserID', how='outer')
df = df[df['Date'].between(df['FromDate'], df['ToDate'])]
df['new'] = df['BaseSalaryUSD'].div(18)
print (df)
UserID Active BaseSalaryCOP \
0 557058:36103848-2606-4d87-9af8-b0498f1c6713 True 9405749
2 557058:7df66ef4-b04d-4ce9-9cdc-55751909a61e True 3775657
3 557058:b0a4e46c-9bfe-439e-ae6e-500e3c2a87e2 True 9542508
4 557058:b25dbdb2-aa23-4706-9e50-90b2f66b60a5 True 8994035
BaseSalaryUSD FromDate ToDate Date rate new
0 2475.20 2020-05-11 2021-05-11 2021-04-21 137.51 137.511111
2 993.59 2020-05-11 2021-05-11 2021-05-11 55.19 55.199444
3 2511.19 2020-05-11 2021-05-11 2021-05-11 139.51 139.510556
4 2366.85 2020-05-11 2021-05-11 2021-05-11 131.49 131.491667
Related
I have two df's, One with students details and another df with students attendance records.
details_df
name roll start_day last_day
0 anthony 9 2020-09-08 2020-09-28
1 paul 6 2020-09-01 2020-09-15
2 marcus 10 2020-08-08 2020-09-08
attendance_df
name roll status day
0 anthony 9 absent 2020-07-25
1 anthony 9 present 2020-09-15
2 anthony 9 absent 2020-09-25
3 paul 6 present 2020-09-02
4 marcus 10 present 2020-07-01
5 marcus 10 present 2020-08-17
I trying to get status=absent True/False for each user between start_day and last_day.
Ex: user - anthony has two records in attendance_df between start_day and last_day out of total 3 records.
From those two records if status=absent then mark that user as True
Expected Output
name roll absent
0 anthony 9 True
1 paul 6 False
2 marcus 10 False
I have tried making details_df into a list then looping into attendance_df. But is there any other efficient way?
You need to do merge (i.e. a join operation) and filter the days for which the column day is between start_day and last_day. Then, a group-by + apply (i.e. grouped aggregation operation):
merged_df = attendance_df.merge(details_df, on=['name', 'roll'])
df = (merged_df[merged_df.day.between(merged_df.start_day, merged_df.last_day)]
.groupby(['name', 'roll'])
.apply(lambda x: (x.status == 'absent').any())
.reset_index())
df.columns = ['name', 'roll', 'absent']
To get:
df
name roll absent
0 anthony 9 True
1 marcus 10 False
2 paul 6 False
Merge, groupby() and find any days that are between start and last using a lambda function
df2=pd.merge(attendance_df,details_df, how='left', on=['name','roll'])
df2.groupby(['name','roll']).apply(lambda x: (x['day'].\
between(x['start_day'],x['last_day'])).any(0)).to_frame('absent')
absent
name roll
anthony 9 True
marcus 10 True
paul 6 True
What is the fastest way to iterate through the rest of a dataframe given rows matching some specific values ?
For example let's say I have a dataframe with 'Date', 'Name' and 'Movie'. There could be many users and movies. I want all the person named John that have seen the same movie as someone named Alicia has seen before.
Input dataframe could be :
date name movie
0 2018-01-16 10:33:59 Alicia Titanic
1 2018-01-17 08:49:13 Chandler Avatar
2 2018-01-18 09:29:09 Luigi Glass
3 2018-01-19 09:45:27 Alicia Die Hard
4 2018-01-20 10:08:05 Bouchra Pulp Fiction
5 2018-01-26 10:21:47 Bariza Glass
6 2018-01-27 10:15:32 Peggy Bumbleblee
7 2018-01-20 10:08:05 John Titanic
8 2018-01-26 10:21:47 Bariza Glass
9 2018-01-27 10:15:32 John Titanic
The result should be :
date name movie
0 2018-01-16 10:33:59 Alicia Titanic
7 2018-01-20 10:08:05 John Titanic
9 2018-01-27 10:15:32 John Titanic
For the moment I am doing the following:
alicias = df[df['Name'] == 'Alicia']
df_res = pd.DataFrame(columns=df.columns)
for i in alicias.index:
df_res = df_res.append(alicias.loc[i], sort=False)
df_johns = df[(df['Date'] > alicias['Date'][i])
&(df['Name'] == 'John')
&(df['Movie'] == alicias['Movie'][i)]
df_res = df_res.append(df_johns, sort=False)
It works but this is very slow. I could also use a groupby which is much faster but I want the result to keep the initial row (the row with 'Alicia' in the example), and I can't find a way to do this with a groupby.
Any help ?
Here's a way to do it. Say you have the following dataframe:
date user movie
0 2018-01-02 Alicia Titanic
1 2018-01-13 John Titanic
2 2018-01-22 John Titanic
3 2018-04-02 John Avatar
4 2018-04-05 Alicia Avatar
5 2018-05-19 John Avatar
IIUC the correct solution should not contain row 3, as Alicia had not seen Avatar yet. So you could do:
df[df.user.eq('Alicia').groupby(df.movie).cumsum()]
date user movie
0 2018-01-02 Alicia Titanic
1 2018-01-13 John Titanic
2 2018-01-22 John Titanic
4 2018-04-05 Alicia Avatar
5 2018-05-19 John Avatar
Explanation:
The following returns True where the user is Alicia:
df.user.eq('Alicia')
0 True
1 False
2 False
3 False
4 True
5 False
Name: user, dtype: bool
What you could do now is to GroupBy the movies, and apply a cumsum on the groups, so only the rows after the first True will also become True:
0 True
1 True
2 True
3 False
4 True
5 True
Name: user, dtype: bool
Finally use boolean indexation on the original dataframe in order to select the rows of interest.
What I want to do is change values in a column into boolean.
What I am looking at: I have a dataset of artists with a column named "Death Year".
Within that column, it has the Death Year or Nan which I changed to Alive. I want to make this column where it turns the death year into false and alive value as True. dType for this column is Object.
Reproducible Example:
df = pd.DataFrame({'DeathYear':[2005,2003,np.nan,1993]})
DeathYear
0 2005.0
1 2003.0
2 NaN
3 1993.0
which you turned into
df['DeathYear'] = df['DeathYear'].fillna('Alive')
DeathYear
0 2005
1 2003
2 Alive
3 1993
You can just use
df['BoolDeathYear'] = df['DeathYear'] == 'Alive'
DeathYear BoolDeathYear
0 2005 False
1 2003 False
2 Alive True
3 1993 False
Notice that, if your final goal is to have the bool column, you don't have to fill the NaNs at all.
Can just do
df['BoolDeathYear'] = df['DeathYear'].isnull()
here's a sample dataset i've created for this question:
data1 = pd.DataFrame([['1','303','3/7/2016'],
['4','404','6/23/2011'],
['7','101','3/7/2016'],
['1','303','5/6/2017']],
columns=["code", "ticket #", "CB date"])
data1['CB date'] = pd.to_datetime(data1['CB date'])
data2 = pd.DataFrame([['1','303','2/5/2016'],
['4','404','6/23/2011'],
['7','101','3/17/2016'],
['1','303','4/6/2017']],
columns=["code", "ticket #", "audit date"])
data2['audit date'] = pd.to_datetime(data2['audit date'])
print(data1)
print(data2)
code ticket # CB date
0 1 303 2016-03-07
1 4 404 2011-06-23
2 7 101 2016-03-07
3 1 303 2017-05-06
code ticket # audit date
0 1 303 2016-02-05
1 4 404 2011-06-23
2 7 101 2016-03-17
3 1 303 2017-04-06
I want to merge the two df's, and make sure that the CB dates are always on or after Audit dates:
data_all = data1.merge(data2, how='inner', on=['code', 'ticket #'])
data_all = data_all[data_all['audit date'] <= data_all['CB date']]
print(data_all)
code ticket # CB date audit date
0 1 303 2016-03-07 2016-02-05
2 1 303 2017-05-06 2016-02-05
3 1 303 2017-05-06 2017-04-06
4 4 404 2011-06-23 2011-06-23
However, I only want to keep the rows with earliest CB date after each audit date. So in above output, row 2 shouldn't be there, because row 1 and row 2 both have same audit date 2016/2/5, but I only want to keep row 1 since the CB date is much closer to 2016/2/5 than row 2 CB date does.
Desired output:
code ticket # CB date audit date
0 1 303 2016-03-07 2016-02-05
3 1 303 2017-05-06 2017-04-06
4 4 404 2011-06-23 2011-06-23
I know in SQL I'd have to gorupby code & ticket # & Audit date first, then order CB date in ascending order, then take the item rank = 1 in each group; but how can I do this in Python/Pandas?
I read other posts here but I am still not getting it, so would really appreciate some advice here.
Few posts I read into include:
Pandas Groupy take only the first N Groups
Pandas: select the first couple of rows in each group
I'd do this with an optional sort_values call and a drop_duplicates call.
data_all.sort_values(data_all.columns.tolist())\
.drop_duplicates(subset=['CB date'], keep='first')
code ticket # CB date audit date
0 1 303 2016-03-07 2016-02-05
2 1 303 2017-05-06 2016-02-05
4 4 404 2011-06-23 2011-06-23
I say the sort_values call is optional here, since your data appears to be sorted already. If it isn't, make sure that's part of your solution.
I have df:
ID,"address","used_at","active_seconds","pageviews"
71ecd2aa165114e5ee292131f1167d8c,"auto.drom.ru",2014-05-17 10:58:59,166,2
71ecd2aa165114e5ee292131f1167d8c,"auto.drom.ru",2016-07-17 17:34:07,92,4
70150aba267f671045f147767251d169,"avito.ru/*/avtomobili",2014-06-15 11:52:09,837,40
bc779f542049bcabb9e68518a215814e,"auto.yandex.ru",2014-01-16 22:23:56,8,1
bc779f542049bcabb9e68518a215814e,"avito.ru/*/avtomobili",2014-01-18 14:38:33,313,5
bc779f542049bcabb9e68518a215814e,"avito.ru/*/avtomobili",2016-07-18 18:12:07,20,1
I need to delete all strings where used_at more than 2016-06-30. How can I do that?
Use dt.date with boolean indexing:
print (df.used_at.dt.date > pd.to_datetime('2016-06-30').date())
0 False
1 True
2 False
3 False
4 False
5 True
Name: used_at, dtype: bool
print (df[df.used_at.dt.date > pd.to_datetime('2016-06-30').date()])
ID address \
1 71ecd2aa165114e5ee292131f1167d8c auto.drom.ru
5 bc779f542049bcabb9e68518a215814e avito.ru/*/avtomobili
used_at active_seconds pageviews
1 2016-07-17 17:34:07 92 4
5 2016-07-18 18:12:07 20 1
Or you can define datetime by year, month and day:
print (df[df.used_at.dt.date > pd.datetime(2016, 6, 30).date()])
ID address \
1 71ecd2aa165114e5ee292131f1167d8c auto.drom.ru
5 bc779f542049bcabb9e68518a215814e avito.ru/*/avtomobili
used_at active_seconds pageviews
1 2016-07-17 17:34:07 92 4
5 2016-07-18 18:12:07 20 1