What is the fastest way to iterate through the rest of a dataframe given rows matching some specific values ?
For example let's say I have a dataframe with 'Date', 'Name' and 'Movie'. There could be many users and movies. I want all the person named John that have seen the same movie as someone named Alicia has seen before.
Input dataframe could be :
date name movie
0 2018-01-16 10:33:59 Alicia Titanic
1 2018-01-17 08:49:13 Chandler Avatar
2 2018-01-18 09:29:09 Luigi Glass
3 2018-01-19 09:45:27 Alicia Die Hard
4 2018-01-20 10:08:05 Bouchra Pulp Fiction
5 2018-01-26 10:21:47 Bariza Glass
6 2018-01-27 10:15:32 Peggy Bumbleblee
7 2018-01-20 10:08:05 John Titanic
8 2018-01-26 10:21:47 Bariza Glass
9 2018-01-27 10:15:32 John Titanic
The result should be :
date name movie
0 2018-01-16 10:33:59 Alicia Titanic
7 2018-01-20 10:08:05 John Titanic
9 2018-01-27 10:15:32 John Titanic
For the moment I am doing the following:
alicias = df[df['Name'] == 'Alicia']
df_res = pd.DataFrame(columns=df.columns)
for i in alicias.index:
df_res = df_res.append(alicias.loc[i], sort=False)
df_johns = df[(df['Date'] > alicias['Date'][i])
&(df['Name'] == 'John')
&(df['Movie'] == alicias['Movie'][i)]
df_res = df_res.append(df_johns, sort=False)
It works but this is very slow. I could also use a groupby which is much faster but I want the result to keep the initial row (the row with 'Alicia' in the example), and I can't find a way to do this with a groupby.
Any help ?
Here's a way to do it. Say you have the following dataframe:
date user movie
0 2018-01-02 Alicia Titanic
1 2018-01-13 John Titanic
2 2018-01-22 John Titanic
3 2018-04-02 John Avatar
4 2018-04-05 Alicia Avatar
5 2018-05-19 John Avatar
IIUC the correct solution should not contain row 3, as Alicia had not seen Avatar yet. So you could do:
df[df.user.eq('Alicia').groupby(df.movie).cumsum()]
date user movie
0 2018-01-02 Alicia Titanic
1 2018-01-13 John Titanic
2 2018-01-22 John Titanic
4 2018-04-05 Alicia Avatar
5 2018-05-19 John Avatar
Explanation:
The following returns True where the user is Alicia:
df.user.eq('Alicia')
0 True
1 False
2 False
3 False
4 True
5 False
Name: user, dtype: bool
What you could do now is to GroupBy the movies, and apply a cumsum on the groups, so only the rows after the first True will also become True:
0 True
1 True
2 True
3 False
4 True
5 True
Name: user, dtype: bool
Finally use boolean indexation on the original dataframe in order to select the rows of interest.
Related
I'm trying to add a calculated column in a dataframe based on a condition that includes other dataframe.
Example:
I have a dataframe Users that contains:
Out[4]:
UserID Active BaseSalaryCOP BaseSalaryUSD FromDate ToDate
0 557058:36103848-2606-4d87-9af8-b0498f1c6713 True 9405749 2475.20 05/11/2020 05/11/2021
1 557058:36103848-2606-4d87-9af8-b0498f1c6713 True 3831329 1008.24 05/11/2020 04/11/2021
2 557058:7df66ef4-b04d-4ce9-9cdc-55751909a61e True 3775657 993.59 05/11/2020 05/11/2021
3 557058:b0a4e46c-9bfe-439e-ae6e-500e3c2a87e2 True 9542508 2511.19 05/11/2020 05/11/2021
4 557058:b25dbdb2-aa23-4706-9e50-90b2f66b60a5 True 8994035 2366.85 05/11/2020 05/11/2021
And I have another called Rate that contains the UserId.
I want to add A calculate column to add the BaseSalaryUSD Divide between 18 where the USer ID match and ToDate match as well.
Something like (If date Match with toDate and USerID Match then add a new colum that contains User["BaseSalaryUSD"] / 18):
Out[5]:
AccountID Date rate
0 557058:36103848-2606-4d87-9af8-b0498f1c6713 04/21/2021 137.51
2 557058:7df66ef4-b04d-4ce9-9cdc-55751909a61e 05/11/2021 55.19
3 557058:b0a4e46c-9bfe-439e-ae6e-500e3c2a87e2 05/11/2021 139.51
4 557058:b25dbdb2-aa23-4706-9e50-90b2f66b60a5 05/11/2021 131.49
Any idea?
Thanks
Use outer join by both Dataframes, then filter by Series.between and divide column by Series.div:
Rate['Date'] = pd.to_datetime(Rate['Date'])
Users['FromDate'] = pd.to_datetime(Users['FromDate'])
Users['ToDate'] = pd.to_datetime(Users['ToDate'])
df = Users.merge(Rate.rename(columns={'AccountID':'UserID'}), on='UserID', how='outer')
df = df[df['Date'].between(df['FromDate'], df['ToDate'])]
df['new'] = df['BaseSalaryUSD'].div(18)
print (df)
UserID Active BaseSalaryCOP \
0 557058:36103848-2606-4d87-9af8-b0498f1c6713 True 9405749
2 557058:7df66ef4-b04d-4ce9-9cdc-55751909a61e True 3775657
3 557058:b0a4e46c-9bfe-439e-ae6e-500e3c2a87e2 True 9542508
4 557058:b25dbdb2-aa23-4706-9e50-90b2f66b60a5 True 8994035
BaseSalaryUSD FromDate ToDate Date rate new
0 2475.20 2020-05-11 2021-05-11 2021-04-21 137.51 137.511111
2 993.59 2020-05-11 2021-05-11 2021-05-11 55.19 55.199444
3 2511.19 2020-05-11 2021-05-11 2021-05-11 139.51 139.510556
4 2366.85 2020-05-11 2021-05-11 2021-05-11 131.49 131.491667
I have two df's, One with students details and another df with students attendance records.
details_df
name roll start_day last_day
0 anthony 9 2020-09-08 2020-09-28
1 paul 6 2020-09-01 2020-09-15
2 marcus 10 2020-08-08 2020-09-08
attendance_df
name roll status day
0 anthony 9 absent 2020-07-25
1 anthony 9 present 2020-09-15
2 anthony 9 absent 2020-09-25
3 paul 6 present 2020-09-02
4 marcus 10 present 2020-07-01
5 marcus 10 present 2020-08-17
I trying to get status=absent True/False for each user between start_day and last_day.
Ex: user - anthony has two records in attendance_df between start_day and last_day out of total 3 records.
From those two records if status=absent then mark that user as True
Expected Output
name roll absent
0 anthony 9 True
1 paul 6 False
2 marcus 10 False
I have tried making details_df into a list then looping into attendance_df. But is there any other efficient way?
You need to do merge (i.e. a join operation) and filter the days for which the column day is between start_day and last_day. Then, a group-by + apply (i.e. grouped aggregation operation):
merged_df = attendance_df.merge(details_df, on=['name', 'roll'])
df = (merged_df[merged_df.day.between(merged_df.start_day, merged_df.last_day)]
.groupby(['name', 'roll'])
.apply(lambda x: (x.status == 'absent').any())
.reset_index())
df.columns = ['name', 'roll', 'absent']
To get:
df
name roll absent
0 anthony 9 True
1 marcus 10 False
2 paul 6 False
Merge, groupby() and find any days that are between start and last using a lambda function
df2=pd.merge(attendance_df,details_df, how='left', on=['name','roll'])
df2.groupby(['name','roll']).apply(lambda x: (x['day'].\
between(x['start_day'],x['last_day'])).any(0)).to_frame('absent')
absent
name roll
anthony 9 True
marcus 10 True
paul 6 True
I need some tips to make a calculation.
I have a DataFrame that looks like the following:
text_id user date important_words
1 John 2018-01-01 {cat, dog, puppy}
1 John 2018-02-01 {cat, dog}
2 Anne 2018-01-01 {flower, sun}
3 John 2018-03-01 {water, blue}
3 Marie 2018-05-01 {water, blue, ocean}
3 Kate 2018-08-01 {island, sand, towel}
4 Max 2018-01-01 {hot, cold}
4 Ethan 2018-06-01 {hot, warm}
5 Marie 2019-01-01 {boo}
In the given dataframe:
the text_id refers to the id of a text: each text with a different id is a different text. The user column refers to the name of the user that has edited the text (adding and erasing important words). The date column refers to the moment in which the edit was made (note that edits on each text are temporarilly sorted). Finally, the important_words column is a set of important words present in the text after the edit of the user.
I need to calculate how many words were added by each user on each edition of a page.
The expected output here would be:
text_id user date important_words added_words
1 John 2018-01-01 {cat, dog, puppy} 3
1 John 2018-02-01 {cat, dog} 0
2 Anne 2018-01-01 {flower, sun} 2
3 John 2018-03-01 {water, blue} 2
3 Marie 2018-05-01 {water, blue, ocean} 1
3 Kate 2018-08-01 {island, sand, towel} 3
4 Max 2018-01-01 {hot, cold} 2
4 Ethan 2018-06-01 {hot, warm} 1
5 Marie 2019-01-01 {boo} 1
Note that the first time editing the text is the creation, so the number of words added is always the size of the important_words set in that case.
Any tips on what would be the fastest way to compute the added_words column will be highly appreciated.
Note that the important_words column contains a set, thus the operation of calculating the difference among two consecutive editions should be easy.
Hard to think but interesting :-) I am using get_dummies, then we just keep the first 1 value per columns and sum them
s=df.important_words.map(','.join).str.get_dummies(sep=',')
s.mask(s==0).cumsum().eq(1).sum(1)
Out[247]:
0 3
1 0
2 2
3 2
4 1
5 3
6 2
7 1
8 1
dtype: int64
df['val']=s.mask(s==0).cumsum().eq(1).sum(1)
Update
s=df.important_words.map(','.join).str.get_dummies(sep=',')
s.mask(s==0).groupby(df['text_id']).cumsum().eq(1).sum(1)
I had posted this question and need to expand on the application. I now need to get the N max date for each Vendor:
#import pandas as pd
#df = pd.read_clipboard()
#df['Insert_Date'] = pd.to_datetime(df['Insert_Date'])
# used in example below
#df2 = df.sort_values(['Vendor','InsertDate']).drop_duplicates(['Vendor'],keep='last')
Vendor Insert_Date Total
Steph 2017-10-25 2
Matt 2017-10-31 13
Chris 2017-11-03 3
Steve 2017-10-23 11
Chris 2017-10-27 3
Steve 2017-11-01 11
If I needed to get the 2nd max date expected output would be:
Vendor Insert_Date Total
Steph 2017-10-25 2
Steve 2017-10-23 11
Matt 2017-10-31 13
Chris 2017-10-27 3
I can easily get the 2nd max date by using df2 in the example df.loc[~df.index.isin(df2.index)] but if i need to get the 50th max value, that is a lot of dataframe building to use isin()...
I have also tried df.groupby('Vendor')['Insert_Date'].nlargest(N_HERE) which gets me close, but i then need to get the N value for each Vendor.
I have also tried filtering out the df by Vendor:
df.loc[df['Vendor']=='Chris', 'Insert_Date'].nlargest(2)
but if I try to get the second record with df.loc[df['Vendor']=='Chris', 'Insert_Date'].nlargest(2)[2] it returns: Timestamp('2017-11-03 00:00:00'). Instead i need to use df.loc[df['Vendor']=='Chris', 'Insert_Date'].nlargest(2)[1:2]. Why must I use list slicing here and not simply[2]?
In summary? how do I return the N largest date by Vendor?
I might've misunderstood your initial problem. You can sort on Insert_Date, and then use groupby + apply in this manner:
n = 9
df.sort_values('Insert_Date')\
.groupby('Vendor', as_index=False).apply(lambda x: x.iloc[-n])
For your example data, it seems n = 0 does the trick.
df.sort_values('Insert_Date')\
.groupby('Vendor', as_index=False).apply(lambda x: x.iloc[0])
Vendor Insert_Date Total
0 Chris 2017-10-27 3
1 Matt 2017-10-31 13
2 Steph 2017-10-25 2
3 Steve 2017-10-23 11
Beware, this code will throw errors if your Vendor groups are smaller in size than n.
I will using head (You can pick the top n here I am using 2) and always drop_duplicates by the last.
df.sort_values('Insert_Date',ascending=False).groupby('Vendor').\
head(2).drop_duplicates('Vendor',keep='last').sort_index()
Out[609]:
Vendor Insert_Date Total
0 Steph 2017-10-25 2
1 Matt 2017-10-31 13
3 Steve 2017-10-23 11
4 Chris 2017-10-27 3
I like #COLDSPEED's answer as its more direct. Here is one using nlargest which involves an intermediate step of creating nthlargest column
n = 2
df1['nth_largest'] = df1.groupby('Vendor').Insert_Date.transform(lambda x: x.nlargest(n).min())
df1.drop_duplicates(subset = ['Vendor', 'nth_largest']).drop('Insert_Date', axis = 1)
Vendor Total nth_largest
0 Steph 2 2017-10-25
1 Matt 13 2017-10-31
2 Chris 3 2017-10-27
3 Steve 11 2017-10-23
Suppose we take a pandas dataframe...
name age family
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
Then do a groupby() ...
group_df = df.groupby('family')
group_df = group_df.aggregate({'name': name_join, 'age': pd.np.mean})
Then do some aggregate/summarize operation (in my example, my function name_join aggregates the names):
def name_join(list_names, concat='-'):
return concat.join(list_names)
The grouped summarized output is thus:
age name
family
1 23 john-jason-jane
2 28 jack-james
Question:
Is there a quick, efficient way to get to the following from the aggregated table?
name age family
0 john 23 1
1 jason 23 1
2 jane 23 1
3 jack 28 2
4 james 28 2
(Note: the age column values are just examples, I don't care for the information I am losing after averaging in this specific example)
The way I thought I could do it does not look too efficient:
create empty dataframe
from every line in group_df, separate the names
return a dataframe with as many rows as there are names in the starting row
append the output to the empty dataframe
The rough equivalent is .reset_index(), but it may not be helpful to think of it as the "opposite" of groupby().
You are splitting a string in to pieces, and maintaining each piece's association with 'family'. This old answer of mine does the job.
Just set 'family' as the index column first, refer to the link above, and then reset_index() at the end to get your desired result.
It turns out that pd.groupby() returns an object with the original data stored in obj. So ungrouping is just pulling out the original data.
group_df = df.groupby('family')
group_df.obj
Example
>>> dat_1 = df.groupby("category_2")
>>> dat_1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fce78b3dd00>
>>> dat_1.obj
order_date category_2 value
1 2011-02-01 Cross Country Race 324400.0
2 2011-03-01 Cross Country Race 142000.0
3 2011-04-01 Cross Country Race 498580.0
4 2011-05-01 Cross Country Race 220310.0
5 2011-06-01 Cross Country Race 364420.0
.. ... ... ...
535 2015-08-01 Triathalon 39200.0
536 2015-09-01 Triathalon 75600.0
537 2015-10-01 Triathalon 58600.0
538 2015-11-01 Triathalon 70050.0
539 2015-12-01 Triathalon 38600.0
[531 rows x 3 columns]
Here's a complete example that recovers the original dataframe from the grouped object
def name_join(list_names, concat='-'):
return concat.join(list_names)
print('create dataframe\n')
df = pandas.DataFrame({'name':['john', 'jason', 'jane', 'jack', 'james'], 'age':[1,36,32,26,30], 'family':[1,1,1,2,2]})
df.index.name='indexer'
print(df)
print('create group_by object')
group_obj_df = df.groupby('family')
print(group_obj_df)
print('\nrecover grouped df')
group_joined_df = group_obj_df.aggregate({'name': name_join, 'age': 'mean'})
group_joined_df
create dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
create group_by object
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbfdd9dd048>
recover grouped df
name age
family
1 john-jason-jane 23
2 jack-james 28
print('\nRecover the original dataframe')
print(pandas.concat([group_obj_df.get_group(key) for key in group_obj_df.groups]))
Recover the original dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
There are a few ways to undo DataFrame.groupby, one way is to do DataFrame.groupby.filter(lambda x:True), this gets back to the original DataFrame.