Suppose I have the following data frame below:
userid recorddate
0 tom 2018-06-12
1 nick 2019-06-01
2 tom 2018-02-12
3 nick 2019-06-02
How would I go about determining and pulling the value for the earliest recorddate for each user. i.e. 2018-02-12 for tom and 2019-06-01 for nick?
In addition, what if I added a parameter such as the earliest recorddate that is greater than 2019-01-01?
Here a solution with loc
df['recorddate'] = pd.to_datetime(df['recorddate'])
date = pd.to_datetime("2019-01-01")
df.loc[df['recorddate']>date]
Output will be:
userid recorddate
1 nick 2019-06-01
3 nick 2019-06-02
you can change the greater sign with equal or smaller sign to get a different result.
Cheers
Everything will be easier if you convert your date strings into datetime objects. Once that's done you can sort them then take the first record per userid. Additionally you can filter the dataframe by passing a date string in your conditional, and proceed the same way.
df['recorddate'] = pd.to_datetime(df['recorddate'])
df.sort_values(by='recorddate', inplace=True)
df.groupby('userid').first()
output
recorddate
userid
nick 2019-06-01
tom 2018-02-12
or
df[df['recorddate']>'2019-01-01'].groupby('userid').first()
Related
I'm working with a data frame in which people can appear with multiple roles and I need to devise a test to see for a given person, do they have any dates that overlap:
import pandas as pd
records = pd.DataFrame({'name': ['Tom','Harry','Jack','Matt','Harry','Matt'],
'job code': [101,101,301,101,401,102], 'start date': ['1/1/20','1/1/20','1/1/20','1/1/20','5/1/20','6/15/20'], 'end date':['12/31/20','4/30/20','12/31/20','11/30/20','12/31/20','12/31/20']})
From this dataset you can see at a glance that everyone is fine except for Matt - he has job dates that overlap which is not allowed. How can I test for this in pandas, checking that each unique name does not have any overlap and flagging the entries that do?
Thanks!
The criteria for overlapping would be
max(start_dates) < min(end_date)
So you could merge and query:
(records.merge(records, on='name')
.loc[lambda x: x['job code_x'] != x['job code_y']]
.loc[lambda x: x.filter(like='start date').max(1) <= x.filter(like='end date').min(1)]
['name'].unique()
)
Output:
['Matt']
The first thing I would do is convert to datetime objects:
records['start date']=pd.to_datetime(records['start date'])
records['end date']=pd.to_datetime(records['end date'])
Then I can work with these rather than strings:
import datetime as dt
# I sort based on name and start date:
records2=records.sort_values(['name', 'start date'])
I then create a new column, which compares the start date to the end date, and returns True if a job overlaps with the subsequent job (False otherwise). This is more specific than what you asked, as it gets to the job level, but you could change this to be True if any jobs overlap for a person.
records2['overlap']=(records2['end date']-records2['start date'].shift(-1).where(records2['name'].eq(records2['name'].shift(-1))))>dt.timedelta(0)
records2
Which returns:
name job code start date end date overlap
1 Harry 101 2020-01-01 2020-04-30 False
4 Harry 401 2020-05-01 2020-12-31 False
2 Jack 301 2020-01-01 2020-12-31 False
3 Matt 101 2020-01-01 2020-11-30 True
5 Matt 102 2020-06-15 2020-12-31 False
0 Tom 101 2020-01-01 2020-12-31 False
This is a helpful question for using shift in conjunction with groups, and there are some nice and different ways to do this. I pulled from the second answer.
If you're interested in how many times each person has an overlap, you can use the following code to create a dataframe with that information:
df=records2.groupby('name').sum('overlap')
df
Which returns:
job code overlap
name
Harry 502 0
Jack 301 0
Matt 203 1
Tom 101 0
I have a dataframe, df, where I am wanting to insert a new column named data in specific format.
df:
Name ID
Kelly A
John B
Desired output:
Date Name ID
2019-10-01 Kelly A
2019-10-01 John B
This is what I am doing:
df['2019-10-01'] = date
I am still researching this. Any insight is helpful
Try with
df['date'] = '2019-10-01'
My dataframe is this:
Date Name Type Description Number
2020-07-24 John Doe Type1 NaN NaN
2020-08-10 Jo Doehn Type1 NaN NaN
2020-08-15 John Doe Type1 NaN NaN
2020-09-10 John Doe Type2 NaN NaN
2020-11-24 John Doe Type1 NaN NaN
I want the Number column to have the instance number with the 60 day period. So for entry 1, the Number should just be 1 since it's the first instance - same with entry 2 since it's a different name. Entry 3 however, should have 2 in the Number column since it's the second instance of John Doe and Type 1 in the 60 day period starting 7/24 (the first instance date). Entry 4 would be 1 as well since the Type is different. Entry 5 would also be 1 since it's outside the 60 day period from 7/24. However, any entries after this with John Doe, Type 1 would have a new 60 day period starting 11/24.
Sorry, I know this is a pretty loaded question with a lot of aspects to it, but I'm trying to get up to speed on dataframes again and I'm not sure where to begin.
As a starting point, you could create a pivot table. (The assign statement just creates a temporary column of ones, to support counting.) In the example below, each row is a date, and each column is a (name, type) pair.
Then, use the resample function (to get one row for every calendar day), and the rolling function (to sum the numbers in the 60-day window).
x = (df.assign(temp = 1)
.pivot_table(index='date',
columns=['name', 'type'],
values='temp',
aggfunc='count',
fill_value=0)
)
x.resample('1d').count().rolling(60).sum()
Can you post sample data in text format (for copy/paste)?
I am trying to identify only first orders of unique "items" purchased by "test" customers in a simplified sample dataframe from the dataframe created below:
df=pd.DataFrame({"cust": ['A55', 'A55', 'A55', 'B080', 'B080', 'D900', 'D900', 'D900', 'D900', 'C019', 'C019', 'Z09c', 'A987', 'A987', 'A987'],
"date":['01/11/2016', '01/11/2016', '01/11/2016', '08/17/2016', '6/17/2016','03/01/2016',
'04/30/2016', '05/16/2016','09/27/2016', '04/20/2016','04/29/2016', '07/07/2016', '1/29/2016', '10/17/2016', '11/11/2016' ],
"item": ['A10BABA', 'A10BABA', 'A10DBDB', 'A9GABA', 'A11AD', 'G198A', 'G198A', 'F673', 'A11BB', 'CBA1', 'CBA1', 'DA21',
'BG10A', 'CG10BA', 'BG10A']
})
df.date = pd.to_datetime(df.date)
df = df.sort_values(["cust", "date"], ascending = True)
The desired output would look as shown in picture - with all unique items ordered by date of purchase in a new column called "cust_item_rank" and remove any repeated (duplicated) orders of the same item by same user.
To clarify further, those items purchased on the same date by same user should have the same order/rank as shown in picture for customer A55 (A10BABA and A10DBDB are ranked as 1).
I have spent a fair bit of time using a combination of group by and/or rank operations but unsuccessful thus far. As an example:
df["cust_item_rank"] = df.groupby("cust")["date"]["item"].rank(ascending = 1, method = "min")
Yields an error (Exception: Column(s) date already selected).
Can somebody please guide me to the desired solution here?
# Remove duplicates
df2 = (df.loc[~df.groupby(['cust'])['item'].apply(pd.Series.duplicated)]
.reset_index(drop=True))
df2['cust_item_rank'] = df2.groupby('cust').cumcount().add(1)
df2
cust date item cust_item_rank
0 A55 2016-01-11 A10BABA 1
1 A55 2016-11-01 A10DBDB 2
2 A987 2016-01-29 BG10A 1
3 A987 2016-10-17 CG10BA 2
4 B080 2016-06-17 A11AD 1
5 B080 2016-08-17 A9GABA 2
6 C019 2016-04-20 CBA1 1
7 D900 2016-03-01 G198A 1
8 D900 2016-05-16 F673 2
9 D900 2016-09-27 A11BB 3
10 Z09c 2016-07-07 DA21 1
To solve this question, I built upon the excellent initial answer by cs95 and calling on the rank function in pandas as follows:
#remove duplicates as recommended by cs95
df2 = (df.loc[~df.groupby(['cust'])['item'].apply(pd.Series.duplicated)]
.reset_index(drop=True))
#rank by date afer grouping by customer
df2["cust_item_rank"]= df2.groupby(["cust"])["date"].rank(ascending=1,method='dense').astype(int)
This resulted in the following (desired output):
It appears that this problem is solved using either "min" or "dense" method of ranking but I chose the latter "dense" method to potentially avoid skipping any rank.
columnx columny results
2019-02-15 2 2019-04-15
2019-05-08 1 2019-06-08
It should not change the days of the month like 15 should be 15 and 8 should be 8. In case of 31 to 30 and vice versa, it's okay. Most Importantly I don't wanna use .apply(). Thanks!
This should solve it.
Pls check that you have datetime format in your columnx and then run below.
EDIT
df["results"]=df["columnx"]+ df['columny'].astype('timedelta64[M]'))