I have the following dataset of students taking multiple SAT exams:
df = pd.DataFrame({'student': 'A A A A A B B B C'.split(),
'exam_date':[datetime.datetime(2013,4,1),datetime.datetime(2013,6,1),
datetime.datetime(2013,7,1),datetime.datetime(2013,10,2),
datetime.datetime(2014,1,1),datetime.datetime(2013,11,2),
datetime.datetime(2014,2,2),datetime.datetime(2014,5,2),
datetime.datetime(2014,5,2)]})
print(df)
student exam_date
0 A 2013-04-01
1 A 2013-06-01
2 A 2013-07-01
3 A 2013-10-02
4 A 2014-01-01
5 B 2013-11-02
6 B 2014-02-02
7 B 2014-05-02
8 C 2014-05-02
I want to create a new column diff with the difference of two successive exam dates for each individual student, and then filter the value with a particular threshold, i.e. 75 days. If the student doesn't have two successive dates, we need to drop that student.
I am trying the following script to create the new column:
df['exam_date'] = df.groupby('student')['exam_date'].apply(lambda x: x.sort_values())
df['diff'] = df.groupby('student')['exam_date'].diff() / np.timedelta64(1, 'D')
print(df)
student exam_date diff
0 A 2013-04-01 NaN
1 A 2013-06-01 61.0
2 A 2013-07-01 30.0
3 A 2013-10-02 93.0
4 A 2014-01-01 91.0
5 B 2013-11-02 NaN
6 B 2014-02-02 92.0
7 B 2014-05-02 89.0
8 C 2014-05-02 NaN
Then I'm using query to filter the value and get the output:
df_new = df.query('diff <= 75')
print(df_new)
student exam_date diff
1 A 2013-06-01 61.0
2 A 2013-07-01 30.0
This is correctly selecting the student A and removing the students B and C. However, I'm missing the earliest date for the student A.
Though using df[df['student'].isin(studentList)]I'm getting the desired result, but it's too much of work.
Is there any better way of getting the desired output, maybe using diff() and le()? Any suggestions would be appreciated. Thanks!
What you want is filtering students, but you are filtering exam records.
After you got df_new, just find the students set, and use that to select df:
df[df.student.isin(df_new.student.unique())]
and you'll get:
student exam_date diff
0 A 2013-04-01 NaN
1 A 2013-06-01 61.0
2 A 2013-07-01 30.0
3 A 2013-10-02 93.0
4 A 2014-01-01 91.0
Related
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
I have the following dataset of students taking multiple SAT exams:
df = pd.DataFrame({'student': 'A A A A A B B B'.split(),
'exam_date':[datetime.datetime(2013,4,1),datetime.datetime(2013,6,1),
datetime.datetime(2013,8,1),datetime.datetime(2013,10,2),
datetime.datetime(2014,1,1),datetime.datetime(2013,11,2),
datetime.datetime(2014,2,2),datetime.datetime(2014,5,2)]})
print(df)
student exam_date
0 A 2013-04-01
1 A 2013-06-01
2 A 2013-08-01
3 A 2013-10-02
4 A 2014-01-01
5 B 2013-11-02
6 B 2014-02-02
7 B 2014-05-02
I want to make a dataset of each student with their first exam date, second exam date, and so on.
I am trying groupby and min to get the 1st date, but not sure about the subsequent dates.
# Find earliest time
df.groupby('student')['exam_date'].agg('min').reset_index()
I tried rank to get the desired result, but it seems too much of work.
# Rank
df['rank'] = df.groupby('student')['exam_date'].rank(ascending=True)
print(df)
student exam_date rank
0 A 2013-04-01 1.0
1 A 2013-06-01 2.0
2 A 2013-08-01 3.0
3 A 2013-10-02 4.0
4 A 2014-01-01 5.0
5 B 2013-11-02 1.0
6 B 2014-02-02 2.0
7 B 2014-05-02 3.0
Is there any better way of getting the desired output? Any suggestions would be appreciated. Thanks!
Desired Output:
student exam_01 exam_02 exam_03 exam_04
0 A 2013-04-01 2013-06-01 2013-08-01 2013-10-02
1 B 2013-11-02 2014-02-02 2013-05-02 NA
You can use groupby+cumcount to generate a helper column and pivot.
NB. This assumes the dates are sorted, if not use sort_values first.
(df.assign(id=df.groupby('student').cumcount().add(1))
.pivot(index='student', columns='id', values='exam_date')
.add_prefix('exam_')
)
Output:
id exam_1 exam_2 exam_3 exam_4 exam_5
student
A 2013-04-01 2013-06-01 2013-08-01 2013-10-02 2014-01-01
B 2013-11-02 2014-02-02 2014-05-02 NaT NaT
I have this data frame
import pandas as pd
df = pd.DataFrame({'COTA':['A','A','A','A','A','B','B','B','B'],
'Date':['14/10/2021','19/10/2020','29/10/2019','30/09/2021','20/09/2020','20/10/2021','29/10/2020','15/10/2019','10/09/2020'],
'Mark':[1,2,3,4,5,1,2,3,3]
})
print(df)
based on this data frame I wanted the MARK from the previous year, I managed to acquire the maximum COTA but I wanted the last one, I used .max() and I thought I could get it with .last() but it didn't work.
follow the example of my code.
df['Date'] = pd.to_datetime(df['Date'])
df['LastYear'] = df['Date'] - pd.offsets.YearEnd(0)
s1 = df.groupby(['Found', 'LastYear'])['Mark'].max()
s2 = s1.rename(index=lambda x: x + pd.offsets.DateOffset(years=1), level=1)
df = df.join(s2.rename('Max_MarkLastYear'), on=['Found', 'LastYear'])
print (df)
Found Date Mark LastYear Max_MarkLastYear
0 A 2021-10-14 1 2021-12-31 5.0
1 A 2020-10-19 2 2020-12-31 3.0
2 A 2019-10-29 3 2019-12-31 NaN
3 A 2021-09-30 4 2021-12-31 5.0
4 A 2020-09-20 5 2020-12-31 3.0
5 B 2021-10-20 1 2021-12-31 3.0
6 B 2020-10-29 2 2020-12-31 3.0
7 B 2019-10-15 3 2019-12-31 NaN
8 B 2020-10-09 3 2020-12-31 3.0
How do I create a new column with the last value of the previous year
I would like to compare one column of a df with another column in a different df. The columns are timestamp and holiday date. I'd like to create a dummy variable wherein if the timestamp in df1 match the dates in df2 = 1, else 0.
For example, df1:
timestamp weight(kg)
0 2016-03-04 4.0
1 2015-02-15 5.0
2 2019-05-04 5.0
3 2018-12-25 29.0
4 2020-01-01 58.0
For example, df2:
holiday
0 2016-12-25
1 2017-01-01
2 2019-05-01
3 2018-12-26
4 2020-05-26
Ideal output:
timestamp weight(kg) holiday
0 2016-03-04 4.0 0
1 2015-02-15 5.0 0
2 2019-05-04 5.0 0
3 2018-12-25 29.0 1
4 2020-01-01 58.0 1
I have tried writing a function but it is taking very long to calculate:
def add_holiday(x):
hols_df = hols.apply(lambda y: y['holiday_dt'] if
x['timestamp'] == y['holiday_dt']
else None, axis=1)
hols_df = hols_df.dropna(axis=0, how='all')
if hols_df.empty:
hols_df= np.nan
else:
hols_df= hols_df.to_string(index=False)
return hols_df
#df_hols['holidays'] = df_hols.apply(add_holiday, axis=1)
Perhaps, there is a simpler way to do so or the function is not exactly well-written. Any help will be appreciated.
Use Series.isin with convert mask to 1,0 by Series.astype:
df1['holiday'] = df1['timestamp'].isin(df2['holiday']).astype(int)
Or with numpy.where:
df1['holiday'] = np.where(df1['timestamp'].isin(df2['holiday']), 1, 0)
customer_id Order_date
1 2015-01-16
1 2015-01-19
2 2014-12-21
2 2015-01-10
1 2015-01-10
3 2018-01-18
3 2017-03-04
4 2019-11-05
4 2010-01-01
3 2019-02-03
Lets say I have data like this
Basically for an ecommerce firm some people buy regularly, some buy once every year, some buy monthly once etc. I need to find the difference between frequency of each transaction for each customer.
This will be a dynamic list, since some people will have transacted thousand times, some would have transacted once, some ten times etc. Any ideas on how to achieve this.
Output needed:
customer_id Order_date_Difference_in_days
1 6,3 #Difference b/w first 2 dates 2015-01-10 and 2015-01-16
#is 6 days and diff b/w next 2 consecutive dates is
#2015-01-16 and 2015-01-19 is #3 days
2 20
3 320,381
4 3596
Basically these are the differences between dates after sorting them first for each customer id
You can also use the below for the current output:
m=(df.assign(Diff=df.sort_values(['customer_id','Order_date'])
.groupby('customer_id')['Order_date'].diff().dt.days).dropna())
m=m.assign(Diff=m['Diff'].astype(str)).groupby('customer_id')['Diff'].agg(','.join)
customer_id
1 6.0,3.0
2 20.0
3 320.0,381.0
4 3595.0
Name: Diff, dtype: object
First we need to sort the data by customer id and the order date
ensure your datetime is a proper date time call df['Order_date'] = pd.to_datetime(df['Order_date'])
df.sort_values(['customer_id','Order_date'],inplace=True)
df["days"] = df.groupby("customer_id")["Order_date"].apply(
lambda x: (x - x.shift()) / np.timedelta64(1, "D")
)
print(df)
customer_id Order_date days
4 1 2015-01-10 NaN
0 1 2015-01-16 6.0
1 1 2015-01-19 3.0
2 2 2014-12-21 NaN
3 2 2015-01-10 20.0
6 3 2017-03-04 NaN
5 3 2018-01-18 320.0
9 3 2019-02-03 381.0
8 4 2010-01-01 NaN
7 4 2019-11-05 3595.0
then you can do a simple agg but you'll need to conver the value into a string.
df.dropna().groupby("customer_id")["days"].agg(
lambda x: ",".join(x.astype(str))
).to_frame()
days
customer_id
1 6.0,3.0
2 20.0
3 320.0,381.0
4 3595.0
For a given pandas data frame called full_df which looks like
index id timestamp data
------- ---- ------------ ------
1 1 2017-01-01 10.0
2 1 2017-02-01 11.0
3 1 2017-04-01 13.0
4 2 2017-02-01 1.0
5 2 2017-03-01 2.0
6 2 2017-05-01 9.0
The start and end dates (and the time delta between start and end) are varying.
But I need a id wise resampled version (added rows marked with *)
index id timestamp data
------- ---- ------------ ------ ----
1 1 2017-01-01 10.0
2 1 2017-02-01 11.0
3 1 2017-03-01 NaN *
4 1 2017-04-01 13.0
5 2 2017-02-01 1.0
6 2 2017-03-01 2.0
7 2 2017-04-01 NaN *
8 2 2017-05-01 9.0
Because the dataset is very large I was wondering if there is more efficient way of doing so than
Do full_df.groupby('id')
Do for each group df
df.index = pd.DatetimeIndex(df['timestamp'])
all_days = pd.date_range(df.index.min(), df.index.max(), freq='MS')
df = df.reindex(all_days)
Combine all groups again with a new index
That's time consuming and not very elegant. Any ideas?
Using resample
In [1175]: (df.set_index('timestamp').groupby('id').resample('MS').asfreq()
.drop(['id', 'index'], 1).reset_index())
Out[1175]:
id timestamp data
0 1 2017-01-01 10.0
1 1 2017-02-01 11.0
2 1 2017-03-01 NaN
3 1 2017-04-01 13.0
4 2 2017-02-01 1.0
5 2 2017-03-01 2.0
6 2 2017-04-01 NaN
7 2 2017-05-01 9.0
Details
In [1176]: df
Out[1176]:
index id timestamp data
0 1 1 2017-01-01 10.0
1 2 1 2017-02-01 11.0
2 3 1 2017-04-01 13.0
3 4 2 2017-02-01 1.0
4 5 2 2017-03-01 2.0
5 6 2 2017-05-01 9.0
In [1177]: df.dtypes
Out[1177]:
index int64
id int64
timestamp datetime64[ns]
data float64
dtype: object
Edit to add: this way does the min/max of dates for full_df, not df. If there wide variation in start/end dates between IDs this will unfortunately inflate the dataframe and #JohnGalt method is better. Nevertheless I'll leave this here as an alternate approach as it ought to be faster than groupby/resample for cases where it is appropriate.
I think the most efficient approach is likely going to be with stack/unstack or melt/pivot.
You could do something like this, for example:
full_df.set_index(['timestamp','id']).unstack('id').stack('id',dropna=False)
index data
timestamp id
2017-01-01 1 1.0 10.0
2 NaN NaN
2017-02-01 1 2.0 11.0
2 4.0 1.0
2017-03-01 1 NaN NaN
2 5.0 2.0
2017-04-01 1 3.0 13.0
2 NaN NaN
2017-05-01 1 NaN NaN
2 6.0 9.0
Just add reset_index().set_index('id') if you want it to display more like how you have it above. Note in particular the use of dropna=False with stack which preserves the NaN placeholders. Without that, the stack/unstack method just leaves you back where you started.
This method automatically includes the min & max dates, and all dates present for at least one timestamp. If there are interior timestamps missing for everyone, then you need to add a resample like this:
full_df.set_index(['timestamp','id']).unstack('id')\
.resample('MS').mean()\
.stack('id',dropna=False)