I have a usage data per customer, collected per months during several years, shaped as ~(6000, 60).
Sample dataframe:
df = pd.DataFrame({'id': ['user_1', 'user_2'], 'access_type': ['mobile', 'desktop'], '2018-09-01 00:00:00': [7,5], '2018-10-01 00:00:00':[1,3], '2018-11-01 00:00:00':[0,10]})
id access_type 2018-09-01 00:00:00 2018-10-01 00:00:00 2018-11-01 00:00:00
0 user_1 mobile 7 1 0
1 user_2 desktop 5 3 10
How do I change 40 selected date-columns to a datetime index (?) format, or other format that will allow selecting/slicing required periods of time as date?
Use DataFrame.melt with DataFrame.set_index:
df2 = (df.melt(['id','access_type'], var_name='date')
.assign(date = lambda x: pd.to_datetime(x['date']))
.set_index('date'))
print (df2)
id access_type value
date
2018-09-01 user_1 mobile 7
2018-09-01 user_2 desktop 5
2018-10-01 user_1 mobile 1
2018-10-01 user_2 desktop 3
2018-11-01 user_1 mobile 0
2018-11-01 user_2 desktop 10
If need MultiIndex use set_index with DataFrame.stack:
s = (df.set_index(['id','access_type'])
.stack()
.rename(index = lambda x: pd.to_datetime(x), level=2))
print (s)
Or:
s = (df.melt(['id','access_type'], var_name='date')
.assign(date = lambda x: pd.to_datetime(x['date']))
.set_index(['id','access_type','date'])['value'])
print (s)
Related
I'd like to add 1 if date_ > buy_date larger than 12 months else 0
example df
customer_id date_ buy_date
34555 2019-01-01 2017-02-01
24252 2019-01-01 2018-02-10
96477 2019-01-01 2017-02-18
output df
customer_id date_ buy_date buy_date>_than_12_months
34555 2019-01-01 2017-02-01 1
24252 2019-01-01 2018-02-10 0
96477 2019-01-01 2018-02-18 1
Based on what I understand, you can try adding a year to buy_date and then subtract from date_ , then check if days are + or -.
df['buy_date>_than_12_months'] = ((df['date_'] -
(df['buy_date']+pd.offsets.DateOffset(years=1)))
.dt.days.gt(0).astype(int))
print(df)
customer_id date_ buy_date buy_date>_than_12_months
0 34555 2019-01-01 2017-02-01 1
1 24252 2019-01-01 2018-02-10 0
2 96477 2019-01-01 2017-02-18 1
import pandas as pd
import numpy as np
values = {'customer_id': [34555,24252,96477],
'date_': ['2019-01-01','2019-01-01','2019-01-01'],
'buy_date': ['2017-02-01','2018-02-10','2017-02-18'],
}
df = pd.DataFrame(values, columns = ['customer_id', 'date_', 'buy_date'])
df['date_'] = pd.to_datetime(df['date_'], format='%Y-%m-%d')
df['buy_date'] = pd.to_datetime(df['buy_date'], format='%Y-%m-%d')
print(df['date_'] - df['buy_date'])
df['buy_date>_than_12_months'] = pd.Series([1 if ((df['date_'] - df['buy_date'])[i]> np.timedelta64(1, 'Y')) else 0 for i in range(3)])
print (df)
In continue to this question
Having the following DF:
group_id timestamp
A 2020-09-29 06:00:00 UTC
A 2020-09-29 08:00:00 UTC
A 2020-09-30 09:00:00 UTC
B 2020-09-01 04:00:00 UTC
B 2020-09-01 06:00:00 UTC
I would like to count the deltas between records using all groups, not counting deltas between groups. Result for the above example:
delta count
2 2
27 1
Explanation: In group A the deltas are
06:00:00 -> 08:00:00 (2 hours)
08:00:00 -> 09:00:00 on the next day (27 hours from the first event)
And in group B:
04:00:00 -> 06:00:00 (2 hours)
How can I achieve this using Python Pandas?
FIrst idea is use custom lambda function with Series.cumsum for cumulative sum:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df1 = (df.groupby("group_id")['timestamp']
.apply(lambda x: x.diff().dt.total_seconds().cumsum())
.div(3600)
.value_counts()
.rename_axis('delta')
.reset_index(name='count')
)
print (df1)
delta count
0 2.0 2
1 27.0 1
Or add another groupby with GroupBy.cumsum:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df1 = (df.groupby("group_id")['timestamp']
.diff()
.dt.total_seconds()
.div(3600)
.groupby(df['group_id'])
.cumsum()
.value_counts()
.rename_axis('delta')
.reset_index(name='count')
)
print (df1)
delta count
0 2.0 2
1 27.0 1
Another idea is subtract first values per groups by GroupBy.transform and GroupBy.first, but for remove first rows with 0 is added filter by Series.duplicated:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df1 = (df['timestamp'].sub(df.groupby("group_id")['timestamp'].transform('first'))
.loc[df['group_id'].duplicated()]
.dt.total_seconds()
.div(3600)
.value_counts()
.rename_axis('delta')
.reset_index(name='count')
)
print (df1)
delta count
0 2.0 2
1 27.0 1
I have a Dataframe with single or multiple entries for a member. I want to select all entries for a member and copy it into an empty dataframe (below).
n member_id signup_date cancel_date checkout_date
1 669991797608307338 2014-10-22 2015-04-03 2014-10-27
2 669991797608307338 2014-10-22 2015-04-03 NaT
3 669991797608307338 2014-10-22 2015-04-03 NaT
4 669991797608307338 2014-10-22 2015-04-03 NaT
5 669991797608307338 2014-10-22 2015-04-03 NaT
261 -216296171696241227 2018-04-30 NaT NaT
262 740140472387380715 2018-04-30 NaT NaT
263 -973878985384418370 2018-04-30 NaT NaT
264 -600987750910073333 2018-04-30 NaT NaT
265 -926101607852327555 2018-04-30 NaT NaT
... and copy the entries into a dataframe for each member_id.
index = pd.date_range('2014-10-22', end='2018-04-30')
columns = ['signup','checkout','cancel']
df2 = pd.DataFrame(index=index, columns=columns)
df2 = df2.fillna(0)
(index) signup checkout cancel
2014-10-22 0 0 0
2014-10-23 0 0 0
2014-10-24 0 0 0
2014-10-25 0 0 0
2014-10-26 0 0 0
What function / method is the most efficient to use to select by member_id?
E.g. if signup_date = 2014-10-22 then there should be a 1 in the copy of the dataframe for the specific member. If the checkout_date = 2014-10-27 a 1 should be in the checkout column on the 2014-10-27 row.
I have a very complicated solution. I create a list of tuples ("member_id", df2_like):
drng=pd.date_range('2014-10-22', '2018-04-30')
lrslt=[ (member,pd.DataFrame({"signup": drng.isin(grp.signup_date), \
"cancel":drng.isin(grp.cancel_date), \
"checkout":drng.isin(grp.checkout_date)}, \
index=drng).astype(int) ) \
for member,grp in df.groupby("member_id") ]
Edit: Extending the (member,member_df) tuples in "lrslt":
new_lrslt= [ (member,mdf,mdf.resample("Y").sum(),mdf.resample("W").sum()) for (member,mdf) in lrslt ]
You can apply a function to df2 and for each date count the number of unique member_id in df and for each of sign, cancel and checkout columns.
Something like this:
df2["signup"] = df2.apply(lambda x: df.where(df["signup_date"] == x.name).member_id.nunique(), axis=1)
In this lambda expression, we filter df where the signup_date equals the DatetimeIndex of df2(accessed by row.name) and count distinct members.
You can do the same for the 2 other columns :
df2["cancel"] = df2.apply(lambda x: df.where(df["cancel_date"] == x.name).member_id.nunique(), axis=1)
df2["checkout"] = df2.apply(lambda x: df.where(df["checkout_date"] == x.name).member_id.nunique(), axis=1)
Output: (with your input example)
signup checkout cancel
2014-10-22 1 0 0
2014-10-23 0 0 0
... ... ... ...
2014-10-27 0 1 0
... ... ... ...
2015-04-03 0 0 1
2018-04-30 5 0 0
NOTE: While this gives correct results, I'm actually not sure about the performances for large dataframes.
EDIT:
To do the same as above but for every member_id, you can loop throught the original dataFrame, like this :
df_list = list()
for member_id in df["member_id"].unique():
d = df2.copy()
d["signup"] = d.apply(lambda x: df.where((df["signup_date"] == x.name) & (df["member_id"] == member_id)).member_id.nunique(), axis=1)
d["cancel"] = d.apply(lambda x: df.where((df["cancel_date"] == x.name) & (df["member_id"] == member_id)).member_id.nunique(), axis = 1)
d["checkout"] = d.apply(lambda x: df.where((df["checkout_date"] == x.name) & (df["member_id"] == member_id)).member_id.nunique(), axis = 1)
df_list.append({"member_id": member_id, "statistics_by_date": d})
It gives a list of: {"member_id": <string>, "statistics_by_date": <DataFrame>}
df = pd.DataFrame({
'subject_id':[1,1,2,2],
'time_1':['2173/04/11 12:35:00','2173/04/12 12:50:00','2173/04/11 12:59:00','2173/04/12 13:14:00'],
'time_2':['2173/04/12 16:35:00','2173/04/13 18:50:00','2173/04/13 22:59:00','2173/04/21 17:14:00'],
'val' :[5,5,40,40],
'iid' :[12,12,12,12]
})
df['time_1'] = pd.to_datetime(df['time_1'])
df['time_2'] = pd.to_datetime(df['time_2'])
df['day'] = df['time_1'].dt.day
Currently my dataframe looks like as shown below
I would like to replace the timestamp in time_1 column to 00:00:00 and time_2 column to 23:59:00
This is what I tried but it doesn't work
df.groupby(df['subject_id'])['time_1'].apply(lambda x: pd.datetime.strftime(x, "%H:%M:%S") == "00:00:00") #approach 1
df.groupby(df['subject_id'])['time_1'].apply(lambda x: pd.pd.Timestamp(hour = '00', second = '00')) #approach 2
I expect my output dataframe to be like as shown below
I pandas if all datetimes have 00:00:00 times in same column then not display it.
Use Series.dt.floor or Series.str.normalize for remove times and for second add DateOffset:
df['time_1'] = pd.to_datetime(df['time_1']).dt.floor('d')
#alternative
#df['time_1'] = pd.to_datetime(df['time_1']).dt.normalize()
df['time_2']=pd.to_datetime(df['time_2']).dt.floor('d') + pd.DateOffset(hours=23, minutes=59)
df['day'] = df['time_1'].dt.day
print (df)
subject_id time_1 time_2 val iid day
0 1 2173-04-11 2173-04-12 23:59:00 5 12 11
1 1 2173-04-12 2173-04-13 23:59:00 5 12 12
2 2 2173-04-11 2173-04-13 23:59:00 40 12 11
3 2 2173-04-12 2173-04-21 23:59:00 40 12 12
I have a data set of of the same category. I want to compare the two date columns of the same category
I want to see if DATE1 less than in values in DATE2 of the same CATEGORY and find the earliest DATE it is greater than
I'm trying this but i'm not getting the results that I am looking for
df['test'] = np.where(m['DATE1'] < df['DATE2'], Y, N)
CATEGORY DATE1 DATE2 GREATERTHAN GREATERDATE
0 23 2015-01-18 2015-01-15 Y 2015-01-10
1 11 2015-02-18 2015-02-19 N 0
2 23 2015-03-18 2015-01-10 Y 2015-01-10
3 11 2015-04-18 2015-08-18 Y 2015-02-19
4 23 2015-05-18 2015-02-21 Y 2015-01-10
5 11 2015-06-18 2015-08-18 Y 2015-02-19
6 15 2015-07-18 2015-02-18 0 0
df['DATE1'] = pd.to_datetime(df['DATE1'])
df['DATE2'] = pd.to_datetime(df['DATE2'])
df['GREATERTHAN'] = np.where(df['DATE1'] > df['DATE2'], 'Y', 'N')
## Getting the earliest date for which data is available, per category
earliest_dates = df.groupby(['CATEGORY']).apply(lambda x: x['DATE1'].append(x['DATE2']).min()).to_frame()
## Merging to get the earliest date column per category
df.merge(earliest_dates, left_on = 'CATEGORY', right_on = earliest_dates.index, how = 'left')