I have the following data frame
import pandas as pd
from pandas import Timestamp
df=pd.DataFrame({
'Tech en Innovation Fonds': {0: '63.57', 1: '63.57', 2: '63.57', 3: '63.57', 4: '61.03', 5: '61.03', 6: 61.03}, 'Aandelen Index Fonds': {0: '80.22', 1: '80.22', 2: '80.22', 3: '80.22', 4: '79.85', 5: '79.85', 6: 79.85},
'Behoudend Mix Fonds': {0: '44.80', 1: '44.8', 2: '44.8', 3: '44.8', 4: '44.8', 5: '44.8', 6: 44.8},
'Neutraal Mix Fonds': {0: '50.43', 1: '50.43', 2: '50.43', 3: '50.43', 4: '50.37', 5: '50.37', 6: 50.37},
'Dynamisch Mix Fonds': {0: '70.20', 1: '70.2', 2: '70.2', 3: '70.2', 4: '70.04', 5: '70.04', 6: 70.04},
'Risicomijdende Strategie': {0: '46.03', 1: '46.03', 2: '46.03', 3: '46.03', 4: '46.08', 5: '46.08', 6: 46.08},
'Tactische Strategie': {0: '48.69', 1: '48.69', 2: '48.69', 3: '48.69', 4: '48.62', 5: '48.62', 6: 48.62},
'Aandelen Groei Strategie': {0: '52.91', 1: '52.91', 2: '52.91', 3: '52.91', 4: '52.77', 5: '52.77', 6: 52.77},
'Datum': {0: Timestamp('2022-07-08 18:00:00'), 1: Timestamp('2022-07-11 19:42:55'), 2: Timestamp('2022-07-12 09:12:09'), 3: Timestamp('2022-07-12 09:29:53'), 4: Timestamp('2022-07-12 15:24:46'), 5: Timestamp('2022-07-12 15:30:02'), 6: Timestamp('2022-07-12 15:59:31')}})
I scrape these from a website several times a day
I am looking for a way to clean the dataframe, so that for each day only the latest entry is kept.
So for this dataframe 2022-07-12 has 5 entries for 2027-07-12 but I want to keep the last one i.e. 2022-07-12 15:59:31
The entries on the previous day are made already okay manually :-(
I intent to do this once a month so each day has several entries
I already tried
dfclean=df.sort_values('Datum').drop_duplicates('Datum', keep='last')
But that gives me al the records back because the time is different
Any one an idea how to do this?
If the data is sorted by date, use a groupby.last:
df.groupby(df['Datum'].dt.date, as_index=False).last()
else:
df.loc[df.groupby(df['Datum'].dt.date)['Datum'].idxmax()]
output:
Tech en Innovation Fonds Aandelen Index Fonds Behoudend Mix Fonds \
0 63.57 80.22 44.80
1 63.57 80.22 44.8
2 61.03 79.85 44.8
Neutraal Mix Fonds Dynamisch Mix Fonds Risicomijdende Strategie \
0 50.43 70.20 46.03
1 50.43 70.2 46.03
2 50.37 70.04 46.08
Tactische Strategie Aandelen Groei Strategie Datum
0 48.69 52.91 2022-07-08 18:00:00
1 48.69 52.91 2022-07-11 19:42:55
2 48.62 52.77 2022-07-12 15:59:31
You can use .max() with datetime columns like this:
dfclean = df.loc[
(df['Datum'].dt.date < df['Datum'].max().date()) |
(df['Datum'] == df['Datum'].max())
]
Output:
Tech en Innovation Fonds Aandelen Index Fonds Behoudend Mix Fonds \
0 63.57 80.22 44.80
1 63.57 80.22 44.8
6 61.03 79.85 44.8
Neutraal Mix Fonds Dynamisch Mix Fonds Risicomijdende Strategie \
0 50.43 70.20 46.03
1 50.43 70.2 46.03
6 50.37 70.04 46.08
Tactische Strategie Aandelen Groei Strategie Datum
0 48.69 52.91 2022-07-08 18:00:00
1 48.69 52.91 2022-07-11 19:42:55
6 48.62 52.77 2022-07-12 15:59:31
Below a working example, where I keep only the date part of the timestamp to filter the dataframe:
df['Datum_Date'] = df['Datum'].dt.date
dfclean = df.sort_values('Datum_Date').drop_duplicates('Datum_Date', keep='last')
dfclean = dfclean.drop(columns='Datum_Date', axis=1)
Does this get you what you need?
df['Day'] = df['Datum'].dt.day
df.loc[df.groupby('Day')['Day'].idxmax()]
Related
Dataset is something like this (there will be duplicate rows in the original):
Code:
import pandas as pd
df_in = pd.DataFrame({'email_ID': {0: 'sachinlaltaprayoohoo',
1: 'sachinlaltaprayoohoo',
2: 'sachinlaltaprayoohoo',
3: 'sachinlaltaprayoohoo',
4: 'sachinlaltaprayoohoo',
5: 'sachinlaltaprayoohoo',
6: 'sheldon.yokoohoo',
7: 'sheldon.yokoohoo',
8: 'sheldon.yokoohoo',
9: 'sheldon.yokoohoo',
10: 'sheldon.yokoohoo',
11: 'sheldon.yokoohoo'},
'time_stamp': {0: '2021-09-10 09:01:56.340259',
1: '2021-09-10 09:01:56.672814',
2: '2021-09-10 09:01:57.471423',
3: '2021-09-10 09:01:57.480891',
4: '2021-09-10 09:01:57.484644',
5: '2021-09-10 09:01:57.984644',
6: '2021-09-10 09:01:56.340259',
7: '2021-09-10 09:01:56.672814',
8: '2021-09-10 09:01:57.471423',
9: '2021-09-10 09:01:57.480891',
10: '2021-09-10 09:01:57.484644',
11: '2021-09-10 09:01:57.984644'},
'screen': {0: 'rewardapp.SplashActivity',
1: 'i1',
2: 'rewardapp.Signup_in',
3: 'rewardapp.PaymentFinalConfirmationActivity',
4: 'rewardapp.Signup_in',
5: 'i1',
6: 'rewardapp.SplashActivity',
7: 'i1',
8: 'rewardapp.Signup_in',
9: 'i1',
10: 'rewardapp.Signup_in',
11: 'rewardapp.PaymentFinalConfirmationActivity'}})
df_in['time_stamp'] = df_in['time_stamp'].astype('datetime64[ns]')
df_in
Output should be this:
Code:
import pandas as pd
df_out = pd.DataFrame({'email_ID': {0: 'sachinlaltaprayoohoo',
1: 'sachinlaltaprayoohoo',
2: 'sachinlaltaprayoohoo',
3: 'sachinlaltaprayoohoo',
4: 'sachinlaltaprayoohoo',
5: 'sachinlaltaprayoohoo',
6: 'sheldon.yokoohoo',
7: 'sheldon.yokoohoo',
8: 'sheldon.yokoohoo',
9: 'sheldon.yokoohoo',
10: 'sheldon.yokoohoo',
11: 'sheldon.yokoohoo'},
'time_stamp': {0: '2021-09-10 09:01:56.340259',
1: '2021-09-10 09:01:56.672814',
2: '2021-09-10 09:01:57.471423',
3: '2021-09-10 09:01:57.480891',
4: '2021-09-10 09:01:57.484644',
5: '2021-09-10 09:01:57.984644',
6: '2021-09-10 09:01:56.340259',
7: '2021-09-10 09:01:56.672814',
8: '2021-09-10 09:01:57.471423',
9: '2021-09-10 09:01:57.480891',
10: '2021-09-10 09:01:57.484644',
11: '2021-09-10 09:01:57.984644'},
'screen': {0: 'rewardapp.SplashActivity',
1: 'i1',
2: 'rewardapp.Signup_in',
3: 'rewardapp.PaymentFinalConfirmationActivity',
4: 'rewardapp.Signup_in',
5: 'i1',
6: 'rewardapp.SplashActivity',
7: 'i1',
8: 'rewardapp.Signup_in',
9: 'i1',
10: 'rewardapp.Signup_in',
11: 'rewardapp.PaymentFinalConfirmationActivity'},
'series1': {0: 0,
1: 1,
2: 2,
3: 3,
4: 0,
5: 1,
6: 0,
7: 1,
8: 2,
9: 3,
10: 4,
11: 5},
'series2': {0: 0,
1: 0,
2: 0,
3: 0,
4: 1,
5: 1,
6: 2,
7: 2,
8: 2,
9: 2,
10: 2,
11: 2}})
df_out['time_stamp'] = df['time_stamp'].astype('datetime64[ns]')
df_out
'series1' column values starts row by row as 0, 1, 2, and so on but resets to 0 when:
'email_ID' column value changes.
'screen' column value == 'rewardapp.PaymentFinalConfirmationActivity'
'series2' column values starts with 0 and increments by 1 whenever 'series1' resets.
My progress:
series1 = [0]
x = 0
for index in df[1:].index:
if ((df._get_value(index - 1, 'email_ID')) == df._get_value(index, 'email_ID')) and (df._get_value(index - 1, 'screen') != 'rewardapp.PaymentFinalConfirmationActivity'):
x += 1
series1.append(x)
else:
x = 0
series1.append(x)
df['series1'] = series1
df
series2 = [0]
x = 0
for index in df[1:].index:
if df._get_value(index, 'series1') - df._get_value(index - 1, 'series1') == 1:
series2.append(x)
else:
x += 1
series2.append(x)
df['series2'] = series2
df
I think the code above is working, I'll test answered codes and select the best in a few hours, thank you.
Let's try
m = (df_in['email_ID'].ne(df_in['email_ID'].shift().bfill()) |
df_in['screen'].shift().eq('rewardapp.PaymentFinalConfirmationActivity'))
df_in['series1'] = df_in.groupby(m.cumsum()).cumcount()
df_in['series2'] = m.cumsum()
print(df_in)
email_ID time_stamp screen series1 series2
0 sachinlaltaprayoohoo 2021-09-10 09:01:56.340259 rewardapp.SplashActivity 0 0
1 sachinlaltaprayoohoo 2021-09-10 09:01:56.672814 i1 1 0
2 sachinlaltaprayoohoo 2021-09-10 09:01:57.471423 rewardapp.Signup_in 2 0
3 sachinlaltaprayoohoo 2021-09-10 09:01:57.480891 rewardapp.PaymentFinalConfirmationActivity 3 0
4 sachinlaltaprayoohoo 2021-09-10 09:01:57.484644 rewardapp.Signup_in 0 1
5 sachinlaltaprayoohoo 2021-09-10 09:01:57.984644 i1 1 1
6 sheldon.yokoohoo 2021-09-10 09:01:56.340259 rewardapp.SplashActivity 0 2
7 sheldon.yokoohoo 2021-09-10 09:01:56.672814 i1 1 2
8 sheldon.yokoohoo 2021-09-10 09:01:57.471423 rewardapp.Signup_in 2 2
9 sheldon.yokoohoo 2021-09-10 09:01:57.480891 i1 3 2
10 sheldon.yokoohoo 2021-09-10 09:01:57.484644 rewardapp.Signup_in 4 2
11 sheldon.yokoohoo 2021-09-10 09:01:57.984644 rewardapp.PaymentFinalConfirmationActivity 5 2
You can use:
m = df_in['screen']=='rewardapp.PaymentFinalConfirmationActivity'
df_in['pf'] = np.where(m, 1, np.nan)
df_in.loc[m, 'pf'] = df_in[m].cumsum()
grouper = df_in.groupby('email_ID')['pf'].bfill()
df_in['series1'] = df_in.groupby(grouper).cumcount()
df_in['series2'] = df_in.groupby(grouper.fillna(0), sort=False).ngroup()
df_in.drop('pf', axis=1, inplace=True)
print(df_in):
email_ID time_stamp \
0 sachinlaltaprayoohoo 2021-09-10 09:01:56.340259
1 sachinlaltaprayoohoo 2021-09-10 09:01:56.672814
2 sachinlaltaprayoohoo 2021-09-10 09:01:57.471423
3 sachinlaltaprayoohoo 2021-09-10 09:01:57.480891
4 sachinlaltaprayoohoo 2021-09-10 09:01:57.484644
5 sachinlaltaprayoohoo 2021-09-10 09:01:57.984644
6 sheldon.yokoohoo 2021-09-10 09:01:56.340259
7 sheldon.yokoohoo 2021-09-10 09:01:56.672814
8 sheldon.yokoohoo 2021-09-10 09:01:57.471423
9 sheldon.yokoohoo 2021-09-10 09:01:57.480891
10 sheldon.yokoohoo 2021-09-10 09:01:57.484644
11 sheldon.yokoohoo 2021-09-10 09:01:57.984644
screen series1 series2
0 rewardapp.SplashActivity 0 0
1 i1 1 0
2 rewardapp.Signup_in 2 0
3 rewardapp.PaymentFinalConfirmationActivity 3 0
4 rewardapp.Signup_in 0 1
5 i1 1 1
6 rewardapp.SplashActivity 0 2
7 i1 1 2
8 rewardapp.Signup_in 2 2
9 i1 3 2
10 rewardapp.Signup_in 4 2
11 rewardapp.PaymentFinalConfirmationActivity 5 2
Explanation:
First locate the rows where 'screen' is 'PaymentFinalConfirmationActivity' and then use cumsum() to identify their numbers.
This is accomplished by:
df_in['pf'] = np.where(m, 1, np.nan)
df_in.loc[m, 'pf'] = df_in[m].cumsum()
Then use bfill to backfill the NaN values with the positions where 'screen' shows 'PaymentFinalConfirmationActivity'. This will ensure the above rows are of the same group, but do it per email_ID. This is accomplished by:
grouper = df_in.groupby('email_ID')['pf'].bfill()
Then it is straightforward to see that once you define a grouper, you can use cumcount to get the series1 column. This is done by:
df_in['series1'] = df_in.groupby(grouper).cumcount()
Then get series2 column by using ngroup(). But make sure the groupby is done with sort=False. Done by:
df_in['series2'] = df_in.groupby(grouper.fillna(0), sort=False).ngroup()
Finally drop the unwanted column pf.
df_in.drop('pf', axis=1, inplace=True)
I would like to create a 2nd column based on the maximum date by month in 1st column, but I'm having trouble identifying the maximum date by month in the 1st column (first step below).
I'm trying to do a groupby but im getting a ValueError: Cannot index with multidimensional key.
I believe the steps are:
Within the datadate column, identify the maximum date by month. Eg.
1/29/1993, 2/11/1993, 3/29/1993, etc.
For the datadate row that equals the maximum date by month: in a new column called last_day_in_month, put the maximum
possible date: Eg. 1/31/1993, 2/28/1993, 3/31/1993, etc. For all the other rows where datadate row != maximum date by month, put
False.
Sample Data and Ideal Output:
{'tic': {0: 'SPY', 1: 'SPY', 2: 'SPY', 3: 'SPY', 4: 'SPY', 5: 'SPY', 6: 'SPY', 7: 'SPY', 8: 'SPY', 9: 'SPY'}, 'cusip': {0: '78462F103', 1: '78462F103', 2: '78462F103', 3: '78462F103', 4: '78462F103', 5: '78462F103', 6: '78462F103', 7: '78462F103', 8: '78462F103', 9: '78462F103'}, 'datadate': {0: '1993-01-29', 1: '1993-02-01', 2: '1993-02-02', 3: '1993-02-03', 4: '1993-02-04', 5: '1993-02-05', 6: '1993-02-08', 7: '1993-02-09', 8: '1993-02-10', 9: '1993-02-11'}, 'prccd': {0: 43.938, 1: 44.25, 2: 44.34375, 3: 44.8125, 4: 45.0, 5: 44.96875, 6: 44.96875, 7: 44.65625, 8: 44.71875, 9: 44.9375}, 'next_year': {0: '1994-01-25', 1: '1994-01-26', 2: '1994-01-27', 3: '1994-01-28', 4: '1994-01-31', 5: '1994-02-01', 6: '1994-02-02', 7: '1994-02-03', 8: '1994-02-04', 9: '1994-02-07'}, 'next_year_px': {0: 47.1875, 1: 47.3125, 2: 47.75, 3: 47.875, 4: 48.21875, 5: 47.96875, 6: 48.28125, 7: 48.0625, 8: 46.96875, 9: 47.1875}, 'one_yr_chg': {0: 0.073956484136738, 1: 0.0692090395480226, 2: 0.076814658210007, 3: 0.0683403068340306, 4: 0.0715277777777777, 5: 0.0667129951355107, 6: 0.0736622654621264, 7: 0.0762771168649405, 8: 0.050314465408805, 9: 0.0500695410292072}, 'daily_chg': {0: nan, 1: 0.0071009149255769, 2: 0.0021186440677967, 3: 0.0105708245243127, 4: 0.0041841004184099, 5: -0.0006944444444444, 6: 0.0, 7: -0.0069492703266157, 8: 0.0013995801259623, 9: 0.004891684136967}, 'last_day_in_month': {0: '1993-01-31', 1: 'False', 2: 'False', 3: 'False', 4: 'False', 5: 'False', 6: 'False', 7: 'False', 8: 'False', 9: '1993-02-28'}}
Check group by month and idxmax for maximum days. Check to_period and to_timestamp for last day of each month.
datetime = pd.to_datetime(df.datadate)
max_day_indx = datetime.groupby(datetime.dt.month).idxmax()
df['last_day_in_month'] = False
df.loc[max_day_indx, 'last_day_in_month'] = datetime[max_day_indx].dt.to_period('M').dt.to_timestamp('M').dt.strftime('%Y-%m-%d')
print(df)
tic cusip datadate prccd next_year next_year_px one_yr_chg \
0 SPY 78462F103 1993-01-29 43.93800 1994-01-25 47.18750 0.073956
1 SPY 78462F103 1993-02-01 44.25000 1994-01-26 47.31250 0.069209
2 SPY 78462F103 1993-02-02 44.34375 1994-01-27 47.75000 0.076815
3 SPY 78462F103 1993-02-03 44.81250 1994-01-28 47.87500 0.068340
4 SPY 78462F103 1993-02-04 45.00000 1994-01-31 48.21875 0.071528
5 SPY 78462F103 1993-02-05 44.96875 1994-02-01 47.96875 0.066713
6 SPY 78462F103 1993-02-08 44.96875 1994-02-02 48.28125 0.073662
7 SPY 78462F103 1993-02-09 44.65625 1994-02-03 48.06250 0.076277
8 SPY 78462F103 1993-02-10 44.71875 1994-02-04 46.96875 0.050314
9 SPY 78462F103 1993-02-11 44.93750 1994-02-07 47.18750 0.050070
daily_chg last_day_in_month
0 NaN 1993-01-31
1 0.007101 False
2 0.002119 False
3 0.010571 False
4 0.004184 False
5 -0.000694 False
6 0.000000 False
7 -0.006949 False
8 0.001400 False
9 0.004892 1993-02-28
I have the following dataframe:
account_id contract_id type date activated
0 1 AAA Downgrade 2021-01-05
1 1 ADS Original 2020-12-12
2 1 ADGD Upgrade 2021-02-03
3 1 BB Winback 2021-05-08
4 1 CC Upgrade 2021-06-01
5 2 HHA Original 2021-03-05
6 2 HAKD Downgrade 2021-03-06
7 3 HADSA Original 2021-05-01
I want the following output:
account_id contract_id type date activated Renewal Order
0 1 ADS Original 2020-12-12 Original
1 1 AAA Downgrade 2021-01-05 1st
2 1 ADGD Upgrade 2021-02-03 2nd
3 1 BB Winback 2021-05-08 Original
4 1 CC Upgrade 2021-06-01 1st
5 2 HHA Original 2021-03-05 Original
6 2 HAKD Downgrade 2021-03-06 1st
7 3 HADSA Original 2021-05-01 Original
The column I want to create is "Renewal Order". Each account can have multiple contracts. The condition is based on each account (account_id), the type (only when it is either "Original" or "Winback", and the order in which the contracts are activated (date_activated). The first contract (or tagged as "Original" under the "Type" column) will be identified as "Original" while the succeeding contracts as "1st", "2nd", and so on. The order resets when the contract is tagged as "Winback" under the "Type" column, i.e. it will now be identified as "Original" and the succeeding contracts as "1st", "2nd", and so on (refer to contract_id BB).
I tried the following code but it does not consider the condition on the "Winback":
def format_order(n):
if n == 0:
return 'Original'
suffix = ['th', 'st', 'nd', 'rd', 'th'][min(n % 10, 4)]
if 11 <= (n % 100) <= 13:
suffix = 'th'
return str(n) + suffix
df = df.sort_values(['account_id', 'date_activated']).reset_index(drop=True)
# apply
df['Renewal Order'] = df.groupby('account_id').cumcount().apply(format_order)
Here's the dictionary of the original dataframe:
{'account_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 3},
'contract_id': {0: 'AAA',
1: 'ADS',
2: 'ADGD',
3: 'BB',
4: 'CC',
5: 'HHA',
6: 'HAKD',
7: 'HADSA'},
'type': {0: 'Downgrade',
1: 'Original',
2: 'Upgrade',
3: 'Winback',
4: 'Upgrade',
5: 'Original',
6: 'Downgrade',
7: 'Original'},
'date activated': {0: Timestamp('2021-01-05 00:00:00'),
1: Timestamp('2020-12-12 00:00:00'),
2: Timestamp('2021-02-03 00:00:00'),
3: Timestamp('2021-05-08 00:00:00'),
4: Timestamp('2021-06-01 00:00:00'),
5: Timestamp('2021-03-05 00:00:00'),
6: Timestamp('2021-03-06 00:00:00'),
7: Timestamp('2021-05-01 00:00:00')}}
Here's the dictionary for the result:
{'account_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 3},
'contract_id': {0: 'ADS',
1: 'AAA',
2: 'ADGD',
3: 'BB',
4: 'CC',
5: 'HHA',
6: 'HAKD',
7: 'HADSA'},
'type': {0: 'Original',
1: 'Downgrade',
2: 'Upgrade',
3: 'Winback',
4: 'Upgrade',
5: 'Original',
6: 'Downgrade',
7: 'Original'},
'date activated': {0: Timestamp('2020-12-12 00:00:00'),
1: Timestamp('2021-01-05 00:00:00'),
2: Timestamp('2021-02-03 00:00:00'),
3: Timestamp('2021-05-08 00:00:00'),
4: Timestamp('2021-06-01 00:00:00'),
5: Timestamp('2021-03-05 00:00:00'),
6: Timestamp('2021-03-06 00:00:00'),
7: Timestamp('2021-05-01 00:00:00')},
'Renewal Order': {0: 'Original',
1: '1st',
2: '2nd',
3: 'Original',
4: '1st',
5: 'Original',
6: '1st',
7: 'Original'}}
Let us just change the cumcount result
s = df.groupby('account_id').cumcount()
s[df.type=='Winback'] = 0
df['Renewal Order'] = s.apply(format_order)
Using #BENY solution:
df = df.sort_values(['account_id', 'date activated']).reset_index(drop=True)
s = df.groupby(['account_id',
(df['type'] == 'Winback').cumsum()
]).cumcount()
df['Renewal Order'] = s.apply(format_order)
Output:
account_id contract_id type date activated Renewal Order
0 1 ADS Downgrade 2020-12-12 Original
1 1 AAA Original 2021-01-05 1st
2 1 ADGD Upgrade 2021-02-03 2nd
3 1 BB Winback 2021-05-08 Original
4 1 CC Upgrade 2021-06-01 1st
5 2 HHA Original 2021-03-05 Original
6 2 HAKD Downgrade 2021-03-06 1st
7 3 HADSA Original 2021-05-01 Original
I have a table of subscription start and end dates for a several customers for various products. I want to get one value for a customer's length of subscription with the company (regardless of product), but they can start and stop subscriptions for different products at different times, and I don't want to double count the time periods of overlapping product subscriptions. How can I calculate this?
A sample data frame:
a = pd.DataFrame( {'index': {0: 9123, 1: 9919, 2: 191, 3: 8892, 4: 8528, 5: 8893, 6: 9124, 7: 192, 8: 8928, 9: 8602, 10: 9629}, 'user_id': {0: 163486, 1: 163486, 2: 163486, 3: 163486, 4: 163486, 5: 163486, 6: 163486, 7: 163486, 8: 545619, 9: 545619, 10: 545619}, 'prod_id': {0: 110, 1: 507, 2: 511, 3: 488, 4: 506, 5: 488, 6: 110, 7: 511, 8: 488, 9: 506, 10: 508}, 'created_at': {0: Timestamp('2016-08-13 11:38:21.706000'), 1: Timestamp('2016-08-13 11:38:21.712000'), 2: Timestamp('2016-08-13 11:38:21.719000'), 3: Timestamp('2016-08-21 15:29:02.863000'), 4: Timestamp('2016-08-21 15:29:02.877000'), 5: Timestamp('2017-01-25 00:26:24.096000'), 6: Timestamp('2017-01-25 00:27:00.205000'), 7: Timestamp('2017-01-25 00:27:00.212000'), 8: Timestamp('2016-08-10 13:55:15.608000'), 9: Timestamp('2016-08-10 13:55:15.623000'), 10: Timestamp('2016-08-10 13:55:15.636000')}, 'removed_at': {0: Timestamp('2017-01-25 00:27:00.220000'), 1: Timestamp('2017-01-25 00:27:00.231000'), 2: Timestamp('2017-01-25 00:27:00.240000'), 3: Timestamp('2017-01-25 00:26:24.108000'), 4: Timestamp('2017-01-25 00:26:24.123000'), 5: NaT, 6: NaT, 7: NaT, 8: Timestamp('2017-02-01 15:52:32.951000'), 9: Timestamp('2017-02-01 15:52:32.968000'), 10: Timestamp('2017-02-01 15:52:32.980000')}, 'length_of_sub': {0: Timedelta('164 days 12:48:38.514000'), 1: Timedelta('164 days 12:48:38.519000'), 2: Timedelta('164 days 12:48:38.521000'), 3: Timedelta('156 days 08:57:21.245000'), 4: Timedelta('156 days 08:57:21.246000'), 5: NaT, 6: NaT, 7: NaT, 8: Timedelta('175 days 01:57:17.343000'), 9: Timedelta('175 days 01:57:17.345000'), 10: Timedelta('175 days 01:57:17.344000')}} )
will yield this:
index user_id prod_id created_at \
0 9123 163486 110 2016-08-13 11:38:21.706
1 9919 163486 507 2016-08-13 11:38:21.712
2 191 163486 511 2016-08-13 11:38:21.719
3 8892 163486 488 2016-08-21 15:29:02.863
4 8528 163486 506 2016-08-21 15:29:02.877
5 8893 163486 488 2017-01-25 00:26:24.096
6 9124 163486 110 2017-01-25 00:27:00.205
7 192 163486 511 2017-01-25 00:27:00.212
8 8928 545619 488 2016-08-10 13:55:15.608
9 8602 545619 506 2016-08-10 13:55:15.623
10 9629 545619 508 2016-08-10 13:55:15.636
removed_at length_of_sub
0 2017-01-25 00:27:00.220 164 days 12:48:38.514000
1 2017-01-25 00:27:00.231 164 days 12:48:38.519000
2 2017-01-25 00:27:00.240 164 days 12:48:38.521000
3 2017-01-25 00:26:24.108 156 days 08:57:21.245000
4 2017-01-25 00:26:24.123 156 days 08:57:21.246000
5 NaT NaT
6 NaT NaT
7 NaT NaT
8 2017-02-01 15:52:32.951 175 days 01:57:17.343000
9 2017-02-01 15:52:32.968 175 days 01:57:17.345000
10 2017-02-01 15:52:32.980 175 days 01:57:17.344000
I expect the output to be a data frame with index of user_id and column length_of_sub that gets the value 175 days for user 545619 and 164 days for user 163486. I don't think it's a simple maximum though, since technically users could overlap product create/remove at dates.
I also want to exclude periods where they aren't subscribed to anything at all.
Does anyone know how I can write a function that can be passed to .apply that will calculate the actual length_of sub for a given user?
The approach I took is to treat each created_at and removed_at as different events. As I iterate through a sorted set of created_at/removed_at I accumulate in a variable named has_sub a 1 if the event is created_at and a -1 if it is removed_at. If this variable is greater than 0 we have a subscription.
def count_sub_time(d):
m = {'created_at': 1, 'removed_at': -1}
d = d.rename(columns=m).stack().sort_values()
has_sub = 0
start_sub = pd.NaT
count = pd.Timedelta(0)
for (_, s), t in d.iteritems():
if has_sub == 0 and s == 1:
start_sub = t
elif has_sub == 1 and s == -1:
count += t - start_sub
has_sub += s
return count
b = a.set_index('user_id')[['created_at', 'removed_at']]
b.dropna().groupby(level=0).apply(count_sub_time)
user_id
163486 164 days 12:48:38.534000
545619 175 days 01:57:17.372000
dtype: timedelta64[ns]
I/You could probably sharpen this up a bit but the logic is there.
You can do this with a couple of groupby statements (instead of "apply") to get the answer you requested:
start = a.groupby('user_id')['created_at'].min()
end = a.groupby('user_id')['removed_at'].max()
diff = (end - start).dt.days.rename('length_of_sub').to_frame()
print(diff)
length_of_sub
user_id
163486 164
545619 175
I'm assuming you don't care about periods where a given customer might have a gap where they stopped subscribing to anything at all in between other subscriptions.
my dataframe looks like:
a = DataFrame({'clicks': {0: 4020, 1: 3718, 2: 2700, 3: 3867, 4: 4018, 5:
4760, 6: 4029},'date': {0: '23-02-2016', 1: '24-02-2016', 2: '11/2/2016',
3: '12/2/2016', 4: '13-02-2016', 5: '14-02-2016', 6: '15-02-2016'}})
Rows have 2 different formattings.
The format I need is:
a = DataFrame({'clicks': {0: 4020, 1: 3718, 2: 2700, 3: 3867, 4: 4018,
5: 4760, 6: 4029}, 'date': {0: '2/23/2016',1: '2/24/2016', 2: '2/11/2016',
3: '2/12/2016', 4: '2/13/2016', 5: '2/14/2016', 6: '2/15/2016'}})
So far I managed to open the csv in Excel as text data, UTF-8 format and then choose a MDY formatting for the date column. Moreover I apply:
a['date'] = a['date'].apply(lambda x: datetime.strptime(x,'%m/%d/%Y'))
How can I efficiently do that in Pandas?
You can convert to datetime using to_datetime and then call dt.strftime to get it in the format you want:
In [21]:
a['date'] = pd.to_datetime(a['date']).dt.strftime('%m/%d/%Y')
a
Out[21]:
clicks date
0 4020 02/23/2016
1 3718 02/24/2016
2 2700 02/11/2016
3 3867 02/12/2016
4 4018 02/13/2016
5 4760 02/14/2016
6 4029 02/15/2016
if the column is already datetime dtype then you can skip the to_datetime step