Cohort analysis by day - python

I'm using a code for create a cohort analysis based on customer retention. As I have the code right now, I see the analysis based by day. I need a desired output based in a month. In the output attached below, you will see that I index = day and should be index = month. The 1,2,3,4 columns should represent 1 day on a month so I need to get from 1 to 31 as output columns.
.csv file sample:
timestamp user_hash
0 2019-02-01 NaN
1 2019-02-01 e044a52188dbccd71428
2 2019-02-01 NaN C0D1A22B-9ECB-4DEF
3 2019-02-01 d7260762b66295fbf9e5
4 2019-02-01 d7260762b66295fbf9e5
Actual output sample:
CohortIndex 1 2 3 4
CohortMonth
2019-02-01 399.0 202.0 160.0 117.0
2019-02-02 215.0 109.0 89.0 61.0
2019-02-03 146.0 79.0 62.0 50.0
2019-02-04 175.0 67.0 50.0 32.0
2019-02-05 179.0 52.0 39.0 32.0
2019-02-06 137.0 31.0 29.0 16.0
2019-02-07 139.0 42.0 33.0 25.0
2019-02-08 143.0 35.0 32.0 24.0
2019-02-09 105.0 31.0 23.0 12.0
The code used is the following:
import pandas as pd
import datetime as dt
df_events = pd.read_csv('.../events.csv')
#convert object column to datatime and remove time from the column
df_events['timestamp'] = pd.to_datetime(df_events['timestamp'].str.strip(), format='%d/%m/%y %H:%M' )
df_events['timestamp'] = df_events['timestamp'].dt.date
#drop NaN from user_hash column
clean_data = df_events[df_events['user_hash'].notna()]
#function to check if we have NaN where whe shouldn't
def missing_data(x):
return x.isna().sum()
#uses the datatime function to gets the month a datatime stam and strips the time
def get_month(x):
return dt.datetime(x.year,x.month,1)#year, month, increment of day
#create a new column
clean_data['LoginMonth'] = clean_data['timestamp'].apply(get_month)
#create new columns called CohortMonth and groupby information
clean_data['CohortMonth'] = clean_data.groupby('user_hash')['timestamp'].transform('min')
#create the cohort
def get_date(df,column):
year = df[column].dt.year
month = df[column].dt.month
day = df[column].dt.day
return year, month, day
#create 2 variables, one for a month and one for a year. As we have 3 variables to return to the function we need to indicate to python that, _ is for day
login_year,login_month, _ = get_date(clean_data,'LoginMonth')
clean_data['CohortMonth'] = pd.to_datetime(clean_data['CohortMonth'])
cohort_year,cohort_month, _ = get_date(clean_data,'CohortMonth')
year_diff = login_year - cohort_year
month_diff = login_month - cohort_month
clean_data['CohortIndex'] = year_diff*12 + month_diff +1
#create cohort analysis data table
cohort_data = clean_data.groupby(['CohortMonth','CohortIndex'])['user_hash'].apply(pd.Series.nunique).reset_index()
cohort_count = cohort_data.pivot_table(index='CohortMonth',
columns='CohortIndex',
values='user_hash')
Thanks!

Related

Generate weeks from column with dates

I have a large dataset which contains a date column that covers from the year 2019. Now I do want to generate number of weeks on a separate column that are contained in those dates.
Here is how the date column looks like:
import pandas as pd
data = {'date': ['2019-09-10', 'NaN', '2019-10-07', '2019-11-04', '2019-11-28',
'2019-12-02', '2020-01-24', '2020-01-29', '2020-02-05',
'2020-02-12', '2020-02-14', '2020-02-24', '2020-03-11',
'2020-03-16', '2020-03-17', '2020-03-18', '2021-09-14',
'2021-09-30', '2021-10-07', '2021-10-08', '2021-10-12',
'2021-10-14', '2021-10-15', '2021-10-19', '2021-10-21',
'2021-10-26', '2021-10-28', '2021-10-29', '2021-11-02',
'2021-11-15', '2021-11-16', '2021-12-01', '2021-12-07',
'2021-12-09', '2021-12-10', '2021-12-14', '2021-12-15',
'2022-01-13', '2022-01-14', '2022-01-21', '2022-01-24',
'2022-01-25', '2022-01-27', '2022-01-31', '2022-02-01',
'2022-02-10', '2022-02-11', '2022-02-16', '2022-02-24']}
df = pd.DataFrame(data)
Now as from the first day this data was collected, I want to count 7 days using the date column and create a week out it. an example if the first week contains the 7 dates, I create a column and call it week one. I want to do the same process until the last week the data was collected.
Maybe it will be a good idea to organize the dates in order as from the first date to current one.
I have tried this but its not generating weeks in order, it actually has repetitive weeks.
pd.to_datetime(df['date'], errors='coerce').dt.week
My intention is, as from the first date the date was collected, count 7 days and store that as week one then continue incrementally until the last week say week number 66.
Here is the expected column of weeks created from the date column
import pandas as pd
week_df = {'weeks': ['1', '2', "3", "5", '6']}
df_weeks = pd.DataFrame(week_df)
IIUC use:
df['date'] = pd.to_datetime(df['date'])
df['week'] = df['date'].sub(df['date'].iat[0]).dt.days // 7 + 1
print (df.head(10))
date week
0 2019-09-10 1.0
1 NaT NaN
2 2019-10-07 4.0
3 2019-11-04 8.0
4 2019-11-28 12.0
5 2019-12-02 12.0
6 2020-01-24 20.0
7 2020-01-29 21.0
8 2020-02-05 22.0
9 2020-02-12 23.0
You have more than 66 weeks here, so either you want the real week count since the beginning or you want a dummy week rank. See below for both solutions:
# convert to week period
s = pd.to_datetime(df['date']).dt.to_period('W')
# get real week number
df['week'] = s.sub(s.iloc[0]).dropna().apply(lambda x: x.n).add(1)
# get dummy week rank
df['week2'] = s.rank(method='dense')
output:
date week week2
0 2019-09-10 1.0 1.0
1 NaN NaN NaN
2 2019-10-07 5.0 2.0
3 2019-11-04 9.0 3.0
4 2019-11-28 12.0 4.0
5 2019-12-02 13.0 5.0
6 2020-01-24 20.0 6.0
7 2020-01-29 21.0 7.0
8 2020-02-05 22.0 8.0
9 2020-02-12 23.0 9.0
10 2020-02-14 23.0 9.0
11 2020-02-24 25.0 10.0
12 2020-03-11 27.0 11.0
13 2020-03-16 28.0 12.0
14 2020-03-17 28.0 12.0
15 2020-03-18 28.0 12.0
16 2021-09-14 106.0 13.0
17 2021-09-30 108.0 14.0
18 2021-10-07 109.0 15.0
19 2021-10-08 109.0 15.0
...
42 2022-01-27 125.0 26.0
43 2022-01-31 126.0 27.0
44 2022-02-01 126.0 27.0
45 2022-02-10 127.0 28.0
46 2022-02-11 127.0 28.0
47 2022-02-16 128.0 29.0
48 2022-02-24 129.0 30.0

get average monthly value by divide from its monthly row count

i have following datframe
created_time shares_count
2021-07-01 250.0
2021-07-31 501.0
2021-08-02 48.0
2021-08-05 300.0
2021-08-07 200.0
2021-09-06 28.0
2021-09-08 100.0
2021-09-25 100.0
2021-09-30 200.0
did the grouping as monthly like this
df_groupby_monthly = df.groupby(pd.Grouper(key='created_time',freq='M')).sum()
df_groupby_monthly
Now how to get the average of these 'shares_count's by dividing from a sum of monthly rows?
ex: if the 07th month has 2 rows average should be 751.0/2 = 375.5, and the 08th month has 3 rows average should be 548.0/3 = 182.666, and the 09th month has 4 rows average should be 428.0/4 = 142.66
how to get like this final output
created_time shares_count
2021-07-31 375.5
2021-08-31 182.666
2021-09-30 142.66
I have tried following
df.groupby(pd.Grouper(key='created_time',freq='M')).apply(lambda x: x['shares_count'].sum()/len(x))
this is working if only one column, multiple ones hard to get
df['created_time'] = pd.to_datetime(df['created_time'])
output = df.groupby(df['created_time'].dt.to_period('M')).mean().round(2).reset_index()
output
###
created_time shares_count
0 2021-07 375.50
1 2021-08 182.67
2 2021-09 107.00
Use this code:
df=df.groupby(pd.Grouper(key='created_time',freq='M')).agg({'shares_count':['sum', 'count']}).reset_index()
df['ss']=df[('shares_count','sum')]/df[('shares_count','count')]

How to create a new column with the last value of the previous year

I have this data frame
import pandas as pd
df = pd.DataFrame({'COTA':['A','A','A','A','A','B','B','B','B'],
'Date':['14/10/2021','19/10/2020','29/10/2019','30/09/2021','20/09/2020','20/10/2021','29/10/2020','15/10/2019','10/09/2020'],
'Mark':[1,2,3,4,5,1,2,3,3]
})
print(df)
based on this data frame I wanted the MARK from the previous year, I managed to acquire the maximum COTA but I wanted the last one, I used .max() and I thought I could get it with .last() but it didn't work.
follow the example of my code.
df['Date'] = pd.to_datetime(df['Date'])
df['LastYear'] = df['Date'] - pd.offsets.YearEnd(0)
s1 = df.groupby(['Found', 'LastYear'])['Mark'].max()
s2 = s1.rename(index=lambda x: x + pd.offsets.DateOffset(years=1), level=1)
df = df.join(s2.rename('Max_MarkLastYear'), on=['Found', 'LastYear'])
print (df)
Found Date Mark LastYear Max_MarkLastYear
0 A 2021-10-14 1 2021-12-31 5.0
1 A 2020-10-19 2 2020-12-31 3.0
2 A 2019-10-29 3 2019-12-31 NaN
3 A 2021-09-30 4 2021-12-31 5.0
4 A 2020-09-20 5 2020-12-31 3.0
5 B 2021-10-20 1 2021-12-31 3.0
6 B 2020-10-29 2 2020-12-31 3.0
7 B 2019-10-15 3 2019-12-31 NaN
8 B 2020-10-09 3 2020-12-31 3.0
How do I create a new column with the last value of the previous year

Python Pandas - Replace NaN values of a column with respect to another column using interpolate()

I am facing problem while dealing with NaN values in Temperature column with respect to column City by using interpolate().
The df is:
data ={
'City':['Greenville','Charlotte', 'Los Gatos','Greenville','Carson City','Greenville','Greenville' ,'Charlotte','Carson City',
'Greenville','Charlotte','Fort Lauderdale', 'Rifle', 'Los Gatos','Fort Lauderdale'],
'Rec_times':['2019-05-21 08:29:55','2019-01-27 17:43:09','2020-12-13 21:53:00','2019-07-17 11:43:09','2018-04-17 16:51:23',
'2019-10-07 13:28:09','2020-01-07 11:38:10','2019-11-03 07:13:09','2020-11-19 10:45:23','2020-10-07 15:48:19','2020-10-07 10:53:09',
'2017-08-31 17:40:49','2016-08-31 17:40:49','2021-11-13 20:13:10','2016-08-31 19:43:29'],
'Temperature':[30,45,26,33,50,None,29,None,48,32,47,33,None,None,28],
'Pressure':[30,None,26,43,50,36,29,None,48,32,None,35,23,49,None]
}
df =pd.DataFrame(data)
df
Output:
City Rec_times Temperature Pressure
0 Greenville 2019-05-21 08:29:55 30.0 30.0
1 Charlotte 2019-01-27 17:43:09 45.0 NaN
2 Los Gatos 2020-12-13 21:53:00 26.0 26.0
3 Greenville 2019-07-17 11:43:09 33.0 43.0
4 Carson City 2018-04-17 16:51:23 50.0 50.0
5 Greenville 2019-10-07 13:28:09 NaN 36.0
6 Greenville 2020-01-07 11:38:10 29.0 29.0
7 Charlotte 2019-11-03 07:13:09 NaN NaN
8 Carson City 2020-11-19 10:45:23 48.0 48.0
9 Greenville 2020-10-07 15:48:19 32.0 32.0
10 Charlotte 2020-10-07 10:53:09 47.0 NaN
11 Fort Lauderdale 2017-08-31 17:40:49 33.0 35.0
12 Rifle 2016-08-31 17:40:49 NaN 23.0
13 Los Gatos 2021-11-13 20:13:10 NaN 49.0
14 Fort Lauderdale 2016-08-31 19:43:29 28.0 NaN
I want you to deal the NaN values in the column Temperature by grouping them based on City using interpolate(method='time').
Ex:
Consider City as 'Greenville' it has 5 temperatures (30,33,NaN,29 and 32) recorded at different times. The NaN value in Temperature is replaced by a value by grouping the records by the City and using interpolate(method='time').
Note: If you know any other optimal method to replace NaN in Temperature you can use as 'Other solution'.
Use a lambda function with DatetimeIndex created by DataFrame.set_index with GroupBy.transform:
df["Rec_times"] = pd.to_datetime(df["Rec_times"])
df['Temperature'] = (df.set_index('Rec_times')
.groupby("City")['Temperature']
.transform(lambda x: x.interpolate(method='time')).to_numpy())
One possible idea for replacing missing values after interpolate is to replace them by the mean of all values like:
df1.Temperature = df1.Temperature.fillna(df1.Temperature.mean())
My understanding is that you want to replace the NaN in column temperature by an interpolation of the temperature in that specific city.
I would have to think about a more sophisticated solution. But here is a simple hack:
df["Rec_times"] = pd.to_datetime(df["Rec_times"]) # .interpolate requires datetime
df["idx"] = df.index # to restore original ordering
df_new = pd.DataFrame() # will hold new data
for (city,group) in df.groupby("City"):
group = group.set_index("Rec_times", drop=False)
df_new = pd.concat((df_new, group.interpolate(method='time')))
df_new = df_new.set_index("idx").sort_index() # Restore original ordering
df_new
Note that interpolation for Rifle will yield NaN given there is only one data point which is NaN.

Get summary data columns in new pandas dataframe from existing dataframe based on other column-ID

I'm want to summarize the data in a dataframe and add the new columns to another dataframe. My data contains appartments with an ID-number and it has surface and volume values for each room in the appartment. What I want is having a dataframe that summarizes this and gives me the total surface and volume per appartment. There are two conditions for the original dataframe:
Two conditions:
- the dataframe can contain empty cells
- when the values of surface or volume are equal for all of the rows within that ID
(so all the same values for the same ID), then the data (surface, volumes) is not
summed but one value/row is passed to the new summary column (example: 'ID 4')(as
this could be a mistake in the original dataframe and the total surface/volume was
inserted for all the rooms by the government-employee)
Initial dataframe 'data':
print(data)
ID Surface Volume
0 2 10.0 25.0
1 2 12.0 30.0
2 2 24.0 60.0
3 2 8.0 20.0
4 4 84.0 200.0
5 4 84.0 200.0
6 4 84.0 200.0
7 52 NaN NaN
8 52 96.0 240.0
9 95 8.0 20.0
10 95 6.0 15.0
11 95 12.0 30.0
12 95 30.0 75.0
13 95 12.0 30.0
Desired output from 'df':
print(df)
ID Surface Volume
0 2 54.0 135.0
1 4 84.0 200.0 #-> as the values are the same for each row of this ID in the original data, the sum is not taken, but only one of the rows is passed (see the second condition)
2 52 96.0 240.0
3 95 68.0 170.0
Tried code:
import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [2,4,52,95]})

data = pd.DataFrame({"ID": [2,2,2,2,4,4,4,52,52,95,95,95,95,95],
"Surface": [10,12,24,8,84,84,84,np.nan,96,8,6,12,30,12],
"Volume": [25,30,60,20,200,200,200,np.nan,240,20,15,30,75,30]})

print(data)

#Tried something, but no idea how to do this actually:
df["Surface"] = data.groupby("ID").agg(sum)
df["Volume"] = data.groupby("ID").agg(sum)
print(df)
Here are necessary 2 conditions - first testing if unique values per groups for each columns separately by GroupBy.transform and DataFrameGroupBy.nunique and compare by eq for equal with 1 and then second condition - it used DataFrame.duplicated by each column with ID column.
Chain both masks by & for bitwise AND and repalce matched values by NaNs by DataFrame.mask and last aggregate sum:
cols = ['Surface','Volume']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())
df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum().reset_index()
print(df)
ID Surface Volume
0 2 54.0 135.0
1 4 84.0 200.0
2 52 96.0 240.0
3 95 68.0 170.0
If need new columns filled by aggregate sum values use GroupBy.transform :
cols = ['Surface','Volume']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())
data[cols] = data[cols].mask(m1 & m2).groupby(data["ID"]).transform('sum')
print(data)
ID Surface Volume
0 2 54.0 135.0
1 2 54.0 135.0
2 2 54.0 135.0
3 2 54.0 135.0
4 4 84.0 200.0
5 4 84.0 200.0
6 4 84.0 200.0
7 52 96.0 240.0
8 52 96.0 240.0
9 95 68.0 170.0
10 95 68.0 170.0
11 95 68.0 170.0
12 95 68.0 170.0
13 95 68.0 170.0

Categories