i have following datframe
created_time shares_count
2021-07-01 250.0
2021-07-31 501.0
2021-08-02 48.0
2021-08-05 300.0
2021-08-07 200.0
2021-09-06 28.0
2021-09-08 100.0
2021-09-25 100.0
2021-09-30 200.0
did the grouping as monthly like this
df_groupby_monthly = df.groupby(pd.Grouper(key='created_time',freq='M')).sum()
df_groupby_monthly
Now how to get the average of these 'shares_count's by dividing from a sum of monthly rows?
ex: if the 07th month has 2 rows average should be 751.0/2 = 375.5, and the 08th month has 3 rows average should be 548.0/3 = 182.666, and the 09th month has 4 rows average should be 428.0/4 = 142.66
how to get like this final output
created_time shares_count
2021-07-31 375.5
2021-08-31 182.666
2021-09-30 142.66
I have tried following
df.groupby(pd.Grouper(key='created_time',freq='M')).apply(lambda x: x['shares_count'].sum()/len(x))
this is working if only one column, multiple ones hard to get
df['created_time'] = pd.to_datetime(df['created_time'])
output = df.groupby(df['created_time'].dt.to_period('M')).mean().round(2).reset_index()
output
###
created_time shares_count
0 2021-07 375.50
1 2021-08 182.67
2 2021-09 107.00
Use this code:
df=df.groupby(pd.Grouper(key='created_time',freq='M')).agg({'shares_count':['sum', 'count']}).reset_index()
df['ss']=df[('shares_count','sum')]/df[('shares_count','count')]
Related
I have the following dataset of students taking multiple SAT exams:
df = pd.DataFrame({'student': 'A A A A A B B B C'.split(),
'exam_date':[datetime.datetime(2013,4,1),datetime.datetime(2013,6,1),
datetime.datetime(2013,7,1),datetime.datetime(2013,10,2),
datetime.datetime(2014,1,1),datetime.datetime(2013,11,2),
datetime.datetime(2014,2,2),datetime.datetime(2014,5,2),
datetime.datetime(2014,5,2)]})
print(df)
student exam_date
0 A 2013-04-01
1 A 2013-06-01
2 A 2013-07-01
3 A 2013-10-02
4 A 2014-01-01
5 B 2013-11-02
6 B 2014-02-02
7 B 2014-05-02
8 C 2014-05-02
I want to create a new column diff with the difference of two successive exam dates for each individual student, and then filter the value with a particular threshold, i.e. 75 days. If the student doesn't have two successive dates, we need to drop that student.
I am trying the following script to create the new column:
df['exam_date'] = df.groupby('student')['exam_date'].apply(lambda x: x.sort_values())
df['diff'] = df.groupby('student')['exam_date'].diff() / np.timedelta64(1, 'D')
print(df)
student exam_date diff
0 A 2013-04-01 NaN
1 A 2013-06-01 61.0
2 A 2013-07-01 30.0
3 A 2013-10-02 93.0
4 A 2014-01-01 91.0
5 B 2013-11-02 NaN
6 B 2014-02-02 92.0
7 B 2014-05-02 89.0
8 C 2014-05-02 NaN
Then I'm using query to filter the value and get the output:
df_new = df.query('diff <= 75')
print(df_new)
student exam_date diff
1 A 2013-06-01 61.0
2 A 2013-07-01 30.0
This is correctly selecting the student A and removing the students B and C. However, I'm missing the earliest date for the student A.
Though using df[df['student'].isin(studentList)]I'm getting the desired result, but it's too much of work.
Is there any better way of getting the desired output, maybe using diff() and le()? Any suggestions would be appreciated. Thanks!
What you want is filtering students, but you are filtering exam records.
After you got df_new, just find the students set, and use that to select df:
df[df.student.isin(df_new.student.unique())]
and you'll get:
student exam_date diff
0 A 2013-04-01 NaN
1 A 2013-06-01 61.0
2 A 2013-07-01 30.0
3 A 2013-10-02 93.0
4 A 2014-01-01 91.0
I'm using a code for create a cohort analysis based on customer retention. As I have the code right now, I see the analysis based by day. I need a desired output based in a month. In the output attached below, you will see that I index = day and should be index = month. The 1,2,3,4 columns should represent 1 day on a month so I need to get from 1 to 31 as output columns.
.csv file sample:
timestamp user_hash
0 2019-02-01 NaN
1 2019-02-01 e044a52188dbccd71428
2 2019-02-01 NaN C0D1A22B-9ECB-4DEF
3 2019-02-01 d7260762b66295fbf9e5
4 2019-02-01 d7260762b66295fbf9e5
Actual output sample:
CohortIndex 1 2 3 4
CohortMonth
2019-02-01 399.0 202.0 160.0 117.0
2019-02-02 215.0 109.0 89.0 61.0
2019-02-03 146.0 79.0 62.0 50.0
2019-02-04 175.0 67.0 50.0 32.0
2019-02-05 179.0 52.0 39.0 32.0
2019-02-06 137.0 31.0 29.0 16.0
2019-02-07 139.0 42.0 33.0 25.0
2019-02-08 143.0 35.0 32.0 24.0
2019-02-09 105.0 31.0 23.0 12.0
The code used is the following:
import pandas as pd
import datetime as dt
df_events = pd.read_csv('.../events.csv')
#convert object column to datatime and remove time from the column
df_events['timestamp'] = pd.to_datetime(df_events['timestamp'].str.strip(), format='%d/%m/%y %H:%M' )
df_events['timestamp'] = df_events['timestamp'].dt.date
#drop NaN from user_hash column
clean_data = df_events[df_events['user_hash'].notna()]
#function to check if we have NaN where whe shouldn't
def missing_data(x):
return x.isna().sum()
#uses the datatime function to gets the month a datatime stam and strips the time
def get_month(x):
return dt.datetime(x.year,x.month,1)#year, month, increment of day
#create a new column
clean_data['LoginMonth'] = clean_data['timestamp'].apply(get_month)
#create new columns called CohortMonth and groupby information
clean_data['CohortMonth'] = clean_data.groupby('user_hash')['timestamp'].transform('min')
#create the cohort
def get_date(df,column):
year = df[column].dt.year
month = df[column].dt.month
day = df[column].dt.day
return year, month, day
#create 2 variables, one for a month and one for a year. As we have 3 variables to return to the function we need to indicate to python that, _ is for day
login_year,login_month, _ = get_date(clean_data,'LoginMonth')
clean_data['CohortMonth'] = pd.to_datetime(clean_data['CohortMonth'])
cohort_year,cohort_month, _ = get_date(clean_data,'CohortMonth')
year_diff = login_year - cohort_year
month_diff = login_month - cohort_month
clean_data['CohortIndex'] = year_diff*12 + month_diff +1
#create cohort analysis data table
cohort_data = clean_data.groupby(['CohortMonth','CohortIndex'])['user_hash'].apply(pd.Series.nunique).reset_index()
cohort_count = cohort_data.pivot_table(index='CohortMonth',
columns='CohortIndex',
values='user_hash')
Thanks!
I am doing a classification problem in which I am trying to predict whether a car will be refuelled the following day.
The data consists of a date, an ID for every car, the distance to destination
What I want is a variable that is lagged 3 days, and not 3 rows per car_ID - since the case is that every car_ID is not present on every day. Therefore, the lag should be based on the date and not the rows.
If there are less than 3 days of history, the result should be -1.
Currently, I have this piece of code which lags every row 3 days
data['distance_to_destination'].groupby(data['car_ID']).shift(3).tolist()
But this is only lagging for the number of rows and not the number of days.
What I want to achieve is the column "lag_dtd_3":
date car_ID distance_to_destination lag_dtd_3
01/01/2019 1 100 -1
01/01/2019 2 200 -1
02/01/2019 1 80 -1
02/01/2019 2 170 -1
02/01/2019 3 500 -1
03/01/2019 2 120 -1
05/01/2019 1 25 80
05/01/2019 2 75 170
06/01/2019 1 20 -1
06/01/2019 2 30 120
06/01/2019 3 120 -1
One solution to lag information by 3 days is to move the index instead of shifting.
pivot = data.pivot(columns='car_ID')
shifted = pivot.copy()
shifted.index = shifted.index + pd.DateOffset(days=3) # Here I lag the index instead of shifting
shifted.columns = shifted.columns.set_levels(['lag_dtd_3'], 0)
output = pd.concat([pivot, shifted], axis = 1).stack('car_ID').reset_index('car_ID')
output['lag_dtd_3'] = output['lag_dtd_3'].fillna(-1)
output = output.dropna()
Output:
car_ID distance_to_destination lag_dtd_3
date
2019-01-01 1 100.0 -1.0
2019-01-01 2 200.0 -1.0
2019-01-02 1 80.0 -1.0
2019-01-02 2 170.0 -1.0
2019-01-02 3 500.0 -1.0
2019-01-03 2 120.0 -1.0
2019-01-05 1 25.0 80.0
2019-01-05 2 75.0 170.0
2019-01-06 1 20.0 -1.0
2019-01-06 2 30.0 120.0
2019-01-06 3 120.0 -1.0
This is best explained through an example.
I have the following dataframe (each row can be thought of as a transaction):
DATE AMOUNT
2017-01-29 10
2017-01-30 20
2017-01-31 30
2017-02-01 40
2017-02-02 50
2017-02-03 60
I would like to compute a 2-day rolling sum but only for rows in February.
Code snippet I have currently:
df.set_index('DATE',inplace=True)
res=df.rolling('2d')['AMOUNT'].sum()
which gives:
AMOUNT
2017-01-29 10
2017-01-30 30
2017-01-31 50
2017-02-01 70
2017-02-02 90
2017-02-03 110
but I really only need the output in the last 3 rows, the operations on the first 3 rows are unnecessary. When the dataframe is huge, this incurs immense time complexity. How do I compute the rolling sum only for the last 3 rows (other than computing the rolling sum for all rows and then doing a row filtering operation after that)?
*I cannot pre-filter the dataframe either because there wouldn't be the 'lookback' period in January for the correct rolling sum value to be obtained.
You can use timedelta to filter your df and keep the last day of January.
import datetime
dateStart = datetime.date(2017, 2, 1) - datetime.timedelta(days=1)
dateEnd = datetime.date(2017, 2, 3)
df.loc[dateStart:dateEnd]
Then you can do your rolling operation and drop the first line (which is 2017-01-31)
you can just compute the rolling sum only for the last rows by using tail(4)
res = df.tail(4).rolling('2d')['AMOUNT'].sum()
Output:
DATE
2017-01-31 NaN
2017-02-01 70.0
2017-02-02 90.0
2017-02-03 110.0
Name: AMOUNT, dtype: float64
If you want to merge those values - excluding 2017-01-31 then you can do the following:
df.loc[res.index[1:]] = res.tail(3)
Output:
AMOUNT
DATE
2017-01-29 10.0
2017-01-30 20.0
2017-01-31 30.0
2017-02-01 70.0
2017-02-02 90.0
2017-02-03 110.0
I have a large dataset I'm trying to manipulate for further analysis. Below is what the relevant parts of the dataframe would look like.
Loan Closing Balance Date
1 175,000 2010-10-31
1 150,000 2010-11-30
1 125,000 2010-12-31
2 275,000 2010-10-31
2 250,000 2010-11-30
2 225,000 2010-12-31
3 375,000 2010-10-31
3 350,000 2010-11-30
3 320,000 2010-12-31
I would like to create a new column called Opening Balance which is basically the Closing Balance for the previous month's month end, so for the second row, Opening Balance would just be equal to 175,000, which is the Closing Balance for the first row.
As dataset starts 2010-10-31, I won't be able to look up a balance for 2010-09-30, so for any row with a date of 2010-10-31, I want to make the Opening Balance for that observation equal to the Closing Balance.
Here's what it should look like:
Loan Closing Balance Date Opening Balance
1 175,000 2010-10-31 175,000
1 150,000 2010-11-30 175,000
1 125,000 2010-12-31 150,000
2 275,000 2010-10-31 275,000
2 250,000 2010-11-30 275,000
2 225,000 2010-12-31 250,000
3 375,000 2010-10-31 375,000
3 350,000 2010-11-30 375,000
3 320,000 2010-12-31 350,000
In Excel I would normally do a compound index match with an eomonth function thrown in to do this but not quite sure how to do this in Python (still very new to it).
Any help appreciated.
I've tried the approach suggested by Santhosh and I get the following:
Thanks I tried your solution and end up getting the following:
Closing Balance_x Date_x Closing Balance_y
0 175000 2010-09-30 150000.0
1 175000 2010-09-30 250000.0
2 175000 2010-09-30 350000.0
3 150000 2010-10-31 125000.0
4 150000 2010-10-31 225000.0
5 150000 2010-10-31 320000.0
6 125000 2010-11-30 NaN
7 275000 2010-09-30 150000.0
8 275000 2010-09-30 250000.0
9 275000 2010-09-30 350000.0
10 250000 2010-10-31 125000.0
11 250000 2010-10-31 225000.0
12 250000 2010-10-31 320000.0
13 225000 2010-11-30 NaN
14 375000 2010-09-30 150000.0
15 375000 2010-09-30 250000.0
16 375000 2010-09-30 350000.0
17 350000 2010-10-31 125000.0
18 350000 2010-10-31 225000.0
19 350000 2010-10-31 320000.0
20 320000 2010-11-30 NaN
I then amended that code to do a merge based off of the Loan ID and Date/pDate:
final_df = pd.merge(df, df, how="left", left_on=['Date'], right_on=['pDate'])
Loan Closing Balance_x Date_x Opening Balance
0 1 175000 2010-09-30 150000.0
1 1 150000 2010-10-31 125000.0
2 1 125000 2010-11-30 NaN
3 2 275000 2010-09-30 250000.0
4 2 250000 2010-10-31 225000.0
5 2 225000 2010-11-30 NaN
6 3 375000 2010-09-30 350000.0
7 3 350000 2010-10-31 320000.0
8 3 320000 2010-11-30 NaN
Now in this case I'm not sure why I get NaN on every November observation. The Opening Balance for Loan 1 in November should be 150,000. The October Opening Balance should be 175,000. And the September Opening Balance should just be defaulted to the same as the September Opening Balance since I do not have an August Closing Balance to refer to.
Update
Think I resolved the issue, I changed the merge code to:
final_df = pd.merge(df, df, how="left", left_on=['Loan','pDate'], right_on=['Loan','Date'])
This still gets me NaN for September observations but that is fine as I can do a manual replace of those values.
I suggest you have another column that says Date - (1month) and then join them on the date fields to get opening balance.
df["cmonth"] = df.Date.apply(lambda x: x.year*100+x.month)
df["pDate"] = df.Date.apply(lambda x: (x - pd.DateOffset(months=1)))
df["pmonth"] = df.pDate.apply(lambda x: x.year*100+x.month)
final_df = pd.merge(df, df, how="left", left_on="cmonth", right_on="pmonth")
print(final_df[["close_x", "Date_x", "close_y"]])
#close_y is your opening balance