Apologies for the topic as I am not sure of the right way to describe my doubt in a single sentence. I have a dataframe which looks like below:
date numbers
1/1/2019 5
2/1/2019 3
3/1/2019 6
4/1/2019 3
5/1/2019 1
6/1/2019 4
I wanted to aggregate with specified intervals (includes overlapping)
The final dataframe should look like this:
for n = 2
date numbers
2/1/2019 8 (sum of 1/1/2019 and 2/1/2019)
3/1/2019 9 (sum of 2/1/2019 and 3/1/2019)
4/1/2019 9 (sum of 3/1/2019 and 4/1/2019)
5/1/2019 4
6/1/2019 5
What I tried is from the link Take the sum of every N rows in a pandas series
But the problem is the sums are (1/1/2019 and 2/1/2019), (3/1/2019,4/1/2019), (5/1/2019 and 6/1/2019) and it is not overlapping.
Please advice
We can do rolling
df.set_index('date').rolling(2).sum()
numbers
date
2019-01-01 NaN
2019-02-01 8.0
2019-03-01 9.0
2019-04-01 9.0
2019-05-01 4.0
2019-06-01 5.0
Related
i have following datframe
created_time shares_count
2021-07-01 250.0
2021-07-31 501.0
2021-08-02 48.0
2021-08-05 300.0
2021-08-07 200.0
2021-09-06 28.0
2021-09-08 100.0
2021-09-25 100.0
2021-09-30 200.0
did the grouping as monthly like this
df_groupby_monthly = df.groupby(pd.Grouper(key='created_time',freq='M')).sum()
df_groupby_monthly
Now how to get the average of these 'shares_count's by dividing from a sum of monthly rows?
ex: if the 07th month has 2 rows average should be 751.0/2 = 375.5, and the 08th month has 3 rows average should be 548.0/3 = 182.666, and the 09th month has 4 rows average should be 428.0/4 = 142.66
how to get like this final output
created_time shares_count
2021-07-31 375.5
2021-08-31 182.666
2021-09-30 142.66
I have tried following
df.groupby(pd.Grouper(key='created_time',freq='M')).apply(lambda x: x['shares_count'].sum()/len(x))
this is working if only one column, multiple ones hard to get
df['created_time'] = pd.to_datetime(df['created_time'])
output = df.groupby(df['created_time'].dt.to_period('M')).mean().round(2).reset_index()
output
###
created_time shares_count
0 2021-07 375.50
1 2021-08 182.67
2 2021-09 107.00
Use this code:
df=df.groupby(pd.Grouper(key='created_time',freq='M')).agg({'shares_count':['sum', 'count']}).reset_index()
df['ss']=df[('shares_count','sum')]/df[('shares_count','count')]
I have the following dataset of students taking multiple SAT exams:
df = pd.DataFrame({'student': 'A A A A A B B B C'.split(),
'exam_date':[datetime.datetime(2013,4,1),datetime.datetime(2013,6,1),
datetime.datetime(2013,7,1),datetime.datetime(2013,10,2),
datetime.datetime(2014,1,1),datetime.datetime(2013,11,2),
datetime.datetime(2014,2,2),datetime.datetime(2014,5,2),
datetime.datetime(2014,5,2)]})
print(df)
student exam_date
0 A 2013-04-01
1 A 2013-06-01
2 A 2013-07-01
3 A 2013-10-02
4 A 2014-01-01
5 B 2013-11-02
6 B 2014-02-02
7 B 2014-05-02
8 C 2014-05-02
I want to create a new column diff with the difference of two successive exam dates for each individual student, and then filter the value with a particular threshold, i.e. 75 days. If the student doesn't have two successive dates, we need to drop that student.
I am trying the following script to create the new column:
df['exam_date'] = df.groupby('student')['exam_date'].apply(lambda x: x.sort_values())
df['diff'] = df.groupby('student')['exam_date'].diff() / np.timedelta64(1, 'D')
print(df)
student exam_date diff
0 A 2013-04-01 NaN
1 A 2013-06-01 61.0
2 A 2013-07-01 30.0
3 A 2013-10-02 93.0
4 A 2014-01-01 91.0
5 B 2013-11-02 NaN
6 B 2014-02-02 92.0
7 B 2014-05-02 89.0
8 C 2014-05-02 NaN
Then I'm using query to filter the value and get the output:
df_new = df.query('diff <= 75')
print(df_new)
student exam_date diff
1 A 2013-06-01 61.0
2 A 2013-07-01 30.0
This is correctly selecting the student A and removing the students B and C. However, I'm missing the earliest date for the student A.
Though using df[df['student'].isin(studentList)]I'm getting the desired result, but it's too much of work.
Is there any better way of getting the desired output, maybe using diff() and le()? Any suggestions would be appreciated. Thanks!
What you want is filtering students, but you are filtering exam records.
After you got df_new, just find the students set, and use that to select df:
df[df.student.isin(df_new.student.unique())]
and you'll get:
student exam_date diff
0 A 2013-04-01 NaN
1 A 2013-06-01 61.0
2 A 2013-07-01 30.0
3 A 2013-10-02 93.0
4 A 2014-01-01 91.0
customer_id Order_date
1 2015-01-16
1 2015-01-19
2 2014-12-21
2 2015-01-10
1 2015-01-10
3 2018-01-18
3 2017-03-04
4 2019-11-05
4 2010-01-01
3 2019-02-03
Lets say I have data like this
Basically for an ecommerce firm some people buy regularly, some buy once every year, some buy monthly once etc. I need to find the difference between frequency of each transaction for each customer.
This will be a dynamic list, since some people will have transacted thousand times, some would have transacted once, some ten times etc. Any ideas on how to achieve this.
Output needed:
customer_id Order_date_Difference_in_days
1 6,3 #Difference b/w first 2 dates 2015-01-10 and 2015-01-16
#is 6 days and diff b/w next 2 consecutive dates is
#2015-01-16 and 2015-01-19 is #3 days
2 20
3 320,381
4 3596
Basically these are the differences between dates after sorting them first for each customer id
You can also use the below for the current output:
m=(df.assign(Diff=df.sort_values(['customer_id','Order_date'])
.groupby('customer_id')['Order_date'].diff().dt.days).dropna())
m=m.assign(Diff=m['Diff'].astype(str)).groupby('customer_id')['Diff'].agg(','.join)
customer_id
1 6.0,3.0
2 20.0
3 320.0,381.0
4 3595.0
Name: Diff, dtype: object
First we need to sort the data by customer id and the order date
ensure your datetime is a proper date time call df['Order_date'] = pd.to_datetime(df['Order_date'])
df.sort_values(['customer_id','Order_date'],inplace=True)
df["days"] = df.groupby("customer_id")["Order_date"].apply(
lambda x: (x - x.shift()) / np.timedelta64(1, "D")
)
print(df)
customer_id Order_date days
4 1 2015-01-10 NaN
0 1 2015-01-16 6.0
1 1 2015-01-19 3.0
2 2 2014-12-21 NaN
3 2 2015-01-10 20.0
6 3 2017-03-04 NaN
5 3 2018-01-18 320.0
9 3 2019-02-03 381.0
8 4 2010-01-01 NaN
7 4 2019-11-05 3595.0
then you can do a simple agg but you'll need to conver the value into a string.
df.dropna().groupby("customer_id")["days"].agg(
lambda x: ",".join(x.astype(str))
).to_frame()
days
customer_id
1 6.0,3.0
2 20.0
3 320.0,381.0
4 3595.0
This is my dataframe:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
# date field a datetime.datetime values
account_id amount
date
2018-01-01 1 100.0
2018-01-01 1 50.0
2018-06-01 1 200.0
2018-07-01 2 100.0
2018-10-01 2 200.0
Problem description
How can I "pad" my dataframe with leading and trailing "empty dates". I have tried to reindex on a date_range and period_range, I have tried to merge another index. I have tried all sorts of things all day, and I have read alot of the docs.
I have a simple dataframe with columns transaction_date, transaction_amount, and transaction_account. I want to group this dataframe so that it is grouped by account at the first level, and then by year, and then by month. Then I want a column for each month, with the sum of that month's transaction amount value.
This seems like it should be something that is easy to do.
Expected Output
This is the closest I have gotten:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
df = df.groupby(['account_id', df.index.year, df.index.month])
df = df.resample('M').sum().fillna(0)
print(df)
account_id amount
account_id date date date
1 2018 1 2018-01-31 2 150.0
6 2018-06-30 1 200.0
2 2018 7 2018-07-31 2 100.0
10 2018-10-31 2 200.0
And this is what I want to achieve (basically reindex the data by date_range(start='2018-01-01', period=12, freq='M')
(Ideally I would want the month to be transposed by year across the top as columns)
amount
account_id Year Month
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
....
12 200.0
2 2018 1 NaN
....
7 100.0
....
10 200.0
....
12 NaN
One way is to reindex
s=df.groupby([df['account_id'],df.index.year,df.index.month]).sum()
idx=pd.MultiIndex.from_product([s.index.levels[0],s.index.levels[1],list(range(1,13))])
s=s.reindex(idx)
s
Out[287]:
amount
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
2 2018 1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 100.0
8 NaN
9 NaN
10 200.0
11 NaN
12 NaN
I have a large dataset I'm trying to manipulate for further analysis. Below is what the relevant parts of the dataframe would look like.
Loan Closing Balance Date
1 175,000 2010-10-31
1 150,000 2010-11-30
1 125,000 2010-12-31
2 275,000 2010-10-31
2 250,000 2010-11-30
2 225,000 2010-12-31
3 375,000 2010-10-31
3 350,000 2010-11-30
3 320,000 2010-12-31
I would like to create a new column called Opening Balance which is basically the Closing Balance for the previous month's month end, so for the second row, Opening Balance would just be equal to 175,000, which is the Closing Balance for the first row.
As dataset starts 2010-10-31, I won't be able to look up a balance for 2010-09-30, so for any row with a date of 2010-10-31, I want to make the Opening Balance for that observation equal to the Closing Balance.
Here's what it should look like:
Loan Closing Balance Date Opening Balance
1 175,000 2010-10-31 175,000
1 150,000 2010-11-30 175,000
1 125,000 2010-12-31 150,000
2 275,000 2010-10-31 275,000
2 250,000 2010-11-30 275,000
2 225,000 2010-12-31 250,000
3 375,000 2010-10-31 375,000
3 350,000 2010-11-30 375,000
3 320,000 2010-12-31 350,000
In Excel I would normally do a compound index match with an eomonth function thrown in to do this but not quite sure how to do this in Python (still very new to it).
Any help appreciated.
I've tried the approach suggested by Santhosh and I get the following:
Thanks I tried your solution and end up getting the following:
Closing Balance_x Date_x Closing Balance_y
0 175000 2010-09-30 150000.0
1 175000 2010-09-30 250000.0
2 175000 2010-09-30 350000.0
3 150000 2010-10-31 125000.0
4 150000 2010-10-31 225000.0
5 150000 2010-10-31 320000.0
6 125000 2010-11-30 NaN
7 275000 2010-09-30 150000.0
8 275000 2010-09-30 250000.0
9 275000 2010-09-30 350000.0
10 250000 2010-10-31 125000.0
11 250000 2010-10-31 225000.0
12 250000 2010-10-31 320000.0
13 225000 2010-11-30 NaN
14 375000 2010-09-30 150000.0
15 375000 2010-09-30 250000.0
16 375000 2010-09-30 350000.0
17 350000 2010-10-31 125000.0
18 350000 2010-10-31 225000.0
19 350000 2010-10-31 320000.0
20 320000 2010-11-30 NaN
I then amended that code to do a merge based off of the Loan ID and Date/pDate:
final_df = pd.merge(df, df, how="left", left_on=['Date'], right_on=['pDate'])
Loan Closing Balance_x Date_x Opening Balance
0 1 175000 2010-09-30 150000.0
1 1 150000 2010-10-31 125000.0
2 1 125000 2010-11-30 NaN
3 2 275000 2010-09-30 250000.0
4 2 250000 2010-10-31 225000.0
5 2 225000 2010-11-30 NaN
6 3 375000 2010-09-30 350000.0
7 3 350000 2010-10-31 320000.0
8 3 320000 2010-11-30 NaN
Now in this case I'm not sure why I get NaN on every November observation. The Opening Balance for Loan 1 in November should be 150,000. The October Opening Balance should be 175,000. And the September Opening Balance should just be defaulted to the same as the September Opening Balance since I do not have an August Closing Balance to refer to.
Update
Think I resolved the issue, I changed the merge code to:
final_df = pd.merge(df, df, how="left", left_on=['Loan','pDate'], right_on=['Loan','Date'])
This still gets me NaN for September observations but that is fine as I can do a manual replace of those values.
I suggest you have another column that says Date - (1month) and then join them on the date fields to get opening balance.
df["cmonth"] = df.Date.apply(lambda x: x.year*100+x.month)
df["pDate"] = df.Date.apply(lambda x: (x - pd.DateOffset(months=1)))
df["pmonth"] = df.pDate.apply(lambda x: x.year*100+x.month)
final_df = pd.merge(df, df, how="left", left_on="cmonth", right_on="pmonth")
print(final_df[["close_x", "Date_x", "close_y"]])
#close_y is your opening balance