I have the following dataframe:
date money
0 2018-01-01 20
1 2018-01-05 30
2 2018-02-15 7
3 2019-03-17 150
4 2018-01-05 15
...
2530 2019-03-17 350
And I need:
[(2018-01-01,20),(2018-01-05,65),(2018-02-15,72),...,(2019-03-17,572)]
So i need to do a cumulative sum of money over all days:
So far I have tried many things and the closest Ithink I've got is:
graph_df.date = pd.to_datetime(graph_df.date)
temporary = graph_df.groupby('date').money.sum()
temporary = temporary.groupby(temporary.index.to_period('date')).cumsum().reset_index()
But this gives me ValueError: Invalid frequency: date
Could anyone help please?
Thanks
I don't think you need the second groupby. You can simply add a column with the cumulative sum.
This does the trick for me:
import pandas as pd
df = pd.DataFrame({'date': ['01-01-2019','04-06-2019', '07-06-2019'], 'money': [12,15,19]})
df['date'] = pd.to_datetime(df['date']) # this is not strictly needed
tmp = df.groupby('date')['money'].sum().reset_index()
tmp['money_sum'] = tmp['money'].cumsum()
Converting the date column to an actual date is not needed for this to work.
list(map(tuple, df.groupby('date', as_index=False)['money'].sum().values))
Edit:
df = pd.DataFrame({'date': ['2018-01-01', '2018-01-05', '2018-02-15', '2019-03-17', '2018-01-05'],
'money': [20, 30, 7, 150, 15]})
#df['date'] = pd.to_datetime(df['date'])
#df = df.sort_values(by='date')
temporary = df.groupby('date', as_index=False)['money'].sum()
temporary['money_cum'] = temporary['money'].cumsum()
Result:
>>> list(map(tuple, temporary[['date', 'money_cum']].values))
[('2018-01-01', 20),
('2018-01-05', 65),
('2018-02-15', 72),
('2019-03-17', 222)]
you can try using df.groupby('date').sum():
example data frame:
df
date money
0 01/01/2018 20
1 05/01/2018 30
2 15/02/2018 7
3 17/03/2019 150
4 05/01/2018 15
5 17/03/2019 550
6 15/02/2018 13
df['cumsum'] = df.money.cumsum()
list(zip(df.groupby('date').tail(1)['date'], df.groupby('date').tail(1)['cumsum']))
[('01/01/2018', 20),
('05/01/2018', 222),
('17/03/2019', 772),
('15/02/2018', 785)]
Related
This is what my data looks like:
month total_mobile_subscription
0 1997-01 414000
1 1997-02 423000
2 1997-03 431000
3 1997-04 479000
4 1997-05 510000
.. ... ...
279 2020-04 9056300
280 2020-05 8928800
281 2020-06 8860000
282 2020-07 8768500
283 2020-08 8659000
[284 rows x 2 columns]
Basically, I'm trying to change this into a dataset sorted by year with the value being the mean average of total mobiles subscriptions for each year.
I am not sure what to do as I am still learning this.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'year': ['1997-01', '1997-02', '1997-03', '1998-01', '1998-02', '1998-03'],
'sale': [500, 1000, 1500, 2000, 1000, 400]
})
for a in range(1,13):
x = '-0'+ str(a)
df['year'] = df['year'].str.replace(x, '')
df2 = (df.groupby(['year']).mean('sale'))
print(df2)
Convert values of month to datetimes and aggregate by years:
y = pd.to_datetime(df['month']).dt.year.rename('year')
df1 = df.groupby(y)['total_mobile_subscription'].mean().reset_index()
df['month'] = pd.to_datetime(df['month'])
df1 = (df.groupby(df['month'].dt.year.rename('year'))['total_mobile_subscription']
.mean().reset_index())
Or aggregate by first 4 values of month column:
df2 = (df.groupby(df['month'].str[:4].rename('year'))['total_mobile_subscription']
.mean().reset_index())
The Excel function SUMIFS supports calculation based on multiple criteria, including day inequalities, as follows
values_to_sum, criteria_range_n, condition_n, .., criteria_range_n, condition_n
Example
Input - tips per person per day, multiple entries per person per day allowed
date person tip
02/03/2022 X 10
05/03/2022 X 30
05/03/2022 Y 20
08/03/2022 X 12
08/03/2022 X 8
Output - sum per selected person per day
date X_sum_per_day
01/03/2022 0
02/03/2022 10
03/03/2022 0
04/03/2022 0
05/03/2022 30
06/03/2022 0
07/03/2022 0
08/03/2022 20
09/03/2022 0
10/03/2022 0
Can this be implemented in pandas and calculated as series for an input range of days? Cumulative would be presumably just application of cumsum() but the initial sum based on multiple criteria is tricky, especially if to be concise.
Code
import pandas as pd
df = pd.DataFrame({'date': ['02-03-2022 00:00:00',
'05-03-2022 00:00:00',
'05-03-2022 00:00:00',
'08-03-2022 00:00:00',
'08-03-2022 00:00:00'],
'person': ['X', 'X', 'Y', 'X', 'X'],
'tip': [10, 30, 20, 12, 8]},
index = [0, 1, 2, 3, 4])
df2 = pd.DataFrame({'date':pd.date_range(start='2022-03-01', end='2022-03-10')})
temp = df[df['person'] == 'X'].groupby(['date']).sum().reset_index()
df2['X_sum'] = df2['date'].map(temp.set_index('date')['tip']).fillna(0)
The above seems kinda hacky and not as simple to reason about as Excel SUMIFS. Additional conditions would also be a hassle (e.g. sum where country = X, company = Y, person = Z).
Any idea for alternative implementation?
IIUC, you want to filter the person X then groupby day and sum the tips, finally reindex the missing days:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
out = (df[df['person'].eq('X')]
.groupby('date')['tip'].sum()
.reindex(pd.date_range(start='2022-03-01', end='2022-03-10'),
fill_value=0)
.reset_index()
)
output:
index tip
0 2022-03-01 0
1 2022-03-02 10
2 2022-03-03 0
3 2022-03-04 0
4 2022-03-05 30
5 2022-03-06 0
6 2022-03-07 0
7 2022-03-08 20
8 2022-03-09 0
9 2022-03-10 0
So I'm fairly new to pandas and I run into this problem that I'm not able to fix.
I have the following dataframe:
import pandas as pd
df = pd.DataFrame({
'Day': ['2018-12-31', '2019-01-07'],
'Product_Finished': [1000, 2000],
'Product_Tested': [50, 10]})
df['Day'] = pd.to_datetime(df['Day'], format='%Y-%m-%d')
df
I would like to add rows to my dateframe based on the column 'Day', ideally adding all other days of the weeks, but keeping the rest of the columns the same value. The output should look something like this:
Day Product_Finished Product_Tested
0 2018-12-31 1000 50
1 2019-01-01 1000 50
2 2019-01-02 1000 50
3 2019-01-03 1000 50
4 2019-01-04 1000 50
5 2019-01-05 1000 50
6 2019-01-06 1000 50
7 2019-01-07 2000 10
8 2019-01-08 2000 10
9 2019-01-09 2000 10
10 2019-01-10 2000 10
11 2019-01-11 2000 10
12 2019-01-12 2000 10
13 2019-01-13 2000 10
Any tips would be greatly appreciated, thank you in advance!
You can achieve this by first creating a new DataFrame that contains the desired date range using pandas.date_range.
Step 2, use pandas.merge_asof specifying to get the last value.
You can re sample by
import datetime
import pandas as pd
df = pd.DataFrame({
'Day': ['2018-12-31', '2019-01-07'],
'Product_Finished': [1000, 2000],
'Product_Tested': [50, 10]})
df['Day'] = pd.to_datetime(df['Day'], format='%Y-%m-%d')
df.set_index('Day',inplace=True)
df_Date=pd.date_range(start=df.index.min(), end=(df.index.max()+ datetime.timedelta(days=6)), freq='D')
df=df.reindex(df_Date,method='ffill',fill_value=None)
df.reset_index(inplace=True)
I have a dataframe that is being read from database records and looks like this:
date added total
2020-09-14 5 5
2020-09-15 4 9
2020-09-16 2 11
I need to be able to resample by different periods and this is what I am using:
df = pd.DataFrame.from_records(raw_data, index='date')
df.index = pd.to_datetime(df.index)
# let's say I want yearly sample, then I would do
df = df.fillna(0).resample('Y').sum()
This almost works, but it is obviously summing the total column, which is something I don't want. I need total column to be the value in the date sampled in the dataframe, like this:
# What I need
date added total
2020 11 11
# What I'm getting
date added total
2020 11 25
You can do this by resampling differently for different columns. Here you want to use sum() aggregator for the added column, but max() for the total.
df = pd.DataFrame({'date':[20200914, 20200915, 20200916, 20210101, 20210102],
'added':[5, 4, 2, 1, 6],
'total':[5, 9, 11, 1, 7]})
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df_res = df.resample('Y', on='date').agg({'added':'sum', 'total':'max'})
And the result is:
df_res
added total
date
2020-12-31 11 11
2021-12-31 7 7
My dataset has dates in the European format, and I'm struggling to convert it into the correct format before I pass it through a pd.to_datetime, so for all day < 12, my month and day switch.
Is there an easy solution to this?
import pandas as pd
import datetime as dt
df = pd.read_csv(loc,dayfirst=True)
df['Date']=pd.to_datetime(df['Date'])
Is there a way to force datetime to acknowledge that the input is formatted at dd/mm/yy?
Thanks for the help!
Edit, a sample from my dates:
renewal["Date"].head()
Out[235]:
0 31/03/2018
2 30/04/2018
3 28/02/2018
4 30/04/2018
5 31/03/2018
Name: Earliest renewal date, dtype: object
After running the following:
renewal['Date']=pd.to_datetime(renewal['Date'],dayfirst=True)
I get:
Out[241]:
0 2018-03-31 #Correct
2 2018-04-01 #<-- this number is wrong and should be 01-04 instad
3 2018-02-28 #Correct
Add format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
You can control the date construction directly if you define separate columns for 'year', 'month' and 'day', like this:
import pandas as pd
df = pd.DataFrame(
{'Date': ['01/03/2018', '06/08/2018', '31/03/2018', '30/04/2018']}
)
date_parts = df['Date'].apply(lambda d: pd.Series(int(n) for n in d.split('/')))
date_parts.columns = ['day', 'month', 'year']
df['Date'] = pd.to_datetime(date_parts)
date_parts
# day month year
# 0 1 3 2018
# 1 6 8 2018
# 2 31 3 2018
# 3 30 4 2018
df
# Date
# 0 2018-03-01
# 1 2018-08-06
# 2 2018-03-31
# 3 2018-04-30