A similar question has been asked for cumsum and grouping but it didn't solve my case.
I have a financial balance sheet of a lot of years and need to sum all previous values by year.
This is my reproducible set:
df=pd.DataFrame(
{"Amount": [265.95,2250.00,-260.00,-2255.95,120],
"Year": [2018,2018,2018,2019,2019]})
The result I want is the following:
Year Amount
2017 0
2018 2255.95
2019 120.00
2020 120.00
So actually in a loop going from the lowest year in my whole set to the highest year in my set.
...
df[df.Year<=2017].Amount.sum()
df[df.Year<=2018].Amount.sum()
df[df.Year<=2019].Amount.sum()
df[df.Year<=2020].Amount.sum()
...
First step is aggregate sum, then use Series.cumsum and Series.reindex with forward filling missing values by all possible years, last replace first missing values to 0:
years = range(2017, 2021)
df1 = (df.groupby('Year')['Amount']
.sum()
.cumsum()
.reindex(years, method='ffill')
.fillna(0)
.reset_index())
print (df1)
Year Amount
0 2017 0.00
1 2018 2255.95
2 2019 120.00
3 2020 120.00
Related
The data have reported values for January 2006 through January 2019. I need to compute the total number of passengers Passenger_Count per month. The dataframe should have 121 entries (10 years * 12 months, plus 1 for january 2019). The range should go from 2009 to 2019.
I have been doing:
df.groupby(['ReportPeriod'])['Passenger_Count'].sum()
But it doesn't give me the right result, it gives
You can do
df['ReportPeriod'] = pd.to_datetime(df['ReportPeriod'])
out = df.groupby(df['ReportPeriod'].dt.strftime('%Y-%m-%d'))['Passenger_Count'].sum()
Try this:
df.index = pd.to_datetime(df["ReportPeriod"], format="%m/%d/%Y")
df = df.groupby(pd.Grouper(freq="M")).sum()
I want to add a new column which's the operation between a cell value and a groupped value by some other column.
Consider I have the following dataframe.
Month
Apples Sold
October
3
October
4
October
5
November
1
November
5
So in the example dataframe I want to do an operation between apple sold and the avg(apple_sold) groupped by its month. As an example let's say that x=apple_sold and y=avg(apple_sold) in a month, I want to do the operation (x-y)/y and name this column variaton_month:
Month
Apple Sold
Variation_month
October
3
-0.25
October
4
0
October
5
0.25
November
1
-0.67
November
5
0.67
Now I know that this would be possible by groupping the dataframe then joining on month then doing the operation:
#Generating the example
import pandas as pd
df=pd.DataFrame({'Month':['October','October','October','November','November'], 'Apples Sold':[3,4,5,1,5]})
#Doing the operation
df = pd.merge(df, df.groupby('Month').mean().reset_index(), on='Month', suffixes=('','_mean'))
df['Variation'] = (df['Apples Sold'] - df['Apples Sold_mean'])/df['Apples Sold_mean']
# Cleaning the mess
df.drop('Apples Sold_mean', axis=1, inplace=True)
df
However if the dataframe's big this strategy can be very slow and unoptimized because of the pd.merge line. Is there a way to do these operations in a optimized way (maybe avoiding table join or using another library)?
We can use groupby + transform to broadcast the mean values per month, thereby avoiding the intermediate merge step
avg = df.groupby('Month')['Apples Sold'].transform('mean')
df['Variation_month'] = df['Apples Sold'].sub(avg).div(avg).round(2)
Month Apples Sold Variation_month
0 October 3 -0.25
1 October 4 0.00
2 October 5 0.25
3 November 1 -0.67
4 November 5 0.67
I have a DataFrame like this:
Year Month Day Rain (mm)
2021 1 1 15
2021 1 2 NaN
2021 1 3 12
And so on (there are multiple years). I have used pivot_table function to convert the DataFrame into this:
Year 2021 2020 2019 2018 2017
Month Day
1 1 15
2 NaN
3 12
I used:
df = df.pivot_table(index=['Month', 'Day'], columns='Year',
values='Rain (mm)', aggfunc='first')
Now I would like to replace all NaN values and also possible -1 values with zeros from every column (by columns I mean years) but I have not been able to do so. I have tried:
df = df.fillna(0)
And also:
df.loc[df['Rain (mm)'] == NaN, 'Rain (mm)'] = 0
But neither won't work, no error message/exception, dataframe just remains unchanged. What I'm doing wrong? Any advise is highly appreciated.
I think problem is NaN are strings, so cannot replace them, so first try convert values to numeric:
df['Rain (mm)'] = pd.to_numeric(df['Rain (mm)'], errors='coerce')
df = df.pivot_table(index=['Month', 'Day'], columns='Year',
values='Rain (mm)', aggfunc='first').fillna(0)
I have a table as below.
Month,Count,Parameter
March 2015,1,40
March 2015,1,10
March 2015,1,1
March 2015,1,25
March 2015,1,50
April 2015,1,15
April 2015,1,1
April 2015,1,1
April 2015,1,15
April 2015,1,15
I need to create a new table from above as shown below.
Unique Month,Total Count,<=30
March 2015,5,3
April 2015,5,5
The logic for new table is as follows. "Unique Month" column is unique month from original table and needs to sorted. "Total Count" is sum of "Count" column from original table for that particular month. "<=30" column is count of "Parameter <= 30" for that particular month.
Is there an easy way to do this in dataframes?
Thanks in advance.
IIUC, just check for Parameter < 30 and then groupby:
(df.assign(le_30=df.Parameter.le(30))
.groupby('Month', as_index=False) # pass sort=False if needed
[['Count','le_30']].sum()
)
Or
(df.Parameter.le(30)
.groupby(df['Month']) # pass sort=False if needed
.agg(['count','sum'])
)
Output:
Month Count le_30
0 April 2015 5 5.0
1 March 2015 5 3.0
Update: as commented above, adding sort=False to groupby will respect your original sorting of Month. For example:
(df.Parameter.le(30)
.groupby(df['Month'], sort=False)
.agg(['count','sum'])
.reset_index()
)
Output:
Month count sum
0 March 2015 5 3.0
1 April 2015 5 5.0
I have a dataframe that has records from 2011 to 2018. One of the columns has the drop_off_date which is the date when the customer left the rewards program. I want to count for each month between 2011 to 2018 how many people dropped of during that month. So for the 84 month period, I want the count of people who dropped off then using the drop_off_date column.
I changed the column to datetime and I know i can use the .agg and .count method but I am not sure how to count per month. I honestly do not know what the next step would be.
Example of the data:
Record ID | store ID | drop_off_date
a1274c212| 12876| 2011-01-27
a1534c543| 12877| 2011-02-23
a1232c952| 12877| 2018-12-02
The result should look like this:
Month: | #of dropoffs:
Jan 2011 | 15
........
Dec 2018 | 6
What I suggest is to work directly with the strings in the column drop_off_ym and to strip them to only keep the year and month:
df['drop_off_ym'] = df.drop_off_date.apply(lambda x: x[:-3])
Then you apply a groupby on the new created column an then a count():
df_counts_by_month = df.groupby('drop_off_ym')['StoreId'].count()
Using your data,
I'm assuming your date has been cast to a datetime value and used errors='coerce' to handle outliers.
you should then drop any NA's from this so you're only dealing with customers who dropped off.
you can do this in a multitude of ways, I would do a simple df.dropna(subset=['drop_off_date'])
print(df)
Record ID store ID drop_off_date
0 a1274c212 12876 2011-01-27
1 a1534c543 12877 2011-02-23
2 a1232c952 12877 2018-12-02
Lets create a month column to use as an aggregate
df['Month'] = df['drop_off_date'].dt.strftime('%b')
then we can do a simple groupby on the record ID as a count. (assuming you only want to count unique ID's)?
df1 = df.groupby(df['Month'])['Record ID'].count().reset_index()
print(df1)
Month Record ID
0 Dec 1
1 Feb 1
2 Jan 1
EDIT: To account for years.
first lets create a year helper column
df['Year'] = df['drop_off_date'].dt.year
df1 = df.groupby(['Month','Year' ])['Record ID'].count().reset_index()
print(df)
Month Year Record ID
0 Dec 2018 1
1 Feb 2011 1
2 Jan 2011 1