Dataframe groupby to new dataframe - python

I have a table as below.
Month,Count,Parameter
March 2015,1,40
March 2015,1,10
March 2015,1,1
March 2015,1,25
March 2015,1,50
April 2015,1,15
April 2015,1,1
April 2015,1,1
April 2015,1,15
April 2015,1,15
I need to create a new table from above as shown below.
Unique Month,Total Count,<=30
March 2015,5,3
April 2015,5,5
The logic for new table is as follows. "Unique Month" column is unique month from original table and needs to sorted. "Total Count" is sum of "Count" column from original table for that particular month. "<=30" column is count of "Parameter <= 30" for that particular month.
Is there an easy way to do this in dataframes?
Thanks in advance.

IIUC, just check for Parameter < 30 and then groupby:
(df.assign(le_30=df.Parameter.le(30))
.groupby('Month', as_index=False) # pass sort=False if needed
[['Count','le_30']].sum()
)
Or
(df.Parameter.le(30)
.groupby(df['Month']) # pass sort=False if needed
.agg(['count','sum'])
)
Output:
Month Count le_30
0 April 2015 5 5.0
1 March 2015 5 3.0
Update: as commented above, adding sort=False to groupby will respect your original sorting of Month. For example:
(df.Parameter.le(30)
.groupby(df['Month'], sort=False)
.agg(['count','sum'])
.reset_index()
)
Output:
Month count sum
0 March 2015 5 3.0
1 April 2015 5 5.0

Related

How to group by month and year from a specific range?

The data have reported values for January 2006 through January 2019. I need to compute the total number of passengers Passenger_Count per month. The dataframe should have 121 entries (10 years * 12 months, plus 1 for january 2019). The range should go from 2009 to 2019.
I have been doing:
df.groupby(['ReportPeriod'])['Passenger_Count'].sum()
But it doesn't give me the right result, it gives
You can do
df['ReportPeriod'] = pd.to_datetime(df['ReportPeriod'])
out = df.groupby(df['ReportPeriod'].dt.strftime('%Y-%m-%d'))['Passenger_Count'].sum()
Try this:
df.index = pd.to_datetime(df["ReportPeriod"], format="%m/%d/%Y")
df = df.groupby(pd.Grouper(freq="M")).sum()

I have multiIndexes for my dataframe, how do I calculate the sum for one level?

Hi everyone, I want to calculate the sum of Violent_type count according to year. For example, calculating total count of violent_type for year 2013, which is 18728+121662+1035. But I don't know how to select the data when there are multiIndexes. Any advice will be appreciated. Thanks.
The level argument in pandas.DataFrame.groupby() is what you are looking for.
level int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
To answer your question, you only need:
df.groupby(level=[0, 1]).sum()
# or
df.groupby(level=['district', 'year']).sum()
To see the effect
import pandas as pd
iterables = [['001', 'SST'], [2013, 2014], ['Dangerous', 'Non-Violent', 'Violent']]
index = pd.MultiIndex.from_product(iterables, names=['district', 'year', 'Violent_type'])
df = pd.DataFrame(list(range(0, len(index))), index=index, columns=['count'])
'''
print(df)
count
district year Violent_type
001 2013 Dangerous 0
Non-Violent 1
Violent 2
2014 Dangerous 3
Non-Violent 4
Violent 5
SST 2013 Dangerous 6
Non-Violent 7
Violent 8
2014 Dangerous 9
Non-Violent 10
Violent 11
'''
print(df.groupby(level=[0, 1]).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''
print(df.groupby(level=['district', 'year']).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''

Cumulative sum of all previous values

A similar question has been asked for cumsum and grouping but it didn't solve my case.
I have a financial balance sheet of a lot of years and need to sum all previous values by year.
This is my reproducible set:
df=pd.DataFrame(
{"Amount": [265.95,2250.00,-260.00,-2255.95,120],
"Year": [2018,2018,2018,2019,2019]})
The result I want is the following:
Year Amount
2017 0
2018 2255.95
2019 120.00
2020 120.00
So actually in a loop going from the lowest year in my whole set to the highest year in my set.
...
df[df.Year<=2017].Amount.sum()
df[df.Year<=2018].Amount.sum()
df[df.Year<=2019].Amount.sum()
df[df.Year<=2020].Amount.sum()
...
First step is aggregate sum, then use Series.cumsum and Series.reindex with forward filling missing values by all possible years, last replace first missing values to 0:
years = range(2017, 2021)
df1 = (df.groupby('Year')['Amount']
.sum()
.cumsum()
.reindex(years, method='ffill')
.fillna(0)
.reset_index())
print (df1)
Year Amount
0 2017 0.00
1 2018 2255.95
2 2019 120.00
3 2020 120.00

How to count the number of dropoffs per month for dataframe column

I have a dataframe that has records from 2011 to 2018. One of the columns has the drop_off_date which is the date when the customer left the rewards program. I want to count for each month between 2011 to 2018 how many people dropped of during that month. So for the 84 month period, I want the count of people who dropped off then using the drop_off_date column.
I changed the column to datetime and I know i can use the .agg and .count method but I am not sure how to count per month. I honestly do not know what the next step would be.
Example of the data:
Record ID | store ID | drop_off_date
a1274c212| 12876| 2011-01-27
a1534c543| 12877| 2011-02-23
a1232c952| 12877| 2018-12-02
The result should look like this:
Month: | #of dropoffs:
Jan 2011 | 15
........
Dec 2018 | 6
What I suggest is to work directly with the strings in the column drop_off_ym and to strip them to only keep the year and month:
df['drop_off_ym'] = df.drop_off_date.apply(lambda x: x[:-3])
Then you apply a groupby on the new created column an then a count():
df_counts_by_month = df.groupby('drop_off_ym')['StoreId'].count()
Using your data,
I'm assuming your date has been cast to a datetime value and used errors='coerce' to handle outliers.
you should then drop any NA's from this so you're only dealing with customers who dropped off.
you can do this in a multitude of ways, I would do a simple df.dropna(subset=['drop_off_date'])
print(df)
Record ID store ID drop_off_date
0 a1274c212 12876 2011-01-27
1 a1534c543 12877 2011-02-23
2 a1232c952 12877 2018-12-02
Lets create a month column to use as an aggregate
df['Month'] = df['drop_off_date'].dt.strftime('%b')
then we can do a simple groupby on the record ID as a count. (assuming you only want to count unique ID's)?
df1 = df.groupby(df['Month'])['Record ID'].count().reset_index()
print(df1)
Month Record ID
0 Dec 1
1 Feb 1
2 Jan 1
EDIT: To account for years.
first lets create a year helper column
df['Year'] = df['drop_off_date'].dt.year
df1 = df.groupby(['Month','Year' ])['Record ID'].count().reset_index()
print(df)
Month Year Record ID
0 Dec 2018 1
1 Feb 2011 1
2 Jan 2011 1

Map dataframe column value by another column's value

My dataframe has a month column with values that repeat as Apr, Apr.1, Apr.2 etc. because there is no year column. I added a year column based on the month value using a for loop as shown below, but I'd like to find a more efficient way to do this:
Products['Year'] = '2015'
for i in range(0, len(Products.Month)):
if '.1' in Products['Month'][i]:
Products['Year'][i] = '2016'
elif '.2' in Products['Month'][i]:
Products['Year'][i] = '2017'
You can use .str and treat the whole columns like string to split at the dot.
Now, apply a function that takes the number string and turns into a new year value if possible.
Starting dataframe:
Month
0 Apr
1 Apr.1
2 Apr.2
Solution:
def get_year(entry):
value = 2015
try:
value += int(entry[-1])
finally:
return str(value)
df['Year'] = df.Month.str.split('.').apply(get_year)
Now df is:
Month Year
0 Apr 2015
1 Apr.1 2016
2 Apr.2 2017
You can use pd.to_numeric after splitting and add 2015 i.e
df['new'] = pd.to_numeric(df['Month'].str.split('.').str[-1],errors='coerce').fillna(0) + 2015
# Sample DataFrame from # Mike Muller
Month Year new
0 Apr 2015 2015.0
1 Apr.1 2016 2016.0
2 Apr.2 2017 2017.0

Categories