How to groupby a dataframe by month while keeping other string columns? - python

A sample of my dataframe is as follows:
|Date_Closed|Owner|Case_Closed_Count|
|2022-07-19|JH|1|
|2022-07-18|JH|2|
|2022-07-17|JH|5|
|2022-07-19|DT|3|
|2022-07-15|DT|1|
|2022-07-01|DT|1|
|2022-06-30|JW|30|
|2022-06-28|JH|2|
My goal is to get a sum of case count per owner per month, which looks like:
|Month|Owner|Case_Closed_Count|
|2022-07|JH|8|
|2022-07|DT|5|
|2022-06|JW|30|
|2022-06|JH|2|
Here is the code I got so far:
df = pd.to_datetime(df['Date_Closed'])
month = df.Date_Closed.dt.to_period("M")
G = df.groupby(month).agg({'Case_Closed_Count':'sum'})
With the code above, I manage to get the case closed count groupby month, but how do I keep the owner column?

here is one way to do it
df['Date_Closed'] = pd.to_datetime(df['Date_Closed'])
df.groupby([df['Date_Closed'].dt.strftime('%Y-%m'), 'Owner']).sum().reset_index()
Date_Closed Owner Case_Closed_Count
0 2022-06 JH 2
1 2022-06 JW 30
2 2022-07 DT 5
3 2022-07 JH 8

Related

Pandas groupby n rows starting from bottom of df

I have the following df:
Week Sales
1 10
2 15
3 10
4 20
5 20
6 10
7 15
8 10
I would like to group every 3 weeks and sum up sales. I want so start with the bottom 3 weeks. If there are less than 3 weeks left at the top like in this example, these weeks should be ignored. Desired output is this:
Week Sales
5-3 50
8-6 35
I tried this on my original df df.reset_index(drop=True).groupby(by=lambda x: x/N, axis=0).sum()
but this solution is not starting from the bottom rows.
Can anyone point me into the right direction here? Thanks!
You can try inverse the data with .iloc[::-1]:
N=3
(df.iloc[::-1].groupby(np.arange(len(df))//N)
.agg({'Week': lambda x: f'{x.iloc[0]}-{x.iloc[-1]}',
'Sales': 'sum'
})
)
Output:
Week Sales
0 8-6 35
1 5-3 50
2 2-1 25
When dealing with period aggregation, I usually use .resample as it is fixable in binning data with different time periods
import io
from datetime import timedelta
import pandas as pd
dataf = pd.read_csv(io.StringIO("""Week Sales
1 10
2 15
3 10
4 20
5 20
6 10
7 15
8 10"""), sep='\s+',).astype(int)
# reverse data and transform int weeks to actual date time
dataf = dataf.iloc[::-1]
dataf['Week'] = dataf['Week'].map(lambda x: timedelta(weeks=x))
# set date object to index for resampling
dataf = dataf.set_index('Week')
# now we resample
dataf.resample('21d').sum() # 21days
::: Note: the label is misleading. And setting kind='period' does raises error

how to create month and year columns using regex and pandas

Hello Stack overflow Community
I've got the Data Frame here
code sum of August
AA 1000
BB 4000
CC 72262
So there are two columns ['code','sum of August']
I've to convert this dataFrame into ['month', 'year', 'code', 'sum of August'] columns
month year code sum of August
8 2020 AA 1000
8 2020 BB 4000
8 2020 CC 72262
So the ['sum of August'] column sometimes named as just ['August'] or ['august']. Also sometimes, it can be ['sum of November'] or ['November'] or ['november'].
I thought of using regex to extract the month name and covert to month number.
Can anyone please help me with this?
Thanks in advance!
You can do the following:
month = {1:'janauary',
2:'february',
3:'march',
4:'april',
5:'may',
6:'june',
7:'july',
8:'august',
9:'september',
10:'october',
11:'november',
12:'december'}
Let's say your data frame is called df. Then you can create the column month automatically using the following:
df['month']=[i for i,j in month.items() if j in str.lower(" ".join(df.columns))][0]
code sum of August month
0 AA 1000 8
1 BB 4000 8
2 CC 72262 8
That means that if a month's name exists in the column names in any way, return the number of this month.
It looks like you're trying to convert month names to their numbers, and the columns can be uppercse or lowercase.
This might work:
months = ['january','febuary','march','april','may','june','july','august','september','october','november','december']
monthNum = []#If you're using a list, just to make this run
sumOfMonths = ['sum of august','sum of NovemBer']#Just to show functionality
for sumOfMonth in sumOfMonths:
for idx, month in enumerate(months):
if month in sumOfMonth.lower():#If the column month name has any of the month keywords
monthNum.append(str(idx + 1)) #i'm just assuming that it's a list, just add the index + 1 to your variable.
I hope this helps! Of course, this wouldn't be exactly what you do, you fill in the variables and change append() if you're not using it.

Get the average mean of entries per month with datetime in Pandas

I have a large df with many entries per month. I would like to see the average entries per month as to see as an example if there are any months that normally have more entries. (Ideally I'd like to plot this with a line of the over all mean to compare with but that is maybe a later question).
My df is something like this:
ufo=pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo['Time']=pd.to_datetime(ufo.Time)
Where the head looks like this:
So if I'd like to see if there are more ufo-sightings in the summer as an example, how would I go about?
I have tried:
ufo.groupby(ufo.Time.month).mean()
But it does only work if I am calculating a numerical value. If I use count()instead I get the sum of all entries for all months.
EDIT: To clarify, I would like to have the mean of entries - ufo-sightings - per month.
You could do something like this:
# count the total months in the records
def total_month(x):
return x.max().year -x.min().year + 1
new_df = ufo.groupby(ufo.Time.dt.month).Time.agg(['size', total_month])
new_df['mean_count'] = new_df['size'] /new_df['total_month']
Output:
size total_month mean_count
Time
1 862 57 15.122807
2 817 70 11.671429
3 1096 55 19.927273
4 1045 68 15.367647
5 1168 53 22.037736
6 3059 71 43.084507
7 2345 65 36.076923
8 1948 64 30.437500
9 1635 67 24.402985
10 1723 65 26.507692
11 1509 50 30.180000
12 1034 56 18.464286
I think this what you are looking for, still please ask for clarification if i didn't reached what you are looking for.
# Add a new column instance, this adds a value to each instance of ufo sighting
ufo['instance'] = 1
# set index to time, this makes df a time series df and then you can apply pandas time series functions.
ufo.set_index(ufo['Time'], drop=True, inplace=True)
# create another df by resampling the original df and counting the instance column by Month ('M' is resample by month)
ufo2 = pd.DataFrame(ufo['instance'].resample('M').count())
# just to find month of resampled observation
ufo2['Time'] = pd.to_datetime(ufo2.index.values)
ufo2['month'] = ufo2['Time'].apply(lambda x: x.month)
and finally you can groupby month :)
ufo2.groupby(by='month').mean()
and this is the output which looks like this:
month mean_instance
1 12.314286
2 11.671429
3 15.657143
4 14.928571
5 16.685714
6 43.084507
7 33.028169
8 27.436620
9 23.028169
10 24.267606
11 21.253521
12 14.563380
Do you mean you want to group your data by month? I think we can do this
ufo['month'] = ufo['Time'].apply(lambda t: t.month)
ufo['year'] = ufo['Time'].apply(lambda t: t.year)
In this way, you will have 'year' and 'month' to group your data.
ufo_2 = ufo.groupby(['year', 'month'])['place_holder'].mean()

Adding missing value in Pandas dataframe

I have a dataframe of following structure (showing it as comma separated values):
day date hour cnt
Friday 9/15/2017 0 3
Friday 9/15/2017 1 5
Friday 9/15/2017 2 8
Friday 9/15/2017 3 6
...........................
Friday 9/15/2017 10
...........................
Saturday 9/16/2017 21 5
Saturday 9/16/2017 22 4
Some of the date values have data for every hour (0-23).
However, some of the date values can have missing hours. In the example, for 9/15/2017 data, there are no records for hour values from 9 to 13. For all these missing records, I need to add a new record with a cnt value (last column) of zero.
How do I achieve this in Python?
Provided you use pandas.DataFrame you may use fillna() method:
DataFrame['cnt'].fillna(value=0, axis=1)
Example:
Consider data:
one two three
a NaN 1.2 -0.355322
c NaN 3.3 0.983801
e 0.01 4 -0.712964
You may fill NaN using fillna():
data.fillna(0)
one two three
a 0 1.2 -0.355322
c 0 3.3 0.983801
e 0.01 4 -0.712964
You can generate a DatetimeIndex and use resample method:
#suppose your dataframe is named df:
idx = pd.DatetimeIndex(pd.to_datetime(df['date']).add(pd.to_timedelta(df['hour'], unit='h')))
df.index = idx
df_filled = df[['cnt']].resample('1H').sum().fillna(0).astype(int)
df_filled['day'] = df_filled.index.strftime('%A')
df_filled['date'] = df_filled.index.strftime('%-m/%-d/%Y')
df_filled['hour'] = df_filled.index.strftime('%-H')
or you can do the pivot and unpivot trick:
df_filled = df.pivot(values='cnt',index='date',columns='hour').fillna(0).unstack()
df_filled = df_filled.reset_index().sort_values(by=['date','hour'])

Iterate over dates in a Pandas Dataframe to get the count of a different column per week

I am a java developer finding it a bit tricky to switch to python and Pandas. Im trying to iterate over dates of a Pandas Dataframe which looks like below,
sender_user_id created
0 1 2016-12-19 07:36:07.816676
1 33 2016-12-19 07:56:07.816676
2 1 2016-12-19 08:14:07.816676
3 15 2016-12-19 08:34:07.816676
what I am trying to get is a dataframe which gives me a count of the total number of transactions that have occurred per week. From the forums I have been able to get syntax for 'for loops' which iterate over indexes only. Basically I need a result dataframe which looks like this. The value field contains the count of sender_user_id and the date needs to be modified to show the starting date per week.
date value
0 2016-12-09 20
1 2016-12-16 36
2 2016-12-23 56
3 2016-12-30 32
Thanks in advance for the help.
I think you need resample by week and aggregate size:
#cast to datetime if necessary
df.created = pd.to_datetime(df.created)
print (df.resample('W', on='created').size().reset_index(name='value'))
created value
0 2016-12-25 4
If need another offsets:
df.created = pd.to_datetime(df.created)
print (df.resample('W-FRI', on='created').size().reset_index(name='value'))
created value
0 2016-12-23 4
If need number of unique values per week aggregate by nunique:
df.created = pd.to_datetime(df.created)
print (df.resample('W-FRI', on='created')['sender_user_id'].nunique()
.reset_index(name='value'))
created value
0 2016-12-23 3

Categories