Get the average mean of entries per month with datetime in Pandas - python

I have a large df with many entries per month. I would like to see the average entries per month as to see as an example if there are any months that normally have more entries. (Ideally I'd like to plot this with a line of the over all mean to compare with but that is maybe a later question).
My df is something like this:
ufo=pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo['Time']=pd.to_datetime(ufo.Time)
Where the head looks like this:
So if I'd like to see if there are more ufo-sightings in the summer as an example, how would I go about?
I have tried:
ufo.groupby(ufo.Time.month).mean()
But it does only work if I am calculating a numerical value. If I use count()instead I get the sum of all entries for all months.
EDIT: To clarify, I would like to have the mean of entries - ufo-sightings - per month.

You could do something like this:
# count the total months in the records
def total_month(x):
return x.max().year -x.min().year + 1
new_df = ufo.groupby(ufo.Time.dt.month).Time.agg(['size', total_month])
new_df['mean_count'] = new_df['size'] /new_df['total_month']
Output:
size total_month mean_count
Time
1 862 57 15.122807
2 817 70 11.671429
3 1096 55 19.927273
4 1045 68 15.367647
5 1168 53 22.037736
6 3059 71 43.084507
7 2345 65 36.076923
8 1948 64 30.437500
9 1635 67 24.402985
10 1723 65 26.507692
11 1509 50 30.180000
12 1034 56 18.464286

I think this what you are looking for, still please ask for clarification if i didn't reached what you are looking for.
# Add a new column instance, this adds a value to each instance of ufo sighting
ufo['instance'] = 1
# set index to time, this makes df a time series df and then you can apply pandas time series functions.
ufo.set_index(ufo['Time'], drop=True, inplace=True)
# create another df by resampling the original df and counting the instance column by Month ('M' is resample by month)
ufo2 = pd.DataFrame(ufo['instance'].resample('M').count())
# just to find month of resampled observation
ufo2['Time'] = pd.to_datetime(ufo2.index.values)
ufo2['month'] = ufo2['Time'].apply(lambda x: x.month)
and finally you can groupby month :)
ufo2.groupby(by='month').mean()
and this is the output which looks like this:
month mean_instance
1 12.314286
2 11.671429
3 15.657143
4 14.928571
5 16.685714
6 43.084507
7 33.028169
8 27.436620
9 23.028169
10 24.267606
11 21.253521
12 14.563380

Do you mean you want to group your data by month? I think we can do this
ufo['month'] = ufo['Time'].apply(lambda t: t.month)
ufo['year'] = ufo['Time'].apply(lambda t: t.year)
In this way, you will have 'year' and 'month' to group your data.
ufo_2 = ufo.groupby(['year', 'month'])['place_holder'].mean()

Related

How to groupby a dataframe by month while keeping other string columns?

A sample of my dataframe is as follows:
|Date_Closed|Owner|Case_Closed_Count|
|2022-07-19|JH|1|
|2022-07-18|JH|2|
|2022-07-17|JH|5|
|2022-07-19|DT|3|
|2022-07-15|DT|1|
|2022-07-01|DT|1|
|2022-06-30|JW|30|
|2022-06-28|JH|2|
My goal is to get a sum of case count per owner per month, which looks like:
|Month|Owner|Case_Closed_Count|
|2022-07|JH|8|
|2022-07|DT|5|
|2022-06|JW|30|
|2022-06|JH|2|
Here is the code I got so far:
df = pd.to_datetime(df['Date_Closed'])
month = df.Date_Closed.dt.to_period("M")
G = df.groupby(month).agg({'Case_Closed_Count':'sum'})
With the code above, I manage to get the case closed count groupby month, but how do I keep the owner column?
here is one way to do it
df['Date_Closed'] = pd.to_datetime(df['Date_Closed'])
df.groupby([df['Date_Closed'].dt.strftime('%Y-%m'), 'Owner']).sum().reset_index()
Date_Closed Owner Case_Closed_Count
0 2022-06 JH 2
1 2022-06 JW 30
2 2022-07 DT 5
3 2022-07 JH 8

Pandas groupby n rows starting from bottom of df

I have the following df:
Week Sales
1 10
2 15
3 10
4 20
5 20
6 10
7 15
8 10
I would like to group every 3 weeks and sum up sales. I want so start with the bottom 3 weeks. If there are less than 3 weeks left at the top like in this example, these weeks should be ignored. Desired output is this:
Week Sales
5-3 50
8-6 35
I tried this on my original df df.reset_index(drop=True).groupby(by=lambda x: x/N, axis=0).sum()
but this solution is not starting from the bottom rows.
Can anyone point me into the right direction here? Thanks!
You can try inverse the data with .iloc[::-1]:
N=3
(df.iloc[::-1].groupby(np.arange(len(df))//N)
.agg({'Week': lambda x: f'{x.iloc[0]}-{x.iloc[-1]}',
'Sales': 'sum'
})
)
Output:
Week Sales
0 8-6 35
1 5-3 50
2 2-1 25
When dealing with period aggregation, I usually use .resample as it is fixable in binning data with different time periods
import io
from datetime import timedelta
import pandas as pd
dataf = pd.read_csv(io.StringIO("""Week Sales
1 10
2 15
3 10
4 20
5 20
6 10
7 15
8 10"""), sep='\s+',).astype(int)
# reverse data and transform int weeks to actual date time
dataf = dataf.iloc[::-1]
dataf['Week'] = dataf['Week'].map(lambda x: timedelta(weeks=x))
# set date object to index for resampling
dataf = dataf.set_index('Week')
# now we resample
dataf.resample('21d').sum() # 21days
::: Note: the label is misleading. And setting kind='period' does raises error

Pandas monthly rolling operation

I ended up figuring it out while writing out this question so I'll just post anyway and answer my own question in case someone else needs a little help.
Problem
Suppose we have a DataFrame, df, containing this data.
import pandas as pd
from io import StringIO
data = StringIO(
"""\
date spendings category
2014-03-25 10 A
2014-04-05 20 A
2014-04-15 10 A
2014-04-25 10 B
2014-05-05 10 B
2014-05-15 10 A
2014-05-25 10 A
"""
)
df = pd.read_csv(data,sep="\s+",parse_dates=True,index_col="date")
Goal
For each row, sum the spendings over every row that is within one month of it, ideally using DataFrame.rolling as it's a very clean syntax.
What I have tried
df = df.rolling("M").sum()
But this throws an exception
ValueError: <MonthEnd> is a non-fixed frequency
version: pandas==0.19.2
Use the "D" offset rather than "M" and specifically use "30D" for 30 days or approximately one month.
df = df.rolling("30D").sum()
Initially, I intuitively jumped to using "M" as I figured it stands for one month, but now it's clear why that doesn't work.
To address why you cannot use things like "AS" or "Y", in this case, "Y" offset is not "a year", it is actually referencing YearEnd (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases), and therefore the rolling function does not get a fixed window (e.g. you get a 365 day window if your index falls on Jan 1, and 1 day if Dec 31).
The proposed solution (offset by 30D) works if you do not need strict calendar months. Alternatively, you would iterate over your date index, and slice with an offset to get more precise control over your sum.
If you have to do it in one line (separated for readability):
df['Sum'] = [
df.loc[
edt - pd.tseries.offsets.DateOffset(months=1):edt, 'spendings'
].sum() for edt in df.index
]
spendings category Sum
date
2014-03-25 10 A 10
2014-04-05 20 A 30
2014-04-15 10 A 40
2014-04-25 10 B 50
2014-05-05 10 B 50
2014-05-15 10 A 40
2014-05-25 10 A 40

Iterate over dates in a Pandas Dataframe to get the count of a different column per week

I am a java developer finding it a bit tricky to switch to python and Pandas. Im trying to iterate over dates of a Pandas Dataframe which looks like below,
sender_user_id created
0 1 2016-12-19 07:36:07.816676
1 33 2016-12-19 07:56:07.816676
2 1 2016-12-19 08:14:07.816676
3 15 2016-12-19 08:34:07.816676
what I am trying to get is a dataframe which gives me a count of the total number of transactions that have occurred per week. From the forums I have been able to get syntax for 'for loops' which iterate over indexes only. Basically I need a result dataframe which looks like this. The value field contains the count of sender_user_id and the date needs to be modified to show the starting date per week.
date value
0 2016-12-09 20
1 2016-12-16 36
2 2016-12-23 56
3 2016-12-30 32
Thanks in advance for the help.
I think you need resample by week and aggregate size:
#cast to datetime if necessary
df.created = pd.to_datetime(df.created)
print (df.resample('W', on='created').size().reset_index(name='value'))
created value
0 2016-12-25 4
If need another offsets:
df.created = pd.to_datetime(df.created)
print (df.resample('W-FRI', on='created').size().reset_index(name='value'))
created value
0 2016-12-23 4
If need number of unique values per week aggregate by nunique:
df.created = pd.to_datetime(df.created)
print (df.resample('W-FRI', on='created')['sender_user_id'].nunique()
.reset_index(name='value'))
created value
0 2016-12-23 3

data cleaning a python dataframe

I have a Python dataframe with 1408 lines of data. My goal is to compare the largest number and smallest number associated with a given weekday during one week to the next week's number on the same day of the week which the prior largest/smallest occurred. Essentially, I want to look at quintiles (since there are 5 days in a business week) rank 1 and 5 and see how they change from week to week. Build a cdf of numbers associated to each weekday.
To clean the data, I need to remove 18 weeks in total from it. That is, every week in the dataframe associated with holidays plus the entire week following week after the holiday occurred.
After this, I think I should insert a column in the dataframe that labels all my data with Monday through Friday-- for all the dates in the file (there are 6 years of data). The reason for labeling M-F is so that I can sort each number associated to the day of the week in ascending order. And query on the day of the week.
Methodological suggestions on either 1. or 2. or both would be immensely appreciated.
Thank you!
#2 seems like it's best tackled with a combination of df.groupby() and apply() on the resulting Groupby object. Perhaps an example is the best way to explain.
Given a dataframe:
In [53]: df
Out[53]:
Value
2012-08-01 61
2012-08-02 52
2012-08-03 89
2012-08-06 44
2012-08-07 35
2012-08-08 98
2012-08-09 64
2012-08-10 48
2012-08-13 100
2012-08-14 95
2012-08-15 14
2012-08-16 55
2012-08-17 58
2012-08-20 11
2012-08-21 28
2012-08-22 95
2012-08-23 18
2012-08-24 81
2012-08-27 27
2012-08-28 81
2012-08-29 28
2012-08-30 16
2012-08-31 50
In [54]: def rankdays(df):
.....: if len(df) != 5:
.....: return pandas.Series()
.....: return pandas.Series(df.Value.rank(), index=df.index.weekday)
.....:
In [52]: df.groupby(lambda x: x.week).apply(rankdays).unstack()
Out[52]:
0 1 2 3 4
32 2 1 5 4 3
33 5 4 1 2 3
34 1 3 5 2 4
35 2 5 3 1 4

Categories