pandas Groupby MonthStart with two business days offset - python

I have a DataFrame that is indexed by date and has daily data.
As described I wish to group and aggregate this data by calendar month start minus 2 business days. My idea is to use groupby and MonthBegin with a 2 days BDay offset to this.
When I try run the code
import pandas as pd
import pandas.tseries.offsets as of
days = of.MonthBegin() - of.BDay(2)
g = df.groupby(pd.Grouper(freq=days, level='Date')).sum()
I get an error
TypeError: Argument 'other' has incorrect type (expected
datetime.datetime, got BusinessDay)
Perhaps I need to use the rollback method on MonthBegin but when I try
days = of.MonthBegin()
days.rollback(of.BDay(2))
g_df = df.groupby(pd.Grouper(freq=days, level='Date')).sum()
TypeError: Cannot convert input [<2 * BusinessDays>] of type to Timestamp
Does anyone have any ideas how to correctly use the offsets to groupby MonthBegin - 2BDay ?

It is hard to tell, what you want to achieve without any data of yours, but here is how you could do it:
df = pd.DataFrame({"dates": ["2018-01-02", "2018-01-03", "2018-02-02", "2018-01-04"],
"vals": [10, 20, 10, 5]})
df.groupby((pd.to_datetime(df.dates) - of.MonthBegin() - of.BDay(2)).dt.month).vals.sum()
Output:
dates
1 10
12 35
Name: vals, dtype: int64

Related

Problem using Groupby in Python for date time. How to make a bar plot with Month/Year?

I have the following data set:
df
OrderDate Total_Charged
7/9/2017 5
7/9/2017 5
7/20/2017 10
8/20/2017 6
9/20/2019 1
...
I want to make a bar plot with month_year (X-axis) and Total charged per month/year i.e. sum it over month and year. Firstly, I want to groupby month and year and next make the plot.However, I get error on the first step:
df["OrderDate"]=pd.to_datetime(df['OrderDate'])
monthly_orders=df.groupby([(df.index.year),(df.index.month)]).sum()["Total_Charged"]
Got following error:
AttributeError: 'RangeIndex' object has no attribute 'year'
What am I doing wrong (what does the error mean)? How can i fix it?
Not sure why you're grouping by the index there. If you want the group by year and month respectively you could do the following:
df["OrderDate"]=pd.to_datetime(df['OrderDate'])
df.groupby([df.OrderDate.dt.year, df.OrderDate.dt.month]).sum().plot.bar()
pandas.DataFrame.resample
This is a versatile option, that easily implements aggregation over various time ranges (e.g. weekly, daily, quarterly, etc)
Code:
A more expansive dataset:
This code block sets up the sample dataset.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# list of dates
first_date = datetime(2017, 1, 1)
last_date = datetime(2019, 9, 20)
x = 4
list_of_dates = [date for date in np.arange(first_date, last_date, timedelta(days=x)).astype(datetime)]
df = pd.DataFrame({'OrderDate': list_of_dates,
'Total_Charged': [np.random.randint(10) for _ in range(len(list_of_dates))]})
Using resample for Monthly Sum:
requires a datetime index
df.OrderDate = pd.to_datetime(df.OrderDate)
df.set_index('OrderDate', inplace=True)
monthly_sums = df.resample('M').sum()
monthly_sums.plot.bar(figsize=(8, 6))
plt.show()
An example with Quarterly Avg:
this shows the versatility of resample compared to groupby
Quarterly would not be easily implemented with groupby
quarterly_avg = df.resample('Q').mean()
quarterly_avg.plot.bar(figsize=(8, 6))
plt.show()

Convert Q1-Q4 period strings to datetime using pandas

Here is a sample with date format:
data = pd.DataFrame({'Quarter':['Q1_01','Q2_01', 'Q3_01', 'Q4_01', 'Q1_02','Q2_02']
, 'Sale' :[10, 20, 30, 40, 50, 60]})
print(data)
# Quarter Sale
#0 Q1_01 10
#1 Q2_01 20
#2 Q3_01 30
#3 Q4_01 40
#4 Q1_02 50
#5 Q2_02 60
print(data.dtypes)
# Quarter object
# Sale int64
Would like to convert Quarter column into Pandas datetime format like
'Jan-2001' or '01-2001' that can be used in fbProphet for time series analysis.
Tried using strptime but got an error TypeError: strptime() argument 1 must be str, not Series
from datetime import datetime
data['Quarter'] = datetime.strptime(data['Quarter'], 'Q%q_%y')
What is the cause of the error ? Any better solution?
Knowing the format to_datetime needs to pass period indices is helpful (it is along the lines of YYYY-QX), so we start with replace, then to_datetime and finally strftime:
u = df.Quarter.str.replace(r'(Q\d)_(\d+)', r'20\2-\1')
pd.to_datetime(u).dt.strftime('%b-%Y')
0 Jan-2001
1 Apr-2001
2 Jul-2001
3 Oct-2001
4 Jan-2002
5 Apr-2002
Name: Quarter, dtype: object
The month represents the start of its respective quarter.
If the dates can range across the 90s and the 2000s, then let's try something different:
df = pd.DataFrame({'Quarter':['Q1_98','Q2_99', 'Q3_01', 'Q4_01', 'Q1_02','Q2_02']})
dt = pd.to_datetime(df.Quarter.str.replace(r'(Q\d)_(\d+)', r'\2-\1'))
(dt.where(dt <= pd.to_datetime('today'), dt - pd.DateOffset(years=100))
.dt.strftime('%b-%Y'))
0 Jan-1998
1 Apr-1999
2 Jul-2001
3 Oct-2001
4 Jan-2002
5 Apr-2002
Name: Quarter, dtype: object
pd.to_datetime auto-parses "98" as "2098", so we do a little fix to subtract 100 years from dates later than "today's date".
This hack will stop working in a few decades. Ye pandas gods, have mercy on my soul :-)
Another option is parsing to PeriodIndex:
(pd.PeriodIndex(df.Quarter.str.replace(r'(Q\d)_(\d+)', r'20\2-\1'), freq='Q')
.strftime('%b-%Y'))
# Index(['Mar-2001', 'Jun-2001', 'Sep-2001',
# 'Dec-2001', 'Mar-2002', 'Jun-2002'], dtype='object')
Here, the months printed out are at the ends of their respective quarters. You decide what to use.

Python datetime delta format

I am attempting to find records in my dataframe that are 30 days old or older. I pretty much have everything working but I need to correct the format of the Age column. Most everything in the program is stuff I found on stack overflow, but I can't figure out how to change the format of the delta that is returned.
import pandas as pd
import datetime as dt
file_name = '/Aging_SRs.xls'
sheet = 'All'
df = pd.read_excel(io=file_name, sheet_name=sheet)
df.rename(columns={'SR Create Date': 'Create_Date', 'SR Number': 'SR'}, inplace=True)
tday = dt.date.today()
tdelta = dt.timedelta(days=30)
aged = tday - tdelta
df = df.loc[df.Create_Date <= aged, :]
# Sets the SR as the index.
df = df.set_index('SR', drop = True)
# Created the Age column.
df.insert(2, 'Age', 0)
# Calculates the days between the Create Date and Today.
df['Age'] = df['Create_Date'].subtract(tday)
The calculation in the last line above gives me the result, but it looks like -197 days +09:39:12 and I need it to just be a positive number 197. I have also tried to search using the python, pandas, and datetime keywords.
df.rename(columns={'Create_Date': 'SR Create Date'}, inplace=True)
writer = pd.ExcelWriter('output_test.xlsx')
df.to_excel(writer)
writer.save()
I can't see your example data, but IIUC and you're just trying to get the absolute value of the number of days of a timedelta, this should work:
df['Age'] = abs(df['Create_Date'].subtract(tday)).dt.days)
Explanation:
Given a dataframe with a timedelta column:
>>> df
delta
0 26523 days 01:57:59
1 -1601 days +01:57:59
You can extract just the number of days as an int using dt.days:
>>> df['delta']dt.days
0 26523
1 -1601
Name: delta, dtype: int64
Then, all you need to do is wrap that in a call to abs to get the absolute value of that int:
>>> abs(df.delta.dt.days)
0 26523
1 1601
Name: delta, dtype: int64
here is what i worked out for basically the same issue.
# create timestamp for today, normalize to 00:00:00
today = pd.to_datetime('today', ).normalize()
# match timezone with datetimes in df so subtraction works
today = today.tz_localize(df['posted'].dt.tz)
# create 'age' column for days old
df['age'] = (today - df['posted']).dt.days
pretty much the same as the answer above, but without the call to abs().

pandas difference between 2 dates

I am trying to find the day difference between today, and dates in my dataframe.
Below is my conversion of dates in my dataframe
df['Date']=pd.to_datetime(df['Date'])
Below is my code to get today
today1=dt.datetime.today().strftime('%Y-%m-%d')
today1=pd.to_datetime(today1)
Both are converted to pandas.to_datetime, but when I do subtraction, the below error came out.
ValueError: Cannot add integral value to Timestamp without offset.
Can someone help to advise? Thanks!
This is a simple example how you can do this:
import pandas
import datetime as dt
First, you have to get today.
today1=dt.datetime.today().strftime('%Y-%m-%d')
today1=pd.to_datetime(today1)
Then, you can construct the data frame:
df = pandas.DataFrame({'Date':'2016-11-24 11:03:10.050000', 'today1': today1 }, index = [0])
In this example I just have 2 columns, each with one value.
Next, you should check the data types:
print(df.dtypes)
Date datetime64[ns]
today1 datetime64[ns]
If both data types are datetime64[ns], you can then subtract df.Date from df.today1.
print(df.today1 - df.Date)
The output:
0 19 days 12:56:49.950000
dtype: timedelta64[ns]

Selecting Data from Last Week in Python

I have a large database and I am looking to read only the last week for my python code.
My first problem is that the column with the received date and time is not in the format for datetime in pandas. My input (Column 15) looks like this:
recvd_dttm
1/1/2015 5:18:32 AM
1/1/2015 6:48:23 AM
1/1/2015 13:49:12 PM
From the Time Series / Date functionality in the pandas library I am looking at basing my code off of the "Week()" function shown in the example below:
In [87]: d
Out[87]: datetime.datetime(2008, 8, 18, 9, 0)
In [88]: d - Week()
Out[88]: Timestamp('2008-08-11 09:00:00')
I have tried ordering the date this way:
df =pd.read_csv('MYDATA.csv')
orderdate = datetime.datetime.strptime(df['recvd_dttm'], '%m/%d/%Y').strftime('%Y %m %d')
however I am getting this error
TypeError: must be string, not Series
Does anyone know a simpler way to do this, or how to fix this error?
Edit: The dates are not necessarily in order. AND sometimes there is a faulty error in the database like a date that is 9/03/2015 (in the future) someone mistyped. I need to be able to ignore those.
import datetime as dt
# convert strings to datetimes
df['recvd_dttm'] = pd.to_datetime(df['recvd_dttm'])
# get first and last datetime for final week of data
range_max = df['recvd_dttm'].max()
range_min = range_max - dt.timedelta(days=7)
# take slice with final week of data
sliced_df = df[(df['recvd_dttm'] >= range_min) &
(df['recvd_dttm'] <= range_max)]
You may iterate over the dates to convert by making a list comprehension
orderdate = [datetime.datetime.strptime(ttm, '%m/%d/%Y').strftime('%Y %m %d') for ttm in list(df['recvd_dttm'])]

Categories